Loading...
Loading...
Loading...
Carapace is a triage orchestration engine for large contribution queues. It detects similarity across pull requests and issues, selects canonical candidates, and routes noisy submissions out of maintainer focus views.
# Carapace Technical Design Specification (SPEC)
## 1. Purpose
Carapace is a triage orchestration engine for large contribution queues. It detects similarity across pull requests and issues, selects canonical candidates, and routes noisy submissions out of maintainer focus views.
This specification defines a modular, extensible architecture that:
- Works as a live service and offline POC.
- Uses fast local-first embedding and similarity approaches.
- Supports low-pass filtering (labels, stale age, actor behavior, etc.).
- Can swap source systems (GitHub now, Jira or others later) without rewriting core logic.
## 2. Scope
### In Scope
- Similarity detection for PRs and issues.
- Cluster formation and canonical candidate selection.
- Low-pass noise filtering before expensive scoring.
- Pluggable embeddings (local and API-backed).
- Extensible connector and hook system.
- Rule-driven routing actions (label, comment, quarantine queue).
- Offline replay mode and live webhook mode.
### Out of Scope (Initial)
- Full autonomous merge/close decisions by default.
- Full code review replacement (reuse existing review signals).
- UI-heavy platform in v1 (GitHub-native views first).
## 3. System Overview
### 3.1 High-Level Components
1. Ingestion Layer
- Receives events from connectors (GitHub webhooks, batch importers, replay logs).
2. Normalization Layer
- Maps provider payloads into a canonical Carapace domain schema.
3. Low-Pass Filter Layer
- Early noise suppression to reduce compute and maintainer cognitive load.
4. Fingerprint + Embedding Layer
- Builds text, structural, and lineage fingerprints.
- Produces embeddings via pluggable provider abstraction.
5. Similarity + Clustering Layer
- Candidate retrieval + pair scoring + cluster assignment.
6. Canonical Selection Layer
- Scores candidates in each cluster and chooses canonical PR.
7. Action + Policy Layer
- Applies labels/comments/routes through connector sinks.
8. Storage Layer
- Relational store for entities/events/decisions.
- Vector and signature index for retrieval.
9. Hook/SDK Extension Layer
- Lifecycle hooks and plugin APIs for custom logic.
### 3.2 Deployment Modes
- Live mode: webhook-driven, near real-time updates.
- Offline mode: snapshot/event replay; identical core pipeline.
## 4. Architecture Requirements
### Functional Requirements
- FR-1: Ingest PR/issue/update events from GitHub.
- FR-2: Support connector abstraction for alternate systems (Jira, GitLab, Bitbucket).
- FR-3: Apply configurable low-pass rules pre-fingerprint.
- FR-4: Generate hybrid fingerprints (text, structural, lineage, quality signals).
- FR-5: Compute similarity with scalable candidate retrieval.
- FR-6: Cluster related submissions incrementally.
- FR-7: Rank and mark canonical entry per cluster.
- FR-8: Emit routing decisions (labels/comments/queues) with confidence and reason traces.
- FR-9: Expose SDK hooks for custom features and decision overrides.
- FR-10: Operate in offline POC mode with deterministic replay.
- FR-11: Load repository-level configuration from `.carapace.yaml` at repository root.
### Non-Functional Requirements
- NFR-1: Handle 10k+ open entities per repo/org.
- NFR-2: P95 online event processing under 3s excluding external API delays.
- NFR-3: Support horizontal worker scaling.
- NFR-4: Full auditability of decisions and feature contributions.
- NFR-5: Idempotent event processing and exactly-once effective actioning.
- NFR-6: Python services must use Pydantic models for boundary validation and typed contracts.
## 5. Domain Model
### Core Entities
- `SourceEntity`: PR, issue, ticket, task.
- `Fingerprint`: normalized feature bundle for an entity.
- `EmbeddingVector`: dense semantic vector(s) by provider/model.
- `SimilarityEdge`: pairwise score with feature breakdown.
- `Cluster`: group of related entities.
- `CanonicalDecision`: winning entity + ranked alternatives.
- `RoutingDecision`: actions to apply (labels/comments/quarantine).
- `PolicyRule`: low-pass/decision rule with version.
### Key Tables (Relational)
- `entities`
- `entity_events`
- `fingerprints`
- `embeddings`
- `similarity_edges`
- `clusters`
- `canonical_decisions`
- `routing_decisions`
- `policy_rules`
- `connector_state` (cursor/checkpoint state per provider)
## 6. Connector and SDK Design
### 6.1 Connector Interfaces
Define provider-specific connectors behind stable contracts:
```text
SourceConnector
- subscribe_events()
- list_open_entities()
- get_entity(id)
- get_diff_or_change_set(id)
- get_reviews_and_checks(id)
SinkConnector
- apply_labels(id, labels[])
- post_comment(id, body)
- set_status(id, state, context)
- route_to_queue(id, queue_key)
```
### 6.2 Hook Lifecycle
Hooks are invoked with immutable context and mutable decision envelope.
```text
before_normalize
after_normalize
before_low_pass
after_low_pass
before_fingerprint
after_fingerprint
before_similarity
after_similarity
before_canonical
after_canonical
before_action
after_action
on_error
```
### 6.3 SDK Principles
- Provider-neutral schema.
- Versioned hook contracts.
- Deterministic replay support.
- Safe extension boundaries (timeouts + fallback behavior per hook).
- Language-agnostic API transport (HTTP/JSON now, gRPC optional).
## 7. Low-Pass Filter Design
### 7.1 Goals
- Drop or down-prioritize predictable noise early.
- Reduce pairwise comparison cost.
- Preserve recoverability and transparency.
### 7.2 Rule Types
1. Hard skip
- Ignore `stale`, `invalid`, `wontfix`, archived targets, bot-only formatting PRs.
2. Soft suppress
- De-prioritize low-signal entities into quarantine queues.
3. Priority boost
- Elevate labels like `security`, `regression`, `release-blocker`.
### 7.3 Rule Inputs
- Labels allowlist/denylist.
- Age windows (`updated_at`, stale duration).
- Actor type/trust tier (maintainer, external, automation).
- CI status stability.
- File/path filters (vendor/docs-only/config-only).
- Historical behavior (repeat low-signal patterns).
### 7.4 Output
- `filter_state`: `pass`, `suppress`, `skip`.
- `filter_reason_codes`: array of deterministic reason ids.
- `priority_weight`: numeric multiplier for downstream ranking.
### 7.5 Example Policy Snippet
```yaml
low_pass:
hard_skip_labels: ["invalid", "wontfix", "duplicate", "stale"]
soft_suppress_labels: ["question", "discussion"]
suppress_if:
- condition: "is_docs_only && actor_is_new && ci_state == 'none'"
reason: "LOW_SIGNAL_DOCS_ONLY"
boost_if:
- condition: "has_label('security') || has_label('regression')"
weight: 1.5
```
### 7.6 Configuration Discovery and Precedence
- Primary repo config path: `.carapace.yaml` at repository root.
- Optional org/global defaults may be provided by service config.
- Precedence order: entity/runtime override > repo `.carapace.yaml` > org defaults > system defaults.
- Missing repo config must not fail processing; system defaults apply.
- Config schema validation must run at load time and produce actionable errors.
## 8. Fingerprinting and Feature Engineering
### 8.1 Text Features
- Title/body normalized tokens.
- Linked issue ids.
- Extracted intent phrases from templates.
- External reviewer summaries (CodeRabbit/Greptile) when present.
### 8.2 Structural Features
- File path set + module buckets.
- Hunk signatures (path + context + normalized token hashes).
- Churn metrics: files changed, additions, deletions.
### 8.3 Lineage Features
- Commit SHAs.
- Patch-id set (when available from cloned refs).
- Base branch/head branch ancestry.
### 8.4 Quality Signals
- CI status history.
- Review approvals/comments.
- External reviewer score/risk/test-gap.
## 9. Embedding Strategy
### 9.1 Requirements
- Local-first for speed/cost/privacy.
- High retrieval quality for intent-level similarity.
- API fallback for quality/capacity bursts.
### 9.2 Provider Abstraction
```text
EmbeddingProvider
- embed_texts(texts[], mode) -> vectors[]
- model_id()
- dimensions()
- max_batch_size()
- health_check()
```
### 9.3 Recommended Local Models
Primary (balanced accuracy/speed):
- `BAAI/bge-m3` served via TEI/Infinity.
Fast path:
- `nomic-embed-text-v1.5` or equivalent compact model.
High-accuracy local option:
- `jina-embeddings-v3` class model where hardware allows.
### 9.4 API Option
- OpenAI-compatible embedding endpoint contract:
- `POST /v1/embeddings`
- request includes `model`, `input`
- response normalized into provider abstraction.
### 9.5 Multi-Vector Strategy (Optional)
- Store two vectors per entity:
- `semantic_text_vector`
- `review_summary_vector`
- Use weighted late fusion in pair scoring.
### 9.6 Operational Tuning
- Batch size auto-tuning by latency target.
- Quantization support (`int8`/`fp16`) for local serving.
- Model cache warm-up on startup.
## 10. Similarity Engine
### 10.1 Candidate Retrieval (Stage 1)
- Inverted indices on module buckets, linked issue ids, label classes.
- MinHash LSH on diff shingles.
- SimHash on normalized text.
Returns top-K candidates per entity for Stage 2 scoring.
### 10.2 Pair Scoring (Stage 2)
Score uses weighted ensemble:
- Lineage score.
- Structure score.
- Semantic score.
- Shape penalty.
Reference formula:
```text
S = 0.45*Lineage + 0.40*Structure + 0.15*Semantic - 0.10*SizePenalty
```
### 10.3 Edge Gating
- Strong edge if lineage overlap exceeds threshold.
- Medium edge if structure overlap + semantic alignment passes joint threshold.
- No edge otherwise.
### 10.4 Clustering
- Incremental union-find with strong/weak tiers.
- Weak edges require shared strong neighbor to avoid chain bridging.
## 11. Canonical Selection Engine
### 11.1 Scoring Features
- Cluster centrality (mean similarity).
- Coverage of cluster touched surface.
- CI health.
- External reviewer score.
- Approvals/review traction.
- Size penalty.
- Filter priority weight from low-pass layer.
### 11.2 Canonical Formula (Initial)
```text
CanonScore =
5.0*Coverage +
4.0*Centrality +
3.0*CI +
2.0*ReviewerScore +
1.5*Approvals +
1.0*PriorityWeight -
1.0*SizePenalty
```
### 11.3 Decision States
- `canonical`
- `duplicate_of:<id>`
- `related_non_duplicate`
- `needs_human_tie_break`
## 12. Actioning and Routing
### Label Taxonomy (Baseline)
- `triage/canonical`
- `triage/duplicate`
- `triage/related`
- `triage/quarantine`
- `triage/noise-suppressed`
- `triage/ready-human`
### Action Policies
- Default safe mode: label + comment only.
- Optional strict mode: hide/quarantine routing.
- Auto-close disabled by default; enable only with explicit policy.
## 13. APIs
### Internal Service APIs
- `POST /events/ingest`
- `POST /entities/backfill`
- `GET /config/effective/{repo}`
- `GET /clusters/{id}`
- `GET /entities/{id}/similar`
- `POST /decisions/recompute`
- `GET /health`
### SDK/Hook APIs
- `register_hook(name, callback, timeout_ms)`
- `register_feature_extractor(name, extractor)`
- `register_connector(source|sink, impl)`
## 14. Storage and Indexing
- SQLite is the default relational backend for v1 (single-file local and hosted deployment baseline).
- Storage access must go through backend interfaces so PostgreSQL can replace SQLite with minimal application-layer changes.
- PostgreSQL is the first planned production replacement backend for scale-out and multi-writer deployments.
- Vector index (`pgvector`, Qdrant, or FAISS-backed sidecar).
- Signature stores for MinHash/SimHash keys.
- Blob storage for raw event snapshots in replay mode.
## 15. Reliability, Security, and Compliance
- Idempotency keys per event delivery.
- Replay-safe action deduplication.
- Signed webhook verification.
- Principle-of-least-privilege connector scopes.
- PII-safe logs and retention controls.
- Full decision traceability for maintainer audits.
## 16. Performance Targets
- 10k open entities processed in < 30 min batch refresh.
- Incremental event update P95 < 3s.
- Candidate retrieval P95 < 150ms for top-K query.
- Pair scoring throughput >= 2k pairs/sec per worker (baseline target).
## 17. Observability
- Metrics:
- ingest lag
- filter drop/suppress rates
- candidate retrieval recall proxy
- cluster churn
- canonical flip rate
- action failure rate
- Structured logs with correlation ids.
- Decision explanation payloads persisted.
## 18. Rollout Plan
1. Phase A: Offline POC
- Snapshot ingest, clustering, canonical report output.
2. Phase B: Live Read-Only
- Webhooks, decisions visible, no action writes.
3. Phase C: Live Assistive
- Labels/comments + quarantine routing.
4. Phase D: Optimization
- Learned pair scorer, active-learning thresholds, connector expansion.
## 19. Semantic Commit Strategy
- Use semantic commit prefixes during implementation:
- `feat:`, `fix:`, `perf:`, `refactor:`, `docs:`, `test:`, `chore:`
- Keep commits scoped to single subsystem changes.
- Final cleanup pass:
- squash noisy interim commits only at release boundary if needed.
## 20. Open Technical Questions
- Which local embedding model meets target latency on available hardware?
- Should canonical formulas remain deterministic or adopt learned ranker in v2?
- Which vector backend is preferred for first production deployment?
- How aggressive should low-pass suppression be by default for new repos?
## 21. Implementation Standards
- Primary implementation language is Python.
- Use Pydantic `BaseModel` for all external payloads, connector DTOs, event envelopes, config schemas, and decision outputs.
- Enable strict validation defaults to catch schema drift early.
- Keep business logic separate from transport models for testability.
- Use typed model fixtures in tests to avoid unvalidated dict-based test inputs.
You are an autonomous senior full-stack engineer responsible for building and maintaining a complete SaaS product. You operate with minimal supervision, making independent decisions while consulting on major strategic changes.
<author>blefnk/rules</author>
trigger: model_decision
description: Authoritative guide for all software-writing agents in this repository