Loading...
Loading...
Evaluation is a first-class architectural concern, not an afterthought. The system uses a four-tier eval taxonomy, ordered by reliability. Deterministic checks handle clinical correctness and safety. LLM-as-judge is scoped strictly to subjective quality dimensions where programmatic checks are insufficient.
# Evaluation Strategy
Evaluation is a first-class architectural concern, not an afterthought. The system uses a four-tier eval taxonomy, ordered by reliability. Deterministic checks handle clinical correctness and safety. LLM-as-judge is scoped strictly to subjective quality dimensions where programmatic checks are insufficient.
## Why Not LLM-as-Judge for Everything
A naive approach would be: run all recommendations through Claude with a rubric, get a score, done. This is insufficient for a clinical system:
- **Circular.** Claude grading Claude's output biases toward its own reasoning patterns. It's more likely to approve outputs that match its style, not outputs that are clinically correct.
- **Non-deterministic.** The same patient case yields different scores across runs. Temperature, sampling, and prompt phrasing all affect the grade.
- **No clinical ground truth.** The LLM doesn't know the right treatment for a CHF patient. It knows what *sounds* right. For well-defined clinical guidelines, "sounds right" and "is right" can diverge.
- **Conflates dimensions.** A single 5-point scale mixes accuracy, safety, completeness, and readability into one number. A recommendation can be beautifully written and clinically wrong.
LLM-as-judge has a role, but it's Tier 4 — the last resort for dimensions that can't be checked any other way.
---
## Eval Taxonomy
### Tier 1: Deterministic / Structural Evals
No LLM involved. Fully reproducible. Run on every recommendation in production and during eval.
| Check | What It Validates | How |
|---|---|---|
| Citation grounding | Every `recommended_action` maps to ≥1 `evidence_table` row | JSON structural check — iterate actions, verify `evidence_refs` resolve to evidence rows |
| Schema compliance | Output conforms to recommendation JSON schema | Pydantic/jsonschema validation — all required fields present, correct types |
| Scope enforcement | No recommendations outside cardiopulmonary | Keyword + NER check against a cardiopulmonary term list; flag actions referencing non-scoped conditions |
| Contraindication coverage | Drug interaction checks were performed | Presence check — `contraindications_checked[]` is non-empty for recommendations involving medications |
| Safety guardrail recall | Red-flag cases triggered escalation | Synthetic edge case suite: inject troponin ↑ + chest pain, STEMI criteria, massive PE, K+ > 6.5, pH < 7.2 → verify `when_to_escalate` is populated and correct |
| Latency | Each pipeline stage meets target | Timer instrumentation — p50/p95 per stage (see [latency budget](README.md#latency-budget)) |
**Key property:** these evals are cheap, fast, and deterministic. They gate every recommendation before it reaches the user (not just during offline eval).
---
### Tier 2: Information Retrieval Evals
Standard IR metrics computed against curated relevance judgments. No LLM involved.
| Metric | Dataset | Target | How |
|---|---|---|---|
| Precision@k | 50 hand-curated cardiopulmonary queries with expected relevant documents | Establish baseline Month 3, improve via config learning | Standard IR precision — fraction of top-k results that are relevant |
| Recall@k (per source) | Same query set, with expected documents per source type | Guidelines surfaced in ≥90% of treatment queries | Fraction of expected relevant documents found in top-k, broken out by source (notes, imaging, guidelines, drugs) |
| NDCG@20 | Same query set with graded relevance judgments (0-3 scale) | Establish baseline Month 3 | Normalized discounted cumulative gain — are the most relevant documents ranked highest? |
| Guideline citation rate | Treatment queries from golden set | ≥2 guideline citations per treatment query | Binary check: count `evidence_table` rows where `source_type == "guideline"` |
| Embedding cluster quality | CXR14 pathology labels | Intra-class similarity > inter-class similarity | Cosine similarity distributions — cardiomegaly vectors should cluster separately from pneumonia vectors |
**Curated query set construction:**
- 50 queries spanning all in-scope conditions
- Each query paired with expected relevant documents (doc_id + chunk_id)
- Relevance grades: 0 (irrelevant), 1 (marginally relevant), 2 (relevant), 3 (highly relevant)
- Stored in `data/eval/retrieval_judgments.jsonl`
- Updated when new guideline documents are added or data pipeline changes
---
### Tier 3: Guideline Adherence Evals
The key insight: for well-defined cardiopulmonary conditions, official clinical guidelines provide deterministic decision trees. These are **more reliable than LLM-as-judge** for clinical correctness because they check against established medical standards, not against what an LLM thinks sounds right.
**Implementation:** a rule-based checker with `condition → expected_actions` mappings derived directly from guidelines.
| Condition + Context | Expected Action | Guideline Source |
|---|---|---|
| CHF + elevated BNP + volume overload | Recommend loop diuretics | AHA/ACC HF 2022, Class I |
| CHF + EF ≤ 40% | Recommend ACE inhibitor/ARB + beta-blocker | AHA/ACC HF 2022, Class I |
| CAP + CURB-65 score calculation | Appropriate antibiotic class for severity | ATS/IDSA CAP 2019 |
| PE + hemodynamic instability | Anticoagulation + escalation to thrombolytics | AHA/ACC PE 2019 |
| PE + hemodynamically stable | Anticoagulation, risk stratification | AHA/ACC PE 2019 |
| COPD exacerbation | Short-acting bronchodilators + systemic corticosteroids | GOLD 2026 |
| COPD + frequent exacerbations | Long-acting bronchodilator maintenance | GOLD 2026 |
| STEMI criteria | Immediate cath lab recommendation | AHA/ACC ACS 2025, hard-coded rule |
| Pleural effusion + large volume | Thoracentesis consideration | BTS Pleural 2023 |
| Incidental lung nodule + size criteria | Follow-up imaging per Fleischner | Fleischner 2017 |
| Troponin ↑ + chest pain | Emergent pathway, call attending/ED | Safety rule + AHA/ACC ACS 2025 |
| Critical K+ (> 6.5) | Immediate treatment, cardiac monitoring | Safety rule |
This is not trying to encode all of medicine. It covers 10-15 high-confidence, Class I/strong recommendations within the cardiopulmonary scope — the cases where there is a clearly "right" answer per guidelines.
**Scoring:** per-case binary (did the recommendation include the expected action?) aggregated to a percentage across the golden set. Target: ≥90% on the golden set.
**Why this works better than LLM-as-judge for correctness:** an LLM might grade a recommendation as "4/5 — good but could be more thorough." The guideline adherence check asks a simpler, more useful question: "Did the system recommend diuretics for a CHF patient with volume overload? Yes or no."
---
### Tier 4: LLM-as-Judge (Scoped)
LLM-as-judge is appropriate for dimensions that are subjective and hard to check programmatically. It is NOT used for clinical correctness, citation accuracy, or safety — those are handled by Tiers 1-3.
**What Tier 4 evaluates:**
| Dimension | Question | Why LLM Is Needed |
|---|---|---|
| Coherence | Is the recommendation well-organized and readable? | Text quality is inherently subjective |
| Completeness | Did the system address all relevant aspects of the case? | Requires understanding the full clinical picture |
| Uncertainty calibration | Is the stated uncertainty level reasonable given the evidence? | Requires judgment about what's ambiguous |
| Differential consideration | Were alternative diagnoses mentioned when appropriate? | Requires clinical reasoning about what else could be going on |
**Mitigations for LLM-as-judge limitations:**
| Problem | Mitigation |
|---|---|
| Circular (Claude judging Claude) | Use a different model as judge. Sonnet judges Haiku output. For Sonnet output, use a separate evaluation prompt with no access to the system prompt. |
| Non-deterministic | Run each evaluation 3x, take majority vote. Log variance across runs. |
| Single conflated score | Grade each dimension independently (1-5 per dimension). Never aggregate into a single number. |
| Prompt sensitivity | Standardize eval prompts, version-control them, test for stability before adopting. |
**Tier 4 does NOT gate config updates.** It is tracked for trending and insight, but config learning safety gates use only Tier 1-3 metrics (see [Config Learning](config-learning.md)).
---
## Golden Set
100 patient cases, stratified by condition and complexity:
| Condition | Cases | Includes |
|---|---|---|
| Pneumonia | 15 | Varying severity (mild outpatient to severe ICU), mixed pathogens |
| CHF | 15 | HFrEF and HFpEF, acute decompensation, chronic management |
| COPD | 15 | Exacerbation, stable chronic, overlap with asthma |
| PE | 15 | Hemodynamically stable, unstable, sub-massive |
| MI | 15 | STEMI, NSTEMI, varying troponin trajectories |
| Multi-system | 15 | CHF + pneumonia, COPD + PE, MI + arrhythmia |
| Adversarial / edge | 10 | Conflicting evidence, ambiguous labs, conditions at the boundary of scope |
**Each case includes:**
- Patient demographics, conditions, medications
- Lab values (with abnormal flags)
- CXR classifications (simulated CheXNet output)
- Clinical notes
- **Expected outputs:**
- Expected problems identified
- Expected recommended actions (from Tier 3 guideline adherence rules)
- Expected safety escalations (if applicable)
- Expected evidence source types that should be cited
- Expected severity level
**Storage:** `data/eval/golden_set/` — version-controlled, JSONL format, one case per line.
**Maintenance:** reviewed and updated when:
- New guideline documents are added to the corpus
- Guidelines are updated (new edition)
- A condition is added to clinical scope
- A systematic eval failure reveals a gap in the golden set
---
## Held-Out Test Set
- **Size:** 20% of Synthea patients per tenant (~1,000 patients per hospital)
- **Isolation:** patient IDs are excluded from the feedback loop. They exist in the same DB and FAISS indices (retrieval should work normally), but their recommendations are never included in config learning feedback aggregation.
- **Purpose:** evaluate config changes on patients the system hasn't been "trained" on via feedback
- **Stability:** the held-out set is fixed at data generation time and does not change unless Synthea data is regenerated
---
## Eval Pipeline
### Nightly Eval (Tier 1-3)
```
Airflow DAG: eval_nightly
│
├─ Tier 1: Run structural checks on last 24h of recommendations
│ └─ Citation grounding, schema compliance, scope, contraindications
│
├─ Tier 2: Run retrieval evals on curated query set
│ └─ Precision@k, Recall@k, NDCG@20, guideline citation rate
│
├─ Tier 3: Run guideline adherence on golden set
│ └─ Generate recommendations for all 100 golden set cases
│ └─ Compare against expected actions
│
├─ Store results in eval_results table
│
└─ Regression check: compare against baseline
├─ All metrics ≥ baseline? → Pass (green)
└─ Any metric < baseline? → Alert (Slack/email)
```
### Weekly Eval (Tier 4)
```
Airflow DAG: eval_weekly_llm
│
├─ Tier 4: Run LLM-as-judge on golden set recommendations
│ └─ 3x per case, 4 dimensions each
│ └─ Majority vote per dimension
│
├─ Store results in eval_results table
│
└─ Track trends (no gating)
```
### Eval Results Schema
```sql
CREATE TABLE eval_results (
id SERIAL PRIMARY KEY,
run_id UUID NOT NULL,
run_type VARCHAR(20) NOT NULL, -- 'nightly', 'weekly', 'config_gate'
timestamp TIMESTAMPTZ NOT NULL,
tenant_id VARCHAR(50) NOT NULL,
condition VARCHAR(50), -- NULL for aggregate metrics
tier INTEGER NOT NULL, -- 1, 2, 3, or 4
metric_name VARCHAR(100) NOT NULL,
metric_value FLOAT NOT NULL,
baseline_id UUID, -- reference to baseline snapshot
passed_baseline BOOLEAN,
metadata JSONB -- additional context (e.g., LLM judge variance)
);
```
### Config Update Gating
When a config update is proposed (via [config learning](config-learning.md)):
1. Run full Tier 1-3 eval on held-out test set with the proposed config
2. Compare every metric against the baseline
3. Per-condition breakdown: check each condition separately
4. **Gate criteria:**
- Tier 1: ALL structural metrics must be ≥ baseline
- Tier 2: ALL retrieval metrics must be ≥ baseline
- Tier 3: guideline adherence must not regress on ANY condition
- Tier 4: tracked but does not gate
5. If any gating metric regresses → block the update, alert, log the failed metrics
---
## Baseline Management
- **Initial baseline:** established after Month 3, when the full pipeline (retrieval + reasoning + safety) first runs end-to-end
- **Baseline snapshot:** all metric values on the held-out set BEFORE any config learning, stored as a named snapshot in `eval_results` with a `baseline_id`
- **Baseline updates:** a new baseline is set only when a config update passes AND the team explicitly promotes the new metrics as the baseline. This prevents baseline drift.
- **Per-tenant baselines:** each tenant has its own baseline (different patient populations → different metric distributions)
- Without a harness, you **can't compare** prompts, models, retrieval configs, or costs.
Evaluate, benchmark, and regression-test AI/LLM systems. Covers evaluation framework design, benchmark creation, human evaluation protocols, automated evaluation (LLM-as-judge), regression testing, statistical significance, and continuous evaluation pipelines.
<img width="1388" height="298" alt="full_diagram" src="https://github.com/user-attachments/assets/12a2371b-8be2-4219-9b48-90503eb43c69" />
A list of all public EEG-datasets. This list of EEG-resources is not exhaustive. If you find something new, or have explored any unfiltered link in depth, please update the repository.