Evaluation Strategy

Evaluation is a first-class architectural concern, not an afterthought. The system uses a four-tier eval taxonomy, ordered by reliability. Deterministic checks handle clinical correctness and safety. LLM-as-judge is scoped strictly to subjective quality dimensions where programmatic checks are insufficient.

Why Not LLM-as-Judge for Everything

A naive approach would be: run all recommendations through Claude with a rubric, get a score, done. This is insufficient for a clinical system:

Circular. Claude grading Claude's output biases toward its own reasoning patterns. It's more likely to approve outputs that match its style, not outputs that are clinically correct.
Non-deterministic. The same patient case yields different scores across runs. Temperature, sampling, and prompt phrasing all affect the grade.
No clinical ground truth. The LLM doesn't know the right treatment for a CHF patient. It knows what sounds right. For well-defined clinical guidelines, "sounds right" and "is right" can diverge.
Conflates dimensions. A single 5-point scale mixes accuracy, safety, completeness, and readability into one number. A recommendation can be beautifully written and clinically wrong.

LLM-as-judge has a role, but it's Tier 4 — the last resort for dimensions that can't be checked any other way.

Eval Taxonomy

Tier 1: Deterministic / Structural Evals

No LLM involved. Fully reproducible. Run on every recommendation in production and during eval.

Check	What It Validates	How
Citation grounding	Every `recommended_action` maps to ≥1 `evidence_table` row	JSON structural check — iterate actions, verify `evidence_refs` resolve to evidence rows
Schema compliance	Output conforms to recommendation JSON schema	Pydantic/jsonschema validation — all required fields present, correct types
Scope enforcement	No recommendations outside cardiopulmonary	Keyword + NER check against a cardiopulmonary term list; flag actions referencing non-scoped conditions
Contraindication coverage	Drug interaction checks were performed	Presence check — `contraindications_checked[]` is non-empty for recommendations involving medications
Safety guardrail recall	Red-flag cases triggered escalation	Synthetic edge case suite: inject troponin ↑ + chest pain, STEMI criteria, massive PE, K+ > 6.5, pH < 7.2 → verify `when_to_escalate` is populated and correct
Latency	Each pipeline stage meets target	Timer instrumentation — p50/p95 per stage (see latency budget)

Key property: these evals are cheap, fast, and deterministic. They gate every recommendation before it reaches the user (not just during offline eval).

Tier 2: Information Retrieval Evals

Standard IR metrics computed against curated relevance judgments. No LLM involved.

Metric	Dataset	Target	How
Precision@k	50 hand-curated cardiopulmonary queries with expected relevant documents	Establish baseline Month 3, improve via config learning	Standard IR precision — fraction of top-k results that are relevant
Recall@k (per source)	Same query set, with expected documents per source type	Guidelines surfaced in ≥90% of treatment queries	Fraction of expected relevant documents found in top-k, broken out by source (notes, imaging, guidelines, drugs)
NDCG@20	Same query set with graded relevance judgments (0-3 scale)	Establish baseline Month 3	Normalized discounted cumulative gain — are the most relevant documents ranked highest?
Guideline citation rate	Treatment queries from golden set	≥2 guideline citations per treatment query	Binary check: count `evidence_table` rows where `source_type == "guideline"`
Embedding cluster quality	CXR14 pathology labels	Intra-class similarity > inter-class similarity	Cosine similarity distributions — cardiomegaly vectors should cluster separately from pneumonia vectors

Curated query set construction:

50 queries spanning all in-scope conditions
Each query paired with expected relevant documents (doc_id + chunk_id)
Relevance grades: 0 (irrelevant), 1 (marginally relevant), 2 (relevant), 3 (highly relevant)
Stored in data/eval/retrieval_judgments.jsonl
Updated when new guideline documents are added or data pipeline changes

Tier 3: Guideline Adherence Evals

The key insight: for well-defined cardiopulmonary conditions, official clinical guidelines provide deterministic decision trees. These are more reliable than LLM-as-judge for clinical correctness because they check against established medical standards, not against what an LLM thinks sounds right.

Implementation: a rule-based checker with condition → expected_actions mappings derived directly from guidelines.

Condition + Context	Expected Action	Guideline Source
CHF + elevated BNP + volume overload	Recommend loop diuretics	AHA/ACC HF 2022, Class I
CHF + EF ≤ 40%	Recommend ACE inhibitor/ARB + beta-blocker	AHA/ACC HF 2022, Class I
CAP + CURB-65 score calculation	Appropriate antibiotic class for severity	ATS/IDSA CAP 2019
PE + hemodynamic instability	Anticoagulation + escalation to thrombolytics	AHA/ACC PE 2019
PE + hemodynamically stable	Anticoagulation, risk stratification	AHA/ACC PE 2019
COPD exacerbation	Short-acting bronchodilators + systemic corticosteroids	GOLD 2026
COPD + frequent exacerbations	Long-acting bronchodilator maintenance	GOLD 2026
STEMI criteria	Immediate cath lab recommendation	AHA/ACC ACS 2025, hard-coded rule
Pleural effusion + large volume	Thoracentesis consideration	BTS Pleural 2023
Incidental lung nodule + size criteria	Follow-up imaging per Fleischner	Fleischner 2017
Troponin ↑ + chest pain	Emergent pathway, call attending/ED	Safety rule + AHA/ACC ACS 2025
Critical K+ (> 6.5)	Immediate treatment, cardiac monitoring	Safety rule

This is not trying to encode all of medicine. It covers 10-15 high-confidence, Class I/strong recommendations within the cardiopulmonary scope — the cases where there is a clearly "right" answer per guidelines.

Scoring: per-case binary (did the recommendation include the expected action?) aggregated to a percentage across the golden set. Target: ≥90% on the golden set.

Why this works better than LLM-as-judge for correctness: an LLM might grade a recommendation as "4/5 — good but could be more thorough." The guideline adherence check asks a simpler, more useful question: "Did the system recommend diuretics for a CHF patient with volume overload? Yes or no."

Tier 4: LLM-as-Judge (Scoped)

LLM-as-judge is appropriate for dimensions that are subjective and hard to check programmatically. It is NOT used for clinical correctness, citation accuracy, or safety — those are handled by Tiers 1-3.

What Tier 4 evaluates:

Dimension	Question	Why LLM Is Needed
Coherence	Is the recommendation well-organized and readable?	Text quality is inherently subjective
Completeness	Did the system address all relevant aspects of the case?	Requires understanding the full clinical picture
Uncertainty calibration	Is the stated uncertainty level reasonable given the evidence?	Requires judgment about what's ambiguous
Differential consideration	Were alternative diagnoses mentioned when appropriate?	Requires clinical reasoning about what else could be going on

Mitigations for LLM-as-judge limitations:

Problem	Mitigation
Circular (Claude judging Claude)	Use a different model as judge. Sonnet judges Haiku output. For Sonnet output, use a separate evaluation prompt with no access to the system prompt.
Non-deterministic	Run each evaluation 3x, take majority vote. Log variance across runs.
Single conflated score	Grade each dimension independently (1-5 per dimension). Never aggregate into a single number.
Prompt sensitivity	Standardize eval prompts, version-control them, test for stability before adopting.

Tier 4 does NOT gate config updates. It is tracked for trending and insight, but config learning safety gates use only Tier 1-3 metrics (see Config Learning).

Golden Set

100 patient cases, stratified by condition and complexity:

Condition	Cases	Includes
Pneumonia	15	Varying severity (mild outpatient to severe ICU), mixed pathogens
CHF	15	HFrEF and HFpEF, acute decompensation, chronic management
COPD	15	Exacerbation, stable chronic, overlap with asthma
PE	15	Hemodynamically stable, unstable, sub-massive
MI	15	STEMI, NSTEMI, varying troponin trajectories
Multi-system	15	CHF + pneumonia, COPD + PE, MI + arrhythmia
Adversarial / edge	10	Conflicting evidence, ambiguous labs, conditions at the boundary of scope

Each case includes:

Patient demographics, conditions, medications
Lab values (with abnormal flags)
CXR classifications (simulated CheXNet output)
Clinical notes
Expected outputs:
- Expected problems identified
- Expected recommended actions (from Tier 3 guideline adherence rules)
- Expected safety escalations (if applicable)
- Expected evidence source types that should be cited
- Expected severity level

Storage: data/eval/golden_set/ — version-controlled, JSONL format, one case per line.

Maintenance: reviewed and updated when:

New guideline documents are added to the corpus
Guidelines are updated (new edition)
A condition is added to clinical scope
A systematic eval failure reveals a gap in the golden set

Held-Out Test Set

Size: 20% of Synthea patients per tenant (~1,000 patients per hospital)
Isolation: patient IDs are excluded from the feedback loop. They exist in the same DB and FAISS indices (retrieval should work normally), but their recommendations are never included in config learning feedback aggregation.
Purpose: evaluate config changes on patients the system hasn't been "trained" on via feedback
Stability: the held-out set is fixed at data generation time and does not change unless Synthea data is regenerated

Eval Pipeline

Nightly Eval (Tier 1-3)

Airflow DAG: eval_nightly
  │
  ├─ Tier 1: Run structural checks on last 24h of recommendations
  │   └─ Citation grounding, schema compliance, scope, contraindications
  │
  ├─ Tier 2: Run retrieval evals on curated query set
  │   └─ Precision@k, Recall@k, NDCG@20, guideline citation rate
  │
  ├─ Tier 3: Run guideline adherence on golden set
  │   └─ Generate recommendations for all 100 golden set cases
  │   └─ Compare against expected actions
  │
  ├─ Store results in eval_results table
  │
  └─ Regression check: compare against baseline
      ├─ All metrics ≥ baseline? → Pass (green)
      └─ Any metric < baseline? → Alert (Slack/email)

Weekly Eval (Tier 4)

Airflow DAG: eval_weekly_llm
  │
  ├─ Tier 4: Run LLM-as-judge on golden set recommendations
  │   └─ 3x per case, 4 dimensions each
  │   └─ Majority vote per dimension
  │
  ├─ Store results in eval_results table
  │
  └─ Track trends (no gating)

Eval Results Schema

CREATE TABLE eval_results (
    id SERIAL PRIMARY KEY,
    run_id UUID NOT NULL,
    run_type VARCHAR(20) NOT NULL,     -- 'nightly', 'weekly', 'config_gate'
    timestamp TIMESTAMPTZ NOT NULL,
    tenant_id VARCHAR(50) NOT NULL,
    condition VARCHAR(50),              -- NULL for aggregate metrics
    tier INTEGER NOT NULL,              -- 1, 2, 3, or 4
    metric_name VARCHAR(100) NOT NULL,
    metric_value FLOAT NOT NULL,
    baseline_id UUID,                   -- reference to baseline snapshot
    passed_baseline BOOLEAN,
    metadata JSONB                      -- additional context (e.g., LLM judge variance)
);

Config Update Gating

When a config update is proposed (via config learning):

Run full Tier 1-3 eval on held-out test set with the proposed config
Compare every metric against the baseline
Per-condition breakdown: check each condition separately
Gate criteria:
- Tier 1: ALL structural metrics must be ≥ baseline
- Tier 2: ALL retrieval metrics must be ≥ baseline
- Tier 3: guideline adherence must not regress on ANY condition
- Tier 4: tracked but does not gate
If any gating metric regresses → block the update, alert, log the failed metrics

Baseline Management

Initial baseline: established after Month 3, when the full pipeline (retrieval + reasoning + safety) first runs end-to-end
Baseline snapshot: all metric values on the held-out set BEFORE any config learning, stored as a named snapshot in eval_results with a baseline_id
Baseline updates: a new baseline is set only when a config update passes AND the team explicitly promotes the new metrics as the baseline. This prevents baseline drift.
Per-tenant baselines: each tenant has its own baseline (different patient populations → different metric distributions)

Evaluation Strategy

Evaluation Strategy

Why Not LLM-as-Judge for Everything

Eval Taxonomy

Tier 1: Deterministic / Structural Evals

Tier 2: Information Retrieval Evals

Tier 3: Guideline Adherence Evals

Tier 4: LLM-as-Judge (Scoped)

Golden Set

Held-Out Test Set

Eval Pipeline

Nightly Eval (Tier 1-3)

Weekly Eval (Tier 4)

Eval Results Schema

Config Update Gating

Baseline Management

Related Documents

AI Tools for Developers

Lesson 01: Evaluation Frameworks Overview

Evaluating AI Agent Systems: Metrics, Benchmarks, and Quality Assurance (2024-2026)

IATA BCBP Standard Compliance