Evaluation and Testing LLM Systems

> Testing LLMs is not like unit testing software. There's no single correct output, no simple assertion to make, and behavior drifts without any code change. Here's how to build a testing culture that actually catches real problems.

NirajKulkarnii

May 2, 2026

0 upvotes

0 downloads

0 views

ai llm eval

View source

# Evaluation and Testing LLM Systems > Testing LLMs is not like unit testing software. There's no single correct output, no simple assertion to make, and behavior drifts without any code change. Here's how to build a testing culture that actually catches real problems. --- ## Why LLM Testing is Hard In traditional software, a test is: ```python assert process("input") == "expected_output" ``` With LLMs, outputs are stochastic, subjective, and context-dependent. "Did the model respond helpfully?" is not a binary question. "Did the model refuse correctly?" depends on who you ask. This isn't a reason to abandon rigorous testing — it's a reason to build a more sophisticated testing framework. --- ## The Evaluation Stack Think of your eval system in layers: ``` Layer 4: Human evaluation (ground truth, expensive, slow) ↑ validates Layer 3: LLM-as-judge (scalable, fast, cheaper) ↑ calibrated against Layer 2: Reference-based metrics (BLEU, ROUGE, exact match) ↑ supplemented by Layer 1: Automated assertion-based tests (fastest, most brittle) ``` Each layer is less reliable and more scalable as you go down. Use all four together. --- ## Layer 1: Automated Assertions The fastest and cheapest tests. Not sufficient alone, but essential for catching regressions on concrete behaviors. **What works:** - Format checks: is the output valid JSON? Does it contain required fields? - Structural checks: does the response cite at least one source? - Negative assertions: does the response NOT contain a banned phrase? - Length checks: is the output within expected length bounds? ```python def test_response_format(response): # Must be valid JSON data = json.loads(response) # Must have required fields assert "answer" in data assert "sources" in data assert isinstance(data["sources"], list) # Answer must be within length bounds assert 10 < len(data["answer"]) < 2000 # Must not contain obvious failure signals assert "I cannot" not in data["answer"] assert "As an AI language model" not in data["answer"] def test_no_pii_leakage(response, input_data): # Any PII in the input must not appear in the output for entity in extract_pii(input_data): assert entity not in response ``` **Anti-pattern:** Over-specifying assertions. If your test checks for exact phrasing, you'll have hundreds of spurious failures every time the model is updated. --- ## Layer 2: Reference-Based Metrics **When to use:** Tasks with a well-defined "correct" answer. Good for extraction, classification, structured generation. Bad for open-ended generation. **BLEU / ROUGE:** N-gram overlap metrics. Useful for summarization tasks where you have human-written reference summaries. BLEU has known problems (doesn't handle paraphrasing, rewards exact matches). Use as a sanity check, not as your primary signal. **Exact match / F1:** Good for extraction tasks. "Does the extracted date match the reference date?" **BERTScore:** Semantic similarity between output and reference using contextual embeddings. Handles paraphrasing better than BLEU. Use for tasks where meaning matters more than exact wording. **The calibration requirement:** Any automated metric must be calibrated against human judgments on your specific task. A metric that correlates poorly with human judgment for your use case is worse than no metric. --- ## Layer 3: LLM-as-Judge Use a capable model to evaluate your production model's outputs. This scales evaluation to a level human labeling cannot match. ### The Basic Pattern ```python EVALUATION_PROMPT = """ You are an evaluator. Rate the following response on a scale of 1-5. Task: {task_description} User question: {question} Model response: {response} Evaluate on: - Accuracy: Is the information correct? (1=wrong, 5=fully correct) - Helpfulness: Does it address the user's need? (1=not helpful, 5=fully helpful) - Safety: Is the response appropriate? (1=harmful, 5=fully safe) Return JSON: {{"accuracy": N, "helpfulness": N, "safety": N, "reasoning": "..."}} """ def evaluate(question, response, task_description): prompt = EVALUATION_PROMPT.format( task_description=task_description, question=question, response=response ) return json.loads(judge_model.complete(prompt)) ``` ### LLM-as-Judge Pitfalls **Verbosity bias:** LLM judges tend to score longer responses higher, independent of quality. Mitigate by explicitly instructing the judge to penalize unnecessary verbosity. **Position bias:** When comparing two responses (A vs B), LLM judges tend to prefer the first one. Always run both orderings (A vs B and B vs A) and aggregate. **Self-serving bias:** The same model family tends to rate its own outputs higher. Don't use the same model as both the production model and the judge. **Lack of factual knowledge:** LLM judges cannot verify facts they weren't trained on. For factually intensive tasks, supplement LLM-as-judge with retrieval-augmented evaluation (give the judge access to the ground truth document). ### Calibrating Your Judge Before relying on LLM-as-judge at scale: 1. Sample 200 responses that span quality levels. 2. Have humans rate them. 3. Compare judge scores to human scores. 4. If correlation > 0.7, the judge is useful. If not, refine the evaluation prompt or use a different judge model. 5. Re-calibrate monthly or after any model update. --- ## Layer 4: Human Evaluation The ground truth layer. Expensive and slow, but required to calibrate everything else. ### Efficient Human Eval Design **Side-by-side comparison beats absolute rating.** "Which response is better: A or B?" is far more reliable than "Rate this response 1-5." Humans are much better at relative judgments than absolute ones. **Use domain experts for domain-specific evals.** A medical QA system should be evaluated by clinicians, not by general-purpose annotators. **Minimal task design.** Each evaluator should answer one question per sample, not fill out a 20-field rubric. Evaluator fatigue degrades quality fast. **Inter-annotator agreement.** Always have multiple annotators for the same samples. If your Cohen's kappa is below 0.6, your evaluation task is too ambiguous. Clarify the rubric. --- ## Regression Testing The highest-leverage testing practice. Every time you change your system (model update, prompt change, RAG update), run your eval suite. ### Building the Test Set **Golden set:** A curated set of 100–500 (question, ideal response) pairs, spanning your use case distribution. These are your smoke tests. **Behavioral tests:** Specific scenarios that test known edge cases and failure modes: - Ambiguous queries - Requests at the boundary of your system's scope - Adversarial inputs - Rare but important topics **Regression set:** Every time you find a real production failure, add it to this set. It grows over time and prevents re-introduction of fixed bugs. ### What to Track Over Time For each model or prompt version: | Metric | How to Measure | |--------|----------------| | Accuracy on golden set | LLM judge + human spot check | | Format validity rate | Automated assertions | | Refusal rate on benign queries | Human-labeled sample | | Refusal rate on policy-violating queries | Adversarial test set | | Latency p50 / p95 | Infrastructure metrics | | Cost per request | Token logging | A model update that improves accuracy by 3% but increases refusal rate on benign queries by 10% is a net negative. Track all dimensions. --- ## Evals for RAG Systems RAG evaluation is more complex because there are more components to measure: **Retrieval quality:** - Recall@k: Are the relevant documents being retrieved? - Precision@k: Are the retrieved documents actually relevant? **Augmentation quality:** - Context relevance: Is the retrieved context relevant to the question? - Context utilization: Does the model actually use the retrieved context? **Generation quality:** - Faithfulness: Are the model's claims supported by the retrieved context? - Answer relevance: Does the answer address the question? - Groundedness: Are citations accurate? The RAGAS framework provides automated metrics for these dimensions. It's worth implementing or adapting. --- ## A/B Testing LLM Changes Running A/B tests on LLM changes requires careful design because: - LLM quality is hard to measure automatically (see above) - Users may perceive quality changes that your metrics miss - Some effects (trust erosion) take time to manifest ### What to Measure in A/B Tests Beyond standard product metrics (engagement, retention), add: - **Regeneration rate:** Users who click "regenerate" are signaling dissatisfaction. - **Copy rate:** Users who copy the response likely found it useful. - **Follow-up question rate:** Highly correlated with response incompleteness. - **Explicit ratings:** Thumbs up/down if you have them. ### Traffic Allocation For risky changes (prompt rewrites, model upgrades), start at 1–5% traffic before scaling. LLM failures can be subtle and non-obvious from metrics alone — you want a human to review samples from both groups before concluding the test. --- ## The Eval-First Development Process The best teams build evaluations before they build the feature: 1. **Define what "good" looks like.** Write examples of ideal outputs for your use case. 2. **Build your eval set.** Create a test set that covers the distribution of real queries. 3. **Establish a baseline.** Measure your current system against the eval set. 4. **Iterate.** Make changes. Re-run evals. Ship if metrics improve, don't regress. This sounds obvious. Most teams skip steps 1–3 and iterate in the dark. --- ## Red-Teaming Red-teaming is adversarial testing: trying to make your system fail in ways that real attackers or edge-case users might try. ### Red-Team Test Categories - **Jailbreaks:** Attempts to get the model to violate its guidelines via clever framing. - **Prompt injection:** Attempting to override system prompt via user input or retrieved content. - **Information extraction:** Attempting to get the model to reveal system prompts, user data, or training data. - **Edge cases:** Queries at the boundary of the system's capabilities (multilingual, highly technical, ambiguous intent). - **Social engineering:** Gradual escalation, building false context, roleplay framing. ### Red-Teaming Process 1. Hire or designate red-teamers who are incentivized to find failures. 2. Give them access to the system but not the full system prompt. 3. Set a quota: find N failures per session. 4. For each failure found, log the attack, the result, and the fix. 5. Add every found failure to your regression test set. 6. Re-run red-teaming after any major change. Red-teaming before every significant model update is not optional for production systems.

Related Documents

Evaluation Harness (Offline + Online)

/godmode:eval

🔬 Open Deep Research

EEG-Datasets