Loading...
Loading...
> Testing LLMs is not like unit testing software. There's no single correct output, no simple assertion to make, and behavior drifts without any code change. Here's how to build a testing culture that actually catches real problems.
# Evaluation and Testing LLM Systems
> Testing LLMs is not like unit testing software. There's no single correct output, no simple assertion to make, and behavior drifts without any code change. Here's how to build a testing culture that actually catches real problems.
---
## Why LLM Testing is Hard
In traditional software, a test is:
```python
assert process("input") == "expected_output"
```
With LLMs, outputs are stochastic, subjective, and context-dependent. "Did the model respond helpfully?" is not a binary question. "Did the model refuse correctly?" depends on who you ask.
This isn't a reason to abandon rigorous testing — it's a reason to build a more sophisticated testing framework.
---
## The Evaluation Stack
Think of your eval system in layers:
```
Layer 4: Human evaluation (ground truth, expensive, slow)
↑ validates
Layer 3: LLM-as-judge (scalable, fast, cheaper)
↑ calibrated against
Layer 2: Reference-based metrics (BLEU, ROUGE, exact match)
↑ supplemented by
Layer 1: Automated assertion-based tests (fastest, most brittle)
```
Each layer is less reliable and more scalable as you go down. Use all four together.
---
## Layer 1: Automated Assertions
The fastest and cheapest tests. Not sufficient alone, but essential for catching regressions on concrete behaviors.
**What works:**
- Format checks: is the output valid JSON? Does it contain required fields?
- Structural checks: does the response cite at least one source?
- Negative assertions: does the response NOT contain a banned phrase?
- Length checks: is the output within expected length bounds?
```python
def test_response_format(response):
# Must be valid JSON
data = json.loads(response)
# Must have required fields
assert "answer" in data
assert "sources" in data
assert isinstance(data["sources"], list)
# Answer must be within length bounds
assert 10 < len(data["answer"]) < 2000
# Must not contain obvious failure signals
assert "I cannot" not in data["answer"]
assert "As an AI language model" not in data["answer"]
def test_no_pii_leakage(response, input_data):
# Any PII in the input must not appear in the output
for entity in extract_pii(input_data):
assert entity not in response
```
**Anti-pattern:** Over-specifying assertions. If your test checks for exact phrasing, you'll have hundreds of spurious failures every time the model is updated.
---
## Layer 2: Reference-Based Metrics
**When to use:** Tasks with a well-defined "correct" answer. Good for extraction, classification, structured generation. Bad for open-ended generation.
**BLEU / ROUGE:** N-gram overlap metrics. Useful for summarization tasks where you have human-written reference summaries. BLEU has known problems (doesn't handle paraphrasing, rewards exact matches). Use as a sanity check, not as your primary signal.
**Exact match / F1:** Good for extraction tasks. "Does the extracted date match the reference date?"
**BERTScore:** Semantic similarity between output and reference using contextual embeddings. Handles paraphrasing better than BLEU. Use for tasks where meaning matters more than exact wording.
**The calibration requirement:** Any automated metric must be calibrated against human judgments on your specific task. A metric that correlates poorly with human judgment for your use case is worse than no metric.
---
## Layer 3: LLM-as-Judge
Use a capable model to evaluate your production model's outputs. This scales evaluation to a level human labeling cannot match.
### The Basic Pattern
```python
EVALUATION_PROMPT = """
You are an evaluator. Rate the following response on a scale of 1-5.
Task: {task_description}
User question: {question}
Model response: {response}
Evaluate on:
- Accuracy: Is the information correct? (1=wrong, 5=fully correct)
- Helpfulness: Does it address the user's need? (1=not helpful, 5=fully helpful)
- Safety: Is the response appropriate? (1=harmful, 5=fully safe)
Return JSON: {{"accuracy": N, "helpfulness": N, "safety": N, "reasoning": "..."}}
"""
def evaluate(question, response, task_description):
prompt = EVALUATION_PROMPT.format(
task_description=task_description,
question=question,
response=response
)
return json.loads(judge_model.complete(prompt))
```
### LLM-as-Judge Pitfalls
**Verbosity bias:** LLM judges tend to score longer responses higher, independent of quality. Mitigate by explicitly instructing the judge to penalize unnecessary verbosity.
**Position bias:** When comparing two responses (A vs B), LLM judges tend to prefer the first one. Always run both orderings (A vs B and B vs A) and aggregate.
**Self-serving bias:** The same model family tends to rate its own outputs higher. Don't use the same model as both the production model and the judge.
**Lack of factual knowledge:** LLM judges cannot verify facts they weren't trained on. For factually intensive tasks, supplement LLM-as-judge with retrieval-augmented evaluation (give the judge access to the ground truth document).
### Calibrating Your Judge
Before relying on LLM-as-judge at scale:
1. Sample 200 responses that span quality levels.
2. Have humans rate them.
3. Compare judge scores to human scores.
4. If correlation > 0.7, the judge is useful. If not, refine the evaluation prompt or use a different judge model.
5. Re-calibrate monthly or after any model update.
---
## Layer 4: Human Evaluation
The ground truth layer. Expensive and slow, but required to calibrate everything else.
### Efficient Human Eval Design
**Side-by-side comparison beats absolute rating.** "Which response is better: A or B?" is far more reliable than "Rate this response 1-5." Humans are much better at relative judgments than absolute ones.
**Use domain experts for domain-specific evals.** A medical QA system should be evaluated by clinicians, not by general-purpose annotators.
**Minimal task design.** Each evaluator should answer one question per sample, not fill out a 20-field rubric. Evaluator fatigue degrades quality fast.
**Inter-annotator agreement.** Always have multiple annotators for the same samples. If your Cohen's kappa is below 0.6, your evaluation task is too ambiguous. Clarify the rubric.
---
## Regression Testing
The highest-leverage testing practice. Every time you change your system (model update, prompt change, RAG update), run your eval suite.
### Building the Test Set
**Golden set:** A curated set of 100–500 (question, ideal response) pairs, spanning your use case distribution. These are your smoke tests.
**Behavioral tests:** Specific scenarios that test known edge cases and failure modes:
- Ambiguous queries
- Requests at the boundary of your system's scope
- Adversarial inputs
- Rare but important topics
**Regression set:** Every time you find a real production failure, add it to this set. It grows over time and prevents re-introduction of fixed bugs.
### What to Track Over Time
For each model or prompt version:
| Metric | How to Measure |
|--------|----------------|
| Accuracy on golden set | LLM judge + human spot check |
| Format validity rate | Automated assertions |
| Refusal rate on benign queries | Human-labeled sample |
| Refusal rate on policy-violating queries | Adversarial test set |
| Latency p50 / p95 | Infrastructure metrics |
| Cost per request | Token logging |
A model update that improves accuracy by 3% but increases refusal rate on benign queries by 10% is a net negative. Track all dimensions.
---
## Evals for RAG Systems
RAG evaluation is more complex because there are more components to measure:
**Retrieval quality:**
- Recall@k: Are the relevant documents being retrieved?
- Precision@k: Are the retrieved documents actually relevant?
**Augmentation quality:**
- Context relevance: Is the retrieved context relevant to the question?
- Context utilization: Does the model actually use the retrieved context?
**Generation quality:**
- Faithfulness: Are the model's claims supported by the retrieved context?
- Answer relevance: Does the answer address the question?
- Groundedness: Are citations accurate?
The RAGAS framework provides automated metrics for these dimensions. It's worth implementing or adapting.
---
## A/B Testing LLM Changes
Running A/B tests on LLM changes requires careful design because:
- LLM quality is hard to measure automatically (see above)
- Users may perceive quality changes that your metrics miss
- Some effects (trust erosion) take time to manifest
### What to Measure in A/B Tests
Beyond standard product metrics (engagement, retention), add:
- **Regeneration rate:** Users who click "regenerate" are signaling dissatisfaction.
- **Copy rate:** Users who copy the response likely found it useful.
- **Follow-up question rate:** Highly correlated with response incompleteness.
- **Explicit ratings:** Thumbs up/down if you have them.
### Traffic Allocation
For risky changes (prompt rewrites, model upgrades), start at 1–5% traffic before scaling. LLM failures can be subtle and non-obvious from metrics alone — you want a human to review samples from both groups before concluding the test.
---
## The Eval-First Development Process
The best teams build evaluations before they build the feature:
1. **Define what "good" looks like.** Write examples of ideal outputs for your use case.
2. **Build your eval set.** Create a test set that covers the distribution of real queries.
3. **Establish a baseline.** Measure your current system against the eval set.
4. **Iterate.** Make changes. Re-run evals. Ship if metrics improve, don't regress.
This sounds obvious. Most teams skip steps 1–3 and iterate in the dark.
---
## Red-Teaming
Red-teaming is adversarial testing: trying to make your system fail in ways that real attackers or edge-case users might try.
### Red-Team Test Categories
- **Jailbreaks:** Attempts to get the model to violate its guidelines via clever framing.
- **Prompt injection:** Attempting to override system prompt via user input or retrieved content.
- **Information extraction:** Attempting to get the model to reveal system prompts, user data, or training data.
- **Edge cases:** Queries at the boundary of the system's capabilities (multilingual, highly technical, ambiguous intent).
- **Social engineering:** Gradual escalation, building false context, roleplay framing.
### Red-Teaming Process
1. Hire or designate red-teamers who are incentivized to find failures.
2. Give them access to the system but not the full system prompt.
3. Set a quota: find N failures per session.
4. For each failure found, log the attack, the result, and the fix.
5. Add every found failure to your regression test set.
6. Re-run red-teaming after any major change.
Red-teaming before every significant model update is not optional for production systems.
- Without a harness, you **can't compare** prompts, models, retrieval configs, or costs.
Evaluate, benchmark, and regression-test AI/LLM systems. Covers evaluation framework design, benchmark creation, human evaluation protocols, automated evaluation (LLM-as-judge), regression testing, statistical significance, and continuous evaluation pipelines.
<img width="1388" height="298" alt="full_diagram" src="https://github.com/user-attachments/assets/12a2371b-8be2-4219-9b48-90503eb43c69" />
A list of all public EEG-datasets. This list of EEG-resources is not exhaustive. If you find something new, or have explored any unfiltered link in depth, please update the repository.