Loading...
Loading...
Loading...
---
title: "LLM Evaluation Cheat Sheet"
description:
Eval types, metrics, LLM-as-judge patterns, regression testing, CI
integration, and tooling for testing non-deterministic AI systems.
---
Testing LLM applications differs from traditional software testing. Outputs are
non-deterministic, quality is subjective, and no single "correct" answer exists
for most tasks. Evaluation bridges this gap with grading strategies designed for
probabilistic systems.
## Quick Reference
| I need to... | Approach | Tool / Method |
| ------------------------------------ | ---------------------------- | ------------------------------------------------- |
| Check exact output format | Deterministic assertion | String match, regex, JSON schema |
| Grade subjective quality | LLM-as-judge | Rubric prompt with stronger model |
| Measure factual accuracy | Golden set + scoring | Exact match, F1, ROUGE-L |
| Detect regressions from prompt edits | Regression eval suite | promptfoo / Braintrust experiment diff |
| Compare two models or prompts | Side-by-side eval | promptfoo matrix view, Braintrust experiments |
| Measure consistency | Cosine similarity | Sentence embeddings across paraphrased inputs |
| Test safety and toxicity | Red-teaming + classification | promptfoo red-team, LLM binary classifier |
| Monitor production quality | Online eval | Braintrust online scoring, sampled human review |
| Run evals in CI | GitHub Actions integration | promptfoo `--fail-on-error`, threshold scripts |
| Control eval costs | Sampling + caching | Subset runs, prompt caching, smaller judge models |
## Why LLM Evaluation Differs
Traditional tests assert deterministic outcomes: `f(x) == y`. LLM evaluation
handles three problems traditional testing avoids:
| Problem | Traditional test | LLM eval |
| ------------------ | ------------------------- | ----------------------------------------------- |
| Non-determinism | Same input, same output | Same input, different outputs each run |
| Subjective quality | Binary pass/fail | Graded on rubrics, similarity, human judgement |
| No ground truth | Known correct answer | Multiple valid answers; "best" is contextual |
| Prompt sensitivity | Code is the specification | Small wording changes cause large output swings |
This means evaluation must combine deterministic checks (where possible),
statistical scoring (for quality), and human review (for edge cases).
## Eval Types
### Deterministic Assertions
Use when output format is constrained and verifiable by code. Fastest and most
reliable.
```python
# Exact match
def eval_exact(output: str, expected: str) -> bool:
return output.strip().lower() == expected.lower()
# Contains required content
def eval_contains(output: str, required: list[str]) -> bool:
return all(phrase in output for phrase in required)
# JSON structure validation
import json
def eval_json_schema(output: str) -> bool:
try:
data = json.loads(output)
return "answer" in data and "confidence" in data
except json.JSONDecodeError:
return False
```
Promptfoo YAML equivalents:
```yaml
assert:
- type: equals
value: "expected output"
- type: contains-all
value: ["required phrase", "another phrase"]
- type: is-json
- type: regex
value: '^\d{4}-\d{2}-\d{2}$'
- type: cost
threshold: 0.01
- type: latency
threshold: 500
```
### Scoring Evals (LLM-as-Judge)
Use when quality is subjective or multi-dimensional. A stronger model grades the
output of the model under test.
```python
import anthropic
client = anthropic.Anthropic()
def llm_judge(output: str, rubric: str) -> dict:
"""Grade output against a rubric using a stronger model."""
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
messages=[{
"role": "user",
"content": f"""Grade this output against the rubric.
<rubric>{rubric}</rubric>
<output>{output}</output>
Think step by step in <thinking> tags.
Then output a JSON object with "score" (1-5) and "reason" (one sentence).
"""
}]
)
return parse_json(response.content[0].text)
```
Promptfoo YAML:
```yaml
assert:
- type: llm-rubric
value: "Response mentions the return policy and provides a timeframe"
provider: openai:gpt-4o
- type: similar
value: "reference answer text"
threshold: 0.8
```
**Rubric design principles:**
- Be specific: "mentions Acme Inc. in the first sentence" beats "good response"
- Use scales: Likert 1-5, binary correct/incorrect, or ordinal categories
- Require reasoning before scoring to improve judge accuracy
- Use a different (ideally stronger) model than the one being evaluated
### Human Evals
Use for final validation, calibrating automated judges, and edge cases where
automated grading fails. Expensive and slow -- reserve for high-stakes
decisions.
| Human eval method | When to use | Scale |
| --------------------- | ---------------------------------------- | ------ |
| Side-by-side ranking | Comparing two prompt variants | Low |
| Likert rating | Measuring tone, helpfulness, coherence | Medium |
| Correctness labeling | Building golden datasets | Low |
| Error categorization | Understanding failure modes | Low |
| Spot-check production | Validating automated scoring calibration | Medium |
**Calibration loop:** Run human evals on a sample, then train LLM-as-judge
rubrics to match human scores. Periodically re-calibrate.
## Eval Datasets
### Golden Sets
Curated input-output pairs with verified correct answers. The foundation of
regression testing.
```python
# golden_set.jsonl — one case per line
# {"input": "What is the capital of France?", "expected": "Paris"}
# {"input": "Summarize photosynthesis in one sentence.", "expected": "..."}
import json
def load_golden_set(path: str) -> list[dict]:
with open(path) as f:
return [json.loads(line) for line in f]
```
**Building golden sets:**
1. Start with 20-50 hand-crafted cases covering core behaviors and edge cases
2. Add cases from production failures (every bug becomes a test case)
3. Include adversarial inputs: jailbreaks, prompt injections, ambiguous queries
4. Label expected output, acceptable variations, and unacceptable outputs
### Synthetic Generation
Use an LLM to expand a small seed set into a larger eval dataset.
```python
def generate_test_cases(seed_examples: list[dict], n: int = 100) -> list[dict]:
"""Generate synthetic eval cases from seed examples."""
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=4096,
messages=[{
"role": "user",
"content": f"""Generate {n} test cases similar to these examples.
Include edge cases, adversarial inputs, and boundary conditions.
Output as JSON array with "input" and "expected" fields.
Examples:
{json.dumps(seed_examples, indent=2)}"""
}]
)
return json.loads(response.content[0].text)
```
**Caution:** Verify synthetic cases. LLM-generated test data can encode the
model's own biases, creating a self-reinforcing loop.
### Production Sampling
Sample real user interactions to build eval sets that reflect actual usage
patterns, not imagined ones.
```python
# Sample strategy: log interactions, then filter
def sample_production_data(logs: list, n: int = 200) -> list:
"""Stratified sample: common cases + edge cases + failures."""
failures = [l for l in logs if l["user_rating"] < 3]
normal = [l for l in logs if l["user_rating"] >= 3]
# Over-sample failures to catch regressions
sample = (
random.sample(failures, min(n // 4, len(failures)))
+ random.sample(normal, min(3 * n // 4, len(normal)))
)
return sample
```
## Metrics
### Quality Metrics
| Metric | What it measures | Method |
| ----------------- | ------------------------------------------------------ | --------------------------------- |
| Accuracy | Correct answers / total answers | Exact match against golden set |
| F1 Score | Precision-recall balance for classification | Compare predicted vs. true labels |
| ROUGE-L | Summary quality (longest common subsequence) | Compare against reference summary |
| Cosine similarity | Semantic consistency across paraphrased inputs | Sentence embeddings comparison |
| Faithfulness | Output grounded in provided context (no hallucination) | LLM-as-judge with source docs |
| Relevance | Output addresses the question asked | LLM-as-judge with rubric |
| Toxicity | Harmful or inappropriate content | Classifier or LLM binary judge |
### Operational Metrics
| Metric | What it measures | Target example |
| ------------- | ------------------------ | -------------- |
| Latency (p50) | Median response time | < 500ms |
| Latency (p99) | Tail response time | < 2000ms |
| Cost per call | API spend per invocation | < $0.01 |
| Token usage | Input + output tokens | < 4096 total |
| Error rate | Failed API calls | < 0.1% |
Track both. A model that scores well on quality but costs 10x more or adds 5
seconds of latency may lose to a "worse" model in production.
## LLM-as-Judge Patterns
### Stronger Model Judges Weaker Model
The most common pattern. Use a frontier model to evaluate a smaller, cheaper
model's output.
```python
def judge_with_rubric(question: str, output: str, rubric: str) -> int:
"""Returns score 1-5."""
response = client.messages.create(
model="claude-sonnet-4-20250514", # Judge: stronger model
max_tokens=512,
messages=[{
"role": "user",
"content": f"""You are an evaluation judge. Score this output 1-5.
<question>{question}</question>
<output>{output}</output>
<rubric>{rubric}</rubric>
Think step by step, then output only the integer score on the last line."""
}]
)
# Parse the last line as the score
return int(response.content[0].text.strip().split('\n')[-1])
```
### Self-Consistency Check
Run the same prompt multiple times. If the model gives conflicting answers, the
output is unreliable.
```python
def self_consistency(prompt: str, n: int = 5, threshold: float = 0.7) -> dict:
"""Run n times, check agreement."""
responses = [get_completion(prompt) for _ in range(n)]
# For classification: majority vote
from collections import Counter
votes = Counter(r.strip().lower() for r in responses)
most_common, count = votes.most_common(1)[0]
return {
"answer": most_common,
"agreement": count / n,
"reliable": count / n >= threshold,
}
```
### Pairwise Comparison
Ask the judge to choose the better output from two candidates. More reliable
than absolute scoring for comparing variants.
```python
def pairwise_judge(question: str, output_a: str, output_b: str) -> str:
"""Returns 'A', 'B', or 'tie'."""
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=512,
messages=[{
"role": "user",
"content": f"""Which response better answers the question?
<question>{question}</question>
<response_a>{output_a}</response_a>
<response_b>{output_b}</response_b>
Think step by step, then output exactly one of: A, B, tie"""
}]
)
result = response.content[0].text.strip().split('\n')[-1].upper()
if "A" in result: return "A"
if "B" in result: return "B"
return "tie"
```
**Position bias:** LLM judges favor the first response shown. Mitigate by
running each comparison twice with swapped order.
## Regression Testing for Prompts
Prompt changes break existing behavior more often than they improve it. Treat
prompt edits like code changes: test before deploying.
### Workflow
```text
1. Baseline — Run eval suite on current prompt, save scores
2. Edit — Change the prompt
3. Re-evaluate — Run same eval suite on new prompt
4. Compare — Diff scores against baseline
5. Ship/revert — Accept if no regressions, or fix and re-test
```
### Promptfoo Configuration for Regression Testing
```yaml
# promptfooconfig.yaml
description: "Summarization prompt regression test"
prompts:
- file://prompts/summarize_v1.txt
- file://prompts/summarize_v2.txt
providers:
- anthropic:messages:claude-sonnet-4-20250514
defaultTest:
assert:
- type: llm-rubric
value: "Summary captures the main point in 1-2 sentences"
- type: latency
threshold: 3000
tests:
- vars:
article: "In a groundbreaking study, researchers at MIT..."
assert:
- type: contains
value: "MIT"
- type: similar
value: "MIT researchers discovered a new antibiotic compound"
threshold: 0.7
- vars:
article: "The quarterly earnings report showed..."
assert:
- type: not-contains
value: "I don't know"
```
```bash
# Run eval and view matrix comparison
npx promptfoo eval
npx promptfoo view
```
### Braintrust Experiment Comparison
```python
from braintrust import Eval
from autoevals import Factuality, ClosedQA
Eval(
"Summarizer",
data=lambda: load_golden_set("golden_set.jsonl"),
task=lambda input: summarize(input, prompt_version="v2"),
scores=[Factuality, ClosedQA],
)
# Braintrust UI shows diff against previous experiment run
```
## A/B Testing and Canary Evaluation
### A/B Testing
Split production traffic between prompt variants and measure user-facing
metrics.
```python
import random
def route_prompt(user_id: str, variants: dict, split: float = 0.5) -> str:
"""Deterministic routing based on user ID."""
bucket = hash(user_id) % 100
variant = "B" if bucket < split * 100 else "A"
return variants[variant]
```
**Measure:** Task completion rate, user satisfaction, error escalations -- not
just automated scores.
### Canary Evaluation
Deploy the new prompt to a small slice of traffic (1-5%). Monitor automated
scores and error rates before full rollout.
```text
1. Deploy new prompt to 2% of traffic
2. Run automated evals on canary outputs for 24-48 hours
3. Compare canary scores against baseline (control group)
4. If canary scores match or exceed baseline → increase to 100%
5. If canary shows regressions → roll back immediately
```
## CI Integration
### GitHub Actions with Promptfoo
```yaml
# .github/workflows/llm-eval.yml
name: LLM Eval
on:
pull_request:
paths:
- "prompts/**"
- "promptfooconfig.yaml"
jobs:
eval:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: 20
- name: Cache promptfoo
uses: actions/cache@v4
with:
path: ~/.cache/promptfoo
key: promptfoo-${{ hashFiles('promptfooconfig.yaml') }}
- name: Run evals
env:
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
PROMPTFOO_CACHE_PATH: ~/.cache/promptfoo
run: |
npx promptfoo eval --output results.json
- name: Check pass rate
run: |
PASS_RATE=$(jq '.results.stats.successes / .results.stats.total' results.json)
echo "Pass rate: $PASS_RATE"
if (( $(echo "$PASS_RATE < 0.95" | bc -l) )); then
echo "::error::Eval pass rate $PASS_RATE below 95% threshold"
exit 1
fi
- name: Upload results
if: always()
uses: actions/upload-artifact@v4
with:
name: eval-results
path: results.json
```
### Setting Thresholds
| Strategy | When to use | Example |
| ----------------- | -------------------------- | -------------------------------- |
| Zero failures | Safety-critical assertions | `--fail-on-error` |
| Pass rate | Probabilistic quality | Fail if < 95% of cases pass |
| Score regression | Comparing against baseline | Fail if avg score drops > 5% |
| Per-metric floors | Multi-dimensional quality | Accuracy > 90% AND toxicity < 1% |
## Cost Management
LLM evals call APIs for every test case and (for LLM-as-judge) again for every
grading step. Costs compound fast.
| Strategy | Savings | Tradeoff |
| -------------------------- | -------- | -------------------------------------- |
| Cache responses | 50-90% | Stale on model updates |
| Sample eval set (20-30%) | 70-80% | Lower statistical confidence |
| Smaller judge model | 60-80% | Less nuanced grading |
| Deterministic checks first | Variable | Only works for structured output |
| Batch API | 50% | Higher latency (hours, not seconds) |
| Run full suite nightly | N/A | CI gets fast subset; nightly gets full |
### Tiered Eval Strategy
```text
PR-level (every push):
→ 50 critical cases, deterministic assertions only
→ Cost: ~$0.50 per run
Merge-level (on merge to main):
→ 200 cases, deterministic + LLM-as-judge
→ Cost: ~$5 per run
Nightly (scheduled):
→ Full golden set (1000+ cases), all metrics
→ Cost: ~$25 per run
```
## Eval-Driven Development
Write evals first, then iterate on prompts. Analogous to TDD for
non-deterministic systems.
```text
1. Red — Write eval cases that define desired behavior
2. Red — Run against initial prompt (expect many failures)
3. Green — Iterate on prompt until eval suite passes
4. Refine — Add edge cases to eval set, tighten thresholds
5. Ship — Baseline scores become the regression floor
```
**The key insight:** Prompts without evals drift silently. The eval suite is the
specification. Without it, you are guessing.
## Tools
| Tool | Strengths | Best for |
| ------------------- | ----------------------------------------------------- | ------------------------------------- |
| **promptfoo** | Open-source, YAML config, matrix comparison, red-team | Prompt comparison, CI integration |
| **Braintrust** | Experiment tracking, online scoring, dataset mgmt | Production monitoring, team workflows |
| **Anthropic Evals** | Native Claude integration, grading examples | Claude-specific applications |
| **Custom harness** | Full control, no vendor lock-in | Unique eval logic, internal tooling |
| **autoevals** (lib) | Pre-built scorers (Factuality, ClosedQA, Relevance) | Quick scoring without writing rubrics |
### Promptfoo Quickstart
```bash
# Install and initialize
npx promptfoo init
# Edit promptfooconfig.yaml, then run
npx promptfoo eval
# View results in browser
npx promptfoo view
# Compare two prompts side by side
npx promptfoo eval --output results.json
```
### Braintrust Quickstart
```bash
# Install
pip install braintrust autoevals
```
```python
from braintrust import Eval
from autoevals import Factuality
Eval(
"My App",
data=lambda: [
{"input": "What is 2+2?", "expected": "4"},
{"input": "Capital of France?", "expected": "Paris"},
],
task=lambda input: get_completion(input),
scores=[Factuality],
)
```
### Custom Eval Harness
```python
import json
from dataclasses import dataclass
@dataclass
class EvalResult:
case_id: str
passed: bool
score: float
reason: str
def run_eval_suite(
cases: list[dict],
task_fn,
graders: list,
threshold: float = 0.9,
) -> dict:
"""Run eval suite and return summary."""
results = []
for case in cases:
output = task_fn(case["input"])
scores = [g(output, case.get("expected", "")) for g in graders]
avg_score = sum(s for s in scores) / len(scores)
results.append(EvalResult(
case_id=case.get("id", ""),
passed=avg_score >= threshold,
score=avg_score,
reason=f"Avg score {avg_score:.2f}",
))
passed = sum(1 for r in results if r.passed)
return {
"total": len(results),
"passed": passed,
"pass_rate": passed / len(results),
"results": results,
}
```
## Anti-Patterns
| Anti-Pattern | Problem | Fix |
| ----------------------------- | ----------------------------------------------------------------------- | --------------------------------------------------------------- |
| Vibes-based evaluation | "It looks good" catches nothing at scale | Define rubrics, measure scores, track over time |
| Testing on training data | Model memorizes answers; eval scores inflate artificially | Use held-out test sets; rotate eval data |
| Single-metric optimization | Optimizing accuracy alone tanks safety, tone, or latency | Track multiple metrics; set floors for each |
| Eval gaming | Tuning prompts to pass specific test cases rather than general behavior | Use large, diverse eval sets; add new cases regularly |
| No baseline | Cannot tell if changes improved or regressed quality | Save baseline scores; diff every prompt change |
| Judge model = tested model | Model grades itself favorably; biases go undetected | Use a different (stronger) model as judge |
| Static eval set | Production inputs drift; eval set becomes unrepresentative | Refresh eval data from production samples quarterly |
| Asserting on exact LLM output | Non-deterministic outputs make exact-match flaky | Use semantic similarity, contains checks, or LLM-as-judge |
| Skipping operational metrics | Model quality is high but latency or cost makes it unusable | Always measure latency, cost, and token usage alongside quality |
## See Also
- [Testing](testing.md) — Test runner commands and patterns for deterministic
software
- [Testing Principles](../why/testing.md) — Strategy, pyramid, TDD — the
foundations eval-driven development builds on
- [Performance Profiling](performance.md) — Benchmarking and latency
measurement, analogous to operational eval metrics
- [Prompt Engineering](prompt-engineering.md) — The prompts these evals validate
- Without a harness, you **can't compare** prompts, models, retrieval configs, or costs.
Evaluate, benchmark, and regression-test AI/LLM systems. Covers evaluation framework design, benchmark creation, human evaluation protocols, automated evaluation (LLM-as-judge), regression testing, statistical significance, and continuous evaluation pipelines.
<img width="1388" height="298" alt="full_diagram" src="https://github.com/user-attachments/assets/12a2371b-8be2-4219-9b48-90503eb43c69" />
A list of all public EEG-datasets. This list of EEG-resources is not exhaustive. If you find something new, or have explored any unfiltered link in depth, please update the repository.