Loading...
Loading...
Generative models produce **open-ended text** — there is rarely a single “correct” string. Quality is **subjective**, **multi-dimensional**, and **context-dependent**: the same answer can be excellent for a casual user and unacceptable for a regulated workflow. Without a disciplined evaluation strategy, teams ship models that look good on a leaderboard but fail in production, leak unsafe content, or hallucinate in high-stakes domains.
# LLM Evaluation & Benchmarking
---
## Why LLM Evaluation Matters
Generative models produce **open-ended text** — there is rarely a single “correct” string. Quality is **subjective**, **multi-dimensional**, and **context-dependent**: the same answer can be excellent for a casual user and unacceptable for a regulated workflow. Without a disciplined evaluation strategy, teams ship models that look good on a leaderboard but fail in production, leak unsafe content, or hallucinate in high-stakes domains.
### Traditional ML vs LLM Evaluation
| Dimension | Traditional supervised ML | LLM / generative evaluation |
|-----------|---------------------------|-------------------------------|
| **Target** | Fixed label or score | Free-form tokens, reasoning chains, tool calls |
| **Gold standard** | Often a single label per example | Multiple valid references; “best” answer may not exist |
| **Common metrics** | Accuracy, precision/recall, F1, AUC-ROC | BLEU/ROUGE (n-gram overlap), LLM-as-judge, human ratings, task success |
| **Error shape** | Wrong class vs right class | Irrelevant, unsafe, unfaithful, verbose, wrong tone, partial correctness |
| **Data needs** | Labeled dataset | References, rubrics, human panels, online signals |
| **Stability** | Metric stable across small model changes | Small prompt/model changes can reorder rankings |
### Dimensions of quality (beyond “correctness”)
| Dimension | Example questions |
|-----------|-------------------|
| **Correctness** | Are facts right for the question and time? |
| **Grounding** | Are claims supported by allowed context (RAG) or tools? |
| **Helpfulness** | Does the answer solve the user’s task without excess? |
| **Clarity** | Is the structure appropriate (steps, bullets, code blocks)? |
| **Safety** | Refusals, toxicity, policy violations, PII leakage |
| **Fairness** | Stereotyping, disparate quality across demographic groups |
| **Latency / cost** | Meets SLOs and budget per request or session |
| **Format validity** | JSON, SQL, or API schemas respected when required |
Production systems trade these off explicitly — a “smarter” model that violates latency SLOs may be **worse** for the product.
### Why “Accuracy” Does Not Transfer to Generation
For **classification**, accuracy answers: “Did we pick the right bucket?” For **generation**, there is usually a **space of acceptable outputs**. Even with one reference, optimizing BLEU encourages **verbatim copying** rather than paraphrases that humans would prefer.
!!! note
**Key insight:** Offline metrics (BLEU, ROUGE, even BERTScore) are **proxies**. They correlate imperfectly with human judgment. Production success is ultimately tied to **task completion**, **safety**, **latency/cost**, and **trust** — not a single scalar on a dev set.
```mermaid
flowchart TB
subgraph Offline["Offline evaluation"]
REF[Reference / rubric]
AUTO[Automatic metrics]
HUMAN[Human labels]
REF --> AUTO
REF --> HUMAN
end
subgraph Online["Online evaluation"]
UX[User behavior]
SAT[Explicit feedback]
BIZ[Business outcomes]
end
Offline -->|"gates releases"| SHIP[Ship candidate]
SHIP --> Online
Online -->|"closes the loop"| Offline
```
---
## Evaluation Taxonomy
A complete evaluation story combines **where** you measure (offline vs online), **who** scores (humans vs models vs n-gram stats), and **what** you optimize (fluency vs factuality vs safety).
### Offline vs Online Evaluation
| Aspect | Offline | Online |
|--------|---------|--------|
| **Definition** | Scoring on held-out datasets, human studies, or batch jobs before/at release | Metrics from real users in production |
| **Latency to signal** | Fast iteration in CI | Slower; needs traffic and logging |
| **Representativeness** | Fixed sets may be stale or leaked into training | Reflects true distribution and drift |
| **Cost** | Human eval expensive; auto metrics cheap at scale | Infrastructure + privacy + experimentation cost |
| **Use when** | Comparing models, regression tests, safety sweeps | Validating UX, monetization, long-horizon quality |
**Offline examples:** nightly regression on a **golden set**, MMLU-style accuracy for reasoning, RAGAS faithfulness on labeled QA pairs.
**Online examples:** A/B test on thumbs-up rate, support ticket deflection, “copy code” rate for a coding assistant, session-level task success.
```mermaid
flowchart LR
subgraph Dev["Development"]
D1[Curate eval sets]
D2[Automated metrics]
D3[Human spot checks]
end
subgraph Staging["Pre-production"]
S1[Shadow traffic]
S2[Canary + guardrails]
end
subgraph Prod["Production"]
P1[A/B experiments]
P2[Monitoring + alerts]
end
Dev --> Staging --> Prod
```
!!! tip
Pair **offline** gates (block bad deploys) with **online** validation (detect drift and UX regressions). Neither alone is sufficient for GenAI.
---
### Automatic Metrics: BLEU, ROUGE, BERTScore, Perplexity
#### BLEU (Bilingual Evaluation Understudy)
- **What it measures:** n-gram **precision** between candidate and one or more reference translations/summaries, with a **brevity penalty** if the output is too short.
- **Best for:** Machine translation, summarization when references are stable.
- **Limitations:** Penalizes valid paraphrases; brittle for creative or long-form generation; multiple references help but do not fix semantic blindness.
#### ROUGE (Recall-Oriented Understudy for Gisting Evaluation)
- **What it measures:** Overlap of n-grams (ROUGE-N), longest common subsequence (ROUGE-L), or skip-bigrams (ROUGE-S) — often reported as **F1**.
- **Best for:** Summarization; recall-oriented tasks.
- **Limitations:** Same paraphrase issues as BLEU; can be gamed by verbose outputs depending on variant.
#### BERTScore
- **What it measures:** **Semantic similarity** via contextual embeddings: match tokens in candidate and reference in embedding space (precision/recall/F1 style).
- **Best for:** When lexical overlap is too strict but you still have references.
- **Limitations:** Can be miscalibrated across domains; expensive vs n-gram metrics; still not “understanding” in a human sense.
#### Perplexity
- **What it measures:** How “surprised” a **language model** is by a text sample under its distribution: lower perplexity = better fit to the model’s own LM objective (on that data).
- **Best for:** Comparing LMs on **held-out text**; tracking training progress.
- **Limitations:** Not a direct quality measure for **downstream tasks**; low perplexity can coexist with toxicity or hallucination; not comparable across different tokenizers/vocabularies without care.
!!! warning
**Do not** use perplexity alone to claim “better assistant behavior.” It measures **fluency under the LM**, not helpfulness, safety, or factual correctness on user tasks.
#### Python: BLEU, ROUGE, BERTScore, and Perplexity-style scoring
```python
"""
Illustrative evaluation utilities: BLEU, ROUGE, BERTScore, perplexity.
Install: pip install nltk rouge-score bert-score transformers torch
"""
from __future__ import annotations
import math
from typing import List, Sequence
import nltk
import torch
from bert_score import score as bert_score
from nltk.translate.bleu_score import corpus_bleu, sentence_bleu
from nltk.tokenize import word_tokenize
from rouge_score import rouge_scorer
from transformers import AutoModelForCausalLM, AutoTokenizer
# NLTK resources (run once in your environment)
for pkg in ("punkt", "punkt_tab"):
try:
nltk.data.find(f"tokenizers/{pkg}")
except LookupError:
nltk.download(pkg)
def tokenize(s: str) -> List[str]:
return word_tokenize(s.lower())
def compute_bleu(
candidates: Sequence[str],
references_list: Sequence[Sequence[str]],
weights: tuple[float, float, float, float] = (0.25, 0.25, 0.25, 0.25),
) -> float:
"""
Corpus BLEU over parallel (candidate, references) pairs.
references_list[i] is one or more reference strings for candidates[i].
NLTK expects list_of_references[i] = list of tokenized references for hypothesis i.
"""
list_of_references = [[tokenize(r) for r in refs] for refs in references_list]
hypotheses = [tokenize(c) for c in candidates]
return corpus_bleu(list_of_references, hypotheses, weights=weights)
def compute_rouge_f1(candidate: str, reference: str) -> dict[str, float]:
scorer = rouge_scorer.RougeScorer(["rouge1", "rouge2", "rougeL"], use_stemmer=True)
scores = scorer.score(reference, candidate)
return {k: scores[k].fmeasure for k in scores}
def compute_bertscore_f1(
candidates: List[str],
references: List[str],
lang: str = "en",
) -> tuple[float, List[float]]:
"""Returns corpus F1 and per-example F1 (BERTScore)."""
precision, recall, f1 = bert_score(
candidates,
references,
lang=lang,
rescale_with_baseline=True,
)
return float(f1.mean()), [float(x) for x in f1]
def perplexity_causal_lm(
model_name: str,
text: str,
max_length: int = 512,
) -> float:
"""
Average negative log-likelihood of tokens (causal LM).
Lower perplexity => model assigns higher probability to the text.
"""
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
model.eval()
enc = tokenizer(text, return_tensors="pt", truncation=True, max_length=max_length)
with torch.no_grad():
out = model(**enc, labels=enc["input_ids"])
# Cross-entropy loss is average token NLL when labels are shifted internally
nll = float(out.loss)
return math.exp(nll)
if __name__ == "__main__":
cand = "The cat sat on the mat."
ref = "A cat was sitting on the mat."
print("ROUGE:", compute_rouge_f1(cand, ref))
bleu_1 = sentence_bleu([tokenize(ref)], tokenize(cand))
print("Sentence BLEU-4 style (1-ref):", bleu_1)
corpus_f1, per_ex = compute_bertscore_f1([cand], [ref])
print("BERTScore F1 (corpus):", corpus_f1)
```
---
### LLM-as-Judge
**Idea:** Use a **stronger** (or instruction-tuned) model to score outputs from a **weaker** or cheaper model on a rubric — e.g., 1–5 on helpfulness, correctness, or safety.
| Benefit | Risk |
|---------|------|
| Scales better than full human eval | **Position bias** (prefers first answer) |
| Captures nuanced criteria if rubric is clear | **Self-bias** if judge shares family with candidate |
| Useful for ranking candidates in auto-ML loops | **Calibration** drift across judge versions |
**Prompt engineering for judgment**
- Fix a **strict rubric** and **output format** (JSON with fields).
- Provide **context** the user saw (retrieved docs for RAG).
- Ask for **per-criterion** scores, then aggregate.
- Use **chain-of-thought** only if you extract a final score in structured form (avoid unparsable rambles).
**Calibration:** Periodically align judge scores with human ratings on a **calibration set**; fit a simple mapping (e.g., Platt scaling, isotonic regression) or **swap** judge model with consensus human labels.
**Position bias mitigation:** **Swap** order of two answers and average scores; or present answers **anonymized** and **shuffled**; use **multiple judges**.
#### Python: minimal LLM-as-judge pipeline
```python
"""
LLM-as-judge skeleton: swap positions to reduce order bias.
Replace call_judge with your API (OpenAI, Vertex, etc.).
"""
from __future__ import annotations
import json
import statistics
from dataclasses import dataclass
from typing import Any, Callable, Dict, List
JUDGE_SYSTEM = """You are an expert evaluator. Score the assistant answer on:
- correctness (1-5)
- helpfulness (1-5)
- safety (1-5)
Respond ONLY with JSON:
{"correctness": int, "helpfulness": int, "safety": int, "rationale": str}"""
def build_user_prompt(question: str, answer: str, context: str | None = None) -> str:
parts = [f"Question:\n{question}\n", f"Assistant answer:\n{answer}\n"]
if context:
parts.insert(1, f"Context (may be used to verify claims):\n{context}\n")
return "\n".join(parts)
def call_judge(system: str, user: str) -> Dict[str, Any]:
"""Stub: wire to your LLM client."""
raise NotImplementedError("Implement with your provider's chat completion API.")
def parse_scores(raw: Dict[str, Any]) -> Dict[str, int]:
return {
"correctness": int(raw["correctness"]),
"helpfulness": int(raw["helpfulness"]),
"safety": int(raw["safety"]),
}
@dataclass
class JudgeResult:
scores_normal: Dict[str, int]
scores_swapped: Dict[str, int]
aggregated: Dict[str, float]
def judge_with_position_debias(
question: str,
answer_a: str,
answer_b: str,
context: str | None,
call_judge_fn: Callable[[str, str], Dict[str, Any]],
) -> JudgeResult:
"""Compare two answers; debias by swapping A/B in the prompt."""
u1 = (
build_user_prompt(question, answer_a, context)
+ "\n\nLabel this answer as candidate A for scoring."
)
s1 = parse_scores(call_judge_fn(JUDGE_SYSTEM, u1))
u2 = (
build_user_prompt(question, answer_b, context)
+ "\n\nLabel this answer as candidate B for scoring."
)
s2 = parse_scores(call_judge_fn(JUDGE_SYSTEM, u2))
# In a pairwise setup, you'd ask the judge to pick A vs B and swap order;
# here we illustrate collecting scores per candidate with separate calls.
agg = {
k: statistics.mean([s1[k], s2[k]])
for k in s1
}
return JudgeResult(s1, s2, agg)
```
!!! note
In **pairwise** Arena-style judging, always **randomize** whether model A appears first; aggregate across many votes to estimate Elo (see LMSYS section).
---
### Human Evaluation
| Component | What to specify |
|-----------|-----------------|
| **Guidelines** | Definition of each score level with examples (anchors) |
| **Task** | Blind comparison, absolute scoring, or pairwise preference |
| **Agreement** | Cohen's Kappa / Fleiss' Kappa for categorical ratings |
| **Interface** | Side-by-side for comparative tasks; rubric panel for safety |
**Inter-annotator agreement:** **Cohen's Kappa** for two raters on categorical labels accounts for **chance agreement**. Rough guide: < 0 poor, 0.21–0.40 fair, 0.41–0.60 moderate, 0.61–0.80 substantial, 0.81–1 almost perfect.
For two raters and *N* items, with \(p_o\) = observed agreement and \(p_e\) = expected agreement by chance:
\[
\kappa = \frac{p_o - p_e}{1 - p_e}
\]
**Fleiss’ Kappa** generalizes to multiple raters and is common when three or more annotators label each example.
**Cost/speed trade-offs:** Expert domain raters are slow and costly but necessary for medical/legal; crowd workers are fast but need **gold questions** and **adversarial checks**; hybrid approaches use **LLM pre-filter** + human review for edge cases.
---
### Reference-Based vs Reference-Free Evaluation
| Type | Needs | Examples | When to use |
|------|-------|----------|-------------|
| **Reference-based** | Gold reference text | BLEU, ROUGE, BERTScore | MT, summarization with references |
| **Reference-free** | Rubric, judge, or entailment model | LLM-as-judge, QA consistency checks | Open-ended chat, reasoning without single reference |
Many production tasks are **reference-free**; combine with **spot checks** against retrieved evidence (RAG) or **tool-executed** ground truth (code runs, SQL results).
---
### Task-Specific vs General Evaluation
| Orientation | Examples | Role |
|-------------|----------|------|
| **General** | MMLU, HellaSwag, broad chat Elo | Capability breadth; weak signal for niche domains |
| **Task-specific** | MedQA, SWE-bench, internal enterprise QA | Directly aligned with product; smaller curated sets |
!!! tip
For system design interviews, always mention **both**: a **broad** benchmark for regression + a **domain** eval set that mirrors customer data (with privacy safeguards).
---
## Benchmark Suites
Benchmarks **operationalize** research progress but are not interchangeable: each stresses different skills (knowledge, reasoning, coding, honesty, social bias).
### Knowledge & Reasoning (Selection / Short Answer)
| Benchmark | What it measures | Notes |
|-----------|------------------|-------|
| **MMLU** | 57 subjects, multi-choice knowledge | Massive multitask language understanding; standard for “general knowledge” |
| **HellaSwag** | Commonsense **next sentence** completion | Adversarially filtered distractors; tests plausible continuation |
| **ARC** | Science exam questions (Easy / Challenge) | Challenge set is harder; reasoning + knowledge |
| **TruthfulQA** | Tendency to imitate **false** popular beliefs | Open-ended or MC; measures **honesty** vs sycophancy |
| **GSM8K** | Grade-school **math word problems** | Step-by-step arithmetic reasoning; chain-of-thought helps |
**MMLU in practice:** Report both **macro** average (equal weight per subject) and **micro** or per-domain breakdowns so narrow English-only gains do not hide collapse in low-resource subjects. Watch for **selection bias** in public leaderboards — models may be instruction-tuned on overlapping trivia.
### Coding
| Benchmark | What it measures |
|-----------|------------------|
| **HumanEval** | Function-level Python from docstring; **pass@k** with unit tests |
| **HumanEval+** | Stricter / extended variants in the literature; check version when citing |
| **SWE-bench** | **Real GitHub issues** — patch generation against repos; much harder than HumanEval |
### Math & STEM
| **MATH** | Competition-style math problems with symbolic/numeric answers — stresses advanced reasoning beyond GSM8K |
### Medical
| **MedQA** (and related USMLE-style sets) | Medical knowledge MCQs; domain-specific risk — high stakes, needs expert review beyond accuracy |
### LMSYS Chatbot Arena & Elo
**Chatbot Arena** collects **human pairwise preferences**: users see two anonymous model responses and pick the better one. Aggregate wins/losses feed an **Elo** (or Bradley–Terry) rating system.
**Why it’s influential for chat:** It reflects **real user prompts** and **holistic quality** (helpfulness, style, safety perception) better than single-reference n-gram scores.
**How Elo works (simplified):** Each model has a rating \(R\). After a match, expected score for A vs B is \(E_A = \frac{1}{1 + 10^{(R_B - R_A)/400}}\). Ratings update based on outcome vs expectation. Over many votes, strong chat models **separate** from weaker ones.
A common update after A faces B (with scores \(S_A \in \{0, 0.5, 1\}\) for loss/tie/win):
\[
R_A' = R_A + K \cdot (S_A - E_A)
\]
\(K\) controls volatility (larger in small-sample regimes or for provisional ratings). Bradley–Terry and other **pairwise preference** models are alternatives when you want probabilistic interpretation of win rates.
!!! note
Arena rankings are **not** a substitute for **safety** certification or **domain** compliance — they aggregate **preference**, which can overweight verbosity or style.
### Safety & Fairness Benchmarks
| Benchmark | Focus |
|-----------|-------|
| **ToxiGen** | Implicit hate / toxic generations toward groups |
| **BBQ** (Bias Benchmark for QA) | Social bias in ambiguous vs disambiguated contexts |
| **RealToxicityPrompts** | Continuation toxicity from prompts of varying toxicity |
### Comparative Table of Major Benchmarks
| Benchmark | Format | Primary signal | Typical metric |
|-----------|--------|----------------|----------------|
| MMLU | Multi-choice | Broad knowledge | Accuracy by subject / macro avg |
| HellaSwag | Multi-choice | Commonsense NLI/continuation | Accuracy |
| ARC | Multi-choice | Science reasoning | Accuracy (Challenge) |
| TruthfulQA | MC or open | Honesty vs myths | MC accuracy or BLEU-like with judge |
| HumanEval | Code + tests | Functional correctness | pass@1 / pass@10 |
| GSM8K | Short answer math | Arithmetic reasoning | Exact match / with CoT |
| MATH | Open STEM/math | Hard reasoning | Exact match |
| SWE-bench | Repo-level patches | Real software engineering | Resolve rate |
| MedQA | MC | Clinical knowledge | Accuracy |
| Chatbot Arena | Pairwise prefs | Chat quality | Elo leaderboard |
| ToxiGen / BBQ / RTP | Gen or MC | Safety / bias | Custom; harm rates |
```mermaid
flowchart LR
subgraph General["General capability"]
MMLU[MMLU / ARC]
HS[HellaSwag]
end
subgraph Domain["Domain & tooling"]
HE[HumanEval]
SWE[SWE-bench]
MED[MedQA / MATH]
end
subgraph Preference["Preference & safety"]
AR[Chatbot Arena Elo]
TQ[TruthfulQA]
SAF[ToxiGen / BBQ]
end
General --> SEL[Model + risk profile]
Domain --> SEL
Preference --> SEL
```
---
### HumanEval and pass@k (Coding)
**HumanEval** provides 164 hand-written Python problems with **hidden unit tests**. Models generate a completion; you execute tests in a sandbox to mark **pass** or **fail**.
**pass@k:** “Probability that at least one of the top *k* samples passes.” For *n* ≥ *k* independent samples with pass probability *p*, estimate:
\[
\text{pass@k} = 1 - \frac{\binom{n-c}{k}}{\binom{n}{k}}
\]
where *c* is the number of passing samples among *n* draws (unbiased estimator used in the literature when sampling without replacement from model outputs).
| Setting | What it tells you |
|---------|-------------------|
| **pass@1** | Greedy or single-sample reliability |
| **pass@10** | Whether the model **can** solve the task with sampling diversity |
| **Larger n** | Reduces variance in pass@k estimates |
!!! tip
In interviews, stating that **SWE-bench** exercises **repository-level** reasoning (files, tests, context) while **HumanEval** is **function-level** shows you understand the **gap** between toy coding and real software engineering.
---
## Production Evaluation Pipeline
Shipping LLMs requires the same rigor as any ML system — with extra emphasis on **subjective quality**, **long sessions**, and **safety**.
### A/B Testing for LLMs (vs Traditional A/B)
| Traditional A/B | LLM A/B |
|-----------------|---------|
| Short, atomic events (click, conversion) | Long **sessions**; one bad turn poisons perception |
| Objective KPIs | Mix of **implicit** (dwell) and **explicit** (thumbs) signals |
| Stable unit of randomization | User-level randomization still key; **carryover** if same user sees both |
| Quick power analysis | Need larger N for noisy subjective outcomes |
**Design tips:** Randomize **users**, not requests, when studying sustained behavior; **pre-register** primary metrics; watch **guardrail** violations as **co-primary** safety endpoints; use **sequential testing** cautiously with peeking corrections.
### Variance, power, and decision criteria
LLM A/B metrics (thumbs-up, session success) have **higher variance** than click-through rates. That implies:
| Topic | Implication |
|-------|-------------|
| **Sample size** | You may need **orders of magnitude** more exposed users than for crisp binary funnels |
| **Multiple comparisons** | Many teams watch dozens of slices; **false discoveries** multiply without correction (Benjamini–Hochberg, Bonferroni, or pre-registered primary KPI only) |
| **CUPED / stratification** | Variance reduction using pre-experiment covariates (historical engagement) when ethical and available |
| **Weekday vs weekend** | Run for **full weeks** to capture periodicity in usage |
| **Novelty effects** | New models can look better briefly; extend duration or use cohort holdouts |
!!! note
A **non-significant** lift is not proof of “no harm.” For safety-critical products, use **guardrail** metrics with **one-sided** monitoring: any increase in severe violations can trigger rollback even when headline satisfaction is flat.
### Online Metrics
| Metric | What it captures | Caveat |
|--------|------------------|--------|
| **User satisfaction** | Thumbs, CSAT, surveys | Selection bias; angry users skew |
| **Task completion** | User reaches goal without retry | Hard to instrument for open goals |
| **Retry / reformulation rate** | User repeats or rephrases | May indicate confusion or model error |
| **Edit distance** (to final artifact) | How much users change drafts | Domain-dependent baseline |
| **Time-to-success** | Latency + quality combined | Can improve with worse outputs if users compensate |
### Guardrail Evaluation
Treat safety filters like **binary classifiers**:
| Term | Meaning |
|------|---------|
| **False positive** | Safe content blocked → hurts UX / trust |
| **False negative** | Unsafe content slips through → brand/legal risk |
Report **precision/recall** on a **labeled adversarial set** that evolves (red-team prompts, toxic paraphrases, jailbreak attempts).
### Regression Testing
- **Golden dataset:** Curated prompts with **expected properties** (must cite source X, must refuse Y, must output valid JSON).
- **Automated detection:** nightly runs comparing metrics to **baselines**; alert on **statistically significant** drops.
- **Version pinning:** Record **model ID**, **prompt hash**, **retriever index version** for reproducibility.
### Designing golden datasets that catch real failures
| Property | Why it matters |
|----------|----------------|
| **Stratified difficulty** | Mix easy, typical, and adversarial prompts so regressions are not masked |
| **Stable expected behavior** | Each row defines pass/fail or rubric thresholds; avoid “I know it when I see it” without anchors |
| **Domain coverage** | Include regulated wording, multilingual snippets, and long context if your product sees them |
| **Privacy** | Synthetic or scrubbed data; never copy production PII into CI |
| **Negative tests** | Prompts that **must** trigger refusal, citation-only answers, or tool calls |
| **Versioned snapshots** | Immutable dataset hash in CI; changes require review |
!!! tip
Treat golden sets like **test suites**: small enough to run nightly, broad enough that a passing run genuinely increases confidence.
### Python: End-to-End Evaluation Pipeline Sketch
```python
"""
Production-oriented batch evaluation pipeline:
load dataset -> score with automatic + judge hooks -> aggregate -> gate.
"""
from __future__ import annotations
import csv
import json
import statistics
from dataclasses import dataclass, field
from pathlib import Path
from typing import Any, Callable, Dict, Iterable, List, Optional
@dataclass
class EvalExample:
id: str
prompt: str
reference: Optional[str]
model_output: str
metadata: Dict[str, Any] = field(default_factory=dict)
@dataclass
class EvalReport:
metrics: Dict[str, float]
failures: List[Dict[str, Any]]
def load_examples(path: Path) -> List[EvalExample]:
rows: List[EvalExample] = []
with path.open(newline="", encoding="utf-8") as f:
reader = csv.DictReader(f)
for row in reader:
rows.append(
EvalExample(
id=row["id"],
prompt=row["prompt"],
reference=row.get("reference") or None,
model_output=row["model_output"],
metadata=json.loads(row.get("metadata") or "{}"),
)
)
return rows
def run_automatic_metrics(ex: EvalExample) -> Dict[str, float]:
out: Dict[str, float] = {}
if ex.reference:
# Plug in ROUGE / BERTScore from earlier helpers
out["rougeL_f1"] = 0.42 # placeholder
return out
def run_judge(ex: EvalExample, judge_fn: Callable[[EvalExample], Dict[str, int]]) -> Dict[str, int]:
return judge_fn(ex)
def aggregate_numeric(values: Iterable[float]) -> float:
vals = list(values)
return statistics.mean(vals) if vals else float("nan")
def evaluate_dataset(
examples: List[EvalExample],
judge_fn: Optional[Callable[[EvalExample], Dict[str, int]]] = None,
thresholds: Optional[Dict[str, float]] = None,
) -> EvalReport:
thresholds = thresholds or {}
all_metrics: Dict[str, List[float]] = {}
failures: List[Dict[str, Any]] = []
for ex in examples:
m = run_automatic_metrics(ex)
for k, v in m.items():
all_metrics.setdefault(k, []).append(v)
if judge_fn:
j = judge_fn(ex)
for k, v in j.items():
key = f"judge_{k}"
all_metrics.setdefault(key, []).append(float(v))
# Example gate: minimum judge safety
if judge_fn:
j = judge_fn(ex)
if j.get("safety", 5) < 4:
failures.append({"id": ex.id, "reason": "low_safety", "scores": j})
summary = {k: aggregate_numeric(v) for k, v in all_metrics.items()}
for name, thr in thresholds.items():
if summary.get(name, thr) < thr:
failures.append({"id": "__global__", "reason": f"{name}_below_threshold", "value": summary.get(name)})
return EvalReport(metrics=summary, failures=failures)
# Example usage:
# examples = load_examples(Path("golden_set.csv"))
# report = evaluate_dataset(examples, judge_fn=my_judge, thresholds={"judge_safety": 4.0})
# assert not report.failures
```
!!! warning
Treat **thresholds** as products of risk analysis — not universal constants. A coding assistant might weight correctness over brevity; a therapy-adjacent bot might invert that priority entirely.
---
## RAG-Specific Evaluation
RAG systems fail in **three** separable places: retrieval, grounding, and generation.
### Faithfulness (Groundedness)
**Question:** Are claims in the answer **supported by** the retrieved context (not merely plausible from world knowledge)?
**Approaches:** Natural Language Inference (NLI) style **entailment** checks per claim; LLM-as-judge with **quote-required** rubrics; sentence-level alignment.
### Relevance (Retrieval Quality)
**Question:** Did we fetch chunks that help answer the user?
**Metrics:** nDCG, MRR, Recall@k **if** you have labeled relevant docs; otherwise **LLM relevance labels** or **pseudo-labels** from click-through.
### Answer Correctness
**Question:** Is the final answer **factually** correct w.r.t. user intent and authoritative sources?
For open domains, combine **reference answers**, **tool verification**, or **human** review.
### Citations and attribution (enterprise RAG)
When answers must include **sources**, evaluate separately:
| Check | Question |
|-------|----------|
| **Citation precision** | Does each cited span actually support the sentence it is attached to? |
| **Citation recall** | Were all non-obvious claims tied to a source where policy requires it? |
| **Attribution correctness** | Are document IDs / URLs stable and ACL-valid for the user? |
| **Hallucinated refs** | Does the model invent titles, sections, or URLs? |
These checks are often implemented with **LLM judges** constrained to quote spans, or with **string overlap** between answer sentences and retrieved chunks plus NLI.
### RAGAS-Style Dimensions
**RAGAS** (Retrieval Augmented Generation Assessment) popularized reference-free or **partially** reference-based metrics using LLM prompts:
| Dimension | Intuition |
|-----------|-----------|
| **Faithfulness** | Answer claims can be inferred from context |
| **Answer relevance** | Answer addresses the user question |
| **Context precision** | Retrieved context is focused (low noise) |
| **Context recall** | Context covers what’s needed for the answer |
!!! tip
In interviews, naming **faithfulness vs relevance** separation often earns credit — it shows you know **where** hallucinations enter the pipeline.
```mermaid
flowchart TB
Q[User query] --> R[Retriever]
R --> C[Contexts]
Q --> G[Generator]
C --> G
G --> A[Answer]
C --> MF[Faithfulness<br/>vs contexts]
A --> MA[Answer relevance<br/>vs query]
R --> MR["Context relevance<br/>/ recall / precision"]
```
### Python: RAGAS-Style Prompted Checks (Illustrative)
```python
"""
Illustrative RAGAS-style evaluation using LLM prompts.
Prefer the `ragas` library in production; this shows the underlying logic.
"""
from __future__ import annotations
from dataclasses import dataclass
from typing import List
@dataclass
class RAGSample:
question: str
contexts: List[str]
answer: str
FAITHFULNESS_PROMPT = """Given contexts and an answer, rate from 0-1 whether
each sentence in the answer is supported by the contexts.
Output JSON: {"score": float, "unsupported_sentences": [str]}"""
ANSWER_REL_PROMPT = """Rate how well the answer addresses the question (0-1).
Output JSON: {"score": float}"""
CTX_PRECISION_PROMPT = """Rate what fraction of retrieved sentences are useful for answering (0-1).
Output JSON: {"score": float}"""
CTX_RECALL_PROMPT = """Given the question and contexts, rate coverage of information needed (0-1).
Output JSON: {"score": float}"""
def llm_json_call(system: str, user: str) -> dict:
raise NotImplementedError("Wire to your LLM API.")
def faithfulness_score(sample: RAGSample) -> float:
user = f"Contexts:\n{sample.contexts}\n\nAnswer:\n{sample.answer}"
return float(llm_json_call(FAITHFULNESS_PROMPT, user)["score"])
def answer_relevance(sample: RAGSample) -> float:
user = f"Question:\n{sample.question}\n\nAnswer:\n{sample.answer}"
return float(llm_json_call(ANSWER_REL_PROMPT, user)["score"])
def context_precision(sample: RAGSample) -> float:
user = f"Question:\n{sample.question}\n\nContexts:\n{sample.contexts}"
return float(llm_json_call(CTX_PRECISION_PROMPT, user)["score"])
def context_recall(sample: RAGSample) -> float:
user = f"Question:\n{sample.question}\n\nContexts:\n{sample.contexts}"
return float(llm_json_call(CTX_RECALL_PROMPT, user)["score"])
def ragas_aggregate(sample: RAGSample) -> dict[str, float]:
return {
"faithfulness": faithfulness_score(sample),
"answer_relevance": answer_relevance(sample),
"context_precision": context_precision(sample),
"context_recall": context_recall(sample),
}
```
**Using the real RAGAS library** (recommended): install `ragas` and wire your LLM/embeddings; it implements robust prompts and aggregations beyond this skeleton.
---
## Evaluation Pitfalls and Anti-Patterns
| Pitfall | Why it hurts | Mitigation |
|---------|--------------|------------|
| **Benchmark gaming / contamination** | Test data leaks into training; inflated scores | Date-cutoffs, decontamination scripts, **held-out** internal sets |
| **Single-metric obsession** | Optimizing BLEU harms fluency/helpfulness | **Dashboard** of metrics + human spot checks |
| **Ignoring safety** | High MMLU + toxic outputs | Parallel **safety** benchmarks + red teaming |
| **Static eval on dynamic models** | Prompt/model updates invalidate baselines | Versioned golden sets; **continuous** eval |
| **Position bias in LLM judges** | Wrong comparative conclusions | Swap positions, multiple judges, calibrate vs humans |
!!! warning
**Leaderboard chasing** without domain validation is a common failure mode in GenAI product teams — especially enterprise RAG where retrieval dominates perceived quality.
---
## How This Connects to System Design
| System type | Evaluation emphasis |
|-------------|---------------------|
| **Chatbot** | Arena-style preferences, session success, safety, latency |
| **RAG / enterprise search** | Faithfulness, citation accuracy, retrieval recall@k, ACL correctness |
| **Code assistant** | pass@k, SWE-bench-style tasks, static analysis, user edit distance |
| **Agents** | Task completion across **tool calls**, error recovery, cost per task |
| **Content moderation** | Precision/recall/FPR/FNR on harm classes; adversarial robustness |
```mermaid
mindmap
root((GenAI System))
Offline
Benchmarks
Golden sets
LLM judges
Online
A/B KPIs
Guardrails
Drift monitors
Domain
Med/legal
Internal data
Safety
Red team
Bias suites
```
---
## Interview Tips (Google-Style “How Would You Evaluate This?”)
Interviewers expect **structured**, **multi-layer** answers — not a single metric.
1. **Clarify the task and risk:** factual Q&A vs creative writing vs code; regulated or not.
2. **Offline first:** curated **golden** + public benchmarks where relevant + **domain** slice.
3. **Decompose metrics:** correctness, helpfulness, hallucination/faithfulness, safety, latency/cost.
4. **Human vs automatic:** when each is mandatory; **LLM-as-judge** caveats (bias, calibration).
5. **Online:** A/B design, primary vs guardrail metrics, **long-session** effects.
6. **RAG:** explicitly mention **retrieval** quality separate from **generation**.
7. **Operationalization:** regression suites in CI, versioning, dashboards, incident loops.
8. **Failure modes:** what regressions would look like (silent hallucination vs retrieval miss vs safety slip).
9. **Cost:** evaluation budget at training time vs inference — e.g., when to afford LLM judges in batch only.
### Phrases that signal maturity
| Instead of… | Prefer… |
|-------------|---------|
| “We’ll use accuracy.” | “We’ll use **task success** + **human/LLM rubric** + automatic proxies.” |
| “BLEU will tell us.” | “BLEU is a **sanity check** for reference-based slices; chat quality needs **preference** or **task** metrics.” |
| “The bigger model wins.” | “We’ll **calibrate** judges, run **pairwise** with debiasing, and validate on **domain** sets.” |
| “We tested on the test set.” | “We keep a **frozen** golden set, monitor **contamination**, and track **prompt/version** hashes.” |
### Red flags interviewers listen for
- One number to rule them all (especially **perplexity** or **BLEU** for chat).
- **No** safety or abuse evaluation for user-facing systems.
- Confusing **retrieval** quality with **generation** quality in RAG.
- Ignoring **latency** and **cost** as part of the evaluation story for scaled systems.
!!! note
Strong candidates also mention **what they would not do** — e.g., “We won’t rely on BLEU alone for chat quality” — showing judgment beats naming ten acronyms.
---
## Quick Reference Card
| If you only remember one thing… | Remember this |
|----------------------------------|---------------|
| Open-ended generation | No single accuracy — use **multi-metric** + **human/LLM** judgment |
| RAG | Split **retrieval**, **faithfulness**, **answer quality** |
| Production | **Offline gates** + **online** validation + **safety** co-primary |
| Benchmarks | Each tests different skills — **compose** them, don’t cherry-pick one |
| Arena / Elo | **Human preference** for holistic chat quality — not safety certification |
---
### Further Reading (Pointers)
- **BLEU / ROUGE / BERTScore** — BLEU (Papineni et al., 2002) measures n-gram precision between generated and reference text, originally designed for machine translation. ROUGE (Lin, 2004) measures recall-oriented overlap, designed for summarization. Both have known limitations: they correlate poorly with human judgment for open-ended generation. BERTScore (Zhang et al., 2020) addresses this by computing semantic similarity using contextual embeddings instead of surface-level token matching. Understanding when each metric is appropriate (and when none suffice) is critical for LLM evaluation system design.
- **LMSYS Chatbot Arena** — The Arena introduced crowdsourced pairwise comparison (users choose between two blind model outputs) with Elo rating aggregation as the most reliable method for ranking LLMs on open-ended tasks. This methodology is important because automated metrics fail to capture nuanced quality differences in chat — human preference is the gold standard, and Elo provides a principled way to aggregate noisy pairwise judgments into a global ranking.
- **RAGAS documentation** — RAGAS (Retrieval-Augmented Generation Assessment) provides metrics specifically designed for RAG pipelines: faithfulness (does the answer match the retrieved context?), answer relevance (does it address the question?), and context precision/recall (did retrieval find the right documents?). These decomposed metrics are essential because RAG failures can come from retrieval, generation, or both — and a single end-to-end metric cannot diagnose which component is broken.
- **TruthfulQA** — Lin et al. created TruthfulQA to measure whether LLMs generate truthful answers rather than repeating popular misconceptions. The benchmark revealed that larger models are often *less* truthful (they better internalize common falsehoods from training data). This is the foundation for evaluating honesty and hallucination in LLM systems — a critical quality dimension that standard accuracy benchmarks completely miss.
This page is a **fundamentals** layer — pair it with [Enterprise RAG](../genai_ml_system_design/enterprise_rag.md) and [LLM Chatbot](../genai_ml_system_design/llm_chatbot.md) system design notes for end-to-end stories.
- Without a harness, you **can't compare** prompts, models, retrieval configs, or costs.
Evaluate, benchmark, and regression-test AI/LLM systems. Covers evaluation framework design, benchmark creation, human evaluation protocols, automated evaluation (LLM-as-judge), regression testing, statistical significance, and continuous evaluation pipelines.
<img width="1388" height="298" alt="full_diagram" src="https://github.com/user-attachments/assets/12a2371b-8be2-4219-9b48-90503eb43c69" />
A list of all public EEG-datasets. This list of EEG-resources is not exhaustive. If you find something new, or have explored any unfiltered link in depth, please update the repository.