Evaluation and Benchmarking

Evaluation is how you know whether a model is **fit for deployment** and whether a **new checkpoint** actually improves the behaviors you care about. Unlike classic supervised learning with a single held-out label distribution, LLMs are judged on **open-ended generation**, **multi-turn dialogue**, **tool use**, and **subjective** qualities like helpfulness. Without disciplined benchmarks, teams ship models that ace **proxy metrics** while failing **real users**—or regress silently when data mixt

spawn08

May 2, 2026

0 upvotes

0 downloads

0 views

ai llm rag prompt eval

View source

# Evaluation and Benchmarking ## Why This Matters for LLMs Evaluation is how you know whether a model is **fit for deployment** and whether a **new checkpoint** actually improves the behaviors you care about. Unlike classic supervised learning with a single held-out label distribution, LLMs are judged on **open-ended generation**, **multi-turn dialogue**, **tool use**, and **subjective** qualities like helpfulness. Without disciplined benchmarks, teams ship models that ace **proxy metrics** while failing **real users**—or regress silently when data mixtures change. Standard suites such as **MMLU**, **HumanEval**, and **GSM8K** provide **comparable** numbers across labs, but each has **blind spots**: contamination (test snippets appearing in training corpora), **format sensitivity** (models that excel only with a specific prompt template), and **narrow** skill coverage. Understanding **automated** metrics (perplexity, pass rates), **human** preference protocols (Elo from pairwise battles), and **LLM-as-judge** biases (position bias, self-preference) is essential for any ML or applied research role—interviewers ask “**how would you evaluate your model?**” in almost every loop. Finally, **evaluation design** is a systems problem: you must balance **statistical power** (enough items per slice), **latency** (how quickly CI runs), **cost** (human labels, API calls to judges), and **governance** (PII in prompts, reproducibility). A mature answer connects **offline** regression gates to **online** A/B tests and **incident** playbooks when metrics and user satisfaction diverge. --- ## Core Concepts ### Why Is LLM Evaluation Hard? Unlike image classification with a fixed label set, many LLM outputs are **partially correct**, **stylistically** constrained, or **subjective**. Let \(y\) be a reference string and \(\hat{y}\) a model output on input \(x\). A naive exact-match score is: \[ s_{\text{exact}}(y, \hat{y}) = \mathbf{1}[y = \hat{y}] . \] !!! math-intuition "In Plain English" **Exact match** is a **hammer**: it ignores paraphrases, formatting, and multiple valid code styles. It is still used where tasks demand it (e.g. short numeric answers on GSM8K) because it is **unambiguous**—but it **underrates** partially correct reasoning. **Memorization** means the model may answer correctly for **wrong** reasons—having seen the benchmark in training. **Benchmark contamination** inflates scores; mitigations include **n-gram overlap** checks, **canary** strings, and **dynamic** benchmarks. A **calibration** view: even when average accuracy rises, **slice** metrics (languages, domains, difficulty bins) may fall—always report **variance** and **worst-group** performance. ### Standard Benchmarks (Overview) | Benchmark | What it measures | Format | Primary metric | |-----------|------------------|--------|------------------| | **MMLU** | Broad knowledge (57 subjects) | Multiple choice | Accuracy | | **HellaSwag** | Commonsense continuation | Sentence completion | Accuracy | | **ARC** | Science reasoning (Challenge/Easy) | Multiple choice | Accuracy | | **HumanEval** | Python function synthesis | Code completion | pass@\(k\) | | **GSM8K** | Grade-school math word problems | Free-form final number | Exact match | | **MATH** | Competition mathematics | Free-form | Exact match | | **TruthfulQA** | Resistance to false popular beliefs | MC / generation | MC accuracy, BLEURT, etc. | | **WinoGrande** | Coreference / commonsense | Fill-in | Accuracy | | **MT-Bench** | Multi-turn chat quality | Open-ended | LLM judge score | Rows are **illustrative**—always check the **official** task definition and prompt template for the version you run. ### Perplexity and Cross-Model Comparisons Perplexity is \(\mathrm{PPL} = \exp(L)\) where \(L\) is average **negative log-likelihood** (in nats or bits, depending on convention) on a test corpus. For token sequence \(x_{1:T}\): \[ L = -\frac{1}{T}\sum_{t=1}^{T} \log p_\theta(x_t \mid x_{<t}) . \] !!! math-intuition "In Plain English" **Lower** perplexity means the model is **less surprised** by each next token. Comparing perplexity across models **only** makes sense with the **same tokenizer** and **same evaluation text**—different tokenizations change \(T\) and the probability mass split. ### pass@\(k\) for Code (HumanEval-style) Following the **unbiased** estimator from **Codex** evaluation (Chen et al.), for each problem generate \(n \ge k\) samples, count \(c\) correct, and estimate: \[ \text{pass@}k = \mathbb{E}\left[ 1 - \frac{\binom{n-c}{k}}{\binom{n}{k}} \right]. \] !!! math-intuition "In Plain English" If you drew \(k\) samples **without** looking, the chance **at least one** passes is **not** \(1-(1-p)^k\) unless you know the per-sample pass rate **and** independence. The combinatorial formula accounts for **finite** \(n\) and **empirical** \(c\) per task—avoid the naive bootstrap that **double-counts** easy problems. !!! example "Worked Example: pass@\(k\) from \(n=5\) samples" Suppose for one HumanEval problem you generate **5** completions and **2** pass the unit tests (\(n=5\), \(c=2\)). For \(k=1\): \[ 1 - \frac{\binom{5-2}{1}}{\binom{5}{1}} = 1 - \frac{\binom{3}{1}}{5} = 1 - \frac{3}{5} = 0.4 . \] For \(k=2\): \[ 1 - \frac{\binom{3}{2}}{\binom{5}{2}} = 1 - \frac{3}{10} = 0.7 . \] Intuition: with **two** correct in five, drawing **two** without replacement has high odds to catch **at least one** correct. Reporting **pass@100** from only \(n=5\) samples is **not** valid—you need \(n \ge k\) and typically **much** larger \(n\) for high-\(k\) estimates. ### BLEU, ROUGE, and n-Gram Overlap **BLEU** compares **n-gram precision** between hypothesis and references with brevity penalty. A simplified unigram precision is: \[ p_1 = \frac{\sum_{w} \min(h_w, r_w)}{\sum_w h_w} \] where \(h_w\) is hypothesis unigram count and \(r_w\) reference count (with **clipping**). !!! math-intuition "In Plain English" BLEU rewards **overlapping words**; it correlates weakly with **human** judgment on creative tasks. It can still help **sanity-check** summarization or translation **regressions** when used **within** a fixed pipeline. **ROUGE-L** measures longest common subsequence—better for **fluency** overlap than BLEU in some settings. Neither should be the **sole** chat metric. ### BERTScore and Embedding Similarity **BERTScore** compares token embeddings from a pretrained encoder: \[ F_{\text{BERT}} = \frac{1}{|x|}\sum_{i}\max_j \cos(e_{x_i}, e_{\hat{y}_j}) . \] !!! math-intuition "In Plain English" Each token in the reference finds its **best semantic match** in the hypothesis (recall-oriented component); precision swaps roles. It captures **paraphrase** better than n-grams but depends on the **encoder** biases and is **slow** at scale. ### Automated Harness: lm-evaluation-harness **EleutherAI’s `lm-evaluation-harness`** standardizes dataset loading, prompt formatting, and metric aggregation. Conceptually, for task set \(\mathcal{T}\), a run produces scores \(\{s_t\}_{t\in\mathcal{T}}\) with optional **normalization** to a **higher-is-better** scale. \[ s_{\text{macro}} = \frac{1}{|\mathcal{T}|}\sum_{t\in\mathcal{T}} s_t . \] !!! math-intuition "In Plain English" **Macro** averaging treats each benchmark equally; **micro** averaging pools all items—choose based on whether you care about **per-task** balance or **per-example** balance. ### Human Evaluation and Elo from Pairwise Battles **Chatbot Arena** (LMSYS) collects **blind** pairwise comparisons. If model A beats B, update strengths \(R_A, R_B\) (Elo-style): \[ E_A = \frac{1}{1 + 10^{(R_B - R_A)/400}}, \quad R_A \leftarrow R_A + K (S_A - E_A) \] where \(S_A \in \{0, 0.5, 1\}\) is the outcome score and \(K\) is a step size. !!! math-intuition "In Plain English" **Elo** turns **head-to-head** wins into a **single** number line—useful for **relative** ranking, but **not** an absolute “intelligence” score. Non-transitive user preferences can distort rankings. ### Inter-Annotator Agreement **Krippendorff’s \(\alpha\)** generalizes agreement beyond chance for multiple raters and missing data. A simplified **Cohen’s \(\kappa\)** for two raters is: \[ \kappa = \frac{p_o - p_e}{1 - p_e} \] where \(p_o\) is observed agreement and \(p_e\) expected by chance. !!! math-intuition "In Plain English" High benchmark accuracy is **meaningless** if humans **disagree** on labels—report **agreement** to show your **human eval** is measuring something **stable**. ### LLM-as-Judge A strong model **\(J\)** scores a candidate answer \(\hat{y}\) for prompt \(x\) on rubric dimensions (helpfulness, correctness, verbosity). A linear scoring template: \[ \text{score}(x,\hat{y}) = w^\top \phi(J(x,\hat{y})) \] where \(\phi\) extracts judge logits or Likert outputs. !!! math-intuition "In Plain English" Judges are **fast** and **scalable** but biased: **position bias** (preferring the first answer), **verbosity bias**, **self-bias** (favoring own style). Mitigations: **swap** positions, **mask** model identities, use **reference** answers. ### Evaluation Pipeline Design A robust pipeline combines: (1) **task-specific** automatic metrics, (2) **slice** dashboards, (3) **human** spot checks, (4) **online** experiments. The following worked example ties steps to numbers. !!! example "Worked Example: Building an Eval Pipeline" 1. **Define metrics:** For a **customer-support** bot, track **resolution rate** (human), **citation** accuracy against KB (automatic), and **toxicity** score (classifier). Set **thresholds**: e.g. toxicity **< 0.05** prevalence at 95th percentile. 2. **Golden set:** Create **300** curated dialogs with **approved** answers and **disallowed** behaviors. Stratify by **intent** (refund, bug, account) with **100** each. 3. **Regression suite:** Weekly run **MMLU** subset (STEM only) + **HumanEval** + **in-house** 300-dialog set. Example: baseline **MMLU-STEM** **0.62**, candidate **0.63** (Δ **+0.01**)—within **sampling noise** if CI width ±**0.02**; do **not** ship on that alone. 4. **LLM judge:** Sample **50** dialogs; judge **pairwise** (A vs B) with **position swap**—if A wins **60%** on original order but **45%** after swap, investigate **position bias** before trusting **55%** net win. 5. **Online:** Run **5%** traffic A/B for **two** weeks; primary metric **user thumbs-up** rate (**+1.2%** lift) with **no** increase in **escalation** rate to humans. ??? deep-dive "Deep Dive: Contamination Audits" Contamination detection blends **n-gram** overlap, **embedding** similarity, and **manual** review of nearest neighbors in training corpora. There is no perfect test—**dynamic** benchmarks (fresh questions, live APIs) reduce **static** leakage concerns. ??? deep-dive "Deep Dive: LLM Judge Calibration" Align judge scores with human labels via **Platt scaling** or **isotonic** regression on a **pilot** set. Report **calibration** curves—judges can be **sharp** but **wrong** on out-of-domain styles. ### MT-Bench and Multi-Turn Scoring **MT-Bench** evaluates **multi-turn** instruction following with **turn-specific** questions. A simplified aggregate score averages turn scores \(s_1,\ldots,s_T\) for a dialog: \[ S = \frac{1}{T}\sum_{t=1}^{T} s_t . \] !!! math-intuition "In Plain English" **Multi-turn** tasks punish models that **drift** or **contradict** earlier turns—single-turn benchmarks miss this failure mode. When judges are LLMs, **pairwise** comparisons per turn can reduce **absolute** score drift. ### GSM8K and Exact-Match Extraction **GSM8K** often scores answers by **extracting** the final numeric result and comparing to ground truth \(a^\star\). Let \(\text{extract}(\hat{y})\) be a deterministic parser: \[ s_{\text{GSM}} = \mathbf{1}[\text{extract}(\hat{y}) = a^\star]. \] !!! math-intuition "In Plain English" A model can show **correct reasoning** yet lose the point if the **final line** is misformatted—teams often add **regex** normalization and **sympy** parsing for robustness. !!! example "Worked Example: Parsing Risk on GSM8K-Style Items" Ground truth: **42**. Model output: “The answer is **42.0** dollars.” A strict string match fails; a **numeric** parse succeeds. If three models output **41**, **42**, **43** under majority vote, the **vote** is wrong—showing **aggregation** must happen on **parsed** numbers, not raw strings. Always log **failure modes** (division error, wrong unit) separately from **parse** errors. ### Statistical Significance at Benchmark Scale Let \(\hat{p}\) be the empirical accuracy on \(n\) i.i.d. items with true rate \(p\). A **normal** approximation to the **95%** confidence interval is: \[ \hat{p} \pm 1.96 \sqrt{\frac{\hat{p}(1-\hat{p})}{n}} . \] !!! math-intuition "In Plain English" On **500** MMLU items, if accuracy moves from **0.70** to **0.73**, the interval width is roughly \(1.96\sqrt{0.7\cdot 0.3/500} \approx 0.04\). The lift may be **noise**—report **CIs** whenever you compare checkpoints. --- ## Code Install dependencies as needed: ```bash pip install numpy torch transformers bert-score tqdm ``` The script below demonstrates **pass@\(k\)** estimation, a **toy** LLM-as-judge prompt skeleton (local small models), and **BERTScore** when available. ```python """ evaluation_examples.py — self-contained demos for pass@k, judge template, BERTScore. Requires: pip install numpy torch transformers bert-score tqdm """ from __future__ import annotations import math import numpy as np # Optional: BERTScore pulls models on first use try: from bert_score import score as bert_score except ImportError: bert_score = None def comb(n: int, k: int) -> float: if k < 0 or k > n: return 0.0 return float(math.comb(n, k)) def pass_at_k(n: int, c: int, k: int) -> float: """Unbiased pass@k estimate for one problem: Chen et al. Codex paper.""" if n < k: raise ValueError("need n >= k") if comb(n, k) == 0: return 0.0 return 1.0 - comb(n - c, k) / comb(n, k) def estimate_pass_at_k_aggregate(results: list[tuple[int, int]], k: int) -> float: """ results: list of (n_samples, c_correct) per problem. Macro average of per-problem pass@k. """ vals = [] for n, c in results: vals.append(pass_at_k(n, c, k)) return float(np.mean(vals)) def judge_prompt(task: str, response_a: str, response_b: str) -> str: """Template for pairwise LLM-as-judge (swap A/B between calls).""" return f"""You are an impartial evaluator. Pick which response better satisfies the task. Task: {task} Response A: {response_a} Response B: {response_b} Answer with a single letter: A or B.""" def run_bertscore_demo(cands: list[str], refs: list[str]) -> None: if bert_score is None: print("bert-score not installed; skip BERTScore demo.") return p, r, f1 = bert_score(cands, refs, lang="en", verbose=False) print("BERTScore F1 (per candidate):", f1.numpy()) def main() -> None: rng = np.random.default_rng(7) # --- pass@k toy aggregation: 3 problems --- scenarios = [ (10, 3), # n, c (10, 1), (10, 6), ] for k in (1, 5): agg = estimate_pass_at_k_aggregate(scenarios, k) print(f"Macro pass@{k} on toy scenarios: {agg:.4f}") # Show single-problem numbers print("Single problem n=10,c=3:", pass_at_k(10, 3, 5)) # --- BERTScore --- cands = ["The capital of France is Paris.", "Paris is the capital city of France."] refs = ["Paris is the capital of France."] run_bertscore_demo(cands, refs * 2) # --- Judge template --- print(judge_prompt("Explain chain rule.", "dy/dx = dy/du * du/dx.", "Use nested functions.")) # --- Simulate Elo update (one step) --- r_a, r_b = 1000.0, 1000.0 k_elo = 32.0 s_a = 1.0 # A wins e_a = 1.0 / (1.0 + 10 ** ((r_b - r_a) / 400.0)) r_a += k_elo * (s_a - e_a) r_b += k_elo * ((1.0 - s_a) - (1.0 - e_a)) print(f"Elo after one match: A={r_a:.2f}, B={r_b:.2f}") if __name__ == "__main__": main() ``` --- ## Interview Guide !!! interview "FAANG-Level Questions" 1. Why is perplexity not comparable across different tokenizers? *Answer:* Perplexity is \(\exp\) of average **negative log-likelihood per token**—different tokenizers **segment text differently**, changing \(T\) and how probability mass is split (multi-token words vs single-token). A model with a larger vocabulary may assign higher per-token probabilities on different boundaries, making PPL **not portable** across tokenizers or even model families. Compare only with **identical** preprocessing and tokenizer. 2. Define pass@\(k\) and explain why the naive \(1-(1-p)^k\) formula can be wrong for code benchmarks. *Answer:* **pass@\(k\)** is the probability that **at least one** of \(k\) generated solutions passes tests; the **unbiased** HumanEval estimator uses samples \(n\ge k\) and counts correct \(c\) per task with combinatorics: \(1 - \binom{n-c}{k}/\binom{n}{k}\). The naive \(1-(1-p)^k\) assumes a known **i.i.d.** \(p\) per draw and ignores **finite** \(n\), **empirical** \(c\), and **per-problem** difficulty—underestimating variance and double-counting easy tasks if misapplied. 3. What is benchmark contamination and how would you detect it? *Answer:* **Contamination** is train/test overlap—benchmark strings or paraphrases appeared in pretraining, inflating scores via **memorization**. Detect with **n-gram** overlap scans, **embedding** nearest-neighbor checks against training corpora, **canary** strings, and **held-out dynamic** items. No detector is perfect; treat static benchmarks as **lower bounds** on leakage concern. 4. Compare macro versus micro averaging over tasks—when does each matter? *Answer:* **Macro** averages each task’s score equally—fair when every **task family** matters equally (avoid letting huge MMLU subjects dominate). **Micro** pools all items—reflects **user frequency** if tasks mirror traffic. Reporting both plus **worst-group** slices prevents hiding regressions on rare but critical tasks. 5. Name two failure modes of BLEU/ROUGE for chat evaluation. *Answer:* They reward **lexical overlap**, so correct paraphrases and varied helpful styles score **low** while verbose, repetitive outputs can score **high**. They ignore **factuality**, **safety**, and **multi-turn** coherence—fine for regression **sanity** in narrow settings, misleading as a sole chat KPI. 6. How does Chatbot Arena estimate model strength, and what is a limitation of Elo here? *Answer:* Arena aggregates **blind pairwise** user preferences and updates **Elo** ratings from wins/losses/ties—good for **relative** model ordering on open-ended chat. Limitations: **non-transitive** user tastes, **prompt** and **demographic** skew, **position** and **length** biases, and mismatch to **enterprise** tasks not sampled in the arena. 7. What is position bias in LLM-as-judge, and how do you mitigate it? *Answer:* Judges may favor the **first** or **second** answer independent of quality—ordering effects in pairwise comparisons. Mitigate by **swapping** A/B positions and averaging, using **single** consolidated prompts with clear rubrics, **calibrating** judges on gold items, and mixing **human** spot checks—never trust one-shot absolute scores without controls. 8. When would you trust human evaluation over automatic metrics for a release decision? *Answer:* For **subjective** qualities (helpfulness, tone), **high-stakes** harm (safety, legal), or tasks where automatic metrics **misfire** (tool use success), humans (experts or representative users) remain the ground truth. Automates are for **scale** and **CI**; ship decisions need **targeted** human review, especially on **slices** and new capabilities—automatic gains can be hollow if misaligned with user goals. 9. What is Krippendorff’s alpha used for in evaluation pipelines? *Answer:* **Krippendorff’s \(\alpha\)** generalizes inter-rater agreement beyond simple accuracy—handles **missing** ratings, **multiple** coders, and **nominal/ordinal** scales. Use it to ensure **label quality** before trusting human eval metrics; low \(\alpha\) means instructions or rubrics need refinement. It complements per-item disagreement analysis. 10. Sketch an end-to-end regression gate for a coding assistant (metrics + cadence). *Answer:* Nightly **HumanEval**/RepoBench-style **pass@k** with fixed \(n\), **static analysis** + unit tests on an internal golden repo, **latency** p95 on codegen paths, and **security** scans (secrets, unsafe code). Weekly add **LLM-judge** pairwise on a frozen dialog set with position swaps; **block** release on regressions beyond CI bands. Tie to **online** canary metrics (accept rate, bug reopen) when available. !!! interview "Follow-up Probes" - “Your MMLU went up but users complain—what do you check first?” - “How do you evaluate **tool-calling** agents differently from raw chat?” - “What’s wrong with evaluating on the **training** chat logs?” - “How would you detect **reward hacking** on a learned metric?” !!! key-phrases "Key Phrases to Use in Interviews" - “**Slice metrics** and **worst-group** performance, not just averages.” - “**Unbiased pass@\(k\)** with \(n \ge k\) samples per task.” - “**Contamination audits** plus **dynamic** benchmarks for trust.” - “**Pairwise** comparisons with **position swaps** for judge fairness.” - “**Offline** regression gates aligned with **online** KPIs.” --- ## References 1. Hendrycks, D., et al. (2021). *Measuring Massive Multitask Language Understanding (MMLU).* ICLR. 2. Chen, M., et al. (2021). *Evaluating Large Language Models Trained on Code.* arXiv:2107.03374. 3. Cobbe, K., et al. (2021). *Training Verifiers to Solve Math Word Problems.* arXiv:2110.14168. 4. Lin, C.-Y. (2004). *ROUGE: A Package for Automatic Evaluation of Summaries.* 5. Papineni, K., et al. (2002). *BLEU: a Method for Automatic Evaluation of Machine Translation.* ACL. 6. Zhang, T., et al. (2020). *BERTScore: Evaluating Text Generation with BERT.* ICLR. 7. EleutherAI. *lm-evaluation-harness* (GitHub). Standardized LM benchmarking framework. 8. Chiang, W.-L., et al. (2024). *Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference.* arXiv:2403.04132. 9. Zheng, L., et al. (2023). *Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena.* NeurIPS Datasets & Benchmarks. 10. Krippendorff, K. (2018). *Content Analysis: An Introduction to Its Methodology.*

Related Documents

Evaluation Harness (Offline + Online)

/godmode:eval

🔬 Open Deep Research

EEG-Datasets