Loading...
Loading...
Loading...
> **Interview Reality:** _"What evaluation techniques do you use and why?"_
# 03 — Evaluation Techniques Deep Dive: LLM-as-Judge, Human Evaluation & Retrieval Metrics
> **Interview Reality:** _"What evaluation techniques do you use and why?"_
> Knowing WHAT to measure (faithfulness, relevance) is necessary but not sufficient.
> You must explain HOW you measure it — the specific techniques, their trade-offs, and when to use each one.
> This is where you prove you've actually built evaluation pipelines, not just read about them.
---
## Table of Contents
1. [The Evaluation Technique Landscape](#1-the-evaluation-technique-landscape)
2. [LLM-as-Judge — Complete Guide](#2-llm-as-judge--complete-guide)
3. [Human Evaluation — Complete Guide](#3-human-evaluation--complete-guide)
4. [Retrieval Metrics (Recall@K, Precision@K, MRR, NDCG)](#4-retrieval-metrics)
5. [Hybrid Evaluation Strategies](#5-hybrid-evaluation-strategies)
6. [Evaluation Pipeline Architecture](#6-evaluation-pipeline-architecture)
7. [Evaluation Anti-Patterns](#7-evaluation-anti-patterns)
8. [Technique Selection Decision Framework](#8-technique-selection-decision-framework)
9. [Interview Deep Dive: Conversation Flow](#9-interview-deep-dive-conversation-flow)
10. [Follow-Up Questions & Answers](#10-follow-up-questions--answers)
---
## 1. The Evaluation Technique Landscape
### Overview: Three Approaches, Different Strengths
```
┌─────────────────────────────────────────────────────────────────────────────────────┐
│ EVALUATION TECHNIQUE COMPARISON │
│ │
│ ┌──────────────────┐ ┌──────────────────┐ ┌──────────────────────────┐ │
│ │ LLM-AS-JUDGE │ │ HUMAN EVALUATION │ │ RETRIEVAL METRICS │ │
│ │ │ │ │ │ │ │
│ │ Use another LLM │ │ Domain experts │ │ Mathematical metrics │ │
│ │ to score outputs │ │ review + score │ │ for search quality │ │
│ │ │ │ │ │ │ │
│ │ ✅ Scalable │ │ ✅ Most accurate │ │ ✅ Deterministic │ │
│ │ ✅ Cheap │ │ ✅ Domain nuance │ │ ✅ No LLM needed │ │
│ │ ✅ Fast │ │ ✅ Ground truth │ │ ✅ Well-understood │ │
│ │ │ │ │ │ │ │
│ │ ❌ Can be wrong │ │ ❌ Expensive │ │ ❌ Only measures search │ │
│ │ ❌ Bias issues │ │ ❌ Slow │ │ ❌ Needs labeled data │ │
│ │ ❌ Needs calib. │ │ ❌ Not scalable │ │ ❌ Doesn't judge output │ │
│ └──────────────────┘ └──────────────────┘ └──────────────────────────┘ │
│ │
│ USE TOGETHER: LLM-as-judge for scale + Human for calibration + Retrieval for search │
└─────────────────────────────────────────────────────────────────────────────────────┘
```
### When to Use Each Technique
| Scenario | Primary Technique | Secondary |
|----------|------------------|-----------|
| Production monitoring (high volume) | LLM-as-Judge | Retrieval metrics |
| Pre-deployment validation | Golden test set + LLM-as-Judge | Human spot-check |
| Regulated industries (medical, legal) | Human evaluation | LLM-as-Judge for pre-screening |
| Search/retrieval optimization | Retrieval metrics | LLM-as-Judge on end-to-end |
| Model comparison / A/B testing | LLM-as-Judge (pairwise) | Human evaluation on disagreements |
| Debugging specific failures | Human evaluation | Detailed trace analysis |
---
## 2. LLM-as-Judge — Complete Guide
### 2.1 What Is LLM-as-Judge?
Using a (typically stronger) LLM to evaluate the output of another LLM. The judge LLM reads the query, context, and response, then assigns a quality score.
```
┌────────────────────────────────────────────────────────────────────┐
│ LLM-AS-JUDGE ARCHITECTURE │
│ │
│ ┌──────────────┐ ┌──────────────────┐ ┌──────────────┐ │
│ │ PRIMARY LLM │ │ JUDGE LLM │ │ SCORE DB │ │
│ │ (generates │───▶│ (evaluates │───▶│ (store + │ │
│ │ response) │ │ response) │ │ aggregate) │ │
│ └──────────────┘ └──────────────────┘ └──────────────┘ │
│ │
│ Primary: GPT-4o-mini (cheap, fast) │
│ Judge: GPT-4o or Claude (high quality, more expensive) │
│ │
│ Key Rule: Judge should be STRONGER than the primary model │
│ (or at least a different model to avoid self-bias) │
└────────────────────────────────────────────────────────────────────┘
```
### 2.2 Types of LLM-as-Judge Evaluations
#### Type 1: Pointwise Scoring (Rate a Single Response)
```python
POINTWISE_JUDGE_PROMPT = """
You are an expert evaluator. Given a question, context, and response,
rate the response quality on a scale of 1-5.
QUESTION: {question}
CONTEXT: {context}
RESPONSE: {response}
Rate each dimension:
1. FAITHFULNESS (1-5): Is every claim in the response supported by the context?
1 = Completely hallucinated
3 = Mix of supported and unsupported claims
5 = Every claim is directly supported by context
2. RELEVANCE (1-5): Does the response answer the question?
1 = Completely off-topic
3 = Partially addresses the question
5 = Directly and completely answers the question
3. COMPLETENESS (1-5): Does the response cover all aspects of the question?
1 = Major information missing
3 = Covers the main point but misses details
5 = Comprehensive answer
Output as JSON:
{{
"faithfulness": <1-5>,
"relevance": <1-5>,
"completeness": <1-5>,
"explanation": "<brief justification>"
}}
"""
```
#### Type 2: Pairwise Comparison (Compare Two Responses)
```python
PAIRWISE_JUDGE_PROMPT = """
You are an expert evaluator. Given a question, context, and TWO responses,
determine which response is better.
QUESTION: {question}
CONTEXT: {context}
RESPONSE A: {response_a}
RESPONSE B: {response_b}
Which response is better and why? Consider:
- Faithfulness to context
- Relevance to the question
- Completeness of the answer
- Clarity of explanation
Output as JSON:
{{
"winner": "A" or "B" or "TIE",
"confidence": <0.0-1.0>,
"explanation": "<brief justification>"
}}
"""
```
> **Interview Tip:** _"Pairwise comparison is more reliable than pointwise scoring because humans and LLMs are better at comparing than absolute scoring. I use pairwise for A/B testing model versions and pointwise for production monitoring."_
#### Type 3: Reference-Based (Compare Against Ground Truth)
```python
REFERENCE_JUDGE_PROMPT = """
Given a question, a reference answer (ground truth), and a candidate response,
evaluate how well the candidate matches the reference.
QUESTION: {question}
REFERENCE ANSWER: {reference}
CANDIDATE RESPONSE: {candidate}
Rate on a scale of 1-5:
1 = Completely different from reference — wrong answer
2 = Some overlap but significant differences
3 = Captures the main idea but with notable omissions or additions
4 = Very close to reference with minor differences
5 = Semantically equivalent to reference
Output:
{{
"score": <1-5>,
"missing_from_candidate": "<what the candidate missed>",
"extra_in_candidate": "<what the candidate added>",
"explanation": "<brief justification>"
}}
"""
```
### 2.3 LLM-as-Judge Biases & Mitigations
```
┌────────────────────────────────────────────────────────────────────────────┐
│ LLM-AS-JUDGE KNOWN BIASES │
│ │
│ BIAS │ DESCRIPTION │ MITIGATION │
│ ────────────────────────┼─────────────────────────────┼────────────────── │
│ Position Bias │ Prefers Response A (first │ Randomize order, │
│ │ listed) in pairwise │ run both orders │
│ │ │ │
│ Verbosity Bias │ Prefers longer, more │ Add "judge │
│ │ detailed responses │ conciseness" inst. │
│ │ │ │
│ Self-Enhancement Bias │ LLM prefers its own output │ Use different │
│ │ │ model as judge │
│ │ │ │
│ Sycophancy Bias │ Avoids giving low scores │ Calibrate with │
│ │ (wants to be "nice") │ known-bad examples │
│ │ │ │
│ Format Bias │ Prefers well-formatted │ Normalize format │
│ │ over accurate responses │ before judging │
│ │ │ │
│ Anchoring Bias │ First examples in few-shot │ Vary few-shot │
│ │ influence all scores │ examples │
└────────────────────────────────────────────────────────────────────────────┘
```
### 2.4 Position Bias Mitigation (Critical for Pairwise)
```python
async def pairwise_judge(question, context, response_a, response_b, judge_llm):
"""
Mitigate position bias by running both orderings and checking consistency.
"""
# Run 1: A first, B second
result_ab = await judge_llm.evaluate(
question=question, context=context,
response_a=response_a, response_b=response_b
)
# Run 2: B first, A second (SWAP ORDER)
result_ba = await judge_llm.evaluate(
question=question, context=context,
response_a=response_b, response_b=response_a # Swapped
)
# Check consistency
if result_ab["winner"] == "A" and result_ba["winner"] == "B":
# Consistent: both orderings agree Response A is better
return {"winner": "A", "confidence": "high", "consistent": True}
elif result_ab["winner"] == "B" and result_ba["winner"] == "A":
# Consistent: both orderings agree Response B is better
return {"winner": "B", "confidence": "high", "consistent": True}
else:
# Inconsistent: position bias detected
return {"winner": "TIE", "confidence": "low", "consistent": False,
"note": "Position bias detected — send to human review"}
```
### 2.5 Multi-Judge Consensus
```python
async def multi_judge_evaluation(question, context, response):
"""
Use multiple judges and take consensus to improve reliability.
"""
judges = [
{"model": "gpt-4o", "prompt": pointwise_prompt_v1},
{"model": "claude-3.5-sonnet", "prompt": pointwise_prompt_v1},
{"model": "gpt-4o", "prompt": pointwise_prompt_v2}, # Different prompt
]
scores = []
for judge in judges:
score = await evaluate_with_judge(judge, question, context, response)
scores.append(score)
# Consensus: average scores, flag high-variance items for human review
avg_faithfulness = mean([s["faithfulness"] for s in scores])
variance = std([s["faithfulness"] for s in scores])
return {
"faithfulness": avg_faithfulness,
"confidence": "high" if variance < 0.5 else "low",
"needs_human_review": variance >= 0.5,
"individual_scores": scores
}
```
### 2.6 LLM-as-Judge Cost Analysis
```
Cost for 1,000 evaluations:
Pointwise (GPT-4o judge):
Input: ~800 tokens/eval (question + context + response + prompt)
Output: ~100 tokens/eval (structured score + explanation)
Cost: 1,000 × (800 × $2.50/M + 100 × $10.00/M)
= 1,000 × ($0.002 + $0.001)
= $3.00 per 1,000 evaluations
Pairwise with position bias mitigation:
2 calls per evaluation → $6.00 per 1,000 evaluations
Multi-judge (3 judges):
3 calls per evaluation → $9.00 per 1,000 evaluations
Cost-optimized: Use GPT-4o-mini as judge:
= $0.30 per 1,000 evaluations (10x cheaper, ~85% accuracy vs GPT-4o)
```
---
## 3. Human Evaluation — Complete Guide
### 3.1 When Human Evaluation Is Essential
```
ALWAYS use human evaluation when:
✅ Building the initial evaluation pipeline (calibration)
✅ Regulated industries (medical, legal, financial)
✅ LLM-as-judge shows high variance (uncertainty)
✅ Evaluating subjective qualities (tone, empathy, clarity)
✅ Building/updating ground truth datasets
✅ Validating LLM-as-judge accuracy (meta-evaluation)
SKIP human evaluation when:
❌ High-volume production monitoring (too expensive)
❌ Binary/structured output validation (automated is better)
❌ A/B testing with clear metrics (automated is sufficient)
```
### 3.2 Human Evaluation Framework
```
┌─────────────────────────────────────────────────────────────────────────┐
│ HUMAN EVALUATION PIPELINE │
│ │
│ ┌──────────┐ ┌──────────────┐ ┌──────────────┐ ┌────────────┐ │
│ │ SAMPLE │──▶│ ANNOTATION │──▶│ QUALITY │──▶│ AGGREGATE │ │
│ │ SELECTION │ │ GUIDELINES │ │ CONTROL │ │ & REPORT │ │
│ └──────────┘ └──────────────┘ └──────────────┘ └────────────┘ │
│ │
│ What: What: What: What: │
│ • Random 1% • Rubric with • Inter-annotator • Cohen's κ │
│ • Edge cases clear examples agreement (IAA) • Score dist. │
│ • Low-conf for each score • Gold questions • Trend charts │
│ LLM-judge • 1-5 scale per • Annotator • Action items │
│ items dimension calibration │
└─────────────────────────────────────────────────────────────────────────┘
```
### 3.3 Annotation Rubric Design
```
FAITHFULNESS RUBRIC:
Score 5 — Perfect Faithfulness
Every claim in the response is directly supported by the context.
Example: Context says "Founded 2015" → Response says "Founded in 2015" ✅
Score 4 — Minor Gap
All major claims supported, but includes trivial unsupported details.
Example: "Founded in 2015" + "likely in California" (location not in context) ⚠️
Score 3 — Mixed
Some claims supported, some clearly unsupported.
Example: "Founded in 2015 with 10,000 employees" (employee count not in context) ⚠️
Score 2 — Mostly Unsupported
Major claims are unsupported or contradict context.
Example: "Founded in 2018" when context says 2015 ❌
Score 1 — Complete Hallucination
No claims are supported by the context.
Example: Entirely fabricated response with no connection to context ❌
EDGE CASES:
• "I don't know" → Score 5 for faithfulness (refusing to hallucinate is good)
• Paraphrasing → Score 5 if semantically equivalent
• Inference → Score 4 if reasonable inference, Score 3 if a stretch
```
### 3.4 Inter-Annotator Agreement (IAA)
```
Why IAA matters: If two humans can't agree on a score, your evaluation is unreliable.
Cohen's Kappa (κ) — Measures Agreement:
κ > 0.80 → Almost perfect agreement — your rubric is clear
κ = 0.60–0.80 → Substantial agreement — acceptable
κ = 0.40–0.60 → Moderate agreement — improve rubric
κ < 0.40 → Fair/poor agreement — rubric needs major revision
How to calculate:
1. Have 2+ annotators score the same 50–100 samples
2. Calculate % agreement
3. Correct for chance agreement using Cohen's Kappa
Example:
Annotator A scores: [5, 4, 3, 5, 2, 4, 5, 3, 4, 5]
Annotator B scores: [5, 4, 4, 5, 2, 3, 5, 3, 4, 5]
Raw agreement: 8/10 = 80%
Cohen's κ: 0.72 → Substantial agreement ✅
```
### 3.5 Human Evaluation Cost & Scale
```
Typical human evaluation cost:
• Internal domain experts: ~$50–100/hour → ~$2–4 per evaluation
• Crowdworkers (e.g., Scale AI, Surge): ~$0.50–2.00 per evaluation
• Specialized annotators (medical, legal): ~$5–15 per evaluation
At scale:
500 evaluations/week × $2/eval = $1,000/week
500 evaluations/week × $10/eval (medical) = $5,000/week
Budget tip: Use LLM-as-judge for 95% of evaluations, reserve
human evaluation for low-confidence cases and calibration.
```
---
## 4. Retrieval Metrics
### 4.1 Why Measure Retrieval Separately?
```
A RAG system can fail at two points:
Point 1: RETRIEVAL fails → Wrong documents → Bad answer
Point 2: GENERATION fails → Right documents → Bad answer
If you only measure the final answer, you can't tell WHERE the failure is.
Retrieval metrics isolate Point 1.
```
### 4.2 Recall@K
**What:** Of all relevant documents, how many did we find in the top K results?
```
Formula: Recall@K = |Relevant ∩ Retrieved@K| / |Relevant|
Example:
Corpus has 5 relevant documents for query "refund policy"
We retrieve top 5 results, and 3 of them are relevant
Recall@5 = 3/5 = 0.6
Interpretation: We're missing 40% of relevant context.
This means the LLM is working with incomplete information.
Benchmarks:
Recall@3: Target > 0.70
Recall@5: Target > 0.80
Recall@10: Target > 0.90
```
### 4.3 Precision@K
**What:** Of the documents we retrieved, how many are actually relevant?
```
Formula: Precision@K = |Relevant ∩ Retrieved@K| / K
Example:
We retrieve top 5 results for "refund policy"
3 of them are about refund policy, 2 are about shipping
Precision@5 = 3/5 = 0.6
Interpretation: 40% of context is noise.
This dilutes the useful context and can confuse the LLM.
Benchmarks:
Precision@3: Target > 0.80
Precision@5: Target > 0.70
Precision@10: Target > 0.60
```
### 4.4 The Recall-Precision Trade-off
```
┌────────────────────────────────────────────────────────────────────┐
│ RECALL vs PRECISION TRADE-OFF │
│ │
│ High K (retrieve more) → Higher Recall, Lower Precision │
│ Low K (retrieve fewer) → Lower Recall, Higher Precision │
│ │
│ Precision │
│ ▲ │
│ │ ● │
│ │ ● │
│ │ ● │
│ │ ● The "sweet spot" is typically K=3 to K=5 │
│ │ ● for RAG systems │
│ │ ● │
│ │ ● │
│ │ ● ● ● ● │
│ └──────────────────────────────▶ Recall │
│ K=1 K=3 K=5 K=10 K=20 K=50 │
│ │
│ RAG SWEET SPOT: K=3 to K=5 gives best balance │
│ More context = more cost + more noise + higher recall │
└────────────────────────────────────────────────────────────────────┘
```
### 4.5 Mean Reciprocal Rank (MRR)
**What:** How high is the first relevant document ranked?
```
Formula: MRR = (1/N) × Σ(1/rank_i)
where rank_i = position of first relevant doc for query i
Example (3 queries):
Query 1: First relevant doc at position 1 → 1/1 = 1.0
Query 2: First relevant doc at position 3 → 1/3 = 0.33
Query 3: First relevant doc at position 2 → 1/2 = 0.5
MRR = (1.0 + 0.33 + 0.5) / 3 = 0.61
Interpretation: On average, the first relevant document
appears around position 1.6 in results.
Benchmarks:
MRR > 0.80 → Excellent (first relevant doc usually in top 1-2)
MRR 0.50–0.80 → Good (first relevant doc usually in top 2-3)
MRR < 0.50 → Poor (relevant docs buried too deep)
```
### 4.6 NDCG (Normalized Discounted Cumulative Gain)
**What:** Are the most relevant documents ranked highest? NDCG rewards having highly relevant documents at the top.
```
NDCG considers:
1. Relevance has grades (not just binary relevant/not-relevant)
2. Position matters — relevant docs at position 1 are worth more than at position 5
Formula:
DCG@K = Σ (relevance_i / log2(i + 1)) for i = 1 to K
NDCG@K = DCG@K / Ideal DCG@K
Example:
Retrieved: [rel=3, rel=0, rel=2, rel=1, rel=0] (relevance on 0-3 scale)
DCG@5 = 3/log2(2) + 0/log2(3) + 2/log2(4) + 1/log2(5) + 0/log2(6)
= 3.0 + 0 + 1.0 + 0.43 + 0
= 4.43
Ideal: [rel=3, rel=2, rel=1, rel=0, rel=0]
Ideal DCG@5 = 3.0 + 1.26 + 0.5 + 0 + 0 = 4.76
NDCG@5 = 4.43 / 4.76 = 0.93
Benchmarks:
NDCG@5 > 0.85 → Excellent ranking quality
NDCG@5 0.70–0.85 → Good
NDCG@5 < 0.70 → Needs improvement (consider re-ranking)
```
### 4.7 Complete Retrieval Metrics Comparison
```
┌────────────────────────────────────────────────────────────────────────┐
│ RETRIEVAL METRICS AT A GLANCE │
│ │
│ Metric │ Measures │ Needs Labels │ When to Use │
│ ─────────────┼─────────────────────────┼──────────────┼───────────── │
│ Recall@K │ Coverage: did we find │ Yes │ Always │
│ │ all relevant docs? │ │ │
│ ─────────────┼─────────────────────────┼──────────────┼───────────── │
│ Precision@K │ Purity: are all results │ Yes │ Always │
│ │ relevant? │ │ │
│ ─────────────┼─────────────────────────┼──────────────┼───────────── │
│ MRR │ Position: how quickly │ Yes │ User-facing │
│ │ do we find something? │ │ search │
│ ─────────────┼─────────────────────────┼──────────────┼───────────── │
│ NDCG@K │ Ranking quality: are │ Yes (graded) │ When ranking │
│ │ best docs on top? │ │ order matters │
│ ─────────────┼─────────────────────────┼──────────────┼───────────── │
│ Hit Rate@K │ Binary: is at least │ Yes │ Simple QA │
│ │ one relevant in top K? │ │ systems │
│ ─────────────┼─────────────────────────┼──────────────┼───────────── │
│ Context │ Relevance of retrieved │ No (LLM │ RAGAS-based │
│ Precision │ docs to the query │ evaluated) │ evaluation │
└────────────────────────────────────────────────────────────────────────┘
```
---
## 5. Hybrid Evaluation Strategies
### 5.1 The Hybrid Evaluation Architecture
In production, you combine all three techniques in a layered system:
```
┌─────────────────────────────────────────────────────────────────────────────────┐
│ HYBRID EVALUATION ARCHITECTURE │
│ │
│ ALL TRAFFIC (100%) │
│ ├──▶ Layer 1: PROGRAMMATIC CHECKS (100% of traffic) │
│ │ • Response length, format, schema validation │
│ │ • Token count anomaly detection │
│ │ • Refusal keyword detection │
│ │ • Cost: $0.00 per check (string matching) │
│ │ │
│ ├──▶ Layer 2: RETRIEVAL METRICS (100% of traffic) │
│ │ • Context relevance (lightweight embedding similarity) │
│ │ • Retrieved doc count │
│ │ • Retrieval latency │
│ │ • Cost: ~$0.001 per check │
│ │ │
│ ├──▶ Layer 3: LLM-AS-JUDGE (15% of traffic, sampled) │
│ │ • Faithfulness scoring │
│ │ • Relevance scoring │
│ │ • Hallucination detection │
│ │ • Cost: ~$0.003 per evaluation │
│ │ │
│ └──▶ Layer 4: HUMAN REVIEW (0.5% of traffic + all flagged) │
│ • Low-confidence LLM-judge items │
│ • Random sample for calibration │
│ • All user-reported issues │
│ • Cost: ~$2.00 per evaluation │
│ │
│ TOTAL COST EXAMPLE (10,000 req/day): │
│ L1: 10,000 × $0.00 = $0.00 │
│ L2: 10,000 × $0.001 = $10.00 │
│ L3: 1,500 × $0.003 = $4.50 │
│ L4: 50 × $2.00 = $100.00 │
│ ───────────────────────────── │
│ TOTAL: $114.50/day for comprehensive evaluation │
└─────────────────────────────────────────────────────────────────────────────────┘
```
### 5.2 Escalation Flow
```
Request completed
│
▼
Layer 1: Programmatic check
├── FAIL (e.g., empty response) → Immediately flag + alert
│
└── PASS → Continue to Layer 2
│
▼
Layer 2: Retrieval metrics
├── FAIL (e.g., 0 relevant docs retrieved) → Alert retrieval team
│
└── PASS → Random sample 15%
│
▼
Layer 3: LLM-as-Judge
├── FAIL (faithfulness < 0.5) → Auto-flag + human queue
├── UNCERTAIN (0.5–0.7) → Human review queue
│
└── PASS (> 0.7) → Log score + continue
│
└── Random 0.5% → Layer 4: Human review
(for calibration)
```
---
## 6. Evaluation Pipeline Architecture
### 6.1 End-to-End Evaluation System
```
┌─────────────────────────────────────────────────────────────────────────────────────────┐
│ PRODUCTION EVALUATION SYSTEM │
│ │
│ ┌──────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ RAG │───▶│ EVALUATION │───▶│ METRICS │ │
│ │ SERVICE │ │ QUEUE │ │ STORE │ │
│ │ │ │ (Kafka/SQS) │ │ (BigQuery) │ │
│ └──────────┘ └──────┬───────┘ └──────┬───────┘ │
│ │ │ │ │
│ │ Async │ Workers │ Dashboards │
│ │ emit │ │ │
│ ▼ ▼ ▼ │
│ ┌──────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ USER │ │ EVAL │ │ GRAFANA / │ │
│ │ RESPONSE │ │ WORKERS │ │ LOOKER │ │
│ │ (served) │ │ │ │ │ │
│ └──────────┘ │ • LLM Judge │ │ • Score │ │
│ │ • Retrieval │ │ trends │ │
│ │ metrics │ │ • Alerts │ │
│ │ • Format │ │ • Drill-down │ │
│ │ checks │ │ │ │
│ └──────────────┘ └──────────────┘ │
│ │
│ KEY DESIGN DECISIONS: │
│ • Evaluation is ASYNC — never blocks user response │
│ • All eval data stored in append-only BigQuery table │
│ • Workers scale independently from serving │
│ • Evaluation failures don't affect production serving │
└─────────────────────────────────────────────────────────────────────────────────────────┘
```
### 6.2 Evaluation Data Schema
```python
# BigQuery table schema for evaluation results
evaluation_schema = {
"request_id": "STRING", # Unique request identifier
"timestamp": "TIMESTAMP", # When the evaluation ran
"query": "STRING", # User query
"response": "STRING", # LLM response
"context_chunks": "ARRAY<STRING>", # Retrieved context
"model_version": "STRING", # Which model version
"prompt_version": "STRING", # Which prompt template
# Evaluation scores
"faithfulness_score": "FLOAT64",
"relevance_score": "FLOAT64",
"hallucination_rate": "FLOAT64",
"completeness_score": "FLOAT64",
# Retrieval metrics
"context_precision": "FLOAT64",
"retrieval_latency_ms": "INT64",
"docs_retrieved": "INT64",
# Metadata
"eval_method": "STRING", # "llm_judge", "human", "automated"
"eval_model": "STRING", # Which judge model
"eval_latency_ms": "INT64",
"eval_cost_usd": "FLOAT64",
# Flags
"flagged_for_review": "BOOLEAN",
"human_reviewed": "BOOLEAN",
"human_score": "FLOAT64", # NULL until human reviews
}
```
---
## 7. Evaluation Anti-Patterns
### Common Mistakes to Avoid
```
┌────────────────────────────────────────────────────────────────────────────┐
│ EVALUATION ANTI-PATTERNS │
│ │
│ ❌ ANTI-PATTERN │ ✅ BETTER APPROACH │
│ ─────────────────────────────┼─────────────────────────────────────────── │
│ Only measuring user │ Combine user feedback (lagging) with │
│ feedback (thumbs up/down) │ automated metrics (leading indicators) │
│ │ │
│ Using the same model as │ Use a different/stronger model as judge, │
│ both generator and judge │ or use different prompts │
│ │ │
│ Evaluating 100% of traffic │ Sample 10-20% for LLM-judge, │
│ with LLM-as-judge (too $$) │ 0.5-1% for human review │
│ │ │
│ No calibration of LLM judge │ Calibrate against human annotations at │
│ against humans │ least quarterly (200+ examples) │
│ │ │
│ Only aggregate metrics │ Segment by query type, user segment, │
│ (hiding failure pockets) │ topic, model version │
│ │ │
│ Ignoring evaluation cost │ Budget evaluation as 50-100% of primary │
│ │ LLM cost — it's not free │
│ │ │
│ Static golden test set │ Update golden set quarterly as product │
│ (stale tests) │ evolves. Add real failure cases. │
│ │ │
│ Not evaluating retrieval │ Always separate retrieval eval from │
│ separately from generation │ generation eval — different fixes needed │
└────────────────────────────────────────────────────────────────────────────┘
```
---
## 8. Technique Selection Decision Framework
```
"Which evaluation technique should I use?"
├── Measuring retrieval quality?
│ └── YES → Retrieval metrics (Recall@K, Precision@K, MRR)
│
├── Measuring output quality at scale?
│ └── YES → LLM-as-Judge (sampled 10-20% of traffic)
│
├── High-stakes domain (medical, legal, financial)?
│ └── YES → Human evaluation + LLM-as-Judge pre-screen
│
├── Comparing two model versions (A/B testing)?
│ └── YES → Pairwise LLM-as-Judge + human tiebreaker
│
├── Building initial evaluation pipeline?
│ └── YES → Start with human evaluation → calibrate LLM-as-Judge
│
├── Debugging a specific failure?
│ └── YES → Human review + detailed trace analysis
│
└── Budget-constrained?
└── YES → Programmatic checks (100%) + LLM-as-Judge on GPT-4o-mini (5%)
```
---
## 9. Interview Deep Dive: Conversation Flow
### Scenario: "Walk me through your evaluation strategy"
**Interviewer:** _"You built a RAG system that answers customer questions. How do you evaluate it?"_
**Your Answer:**
> *"I use a three-layer evaluation strategy combining automated metrics, LLM-as-judge, and human review.*
>
> *First, **retrieval metrics** — I measure Recall@5 and Precision@5 to ensure we're finding the right documents. If Recall@5 drops below 0.80, I know the problem is in search, not generation. I calculate these against a labeled test set of 300+ queries with annotated relevant documents.*
>
> *Second, **LLM-as-judge** — I run GPT-4o as a judge on 15% of production traffic using a pointwise scoring rubric that evaluates faithfulness, relevance, and hallucination rate. Each evaluation costs about $0.003, so for our 10K daily requests, that's $4.50/day. I calibrate the judge against human annotations quarterly — last calibration showed 87% agreement on faithfulness scoring.*
>
> *Third, **human review** — 0.5% of traffic goes to human reviewers, plus all cases where the LLM judge has low confidence (score between 0.5 and 0.7). Human reviewers use a detailed rubric with 1-5 scores. I track inter-annotator agreement with Cohen's Kappa — we maintain κ > 0.75.*
>
> *All scores feed into BigQuery, visualized in Grafana with alerts when any metric drops below threshold for more than 30 minutes."*
**Follow-Up:** _"How accurate is GPT-4o as a judge?"_
> *"On our domain, GPT-4o agrees with human annotators about 87% of the time for faithfulness and 82% for relevance. I mitigate known biases — position bias in pairwise comparisons by running both orderings, verbosity bias by instructing the judge to value accuracy over length. When the judge and humans disagree, I analyze the pattern and update the judge prompt."*
---
## 10. Follow-Up Questions & Answers
### Q1: "What if you can't afford human evaluation?"
**Answer:** _"Start with the cheapest viable approach: (1) Programmatic checks for 100% of traffic — free. (2) GPT-4o-mini as judge on 5% of traffic — about $0.30/1000 evals. (3) Personally review 20–50 random responses per week as a 'one-person human eval.' This costs ~$0.50/day and catches 80% of what a full human eval program would."_
### Q2: "How do you create a labeled test set for retrieval metrics?"
**Answer:** _"Three approaches: (1) Start with 50 common user queries and manually annotate which documents should be retrieved. This takes one afternoon. (2) Use query logs — real user queries are the best test set. Have someone mark the relevant documents. (3) Use LLM-assisted annotation — have GPT-4o label relevance, then spot-check 10% manually. I aim for 300+ labeled examples covering different topics and complexity levels."_
### Q3: "MRR vs NDCG — when to use which?"
**Answer:** _"MRR cares about the **first** relevant result — great for when users need one good answer fast (like a chatbot). NDCG cares about the **ordering of all results** — great for when we pass multiple chunks to the LLM. For RAG, I use NDCG because we retrieve K=5 chunks and ALL of them affect generation quality, not just the first one."_
### Q4: "How often should you recalibrate LLM-as-judge?"
**Answer:** _"Quarterly recalibration minimum, and immediately after any major change: new model deployment, new prompt template, new domain added, or new judge model. Each calibration requires 200+ human-annotated examples. I compare judge scores against human scores and retune the judge prompt until agreement exceeds 85%."_
---
## 🔑 Key Takeaways for Interview Day
1. **Name all three techniques** — LLM-as-Judge, Human Evaluation, Retrieval Metrics
2. **Know when to use each** — scale vs accuracy vs cost trade-off
3. **Describe the hybrid approach** — combine all three in production
4. **Mention position bias mitigation** — shows deep understanding of LLM-as-judge
5. **Know your retrieval metrics** — Recall@K, Precision@K, MRR, NDCG with formulas
6. **Talk about calibration** — LLM judge calibrated against human annotations
7. **Budget the evaluation** — evaluation costs money, plan for it
- Without a harness, you **can't compare** prompts, models, retrieval configs, or costs.
Evaluate, benchmark, and regression-test AI/LLM systems. Covers evaluation framework design, benchmark creation, human evaluation protocols, automated evaluation (LLM-as-judge), regression testing, statistical significance, and continuous evaluation pipelines.
<img width="1388" height="298" alt="full_diagram" src="https://github.com/user-attachments/assets/12a2371b-8be2-4219-9b48-90503eb43c69" />
A list of all public EEG-datasets. This list of EEG-resources is not exhaustive. If you find something new, or have explored any unfiltered link in depth, please update the repository.