03 — Evaluation Techniques Deep Dive: LLM-as-Judge, Human Evaluation & Retrieval Metrics — .md Directory

# 03 — Evaluation Techniques Deep Dive: LLM-as-Judge, Human Evaluation & Retrieval Metrics > **Interview Reality:** _"What evaluation techniques do you use and why?"_ > Knowing WHAT to measure (faithfulness, relevance) is necessary but not sufficient. > You must explain HOW you measure it — the specific techniques, their trade-offs, and when to use each one. > This is where you prove you've actually built evaluation pipelines, not just read about them. --- ## Table of Contents 1. [The Evaluation Technique Landscape](#1-the-evaluation-technique-landscape) 2. [LLM-as-Judge — Complete Guide](#2-llm-as-judge--complete-guide) 3. [Human Evaluation — Complete Guide](#3-human-evaluation--complete-guide) 4. [Retrieval Metrics (Recall@K, Precision@K, MRR, NDCG)](#4-retrieval-metrics) 5. [Hybrid Evaluation Strategies](#5-hybrid-evaluation-strategies) 6. [Evaluation Pipeline Architecture](#6-evaluation-pipeline-architecture) 7. [Evaluation Anti-Patterns](#7-evaluation-anti-patterns) 8. [Technique Selection Decision Framework](#8-technique-selection-decision-framework) 9. [Interview Deep Dive: Conversation Flow](#9-interview-deep-dive-conversation-flow) 10. [Follow-Up Questions & Answers](#10-follow-up-questions--answers) --- ## 1. The Evaluation Technique Landscape ### Overview: Three Approaches, Different Strengths ``` ┌─────────────────────────────────────────────────────────────────────────────────────┐ │ EVALUATION TECHNIQUE COMPARISON │ │ │ │ ┌──────────────────┐ ┌──────────────────┐ ┌──────────────────────────┐ │ │ │ LLM-AS-JUDGE │ │ HUMAN EVALUATION │ │ RETRIEVAL METRICS │ │ │ │ │ │ │ │ │ │ │ │ Use another LLM │ │ Domain experts │ │ Mathematical metrics │ │ │ │ to score outputs │ │ review + score │ │ for search quality │ │ │ │ │ │ │ │ │ │ │ │ ✅ Scalable │ │ ✅ Most accurate │ │ ✅ Deterministic │ │ │ │ ✅ Cheap │ │ ✅ Domain nuance │ │ ✅ No LLM needed │ │ │ │ ✅ Fast │ │ ✅ Ground truth │ │ ✅ Well-understood │ │ │ │ │ │ │ │ │ │ │ │ ❌ Can be wrong │ │ ❌ Expensive │ │ ❌ Only measures search │ │ │ │ ❌ Bias issues │ │ ❌ Slow │ │ ❌ Needs labeled data │ │ │ │ ❌ Needs calib. │ │ ❌ Not scalable │ │ ❌ Doesn't judge output │ │ │ └──────────────────┘ └──────────────────┘ └──────────────────────────┘ │ │ │ │ USE TOGETHER: LLM-as-judge for scale + Human for calibration + Retrieval for search │ └─────────────────────────────────────────────────────────────────────────────────────┘ ``` ### When to Use Each Technique | Scenario | Primary Technique | Secondary | |----------|------------------|-----------| | Production monitoring (high volume) | LLM-as-Judge | Retrieval metrics | | Pre-deployment validation | Golden test set + LLM-as-Judge | Human spot-check | | Regulated industries (medical, legal) | Human evaluation | LLM-as-Judge for pre-screening | | Search/retrieval optimization | Retrieval metrics | LLM-as-Judge on end-to-end | | Model comparison / A/B testing | LLM-as-Judge (pairwise) | Human evaluation on disagreements | | Debugging specific failures | Human evaluation | Detailed trace analysis | --- ## 2. LLM-as-Judge — Complete Guide ### 2.1 What Is LLM-as-Judge? Using a (typically stronger) LLM to evaluate the output of another LLM. The judge LLM reads the query, context, and response, then assigns a quality score. ``` ┌────────────────────────────────────────────────────────────────────┐ │ LLM-AS-JUDGE ARCHITECTURE │ │ │ │ ┌──────────────┐ ┌──────────────────┐ ┌──────────────┐ │ │ │ PRIMARY LLM │ │ JUDGE LLM │ │ SCORE DB │ │ │ │ (generates │───▶│ (evaluates │───▶│ (store + │ │ │ │ response) │ │ response) │ │ aggregate) │ │ │ └──────────────┘ └──────────────────┘ └──────────────┘ │ │ │ │ Primary: GPT-4o-mini (cheap, fast) │ │ Judge: GPT-4o or Claude (high quality, more expensive) │ │ │ │ Key Rule: Judge should be STRONGER than the primary model │ │ (or at least a different model to avoid self-bias) │ └────────────────────────────────────────────────────────────────────┘ ``` ### 2.2 Types of LLM-as-Judge Evaluations #### Type 1: Pointwise Scoring (Rate a Single Response) ```python POINTWISE_JUDGE_PROMPT = """ You are an expert evaluator. Given a question, context, and response, rate the response quality on a scale of 1-5. QUESTION: {question} CONTEXT: {context} RESPONSE: {response} Rate each dimension: 1. FAITHFULNESS (1-5): Is every claim in the response supported by the context? 1 = Completely hallucinated 3 = Mix of supported and unsupported claims 5 = Every claim is directly supported by context 2. RELEVANCE (1-5): Does the response answer the question? 1 = Completely off-topic 3 = Partially addresses the question 5 = Directly and completely answers the question 3. COMPLETENESS (1-5): Does the response cover all aspects of the question? 1 = Major information missing 3 = Covers the main point but misses details 5 = Comprehensive answer Output as JSON: {{ "faithfulness": <1-5>, "relevance": <1-5>, "completeness": <1-5>, "explanation": "<brief justification>" }} """ ``` #### Type 2: Pairwise Comparison (Compare Two Responses) ```python PAIRWISE_JUDGE_PROMPT = """ You are an expert evaluator. Given a question, context, and TWO responses, determine which response is better. QUESTION: {question} CONTEXT: {context} RESPONSE A: {response_a} RESPONSE B: {response_b} Which response is better and why? Consider: - Faithfulness to context - Relevance to the question - Completeness of the answer - Clarity of explanation Output as JSON: {{ "winner": "A" or "B" or "TIE", "confidence": <0.0-1.0>, "explanation": "<brief justification>" }} """ ``` > **Interview Tip:** _"Pairwise comparison is more reliable than pointwise scoring because humans and LLMs are better at comparing than absolute scoring. I use pairwise for A/B testing model versions and pointwise for production monitoring."_ #### Type 3: Reference-Based (Compare Against Ground Truth) ```python REFERENCE_JUDGE_PROMPT = """ Given a question, a reference answer (ground truth), and a candidate response, evaluate how well the candidate matches the reference. QUESTION: {question} REFERENCE ANSWER: {reference} CANDIDATE RESPONSE: {candidate} Rate on a scale of 1-5: 1 = Completely different from reference — wrong answer 2 = Some overlap but significant differences 3 = Captures the main idea but with notable omissions or additions 4 = Very close to reference with minor differences 5 = Semantically equivalent to reference Output: {{ "score": <1-5>, "missing_from_candidate": "<what the candidate missed>", "extra_in_candidate": "<what the candidate added>", "explanation": "<brief justification>" }} """ ``` ### 2.3 LLM-as-Judge Biases & Mitigations ``` ┌────────────────────────────────────────────────────────────────────────────┐ │ LLM-AS-JUDGE KNOWN BIASES │ │ │ │ BIAS │ DESCRIPTION │ MITIGATION │ │ ────────────────────────┼─────────────────────────────┼────────────────── │ │ Position Bias │ Prefers Response A (first │ Randomize order, │ │ │ listed) in pairwise │ run both orders │ │ │ │ │ │ Verbosity Bias │ Prefers longer, more │ Add "judge │ │ │ detailed responses │ conciseness" inst. │ │ │ │ │ │ Self-Enhancement Bias │ LLM prefers its own output │ Use different │ │ │ │ model as judge │ │ │ │ │ │ Sycophancy Bias │ Avoids giving low scores │ Calibrate with │ │ │ (wants to be "nice") │ known-bad examples │ │ │ │ │ │ Format Bias │ Prefers well-formatted │ Normalize format │ │ │ over accurate responses │ before judging │ │ │ │ │ │ Anchoring Bias │ First examples in few-shot │ Vary few-shot │ │ │ influence all scores │ examples │ └────────────────────────────────────────────────────────────────────────────┘ ``` ### 2.4 Position Bias Mitigation (Critical for Pairwise) ```python async def pairwise_judge(question, context, response_a, response_b, judge_llm): """ Mitigate position bias by running both orderings and checking consistency. """ # Run 1: A first, B second result_ab = await judge_llm.evaluate( question=question, context=context, response_a=response_a, response_b=response_b ) # Run 2: B first, A second (SWAP ORDER) result_ba = await judge_llm.evaluate( question=question, context=context, response_a=response_b, response_b=response_a # Swapped ) # Check consistency if result_ab["winner"] == "A" and result_ba["winner"] == "B": # Consistent: both orderings agree Response A is better return {"winner": "A", "confidence": "high", "consistent": True} elif result_ab["winner"] == "B" and result_ba["winner"] == "A": # Consistent: both orderings agree Response B is better return {"winner": "B", "confidence": "high", "consistent": True} else: # Inconsistent: position bias detected return {"winner": "TIE", "confidence": "low", "consistent": False, "note": "Position bias detected — send to human review"} ``` ### 2.5 Multi-Judge Consensus ```python async def multi_judge_evaluation(question, context, response): """ Use multiple judges and take consensus to improve reliability. """ judges = [ {"model": "gpt-4o", "prompt": pointwise_prompt_v1}, {"model": "claude-3.5-sonnet", "prompt": pointwise_prompt_v1}, {"model": "gpt-4o", "prompt": pointwise_prompt_v2}, # Different prompt ] scores = [] for judge in judges: score = await evaluate_with_judge(judge, question, context, response) scores.append(score) # Consensus: average scores, flag high-variance items for human review avg_faithfulness = mean([s["faithfulness"] for s in scores]) variance = std([s["faithfulness"] for s in scores]) return { "faithfulness": avg_faithfulness, "confidence": "high" if variance < 0.5 else "low", "needs_human_review": variance >= 0.5, "individual_scores": scores } ``` ### 2.6 LLM-as-Judge Cost Analysis ``` Cost for 1,000 evaluations: Pointwise (GPT-4o judge): Input: ~800 tokens/eval (question + context + response + prompt) Output: ~100 tokens/eval (structured score + explanation) Cost: 1,000 × (800 × $2.50/M + 100 × $10.00/M) = 1,000 × ($0.002 + $0.001) = $3.00 per 1,000 evaluations Pairwise with position bias mitigation: 2 calls per evaluation → $6.00 per 1,000 evaluations Multi-judge (3 judges): 3 calls per evaluation → $9.00 per 1,000 evaluations Cost-optimized: Use GPT-4o-mini as judge: = $0.30 per 1,000 evaluations (10x cheaper, ~85% accuracy vs GPT-4o) ``` --- ## 3. Human Evaluation — Complete Guide ### 3.1 When Human Evaluation Is Essential ``` ALWAYS use human evaluation when: ✅ Building the initial evaluation pipeline (calibration) ✅ Regulated industries (medical, legal, financial) ✅ LLM-as-judge shows high variance (uncertainty) ✅ Evaluating subjective qualities (tone, empathy, clarity) ✅ Building/updating ground truth datasets ✅ Validating LLM-as-judge accuracy (meta-evaluation) SKIP human evaluation when: ❌ High-volume production monitoring (too expensive) ❌ Binary/structured output validation (automated is better) ❌ A/B testing with clear metrics (automated is sufficient) ``` ### 3.2 Human Evaluation Framework ``` ┌─────────────────────────────────────────────────────────────────────────┐ │ HUMAN EVALUATION PIPELINE │ │ │ │ ┌──────────┐ ┌──────────────┐ ┌──────────────┐ ┌────────────┐ │ │ │ SAMPLE │──▶│ ANNOTATION │──▶│ QUALITY │──▶│ AGGREGATE │ │ │ │ SELECTION │ │ GUIDELINES │ │ CONTROL │ │ & REPORT │ │ │ └──────────┘ └──────────────┘ └──────────────┘ └────────────┘ │ │ │ │ What: What: What: What: │ │ • Random 1% • Rubric with • Inter-annotator • Cohen's κ │ │ • Edge cases clear examples agreement (IAA) • Score dist. │ │ • Low-conf for each score • Gold questions • Trend charts │ │ LLM-judge • 1-5 scale per • Annotator • Action items │ │ items dimension calibration │ └─────────────────────────────────────────────────────────────────────────┘ ``` ### 3.3 Annotation Rubric Design ``` FAITHFULNESS RUBRIC: Score 5 — Perfect Faithfulness Every claim in the response is directly supported by the context. Example: Context says "Founded 2015" → Response says "Founded in 2015" ✅ Score 4 — Minor Gap All major claims supported, but includes trivial unsupported details. Example: "Founded in 2015" + "likely in California" (location not in context) ⚠️ Score 3 — Mixed Some claims supported, some clearly unsupported. Example: "Founded in 2015 with 10,000 employees" (employee count not in context) ⚠️ Score 2 — Mostly Unsupported Major claims are unsupported or contradict context. Example: "Founded in 2018" when context says 2015 ❌ Score 1 — Complete Hallucination No claims are supported by the context. Example: Entirely fabricated response with no connection to context ❌ EDGE CASES: • "I don't know" → Score 5 for faithfulness (refusing to hallucinate is good) • Paraphrasing → Score 5 if semantically equivalent • Inference → Score 4 if reasonable inference, Score 3 if a stretch ``` ### 3.4 Inter-Annotator Agreement (IAA) ``` Why IAA matters: If two humans can't agree on a score, your evaluation is unreliable. Cohen's Kappa (κ) — Measures Agreement: κ > 0.80 → Almost perfect agreement — your rubric is clear κ = 0.60–0.80 → Substantial agreement — acceptable κ = 0.40–0.60 → Moderate agreement — improve rubric κ < 0.40 → Fair/poor agreement — rubric needs major revision How to calculate: 1. Have 2+ annotators score the same 50–100 samples 2. Calculate % agreement 3. Correct for chance agreement using Cohen's Kappa Example: Annotator A scores: [5, 4, 3, 5, 2, 4, 5, 3, 4, 5] Annotator B scores: [5, 4, 4, 5, 2, 3, 5, 3, 4, 5] Raw agreement: 8/10 = 80% Cohen's κ: 0.72 → Substantial agreement ✅ ``` ### 3.5 Human Evaluation Cost & Scale ``` Typical human evaluation cost: • Internal domain experts: ~$50–100/hour → ~$2–4 per evaluation • Crowdworkers (e.g., Scale AI, Surge): ~$0.50–2.00 per evaluation • Specialized annotators (medical, legal): ~$5–15 per evaluation At scale: 500 evaluations/week × $2/eval = $1,000/week 500 evaluations/week × $10/eval (medical) = $5,000/week Budget tip: Use LLM-as-judge for 95% of evaluations, reserve human evaluation for low-confidence cases and calibration. ``` --- ## 4. Retrieval Metrics ### 4.1 Why Measure Retrieval Separately? ``` A RAG system can fail at two points: Point 1: RETRIEVAL fails → Wrong documents → Bad answer Point 2: GENERATION fails → Right documents → Bad answer If you only measure the final answer, you can't tell WHERE the failure is. Retrieval metrics isolate Point 1. ``` ### 4.2 Recall@K **What:** Of all relevant documents, how many did we find in the top K results? ``` Formula: Recall@K = |Relevant ∩ Retrieved@K| / |Relevant| Example: Corpus has 5 relevant documents for query "refund policy" We retrieve top 5 results, and 3 of them are relevant Recall@5 = 3/5 = 0.6 Interpretation: We're missing 40% of relevant context. This means the LLM is working with incomplete information. Benchmarks: Recall@3: Target > 0.70 Recall@5: Target > 0.80 Recall@10: Target > 0.90 ``` ### 4.3 Precision@K **What:** Of the documents we retrieved, how many are actually relevant? ``` Formula: Precision@K = |Relevant ∩ Retrieved@K| / K Example: We retrieve top 5 results for "refund policy" 3 of them are about refund policy, 2 are about shipping Precision@5 = 3/5 = 0.6 Interpretation: 40% of context is noise. This dilutes the useful context and can confuse the LLM. Benchmarks: Precision@3: Target > 0.80 Precision@5: Target > 0.70 Precision@10: Target > 0.60 ``` ### 4.4 The Recall-Precision Trade-off ``` ┌────────────────────────────────────────────────────────────────────┐ │ RECALL vs PRECISION TRADE-OFF │ │ │ │ High K (retrieve more) → Higher Recall, Lower Precision │ │ Low K (retrieve fewer) → Lower Recall, Higher Precision │ │ │ │ Precision │ │ ▲ │ │ │ ● │ │ │ ● │ │ │ ● │ │ │ ● The "sweet spot" is typically K=3 to K=5 │ │ │ ● for RAG systems │ │ │ ● │ │ │ ● │ │ │ ● ● ● ● │ │ └──────────────────────────────▶ Recall │ │ K=1 K=3 K=5 K=10 K=20 K=50 │ │ │ │ RAG SWEET SPOT: K=3 to K=5 gives best balance │ │ More context = more cost + more noise + higher recall │ └────────────────────────────────────────────────────────────────────┘ ``` ### 4.5 Mean Reciprocal Rank (MRR) **What:** How high is the first relevant document ranked? ``` Formula: MRR = (1/N) × Σ(1/rank_i) where rank_i = position of first relevant doc for query i Example (3 queries): Query 1: First relevant doc at position 1 → 1/1 = 1.0 Query 2: First relevant doc at position 3 → 1/3 = 0.33 Query 3: First relevant doc at position 2 → 1/2 = 0.5 MRR = (1.0 + 0.33 + 0.5) / 3 = 0.61 Interpretation: On average, the first relevant document appears around position 1.6 in results. Benchmarks: MRR > 0.80 → Excellent (first relevant doc usually in top 1-2) MRR 0.50–0.80 → Good (first relevant doc usually in top 2-3) MRR < 0.50 → Poor (relevant docs buried too deep) ``` ### 4.6 NDCG (Normalized Discounted Cumulative Gain) **What:** Are the most relevant documents ranked highest? NDCG rewards having highly relevant documents at the top. ``` NDCG considers: 1. Relevance has grades (not just binary relevant/not-relevant) 2. Position matters — relevant docs at position 1 are worth more than at position 5 Formula: DCG@K = Σ (relevance_i / log2(i + 1)) for i = 1 to K NDCG@K = DCG@K / Ideal DCG@K Example: Retrieved: [rel=3, rel=0, rel=2, rel=1, rel=0] (relevance on 0-3 scale) DCG@5 = 3/log2(2) + 0/log2(3) + 2/log2(4) + 1/log2(5) + 0/log2(6) = 3.0 + 0 + 1.0 + 0.43 + 0 = 4.43 Ideal: [rel=3, rel=2, rel=1, rel=0, rel=0] Ideal DCG@5 = 3.0 + 1.26 + 0.5 + 0 + 0 = 4.76 NDCG@5 = 4.43 / 4.76 = 0.93 Benchmarks: NDCG@5 > 0.85 → Excellent ranking quality NDCG@5 0.70–0.85 → Good NDCG@5 < 0.70 → Needs improvement (consider re-ranking) ``` ### 4.7 Complete Retrieval Metrics Comparison ``` ┌────────────────────────────────────────────────────────────────────────┐ │ RETRIEVAL METRICS AT A GLANCE │ │ │ │ Metric │ Measures │ Needs Labels │ When to Use │ │ ─────────────┼─────────────────────────┼──────────────┼───────────── │ │ Recall@K │ Coverage: did we find │ Yes │ Always │ │ │ all relevant docs? │ │ │ │ ─────────────┼─────────────────────────┼──────────────┼───────────── │ │ Precision@K │ Purity: are all results │ Yes │ Always │ │ │ relevant? │ │ │ │ ─────────────┼─────────────────────────┼──────────────┼───────────── │ │ MRR │ Position: how quickly │ Yes │ User-facing │ │ │ do we find something? │ │ search │ │ ─────────────┼─────────────────────────┼──────────────┼───────────── │ │ NDCG@K │ Ranking quality: are │ Yes (graded) │ When ranking │ │ │ best docs on top? │ │ order matters │ │ ─────────────┼─────────────────────────┼──────────────┼───────────── │ │ Hit Rate@K │ Binary: is at least │ Yes │ Simple QA │ │ │ one relevant in top K? │ │ systems │ │ ─────────────┼─────────────────────────┼──────────────┼───────────── │ │ Context │ Relevance of retrieved │ No (LLM │ RAGAS-based │ │ Precision │ docs to the query │ evaluated) │ evaluation │ └────────────────────────────────────────────────────────────────────────┘ ``` --- ## 5. Hybrid Evaluation Strategies ### 5.1 The Hybrid Evaluation Architecture In production, you combine all three techniques in a layered system: ``` ┌─────────────────────────────────────────────────────────────────────────────────┐ │ HYBRID EVALUATION ARCHITECTURE │ │ │ │ ALL TRAFFIC (100%) │ │ ├──▶ Layer 1: PROGRAMMATIC CHECKS (100% of traffic) │ │ │ • Response length, format, schema validation │ │ │ • Token count anomaly detection │ │ │ • Refusal keyword detection │ │ │ • Cost: $0.00 per check (string matching) │ │ │ │ │ ├──▶ Layer 2: RETRIEVAL METRICS (100% of traffic) │ │ │ • Context relevance (lightweight embedding similarity) │ │ │ • Retrieved doc count │ │ │ • Retrieval latency │ │ │ • Cost: ~$0.001 per check │ │ │ │ │ ├──▶ Layer 3: LLM-AS-JUDGE (15% of traffic, sampled) │ │ │ • Faithfulness scoring │ │ │ • Relevance scoring │ │ │ • Hallucination detection │ │ │ • Cost: ~$0.003 per evaluation │ │ │ │ │ └──▶ Layer 4: HUMAN REVIEW (0.5% of traffic + all flagged) │ │ • Low-confidence LLM-judge items │ │ • Random sample for calibration │ │ • All user-reported issues │ │ • Cost: ~$2.00 per evaluation │ │ │ │ TOTAL COST EXAMPLE (10,000 req/day): │ │ L1: 10,000 × $0.00 = $0.00 │ │ L2: 10,000 × $0.001 = $10.00 │ │ L3: 1,500 × $0.003 = $4.50 │ │ L4: 50 × $2.00 = $100.00 │ │ ───────────────────────────── │ │ TOTAL: $114.50/day for comprehensive evaluation │ └─────────────────────────────────────────────────────────────────────────────────┘ ``` ### 5.2 Escalation Flow ``` Request completed │ ▼ Layer 1: Programmatic check ├── FAIL (e.g., empty response) → Immediately flag + alert │ └── PASS → Continue to Layer 2 │ ▼ Layer 2: Retrieval metrics ├── FAIL (e.g., 0 relevant docs retrieved) → Alert retrieval team │ └── PASS → Random sample 15% │ ▼ Layer 3: LLM-as-Judge ├── FAIL (faithfulness < 0.5) → Auto-flag + human queue ├── UNCERTAIN (0.5–0.7) → Human review queue │ └── PASS (> 0.7) → Log score + continue │ └── Random 0.5% → Layer 4: Human review (for calibration) ``` --- ## 6. Evaluation Pipeline Architecture ### 6.1 End-to-End Evaluation System ``` ┌─────────────────────────────────────────────────────────────────────────────────────────┐ │ PRODUCTION EVALUATION SYSTEM │ │ │ │ ┌──────────┐ ┌──────────────┐ ┌──────────────┐ │ │ │ RAG │───▶│ EVALUATION │───▶│ METRICS │ │ │ │ SERVICE │ │ QUEUE │ │ STORE │ │ │ │ │ │ (Kafka/SQS) │ │ (BigQuery) │ │ │ └──────────┘ └──────┬───────┘ └──────┬───────┘ │ │ │ │ │ │ │ │ Async │ Workers │ Dashboards │ │ │ emit │ │ │ │ ▼ ▼ ▼ │ │ ┌──────────┐ ┌──────────────┐ ┌──────────────┐ │ │ │ USER │ │ EVAL │ │ GRAFANA / │ │ │ │ RESPONSE │ │ WORKERS │ │ LOOKER │ │ │ │ (served) │ │ │ │ │ │ │ └──────────┘ │ • LLM Judge │ │ • Score │ │ │ │ • Retrieval │ │ trends │ │ │ │ metrics │ │ • Alerts │ │ │ │ • Format │ │ • Drill-down │ │ │ │ checks │ │ │ │ │ └──────────────┘ └──────────────┘ │ │ │ │ KEY DESIGN DECISIONS: │ │ • Evaluation is ASYNC — never blocks user response │ │ • All eval data stored in append-only BigQuery table │ │ • Workers scale independently from serving │ │ • Evaluation failures don't affect production serving │ └─────────────────────────────────────────────────────────────────────────────────────────┘ ``` ### 6.2 Evaluation Data Schema ```python # BigQuery table schema for evaluation results evaluation_schema = { "request_id": "STRING", # Unique request identifier "timestamp": "TIMESTAMP", # When the evaluation ran "query": "STRING", # User query "response": "STRING", # LLM response "context_chunks": "ARRAY<STRING>", # Retrieved context "model_version": "STRING", # Which model version "prompt_version": "STRING", # Which prompt template # Evaluation scores "faithfulness_score": "FLOAT64", "relevance_score": "FLOAT64", "hallucination_rate": "FLOAT64", "completeness_score": "FLOAT64", # Retrieval metrics "context_precision": "FLOAT64", "retrieval_latency_ms": "INT64", "docs_retrieved": "INT64", # Metadata "eval_method": "STRING", # "llm_judge", "human", "automated" "eval_model": "STRING", # Which judge model "eval_latency_ms": "INT64", "eval_cost_usd": "FLOAT64", # Flags "flagged_for_review": "BOOLEAN", "human_reviewed": "BOOLEAN", "human_score": "FLOAT64", # NULL until human reviews } ``` --- ## 7. Evaluation Anti-Patterns ### Common Mistakes to Avoid ``` ┌────────────────────────────────────────────────────────────────────────────┐ │ EVALUATION ANTI-PATTERNS │ │ │ │ ❌ ANTI-PATTERN │ ✅ BETTER APPROACH │ │ ─────────────────────────────┼─────────────────────────────────────────── │ │ Only measuring user │ Combine user feedback (lagging) with │ │ feedback (thumbs up/down) │ automated metrics (leading indicators) │ │ │ │ │ Using the same model as │ Use a different/stronger model as judge, │ │ both generator and judge │ or use different prompts │ │ │ │ │ Evaluating 100% of traffic │ Sample 10-20% for LLM-judge, │ │ with LLM-as-judge (too $$) │ 0.5-1% for human review │ │ │ │ │ No calibration of LLM judge │ Calibrate against human annotations at │ │ against humans │ least quarterly (200+ examples) │ │ │ │ │ Only aggregate metrics │ Segment by query type, user segment, │ │ (hiding failure pockets) │ topic, model version │ │ │ │ │ Ignoring evaluation cost │ Budget evaluation as 50-100% of primary │ │ │ LLM cost — it's not free │ │ │ │ │ Static golden test set │ Update golden set quarterly as product │ │ (stale tests) │ evolves. Add real failure cases. │ │ │ │ │ Not evaluating retrieval │ Always separate retrieval eval from │ │ separately from generation │ generation eval — different fixes needed │ └────────────────────────────────────────────────────────────────────────────┘ ``` --- ## 8. Technique Selection Decision Framework ``` "Which evaluation technique should I use?" ├── Measuring retrieval quality? │ └── YES → Retrieval metrics (Recall@K, Precision@K, MRR) │ ├── Measuring output quality at scale? │ └── YES → LLM-as-Judge (sampled 10-20% of traffic) │ ├── High-stakes domain (medical, legal, financial)? │ └── YES → Human evaluation + LLM-as-Judge pre-screen │ ├── Comparing two model versions (A/B testing)? │ └── YES → Pairwise LLM-as-Judge + human tiebreaker │ ├── Building initial evaluation pipeline? │ └── YES → Start with human evaluation → calibrate LLM-as-Judge │ ├── Debugging a specific failure? │ └── YES → Human review + detailed trace analysis │ └── Budget-constrained? └── YES → Programmatic checks (100%) + LLM-as-Judge on GPT-4o-mini (5%) ``` --- ## 9. Interview Deep Dive: Conversation Flow ### Scenario: "Walk me through your evaluation strategy" **Interviewer:** _"You built a RAG system that answers customer questions. How do you evaluate it?"_ **Your Answer:** > *"I use a three-layer evaluation strategy combining automated metrics, LLM-as-judge, and human review.* > > *First, **retrieval metrics** — I measure Recall@5 and Precision@5 to ensure we're finding the right documents. If Recall@5 drops below 0.80, I know the problem is in search, not generation. I calculate these against a labeled test set of 300+ queries with annotated relevant documents.* > > *Second, **LLM-as-judge** — I run GPT-4o as a judge on 15% of production traffic using a pointwise scoring rubric that evaluates faithfulness, relevance, and hallucination rate. Each evaluation costs about $0.003, so for our 10K daily requests, that's $4.50/day. I calibrate the judge against human annotations quarterly — last calibration showed 87% agreement on faithfulness scoring.* > > *Third, **human review** — 0.5% of traffic goes to human reviewers, plus all cases where the LLM judge has low confidence (score between 0.5 and 0.7). Human reviewers use a detailed rubric with 1-5 scores. I track inter-annotator agreement with Cohen's Kappa — we maintain κ > 0.75.* > > *All scores feed into BigQuery, visualized in Grafana with alerts when any metric drops below threshold for more than 30 minutes."* **Follow-Up:** _"How accurate is GPT-4o as a judge?"_ > *"On our domain, GPT-4o agrees with human annotators about 87% of the time for faithfulness and 82% for relevance. I mitigate known biases — position bias in pairwise comparisons by running both orderings, verbosity bias by instructing the judge to value accuracy over length. When the judge and humans disagree, I analyze the pattern and update the judge prompt."* --- ## 10. Follow-Up Questions & Answers ### Q1: "What if you can't afford human evaluation?" **Answer:** _"Start with the cheapest viable approach: (1) Programmatic checks for 100% of traffic — free. (2) GPT-4o-mini as judge on 5% of traffic — about $0.30/1000 evals. (3) Personally review 20–50 random responses per week as a 'one-person human eval.' This costs ~$0.50/day and catches 80% of what a full human eval program would."_ ### Q2: "How do you create a labeled test set for retrieval metrics?" **Answer:** _"Three approaches: (1) Start with 50 common user queries and manually annotate which documents should be retrieved. This takes one afternoon. (2) Use query logs — real user queries are the best test set. Have someone mark the relevant documents. (3) Use LLM-assisted annotation — have GPT-4o label relevance, then spot-check 10% manually. I aim for 300+ labeled examples covering different topics and complexity levels."_ ### Q3: "MRR vs NDCG — when to use which?" **Answer:** _"MRR cares about the **first** relevant result — great for when users need one good answer fast (like a chatbot). NDCG cares about the **ordering of all results** — great for when we pass multiple chunks to the LLM. For RAG, I use NDCG because we retrieve K=5 chunks and ALL of them affect generation quality, not just the first one."_ ### Q4: "How often should you recalibrate LLM-as-judge?" **Answer:** _"Quarterly recalibration minimum, and immediately after any major change: new model deployment, new prompt template, new domain added, or new judge model. Each calibration requires 200+ human-annotated examples. I compare judge scores against human scores and retune the judge prompt until agreement exceeds 85%."_ --- ## 🔑 Key Takeaways for Interview Day 1. **Name all three techniques** — LLM-as-Judge, Human Evaluation, Retrieval Metrics 2. **Know when to use each** — scale vs accuracy vs cost trade-off 3. **Describe the hybrid approach** — combine all three in production 4. **Mention position bias mitigation** — shows deep understanding of LLM-as-judge 5. **Know your retrieval metrics** — Recall@K, Precision@K, MRR, NDCG with formulas 6. **Talk about calibration** — LLM judge calibrated against human annotations 7. **Budget the evaluation** — evaluation costs money, plan for it

03 — Evaluation Techniques Deep Dive: LLM-as-Judge, Human Evaluation & Retrieval Metrics

Related Documents

Evaluation Harness (Offline + Online)

/godmode:eval

🔬 Open Deep Research

EEG-Datasets