Lesson 01: Evaluation Frameworks Overview

Module 07: Evaluation and Testing

Overview

Evaluating LLM applications is harder than evaluating traditional software: outputs are non-deterministic, quality is subjective, and ground truth is expensive to collect. This lesson establishes a taxonomy of evaluation approaches and the frameworks that implement them.

Learning Objectives

Classify evaluation types (automated, LLM-as-judge, human) and select by use case
Compare evaluation frameworks (RAGAS, DeepEval, promptfoo, LangSmith, Braintrust)
Design a two-phase evaluation strategy (offline golden set + online monitoring)
Create a golden dataset structure with proper annotation guidelines
Define quality thresholds that serve as deployment gates

1. Why LLM Evaluation Is Hard

The ground truth problem: For a question like "Explain RAG", there are hundreds of correct answers. Traditional accuracy metrics (precision/recall) don't apply directly.

The distribution problem: Your evaluation set may not represent real-world traffic. Users ask questions you didn't anticipate.

The non-determinism problem: The same prompt produces different outputs at temperature > 0.

The judge problem: Using LLMs to judge LLM outputs introduces its own biases (model self-preference, verbosity bias).

2. Evaluation Type Taxonomy

Automated Metrics (Fast, Cheap)

Metric	What It Measures	Good For
Exact Match	Output == expected output	JSON extraction, classification
F1 / ROUGE	Overlap between generated and reference	Summarization (rough)
BERTScore	Semantic similarity to reference	Free-form text quality
BLEU	N-gram overlap	Translation (deprecated for LLMs)

# Exact match evaluation (classification)
def evaluate_classification(predictions: list[str], ground_truth: list[str]) -> dict:
    correct = sum(p.lower().strip() == gt.lower().strip() for p, gt in zip(predictions, ground_truth))
    return {
        "accuracy": correct / len(predictions),
        "correct": correct,
        "total": len(predictions)
    }

LLM-as-Judge (Flexible, Moderate Cost)

Use a capable LLM (GPT-4o, Claude Sonnet) to score outputs on a rubric:

def llm_judge(question: str, answer: str, criteria: str) -> dict:
    """Use an LLM to evaluate an answer against criteria."""
    judge_prompt = f"""Score this answer on a scale of 1-5 for {criteria}.

Question: {question}
Answer: {answer}

Score (1-5) and brief reason:"""
    
    # Call LLM judge
    response = judge_llm.complete(judge_prompt)
    # Parse score and reason from response
    return {"score": parse_score(response), "reason": response}

LLM-as-judge biases to be aware of:

Verbosity bias: longer answers rated higher regardless of quality
Self-preference: Claude prefers Claude-style answers, GPT prefers GPT-style
Position bias: first answer in a comparison rated higher

Human Evaluation (Ground Truth, Expensive)

Use for: complex reasoning, subjective quality, safety-critical assessments.

Annotation tools: Label Studio (open-source), Argilla (NLP-focused), Scale AI (managed).

Best practices:

2+ annotators per item, measure inter-annotator agreement (Cohen's kappa ≥ 0.7)
Clear annotation guidelines with examples
Regular calibration sessions to prevent annotation drift

3. Framework Comparison

Framework	Best For	Integration	License
RAGAS	RAG evaluation (retrieval + generation)	Python	MIT
DeepEval	General LLM testing with pytest	pytest	MIT
promptfoo	YAML test suites, red-teaming, CLI	CLI/CI	MIT
LangSmith	LangChain apps, dataset management	LangChain	Commercial
Braintrust	Score tracking over time, A/B testing	Python	Commercial

4. Evaluation Strategy Design

Offline Evaluation (Pre-Deployment Gate)

# evaluation/golden_dataset.json
golden_dataset = [
    {
        "id": "q001",
        "category": "retrieval_qa",
        "difficulty": "easy",
        "question": "What chunking strategies are recommended for RAG?",
        "reference_answer": "Recursive character splitting, semantic chunking, and parent-document retrieval are the main production chunking strategies.",
        "required_context_concepts": ["chunking", "recursive", "semantic"],
        "tags": ["rag", "chunking"]
    },
    {
        "id": "q002",
        "category": "reasoning",
        "difficulty": "hard",
        "question": "When should you use Qdrant vs ChromaDB in production?",
        "reference_answer": "ChromaDB for development and small scale (<500K vectors); Qdrant for high-performance self-hosted production with rich filtering.",
        "required_context_concepts": ["qdrant", "chromadb", "production"],
        "tags": ["vector_db", "selection"]
    },
]

Online Evaluation (Post-Deployment Monitoring)

import random

def sample_for_evaluation(request: dict, response: dict, sample_rate: float = 0.05) -> bool:
    """Sample ~5% of production traffic for evaluation."""
    return random.random() < sample_rate

# Log sampled responses for async LLM-as-judge evaluation
if sample_for_evaluation(request, response):
    eval_queue.enqueue({
        "question": request["question"],
        "answer": response["answer"],
        "contexts": response["retrieved_chunks"],
        "timestamp": response["timestamp"]
    })

Key Takeaways

Combine automated metrics (fast, cheap) + LLM-as-judge (flexible) + human review (ground truth)
Golden datasets of 100-200 questions are the foundation of reliable offline evaluation
LLM-as-judge biases (verbosity, self-preference) must be mitigated with diverse judge models and rubric design
Offline evaluation gates deployment; online evaluation monitors production quality
RAGAS is specialized for RAG; DeepEval for general LLM testing with pytest integration

Next Steps

Continue to Lesson 02: DeepEval and Promptfoo.

Lesson 01: Evaluation Frameworks Overview

Lesson 01: Evaluation Frameworks Overview

Overview

Learning Objectives

1. Why LLM Evaluation Is Hard

2. Evaluation Type Taxonomy

Automated Metrics (Fast, Cheap)

LLM-as-Judge (Flexible, Moderate Cost)

Human Evaluation (Ground Truth, Expensive)

3. Framework Comparison

4. Evaluation Strategy Design

Offline Evaluation (Pre-Deployment Gate)

Online Evaluation (Post-Deployment Monitoring)

Key Takeaways

Further Reading

Next Steps

Related Documents

AI Tools for Developers

Voice AI Leaderboards, Benchmarks, and Evaluation Gaps (Jan 2025 -- Feb 2026)

Evaluating AI Agent Systems: Metrics, Benchmarks, and Quality Assurance (2024-2026)