Loading...
Loading...
Loading...
**Module 07: Evaluation and Testing**
# Lesson 01: Evaluation Frameworks Overview
**Module 07: Evaluation and Testing**
---
## Overview
Evaluating LLM applications is harder than evaluating traditional software: outputs are non-deterministic, quality is subjective, and ground truth is expensive to collect. This lesson establishes a taxonomy of evaluation approaches and the frameworks that implement them.
---
## Learning Objectives
1. Classify evaluation types (automated, LLM-as-judge, human) and select by use case
2. Compare evaluation frameworks (RAGAS, DeepEval, promptfoo, LangSmith, Braintrust)
3. Design a two-phase evaluation strategy (offline golden set + online monitoring)
4. Create a golden dataset structure with proper annotation guidelines
5. Define quality thresholds that serve as deployment gates
---
## 1. Why LLM Evaluation Is Hard
**The ground truth problem**: For a question like "Explain RAG", there are hundreds of correct answers. Traditional accuracy metrics (precision/recall) don't apply directly.
**The distribution problem**: Your evaluation set may not represent real-world traffic. Users ask questions you didn't anticipate.
**The non-determinism problem**: The same prompt produces different outputs at temperature > 0.
**The judge problem**: Using LLMs to judge LLM outputs introduces its own biases (model self-preference, verbosity bias).
---
## 2. Evaluation Type Taxonomy
### Automated Metrics (Fast, Cheap)
| Metric | What It Measures | Good For |
|--------|-----------------|---------|
| Exact Match | Output == expected output | JSON extraction, classification |
| F1 / ROUGE | Overlap between generated and reference | Summarization (rough) |
| BERTScore | Semantic similarity to reference | Free-form text quality |
| BLEU | N-gram overlap | Translation (deprecated for LLMs) |
```python
# Exact match evaluation (classification)
def evaluate_classification(predictions: list[str], ground_truth: list[str]) -> dict:
correct = sum(p.lower().strip() == gt.lower().strip() for p, gt in zip(predictions, ground_truth))
return {
"accuracy": correct / len(predictions),
"correct": correct,
"total": len(predictions)
}
```
### LLM-as-Judge (Flexible, Moderate Cost)
Use a capable LLM (GPT-4o, Claude Sonnet) to score outputs on a rubric:
```python
def llm_judge(question: str, answer: str, criteria: str) -> dict:
"""Use an LLM to evaluate an answer against criteria."""
judge_prompt = f"""Score this answer on a scale of 1-5 for {criteria}.
Question: {question}
Answer: {answer}
Score (1-5) and brief reason:"""
# Call LLM judge
response = judge_llm.complete(judge_prompt)
# Parse score and reason from response
return {"score": parse_score(response), "reason": response}
```
**LLM-as-judge biases to be aware of**:
- Verbosity bias: longer answers rated higher regardless of quality
- Self-preference: Claude prefers Claude-style answers, GPT prefers GPT-style
- Position bias: first answer in a comparison rated higher
### Human Evaluation (Ground Truth, Expensive)
Use for: complex reasoning, subjective quality, safety-critical assessments.
**Annotation tools**: Label Studio (open-source), Argilla (NLP-focused), Scale AI (managed).
**Best practices**:
- 2+ annotators per item, measure inter-annotator agreement (Cohen's kappa ≥ 0.7)
- Clear annotation guidelines with examples
- Regular calibration sessions to prevent annotation drift
---
## 3. Framework Comparison
| Framework | Best For | Integration | License |
|-----------|---------|-------------|---------|
| RAGAS | RAG evaluation (retrieval + generation) | Python | MIT |
| DeepEval | General LLM testing with pytest | pytest | MIT |
| promptfoo | YAML test suites, red-teaming, CLI | CLI/CI | MIT |
| LangSmith | LangChain apps, dataset management | LangChain | Commercial |
| Braintrust | Score tracking over time, A/B testing | Python | Commercial |
---
## 4. Evaluation Strategy Design
### Offline Evaluation (Pre-Deployment Gate)
```python
# evaluation/golden_dataset.json
golden_dataset = [
{
"id": "q001",
"category": "retrieval_qa",
"difficulty": "easy",
"question": "What chunking strategies are recommended for RAG?",
"reference_answer": "Recursive character splitting, semantic chunking, and parent-document retrieval are the main production chunking strategies.",
"required_context_concepts": ["chunking", "recursive", "semantic"],
"tags": ["rag", "chunking"]
},
{
"id": "q002",
"category": "reasoning",
"difficulty": "hard",
"question": "When should you use Qdrant vs ChromaDB in production?",
"reference_answer": "ChromaDB for development and small scale (<500K vectors); Qdrant for high-performance self-hosted production with rich filtering.",
"required_context_concepts": ["qdrant", "chromadb", "production"],
"tags": ["vector_db", "selection"]
},
]
```
### Online Evaluation (Post-Deployment Monitoring)
```python
import random
def sample_for_evaluation(request: dict, response: dict, sample_rate: float = 0.05) -> bool:
"""Sample ~5% of production traffic for evaluation."""
return random.random() < sample_rate
# Log sampled responses for async LLM-as-judge evaluation
if sample_for_evaluation(request, response):
eval_queue.enqueue({
"question": request["question"],
"answer": response["answer"],
"contexts": response["retrieved_chunks"],
"timestamp": response["timestamp"]
})
```
---
## Key Takeaways
- Combine automated metrics (fast, cheap) + LLM-as-judge (flexible) + human review (ground truth)
- Golden datasets of 100-200 questions are the foundation of reliable offline evaluation
- LLM-as-judge biases (verbosity, self-preference) must be mitigated with diverse judge models and rubric design
- Offline evaluation gates deployment; online evaluation monitors production quality
- RAGAS is specialized for RAG; DeepEval for general LLM testing with pytest integration
---
## Further Reading
- [RAGAS Documentation](https://docs.ragas.io/)
- [DeepEval Documentation](https://docs.confident-ai.com/)
- [Chatbot Arena (LMSYS)](https://chat.lmsys.org/): human preference evaluation
- [Patterns for Building LLM-based Systems (Eugene Yan)](https://eugeneyan.com/writing/llm-patterns/)
---
## Next Steps
Continue to **[Lesson 02: DeepEval and Promptfoo](02_deepeval_and_promptfoo.md)**.
- Without a harness, you **can't compare** prompts, models, retrieval configs, or costs.
Evaluate, benchmark, and regression-test AI/LLM systems. Covers evaluation framework design, benchmark creation, human evaluation protocols, automated evaluation (LLM-as-judge), regression testing, statistical significance, and continuous evaluation pipelines.
<img width="1388" height="298" alt="full_diagram" src="https://github.com/user-attachments/assets/12a2371b-8be2-4219-9b48-90503eb43c69" />
A list of all public EEG-datasets. This list of EEG-resources is not exhaustive. If you find something new, or have explored any unfiltered link in depth, please update the repository.