Loading...
Loading...
This guide explains how to evaluate the RAG (Retrieval-Augmented Generation) performance of the Clarity and Rigor agents using different retriever configurations.
# RAG Evaluation Guide
This guide explains how to evaluate the RAG (Retrieval-Augmented Generation) performance of the Clarity and Rigor agents using different retriever configurations.
## Overview
The evaluation system measures:
- **Retrieval Quality**: How well the system retrieves relevant context from the vector database
- **Answer Quality**: How well the agent generates responses using retrieved context
- **Agent Performance**: Comparison across different retriever configurations (Naive, BM25, Cohere Rerank)
## Evaluation Metrics
### RAGAS Metrics (Standard)
1. **Faithfulness**: Whether the agent's answer is grounded in the retrieved contexts
2. **Answer Relevancy**: Whether the answer addresses the input question appropriately
### Custom Retrieval Metrics
3. **Context Precision**: Percentage of retrieved contexts that are actually relevant (word overlap with reference contexts)
4. **Context Recall**: Percentage of reference contexts successfully retrieved
5. **Context F1**: Harmonic mean of precision and recall
**Note**: We use custom context metrics because RAGAS 0.3+ context metrics don't work well when agent outputs differ in format from reference answers (e.g., issue descriptions vs. suggested fixes).
---
## Quick Start
### 1. Install Dependencies
```bash
uv sync
```
This installs: `ragas`, `datasets`, `matplotlib`, `seaborn`, and other required packages.
### 2. Run Evaluation
**Evaluate both agents with default retriever (naive, k=8):**
```bash
python eval/evaluate_rag_performance.py --evaluator all
```
**Evaluate specific agent:**
```bash
python eval/evaluate_rag_performance.py --evaluator clarity
python eval/evaluate_rag_performance.py --evaluator rigor
```
**Test with limited samples (faster):**
```bash
python eval/evaluate_rag_performance.py --evaluator clarity --num-samples 2
```
**Override retriever parameters:**
```bash
# Change k value (number of chunks retrieved)
python eval/evaluate_rag_performance.py --evaluator clarity --retriever-k 10
# Specify retriever type
python eval/evaluate_rag_performance.py --evaluator clarity --retriever-type cohere_rerank --retriever-k 8
```
### 3. Generate Comparison Plots
After running evaluations with multiple configurations:
```bash
# Plot all agents
python eval/plot_retriever_comparison.py
# Plot specific agent only
python eval/plot_retriever_comparison.py --agent clarity
python eval/plot_retriever_comparison.py --agent rigor
# Customize plot size
python eval/plot_retriever_comparison.py --width 14 --height 7
```
---
## Testing Different Retriever Configurations
To compare retriever performance, you need to modify the retriever configuration in `app/config.py` before each evaluation run.
### Example: Testing Naive vs BM25 vs Cohere Rerank
**Step 1: Test Naive Retriever (k=8)**
```python
# In app/config.py, set:
RETRIEVER_CONFIG = {
"type": "naive",
"k": 8,
}
```
```bash
python eval/evaluate_rag_performance.py --evaluator all
```
**Step 2: Test BM25 Retriever (k=8)**
```python
# In app/config.py, set:
RETRIEVER_CONFIG = {
"type": "bm25",
"k": 8,
}
```
```bash
python eval/evaluate_rag_performance.py --evaluator all
```
**Step 3: Test Cohere Rerank (k=8, initial=20)**
```python
# In app/config.py, set:
RETRIEVER_CONFIG = {
"type": "cohere_rerank",
"k": 8,
"initial_k": 20,
}
```
```bash
python eval/evaluate_rag_performance.py --evaluator all
```
**Step 4: Generate Comparison**
```bash
python eval/plot_retriever_comparison.py
```
This will create:
- `eval/results/clarity_retriever_comparison.png`
- `eval/results/rigor_retriever_comparison.png`
- Console output with detailed metric comparisons
---
## Understanding Results
### Output Files
All results are saved to `eval/results/{retriever_config}/`:
**Per-Agent Results:**
- `{agent}_results_TIMESTAMP.json` - Detailed per-sample results
- `{agent}_metrics.json` - Aggregated metrics summary
**Example `clarity_metrics.json`:**
```json
{
"retriever_config": "clarity_naive_k8",
"faithfulness": 0.1000,
"answer_relevancy": 0.0753,
"context_precision": 0.1250,
"context_recall": 1.0000,
"context_f1": 0.2222
}
```
**Comparison Outputs (generated by plot script):**
- `{agent}_retriever_comparison.png` - Bar chart comparing retrievers
- Console tables showing improvements vs baseline
---
## Golden Dataset Format
The evaluation uses golden datasets stored in `eval/data/golden/`:
- `golden_clarity_10.csv` - 10 samples for Clarity agent
- `golden_rigor_10.csv` - 10 samples for Rigor agent
### Dataset Fields
| Field | Description |
|-------|-------------|
| `reference_question` | The section content (input text) |
| `reference_context` | The relevant guideline context |
| `reference_answer` | The expected output/suggestion |
| `issue_type` | Type of issue (e.g., "clarity", "technical_rigor") |
| `severity` | Issue severity level |
| `domain` | Content domain |
### RAGAS Field Mapping
```python
{
"user_input": reference_question, # Input section text
"response": agent_generated_output, # Agent's actual output
"retrieved_contexts": [contexts], # Retrieved from vector DB
"reference": reference_answer, # Expected output
"reference_contexts": [reference_context] # Ground truth context
}
```
---
## Workflow Summary
```
┌─────────────────────────────────────────────────────────┐
│ 1. Configure Retriever (app/config.py) │
│ - Set type (naive, bm25, cohere_rerank) │
│ - Set k and other parameters │
└────────────────┬────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ 2. Run Evaluation │
│ python eval/evaluate_rag_performance.py │
│ - Loads golden dataset │
│ - Runs agent on each sample │
│ - Retrieves contexts from vector DB │
│ - Computes RAGAS + custom metrics │
│ - Saves to eval/results/{config}/ │
└────────────────┬────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ 3. Repeat for Each Configuration │
│ - Modify app/config.py │
│ - Run evaluation again │
│ - Results saved in separate subdirectories │
└────────────────┬────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ 4. Generate Comparison │
│ python eval/plot_retriever_comparison.py │
│ - Loads all metrics from subdirectories │
│ - Creates comparison plots and tables │
│ - Shows improvements vs baseline │
└─────────────────────────────────────────────────────────┘
```
---
## Available Retriever Configurations
### 1. Naive (Baseline)
```python
RETRIEVER_CONFIG = {
"type": "naive",
"k": 8, # or 10
}
```
**Description**: Simple semantic vector search using embeddings only.
### 2. BM25 (Keyword-based)
```python
RETRIEVER_CONFIG = {
"type": "bm25",
"k": 8, # or 10
}
```
**Description**: Traditional keyword-based retrieval using BM25 algorithm.
### 3. Cohere Rerank (Advanced)
```python
RETRIEVER_CONFIG = {
"type": "cohere_rerank",
"k": 8, # Final number of chunks
"initial_k": 20, # Initial retrieval before reranking
}
```
**Description**: Two-stage retrieval with semantic search + Cohere cross-encoder reranking.
---
## Advanced Usage
### Custom Metrics Only
If you want to skip RAGAS metrics and only compute custom retrieval metrics, modify the evaluation script to comment out RAGAS evaluation.
### Adding New Metrics
Add additional RAGAS metrics in `evaluate_rag_performance.py`:
```python
from ragas.metrics import context_entity_recall
# In compute_ragas_metrics()
metrics_to_compute = [
faithfulness,
answer_relevancy,
context_entity_recall, # NEW
]
```
### Batch Evaluation
To evaluate multiple configurations automatically, create a bash script:
```bash
#!/bin/bash
# evaluate_all_configs.sh
configs=("naive" "bm25" "cohere_rerank")
for config in "${configs[@]}"; do
echo "Evaluating $config..."
# Update config.py programmatically or manually
python eval/evaluate_rag_performance.py --evaluator all
done
python eval/plot_retriever_comparison.py
```
---
## Files Overview
### Core Scripts
- `evaluate_rag_performance.py` - Main evaluation script
- `custom_retrieval_metrics.py` - Custom context precision/recall/F1 metrics
- `plot_retriever_comparison.py` - Generate comparison plots and tables
### Data
- `data/golden/golden_clarity_10.csv` - Clarity golden dataset
- `data/golden/golden_rigor_10.csv` - Rigor golden dataset
- `results/` - All evaluation results (organized by retriever config)
### Golden Dataset Generation (Optional)
- `golden_dataset/step1_generate_seeds.py` - Generate seed questions
- `golden_dataset/step2_evolve_candidates.py` - Evolve candidate samples
- `golden_dataset/step3_filter_golden.py` - Filter to final golden set
- `golden_dataset/config.py` - Configuration for dataset generation
---
## References
- [RAGAS Documentation](https://docs.ragas.io/)
- [RAGAS Metrics Guide](https://docs.ragas.io/en/latest/concepts/metrics/)
- Project config: `app/config.py`
- Retriever implementations: `app/retrievers/`
- Without a harness, you **can't compare** prompts, models, retrieval configs, or costs.
Evaluate, benchmark, and regression-test AI/LLM systems. Covers evaluation framework design, benchmark creation, human evaluation protocols, automated evaluation (LLM-as-judge), regression testing, statistical significance, and continuous evaluation pipelines.
<img width="1388" height="298" alt="full_diagram" src="https://github.com/user-attachments/assets/12a2371b-8be2-4219-9b48-90503eb43c69" />
A list of all public EEG-datasets. This list of EEG-resources is not exhaustive. If you find something new, or have explored any unfiltered link in depth, please update the repository.