RAG Evaluation Guide

This guide explains how to evaluate the RAG (Retrieval-Augmented Generation) performance of the Clarity and Rigor agents using different retriever configurations.

Overview

The evaluation system measures:

Retrieval Quality: How well the system retrieves relevant context from the vector database
Answer Quality: How well the agent generates responses using retrieved context
Agent Performance: Comparison across different retriever configurations (Naive, BM25, Cohere Rerank)

Evaluation Metrics

RAGAS Metrics (Standard)

Faithfulness: Whether the agent's answer is grounded in the retrieved contexts
Answer Relevancy: Whether the answer addresses the input question appropriately

Custom Retrieval Metrics

Context Precision: Percentage of retrieved contexts that are actually relevant (word overlap with reference contexts)
Context Recall: Percentage of reference contexts successfully retrieved
Context F1: Harmonic mean of precision and recall

Note: We use custom context metrics because RAGAS 0.3+ context metrics don't work well when agent outputs differ in format from reference answers (e.g., issue descriptions vs. suggested fixes).

Quick Start

1. Install Dependencies

uv sync

This installs: ragas, datasets, matplotlib, seaborn, and other required packages.

2. Run Evaluation

Evaluate both agents with default retriever (naive, k=8):

python eval/evaluate_rag_performance.py --evaluator all

Evaluate specific agent:

python eval/evaluate_rag_performance.py --evaluator clarity
python eval/evaluate_rag_performance.py --evaluator rigor

Test with limited samples (faster):

python eval/evaluate_rag_performance.py --evaluator clarity --num-samples 2

Override retriever parameters:

# Change k value (number of chunks retrieved)
python eval/evaluate_rag_performance.py --evaluator clarity --retriever-k 10

# Specify retriever type
python eval/evaluate_rag_performance.py --evaluator clarity --retriever-type cohere_rerank --retriever-k 8

3. Generate Comparison Plots

After running evaluations with multiple configurations:

# Plot all agents
python eval/plot_retriever_comparison.py

# Plot specific agent only
python eval/plot_retriever_comparison.py --agent clarity
python eval/plot_retriever_comparison.py --agent rigor

# Customize plot size
python eval/plot_retriever_comparison.py --width 14 --height 7

Testing Different Retriever Configurations

To compare retriever performance, you need to modify the retriever configuration in app/config.py before each evaluation run.

Example: Testing Naive vs BM25 vs Cohere Rerank

Step 1: Test Naive Retriever (k=8)

# In app/config.py, set:
RETRIEVER_CONFIG = {
    "type": "naive",
    "k": 8,
}

python eval/evaluate_rag_performance.py --evaluator all

Step 2: Test BM25 Retriever (k=8)

# In app/config.py, set:
RETRIEVER_CONFIG = {
    "type": "bm25",
    "k": 8,
}

python eval/evaluate_rag_performance.py --evaluator all

Step 3: Test Cohere Rerank (k=8, initial=20)

# In app/config.py, set:
RETRIEVER_CONFIG = {
    "type": "cohere_rerank",
    "k": 8,
    "initial_k": 20,
}

python eval/evaluate_rag_performance.py --evaluator all

Step 4: Generate Comparison

python eval/plot_retriever_comparison.py

This will create:

eval/results/clarity_retriever_comparison.png
eval/results/rigor_retriever_comparison.png
Console output with detailed metric comparisons

Understanding Results

Output Files

All results are saved to eval/results/{retriever_config}/:

Per-Agent Results:

{agent}_results_TIMESTAMP.json - Detailed per-sample results
{agent}_metrics.json - Aggregated metrics summary

Example clarity_metrics.json:

{
  "retriever_config": "clarity_naive_k8",
  "faithfulness": 0.1000,
  "answer_relevancy": 0.0753,
  "context_precision": 0.1250,
  "context_recall": 1.0000,
  "context_f1": 0.2222
}

Comparison Outputs (generated by plot script):

{agent}_retriever_comparison.png - Bar chart comparing retrievers
Console tables showing improvements vs baseline

Golden Dataset Format

The evaluation uses golden datasets stored in eval/data/golden/:

golden_clarity_10.csv - 10 samples for Clarity agent
golden_rigor_10.csv - 10 samples for Rigor agent

Dataset Fields

Field	Description
`reference_question`	The section content (input text)
`reference_context`	The relevant guideline context
`reference_answer`	The expected output/suggestion
`issue_type`	Type of issue (e.g., "clarity", "technical_rigor")
`severity`	Issue severity level
`domain`	Content domain

RAGAS Field Mapping

{
    "user_input": reference_question,      # Input section text
    "response": agent_generated_output,    # Agent's actual output
    "retrieved_contexts": [contexts],      # Retrieved from vector DB
    "reference": reference_answer,         # Expected output
    "reference_contexts": [reference_context]  # Ground truth context
}

Workflow Summary

┌─────────────────────────────────────────────────────────┐
│ 1. Configure Retriever (app/config.py)                 │
│    - Set type (naive, bm25, cohere_rerank)              │
│    - Set k and other parameters                         │
└────────────────┬────────────────────────────────────────┘
                 │
                 ▼
┌─────────────────────────────────────────────────────────┐
│ 2. Run Evaluation                                       │
│    python eval/evaluate_rag_performance.py              │
│    - Loads golden dataset                               │
│    - Runs agent on each sample                          │
│    - Retrieves contexts from vector DB                  │
│    - Computes RAGAS + custom metrics                    │
│    - Saves to eval/results/{config}/                    │
└────────────────┬────────────────────────────────────────┘
                 │
                 ▼
┌─────────────────────────────────────────────────────────┐
│ 3. Repeat for Each Configuration                       │
│    - Modify app/config.py                               │
│    - Run evaluation again                               │
│    - Results saved in separate subdirectories           │
└────────────────┬────────────────────────────────────────┘
                 │
                 ▼
┌─────────────────────────────────────────────────────────┐
│ 4. Generate Comparison                                  │
│    python eval/plot_retriever_comparison.py             │
│    - Loads all metrics from subdirectories              │
│    - Creates comparison plots and tables                │
│    - Shows improvements vs baseline                     │
└─────────────────────────────────────────────────────────┘

Available Retriever Configurations

1. Naive (Baseline)

RETRIEVER_CONFIG = {
    "type": "naive",
    "k": 8,  # or 10
}

Description: Simple semantic vector search using embeddings only.

2. BM25 (Keyword-based)

RETRIEVER_CONFIG = {
    "type": "bm25",
    "k": 8,  # or 10
}

Description: Traditional keyword-based retrieval using BM25 algorithm.

3. Cohere Rerank (Advanced)

RETRIEVER_CONFIG = {
    "type": "cohere_rerank",
    "k": 8,        # Final number of chunks
    "initial_k": 20,  # Initial retrieval before reranking
}

Description: Two-stage retrieval with semantic search + Cohere cross-encoder reranking.

Advanced Usage

Custom Metrics Only

If you want to skip RAGAS metrics and only compute custom retrieval metrics, modify the evaluation script to comment out RAGAS evaluation.

Adding New Metrics

Add additional RAGAS metrics in evaluate_rag_performance.py:

from ragas.metrics import context_entity_recall

# In compute_ragas_metrics()
metrics_to_compute = [
    faithfulness,
    answer_relevancy,
    context_entity_recall,  # NEW
]

Batch Evaluation

To evaluate multiple configurations automatically, create a bash script:

#!/bin/bash
# evaluate_all_configs.sh

configs=("naive" "bm25" "cohere_rerank")

for config in "${configs[@]}"; do
    echo "Evaluating $config..."
    # Update config.py programmatically or manually
    python eval/evaluate_rag_performance.py --evaluator all
done

python eval/plot_retriever_comparison.py

Files Overview

Core Scripts

evaluate_rag_performance.py - Main evaluation script
custom_retrieval_metrics.py - Custom context precision/recall/F1 metrics
plot_retriever_comparison.py - Generate comparison plots and tables

Data

data/golden/golden_clarity_10.csv - Clarity golden dataset
data/golden/golden_rigor_10.csv - Rigor golden dataset
results/ - All evaluation results (organized by retriever config)

Golden Dataset Generation (Optional)

golden_dataset/step1_generate_seeds.py - Generate seed questions
golden_dataset/step2_evolve_candidates.py - Evolve candidate samples
golden_dataset/step3_filter_golden.py - Filter to final golden set
golden_dataset/config.py - Configuration for dataset generation

References

RAGAS Documentation
RAGAS Metrics Guide
Project config: app/config.py
Retriever implementations: app/retrievers/

RAG Evaluation Guide

RAG Evaluation Guide

Overview

Evaluation Metrics

RAGAS Metrics (Standard)

Custom Retrieval Metrics

Quick Start

1. Install Dependencies

2. Run Evaluation

3. Generate Comparison Plots

Testing Different Retriever Configurations

Example: Testing Naive vs BM25 vs Cohere Rerank

Understanding Results

Output Files

Golden Dataset Format

Dataset Fields

RAGAS Field Mapping

Workflow Summary

Available Retriever Configurations

1. Naive (Baseline)

2. BM25 (Keyword-based)

3. Cohere Rerank (Advanced)

Advanced Usage

Custom Metrics Only

Adding New Metrics

Batch Evaluation

Files Overview

Core Scripts

Data

Golden Dataset Generation (Optional)

References

Related Documents

AI Tools for Developers

Lesson 01: Evaluation Frameworks Overview

Evaluating AI Agent Systems: Metrics, Benchmarks, and Quality Assurance (2024-2026)

IATA BCBP Standard Compliance