Loading...
Loading...
[📺 Watch: (RAG Deep Dive series) Evaluating RAG answer quality](https://www.youtube.com/watch?v=lyCLu53fb3g)
# Evaluating the RAG answer quality
[📺 Watch: (RAG Deep Dive series) Evaluating RAG answer quality](https://www.youtube.com/watch?v=lyCLu53fb3g)
Follow these steps to evaluate the quality of the answers generated by the RAG flow.
* [Deploy an evaluation model](#deploy-an-evaluation-model)
* [Setup the evaluation environment](#setup-the-evaluation-environment)
* [Generate ground truth data](#generate-ground-truth-data)
* [Run bulk evaluation](#run-bulk-evaluation)
* [Review the evaluation results](#review-the-evaluation-results)
* [Run bulk evaluation on a PR](#run-bulk-evaluation-on-a-pr)
## Deploy an evaluation model
1. Run this command to tell `azd` to deploy a GPT-4 level model for evaluation:
```shell
azd env set USE_EVAL true
```
2. Set the capacity to the highest possible value to ensure that the evaluation runs relatively quickly. Even with a high capacity, it can take a long time to generate ground truth data and run bulk evaluations.
```shell
azd env set AZURE_OPENAI_EVAL_DEPLOYMENT_CAPACITY 100
```
By default, that will provision a `gpt-4o` model, version `2024-08-06`. To change those settings, set the azd environment variables `AZURE_OPENAI_EVAL_MODEL` and `AZURE_OPENAI_EVAL_MODEL_VERSION` to the desired values.
3. Then, run the following command to provision the model:
```shell
azd provision
```
## Setup the evaluation environment
Make a new Python virtual environment and activate it. This is currently required due to incompatibilities between the dependencies of the evaluation script and the main project.
```bash
python -m venv .evalenv
```
```bash
source .evalenv/bin/activate
```
Install all the dependencies for the evaluation script by running the following command:
```bash
pip install -r evals/requirements.txt
```
## Generate ground truth data
Generate ground truth data by running the following command:
```bash
python evals/generate_ground_truth.py --numquestions=200 --numsearchdocs=1000
```
The options are:
* `numquestions`: The number of questions to generate. We suggest at least 200.
* `numsearchdocs`: The number of documents (chunks) to retrieve from your search index. You can leave off the option to fetch all documents, but that will significantly increase time it takes to generate ground truth data. You may want to at least start with a subset.
* `kgfile`: An existing RAGAS knowledge base JSON file, which is usually `ground_truth_kg.json`. You may want to specify this if you already created a knowledge base and just want to tweak the question generation steps.
* `groundtruthfile`: The file to write the generated ground truth answwers. By default, this is `evals/ground_truth.jsonl`.
🕰️ This may take a long time, possibly several hours, depending on the size of the search index.
Review the generated data in `evals/ground_truth.jsonl` after running that script, removing any question/answer pairs that don't seem like realistic user input.
## Run bulk evaluation
Review the configuration in `evals/evaluate_config.json` to ensure that everything is correctly setup. You may want to adjust the metrics used. See [the ai-rag-chat-evaluator README](https://github.com/Azure-Samples/ai-rag-chat-evaluator) for more information on the available metrics.
By default, the evaluation script will evaluate every question in the ground truth data.
Run the evaluation script by running the following command:
```bash
python evals/evaluate.py
```
The options are:
* `numquestions`: The number of questions to evaluate. By default, this is all questions in the ground truth data.
* `resultsdir`: The directory to write the evaluation results. By default, this is a timestamped folder in `evals/results`. This option can also be specified in `evaluate_config.json`.
* `targeturl`: The URL of the running application to evaluate. By default, this is `http://localhost:50505`. This option can also be specified in `evaluate_config.json`.
🕰️ This may take a long time, possibly several hours, depending on the number of ground truth questions, the TPM capacity of the evaluation model, and the number of LLM-based metrics requested.
## Review the evaluation results
The evaluation script will output a summary of the evaluation results, inside the `evals/results` directory.
You can see a summary of results across all evaluation runs by running the following command:
```bash
python -m evaltools summary evals/results
```
Compare answers to the ground truth by running the following command:
```bash
python -m evaltools diff evals/results/baseline/
```
Compare answers across two runs by running the following command:
```bash
python -m evaltools diff evals/results/baseline/ evals/results/SECONDRUNHERE
```
## Run bulk evaluation on a PR
This repository includes a GitHub Action workflow `evaluate.yaml` that can be used to run the evaluation on the changes in a PR.
In order for the workflow to run successfully, you must first set up [continuous integration](./azd.md#github-actions) for the repository.
To run the evaluation on the changes in a PR, a repository member can post a `/evaluate` comment to the PR. This will trigger the evaluation workflow to run the evaluation on the PR changes and will post the results to the PR.
## Evaluate multimodal RAG answers
The repository also includes an `evaluate_config_multimodal.json` file specifically for evaluating multimodal RAG answers. This configuration uses a different ground truth file, `ground_truth_multimodal.jsonl`, which includes questions based off the sample data that require both text and image sources to answer.
Note that the "groundedness" evaluator is not reliable for multimodal RAG, since it does not currently incorporate the image sources. We still include it in the metrics, but the more reliable metrics are "relevance" and "citations matched".
## Evaluate PBSG Golden Set triage behavior
For the Pro Bono SG workflow, use the per-id JSON files under `data/pbsg_golden_set_by_id/` (default) or pass `--dataset` to a directory or legacy single-array JSON file. Run the endpoint-based evaluator:
```bash
python evals/pbsg_golden_set_eval.py --targeturl http://localhost:50505
```
This evaluator checks:
* **Phase 1 entry selection**: whether the answer picks the expected `Selected Entry` ID
* **Part B completeness**: whether all expected triage questions for that entry are present
* **Phase 2 route validity**: whether returned route labels (`Route A`, `Route B`, etc.) are valid for that entry
Useful options:
* `--max-entries`: run only a subset of entries for quick iteration
* `--per-entry-variations`: limit number of variations per entry
* `--output`: customize where the JSON report is written (default `evals/results/pbsg_golden_set_eval.json`)
- Without a harness, you **can't compare** prompts, models, retrieval configs, or costs.
Evaluate, benchmark, and regression-test AI/LLM systems. Covers evaluation framework design, benchmark creation, human evaluation protocols, automated evaluation (LLM-as-judge), regression testing, statistical significance, and continuous evaluation pipelines.
<img width="1388" height="298" alt="full_diagram" src="https://github.com/user-attachments/assets/12a2371b-8be2-4219-9b48-90503eb43c69" />
A list of all public EEG-datasets. This list of EEG-resources is not exhaustive. If you find something new, or have explored any unfiltered link in depth, please update the repository.