Loading...
Loading...
Loading...
---
phase: 13-retrieval-evaluation-framework
plan: 02
type: execute
wave: 1
depends_on: []
files_modified:
- src/main/java/dev/alexandria/search/eval/EvaluationExporter.java
- src/main/java/dev/alexandria/search/eval/GoldenSetEntry.java
- src/main/java/dev/alexandria/search/eval/EvaluationResult.java
- src/main/resources/eval/golden-set.json
- src/main/resources/application.yml
- src/test/java/dev/alexandria/search/eval/EvaluationExporterTest.java
autonomous: true
requirements:
- EVAL-02
- EVAL-05
must_haves:
truths:
- "A golden-set.json file contains 100 annotated queries with graded relevance judgments"
- "Each query has a type (factual, conceptual, code_lookup, troubleshooting) matching the required distribution"
- "Each query has relevance judgments with chunk identifiers and grades 0-2"
- "EvaluationExporter writes aggregate CSV with global and per-type metrics"
- "EvaluationExporter writes detailed CSV with per-query chunk results"
- "CSV filenames contain ISO timestamp and configurable label"
- "Output directory is configurable via application properties"
artifacts:
- path: "src/main/resources/eval/golden-set.json"
provides: "100 annotated queries for retrieval evaluation"
min_lines: 200
- path: "src/main/java/dev/alexandria/search/eval/GoldenSetEntry.java"
provides: "Record for parsing golden set entries"
contains: "record GoldenSetEntry"
- path: "src/main/java/dev/alexandria/search/eval/EvaluationResult.java"
provides: "Record for per-query evaluation results"
contains: "record EvaluationResult"
- path: "src/main/java/dev/alexandria/search/eval/EvaluationExporter.java"
provides: "CSV export service for evaluation results"
contains: "class EvaluationExporter"
- path: "src/test/java/dev/alexandria/search/eval/EvaluationExporterTest.java"
provides: "Unit tests for CSV export"
min_lines: 50
key_links:
- from: "src/main/java/dev/alexandria/search/eval/EvaluationExporter.java"
to: "src/main/resources/application.yml"
via: "@Value for output directory and default label"
pattern: "alexandria\\.eval"
- from: "src/main/resources/eval/golden-set.json"
to: "src/main/java/dev/alexandria/search/eval/GoldenSetEntry.java"
via: "Jackson deserialization"
pattern: "GoldenSetEntry"
---
<objective>
Create the golden set of 100 annotated queries and the CSV export service for evaluation results.
Purpose: The golden set provides the ground truth for measuring retrieval quality. The exporter enables tracking metric trends across pipeline changes (phases 14-18). Both are independent of the metrics computation (plan 01).
Output: golden-set.json with 100 queries, GoldenSetEntry/EvaluationResult records, EvaluationExporter service, application config.
</objective>
<execution_context>
@./.claude/get-shit-done/workflows/execute-plan.md
@./.claude/get-shit-done/templates/summary.md
</execution_context>
<context>
@.planning/PROJECT.md
@.planning/ROADMAP.md
@.planning/STATE.md
@.planning/phases/13-retrieval-evaluation-framework/13-CONTEXT.md
@src/main/resources/application.yml
</context>
<tasks>
<task type="auto">
<name>Task 1: Create golden set data model and 100 annotated queries</name>
<files>
src/main/java/dev/alexandria/search/eval/GoldenSetEntry.java
src/main/resources/eval/golden-set.json
</files>
<action>
Create `GoldenSetEntry` record in `dev.alexandria.search.eval`:
```java
public record GoldenSetEntry(
String query,
QueryType queryType,
List<RelevanceJudgment> judgments
) {}
```
Note: `RelevanceJudgment` and `QueryType` are created in plan 01. If plan 02 executes first, create minimal placeholder versions that plan 01 will flesh out. Use the same package `dev.alexandria.search.eval`.
Create `src/main/resources/eval/golden-set.json` with 100 queries about Spring Boot documentation.
**Distribution (per user decision):**
- ~30 code lookup queries (e.g., "How to create a REST controller with Spring Boot", "Spring Boot @Transactional annotation example")
- ~30 factual queries (e.g., "What is the default embedded server in Spring Boot", "Spring Boot auto-configuration order")
- ~25 conceptual queries (e.g., "How does Spring Boot auto-configuration work", "Difference between @Component and @Service")
- ~15 troubleshooting queries (e.g., "Spring Boot application fails to start with port already in use", "How to fix circular dependency in Spring Boot")
**JSON structure:**
```json
[
{
"query": "How to create a REST controller in Spring Boot",
"queryType": "CODE_LOOKUP",
"judgments": [
{ "chunkId": "spring-boot/web/rest-controllers#creating-rest-controller", "grade": 2 },
{ "chunkId": "spring-boot/web/rest-controllers#request-mapping", "grade": 1 },
{ "chunkId": "spring-boot/web/rest-controllers#response-body", "grade": 1 }
]
}
]
```
**ChunkId format:** Use semantic path-style identifiers that map to Spring Boot documentation sections. These are logical identifiers — the actual matching against real chunks happens in the integration test (plan 03) which maps these to real chunk content via source_url + section_path metadata.
**Relevance grades (per user decision):**
- 0 = not relevant (do NOT include grade-0 entries in the JSON — absence means 0)
- 1 = partially relevant (related topic but not the primary answer)
- 2 = highly relevant (directly answers the query)
Each query should have 2-5 relevant judgments (grade 1 or 2). Aim for realistic relevance — not every query will have 5 highly relevant chunks.
**Quality criteria for queries:**
- Queries should be natural language, as a developer would type them
- Vary specificity: some broad ("Spring Boot configuration"), some narrow ("@ConditionalOnProperty annotation syntax")
- Include queries with technical terms (annotations, class names) and queries with natural language descriptions
</action>
<verify>
- File `src/main/resources/eval/golden-set.json` exists and is valid JSON
- JSON contains exactly 100 entries
- Distribution: approximately 30 CODE_LOOKUP, 30 FACTUAL, 25 CONCEPTUAL, 15 TROUBLESHOOTING
- Each entry has query, queryType, and at least 1 judgment with grade 1 or 2
- `./gradlew compileJava` passes
</verify>
<done>
golden-set.json contains 100 diverse, well-structured queries with graded relevance judgments covering all 4 query types in the specified distribution
</done>
</task>
<task type="auto">
<name>Task 2: Create CSV export service with application configuration</name>
<files>
src/main/java/dev/alexandria/search/eval/EvaluationResult.java
src/main/java/dev/alexandria/search/eval/EvaluationExporter.java
src/main/resources/application.yml
src/test/java/dev/alexandria/search/eval/EvaluationExporterTest.java
</files>
<action>
**EvaluationResult record:**
```java
public record EvaluationResult(
String query,
QueryType queryType,
List<ChunkResult> chunkResults,
double recallAt5, double recallAt10, double recallAt20,
double precisionAt5, double precisionAt10, double precisionAt20,
double mrr,
double ndcgAt5, double ndcgAt10, double ndcgAt20,
double averagePrecision,
double hitRateAt5, double hitRateAt10, double hitRateAt20
) {
public record ChunkResult(String chunkId, double score, int rank, int relevanceGrade) {}
}
```
**EvaluationExporter** Spring @Service:
- Constructor-injected with `@Value("${alexandria.eval.output-dir:${user.home}/.alexandria/eval}")` for output directory
- `export(List<EvaluationResult> results, String label)` method that writes BOTH CSVs:
**Aggregate CSV** (`eval-aggregate-{timestamp}-{label}.csv`):
- Columns: query_type, count, recall_at_5, recall_at_10, recall_at_20, precision_at_5, precision_at_10, precision_at_20, mrr, ndcg_at_5, ndcg_at_10, ndcg_at_20, map, hit_rate_at_5, hit_rate_at_10, hit_rate_at_20
- One row per QueryType (FACTUAL, CONCEPTUAL, CODE_LOOKUP, TROUBLESHOOTING) with averages
- One GLOBAL row with overall averages across all queries
- Values formatted to 4 decimal places
**Detailed CSV** (`eval-detailed-{timestamp}-{label}.csv`):
- Columns: query, query_type, chunk_id, score, rank, relevance_grade, recall_at_10, mrr, ndcg_at_10
- One row per query+chunk combination (per user decision: "query + chunk_id + score + rang + jugement de pertinence")
- Include per-query metrics on the first chunk row for each query
**Timestamp format:** ISO local datetime with hyphens replacing colons (e.g., `2026-02-21T14-30-00`), per user decision.
**Directory creation:** Create output directory if it does not exist.
**Application config additions** to `application.yml`:
```yaml
alexandria:
eval:
output-dir: ${ALEXANDRIA_EVAL_DIR:${user.home}/.alexandria/eval}
thresholds:
recall-at-10: 0.70
mrr: 0.60
```
**Unit tests** for EvaluationExporter:
- Test aggregate CSV output format with known EvaluationResult data (write to temp directory)
- Test detailed CSV output format
- Test filename contains timestamp and label
- Test directory creation when it does not exist
- Use `@TempDir` for isolated filesystem testing — no mocks needed for pure I/O verification
</action>
<verify>
- `./gradlew test --tests "dev.alexandria.search.eval.EvaluationExporterTest"` passes
- `./gradlew spotlessApply && ./gradlew compileJava` passes
- Application config has `alexandria.eval.output-dir` and `alexandria.eval.thresholds` properties
</verify>
<done>
EvaluationExporter writes correctly formatted aggregate and detailed CSVs to a configurable directory with timestamped filenames, and application.yml has configurable thresholds and output directory
</done>
</task>
</tasks>
<verification>
- golden-set.json contains exactly 100 entries with correct type distribution
- EvaluationExporter unit tests pass
- Application properties include eval configuration
- `./gradlew spotlessApply && ./gradlew compileJava` passes clean
</verification>
<success_criteria>
- 100 annotated queries exist in golden-set.json with graded relevance judgments
- Query type distribution matches: ~30 code, ~30 factual, ~25 conceptual, ~15 troubleshooting
- CSV export produces two files (aggregate + detailed) with correct format
- Output directory and thresholds are configurable via application.yml
</success_criteria>
<output>
After completion, create `.planning/phases/13-retrieval-evaluation-framework/13-02-SUMMARY.md`
</output>
- Without a harness, you **can't compare** prompts, models, retrieval configs, or costs.
Evaluate, benchmark, and regression-test AI/LLM systems. Covers evaluation framework design, benchmark creation, human evaluation protocols, automated evaluation (LLM-as-judge), regression testing, statistical significance, and continuous evaluation pipelines.
<img width="1388" height="298" alt="full_diagram" src="https://github.com/user-attachments/assets/12a2371b-8be2-4219-9b48-90503eb43c69" />
A list of all public EEG-datasets. This list of EEG-resources is not exhaustive. If you find something new, or have explored any unfiltered link in depth, please update the repository.