13-02-PLAN

--- phase: 13-retrieval-evaluation-framework plan: 02 type: execute wave: 1 depends_on: [] files_modified: - src/main/java/dev/alexandria/search/eval/EvaluationExporter.java - src/main/java/dev/alexandria/search/eval/GoldenSetEntry.java - src/main/java/dev/alexandria/search/eval/EvaluationResult.java - src/main/resources/eval/golden-set.json - src/main/resources/application.yml - src/test/java/dev/alexandria/search/eval/EvaluationExporterTest.java autonomous: true requirements: - EVAL-02 - EVAL-05 must_haves: truths: - "A golden-set.json file contains 100 annotated queries with graded relevance judgments" - "Each query has a type (factual, conceptual, code_lookup, troubleshooting) matching the required distribution" - "Each query has relevance judgments with chunk identifiers and grades 0-2" - "EvaluationExporter writes aggregate CSV with global and per-type metrics" - "EvaluationExporter writes detailed CSV with per-query chunk results" - "CSV filenames contain ISO timestamp and configurable label" - "Output directory is configurable via application properties" artifacts: - path: "src/main/resources/eval/golden-set.json" provides: "100 annotated queries for retrieval evaluation" min_lines: 200 - path: "src/main/java/dev/alexandria/search/eval/GoldenSetEntry.java" provides: "Record for parsing golden set entries" contains: "record GoldenSetEntry" - path: "src/main/java/dev/alexandria/search/eval/EvaluationResult.java" provides: "Record for per-query evaluation results" contains: "record EvaluationResult" - path: "src/main/java/dev/alexandria/search/eval/EvaluationExporter.java" provides: "CSV export service for evaluation results" contains: "class EvaluationExporter" - path: "src/test/java/dev/alexandria/search/eval/EvaluationExporterTest.java" provides: "Unit tests for CSV export" min_lines: 50 key_links: - from: "src/main/java/dev/alexandria/search/eval/EvaluationExporter.java" to: "src/main/resources/application.yml" via: "@Value for output directory and default label" pattern: "alexandria\\.eval" - from: "src/main/resources/eval/golden-set.json" to: "src/main/java/dev/alexandria/search/eval/GoldenSetEntry.java" via: "Jackson deserialization" pattern: "GoldenSetEntry" --- <objective> Create the golden set of 100 annotated queries and the CSV export service for evaluation results. Purpose: The golden set provides the ground truth for measuring retrieval quality. The exporter enables tracking metric trends across pipeline changes (phases 14-18). Both are independent of the metrics computation (plan 01). Output: golden-set.json with 100 queries, GoldenSetEntry/EvaluationResult records, EvaluationExporter service, application config. </objective> <execution_context> @./.claude/get-shit-done/workflows/execute-plan.md @./.claude/get-shit-done/templates/summary.md </execution_context> <context> @.planning/PROJECT.md @.planning/ROADMAP.md @.planning/STATE.md @.planning/phases/13-retrieval-evaluation-framework/13-CONTEXT.md @src/main/resources/application.yml </context> <tasks> <task type="auto"> <name>Task 1: Create golden set data model and 100 annotated queries</name> <files> src/main/java/dev/alexandria/search/eval/GoldenSetEntry.java src/main/resources/eval/golden-set.json </files> <action> Create `GoldenSetEntry` record in `dev.alexandria.search.eval`: ```java public record GoldenSetEntry( String query, QueryType queryType, List<RelevanceJudgment> judgments ) {} ``` Note: `RelevanceJudgment` and `QueryType` are created in plan 01. If plan 02 executes first, create minimal placeholder versions that plan 01 will flesh out. Use the same package `dev.alexandria.search.eval`. Create `src/main/resources/eval/golden-set.json` with 100 queries about Spring Boot documentation. **Distribution (per user decision):** - ~30 code lookup queries (e.g., "How to create a REST controller with Spring Boot", "Spring Boot @Transactional annotation example") - ~30 factual queries (e.g., "What is the default embedded server in Spring Boot", "Spring Boot auto-configuration order") - ~25 conceptual queries (e.g., "How does Spring Boot auto-configuration work", "Difference between @Component and @Service") - ~15 troubleshooting queries (e.g., "Spring Boot application fails to start with port already in use", "How to fix circular dependency in Spring Boot") **JSON structure:** ```json [ { "query": "How to create a REST controller in Spring Boot", "queryType": "CODE_LOOKUP", "judgments": [ { "chunkId": "spring-boot/web/rest-controllers#creating-rest-controller", "grade": 2 }, { "chunkId": "spring-boot/web/rest-controllers#request-mapping", "grade": 1 }, { "chunkId": "spring-boot/web/rest-controllers#response-body", "grade": 1 } ] } ] ``` **ChunkId format:** Use semantic path-style identifiers that map to Spring Boot documentation sections. These are logical identifiers — the actual matching against real chunks happens in the integration test (plan 03) which maps these to real chunk content via source_url + section_path metadata. **Relevance grades (per user decision):** - 0 = not relevant (do NOT include grade-0 entries in the JSON — absence means 0) - 1 = partially relevant (related topic but not the primary answer) - 2 = highly relevant (directly answers the query) Each query should have 2-5 relevant judgments (grade 1 or 2). Aim for realistic relevance — not every query will have 5 highly relevant chunks. **Quality criteria for queries:** - Queries should be natural language, as a developer would type them - Vary specificity: some broad ("Spring Boot configuration"), some narrow ("@ConditionalOnProperty annotation syntax") - Include queries with technical terms (annotations, class names) and queries with natural language descriptions </action> <verify> - File `src/main/resources/eval/golden-set.json` exists and is valid JSON - JSON contains exactly 100 entries - Distribution: approximately 30 CODE_LOOKUP, 30 FACTUAL, 25 CONCEPTUAL, 15 TROUBLESHOOTING - Each entry has query, queryType, and at least 1 judgment with grade 1 or 2 - `./gradlew compileJava` passes </verify> <done> golden-set.json contains 100 diverse, well-structured queries with graded relevance judgments covering all 4 query types in the specified distribution </done> </task> <task type="auto"> <name>Task 2: Create CSV export service with application configuration</name> <files> src/main/java/dev/alexandria/search/eval/EvaluationResult.java src/main/java/dev/alexandria/search/eval/EvaluationExporter.java src/main/resources/application.yml src/test/java/dev/alexandria/search/eval/EvaluationExporterTest.java </files> <action> **EvaluationResult record:** ```java public record EvaluationResult( String query, QueryType queryType, List<ChunkResult> chunkResults, double recallAt5, double recallAt10, double recallAt20, double precisionAt5, double precisionAt10, double precisionAt20, double mrr, double ndcgAt5, double ndcgAt10, double ndcgAt20, double averagePrecision, double hitRateAt5, double hitRateAt10, double hitRateAt20 ) { public record ChunkResult(String chunkId, double score, int rank, int relevanceGrade) {} } ``` **EvaluationExporter** Spring @Service: - Constructor-injected with `@Value("${alexandria.eval.output-dir:${user.home}/.alexandria/eval}")` for output directory - `export(List<EvaluationResult> results, String label)` method that writes BOTH CSVs: **Aggregate CSV** (`eval-aggregate-{timestamp}-{label}.csv`): - Columns: query_type, count, recall_at_5, recall_at_10, recall_at_20, precision_at_5, precision_at_10, precision_at_20, mrr, ndcg_at_5, ndcg_at_10, ndcg_at_20, map, hit_rate_at_5, hit_rate_at_10, hit_rate_at_20 - One row per QueryType (FACTUAL, CONCEPTUAL, CODE_LOOKUP, TROUBLESHOOTING) with averages - One GLOBAL row with overall averages across all queries - Values formatted to 4 decimal places **Detailed CSV** (`eval-detailed-{timestamp}-{label}.csv`): - Columns: query, query_type, chunk_id, score, rank, relevance_grade, recall_at_10, mrr, ndcg_at_10 - One row per query+chunk combination (per user decision: "query + chunk_id + score + rang + jugement de pertinence") - Include per-query metrics on the first chunk row for each query **Timestamp format:** ISO local datetime with hyphens replacing colons (e.g., `2026-02-21T14-30-00`), per user decision. **Directory creation:** Create output directory if it does not exist. **Application config additions** to `application.yml`: ```yaml alexandria: eval: output-dir: ${ALEXANDRIA_EVAL_DIR:${user.home}/.alexandria/eval} thresholds: recall-at-10: 0.70 mrr: 0.60 ``` **Unit tests** for EvaluationExporter: - Test aggregate CSV output format with known EvaluationResult data (write to temp directory) - Test detailed CSV output format - Test filename contains timestamp and label - Test directory creation when it does not exist - Use `@TempDir` for isolated filesystem testing — no mocks needed for pure I/O verification </action> <verify> - `./gradlew test --tests "dev.alexandria.search.eval.EvaluationExporterTest"` passes - `./gradlew spotlessApply && ./gradlew compileJava` passes - Application config has `alexandria.eval.output-dir` and `alexandria.eval.thresholds` properties </verify> <done> EvaluationExporter writes correctly formatted aggregate and detailed CSVs to a configurable directory with timestamped filenames, and application.yml has configurable thresholds and output directory </done> </task> </tasks> <verification> - golden-set.json contains exactly 100 entries with correct type distribution - EvaluationExporter unit tests pass - Application properties include eval configuration - `./gradlew spotlessApply && ./gradlew compileJava` passes clean </verification> <success_criteria> - 100 annotated queries exist in golden-set.json with graded relevance judgments - Query type distribution matches: ~30 code, ~30 factual, ~25 conceptual, ~15 troubleshooting - CSV export produces two files (aggregate + detailed) with correct format - Output directory and thresholds are configurable via application.yml </success_criteria> <output> After completion, create `.planning/phases/13-retrieval-evaluation-framework/13-02-SUMMARY.md` </output>

Related Documents

Evaluation Harness (Offline + Online)

/godmode:eval

🔬 Open Deep Research

EEG-Datasets