Loading...
Loading...
Create a plan to build an n8n workflow that evaluates multiple LLM prompts for generating meal feedback using a **thinking model to generate ground truth** for comparison.
# Instructions for Claude Code: n8n Meal Feedback LLM Evaluation Workflow
## Updated with Ground-Truth Generation Approach
Create a plan to build an n8n workflow that evaluates multiple LLM prompts for generating meal feedback using a **thinking model to generate ground truth** for comparison.
## Project Context
**Goal**: Build an n8n workflow that uses a high-quality thinking model (e.g., GPT-4o with reasoning, Claude Opus, or o1) to generate reference "ground truth" answers, then evaluates faster/cheaper models against this ground truth.
**Input Source**: Google Sheet "Meals" with columns:
- ID, tracked_foods, weighted_average, meal_quality, nutritional_value, blood_sugar_impact, blood_fat_impact, meal_score, meal_time
**Output Target**: New Google Sheet with columns:
- ID, tracked_foods, hormone_balance_rating, hunger_and_fullness_explanation, fat_storage_explanation, metabolic_health_rating, metabolic_health_risks_explanation, metabolic_health_benefits_explanation
**Prompts to Evaluate**:
- trackMeal_hormone_balance_rating_bot.md → hormone_balance_rating
- trackMeal_hormone_balance_detail.md → hunger_and_fullness_explanation, fat_storage_explanation
- trackMeal_metabolic_health_rating_bot.md → metabolic_health_rating
- trackMeal_metabolic_health_detail_bot.md → metabolic_health_risks_explanation, metabolic_health_benefits_explanation
**Prompt Structure & Naming Convention**:
You will provide three types of prompts for each evaluation category:
1. **Ground Truth Prompts** (for GPT-5.2 with reasoning):
- `trackMeal_hormone_balance_rating_evaluator.md`
- `trackMeal_hormone_balance_detail_evaluator.md`
- `trackMeal_metabolic_health_rating_evaluator.md`
- `trackMeal_metabolic_health_detail_evaluator.md`
2. **Comparison Prompts** (for evaluating candidate vs ground truth):
- `trackMeal_hormone_balance_rating_comparison.md`
- `trackMeal_hormone_balance_detail_comparison.md`
- `trackMeal_metabolic_health_rating_comparison.md`
- `trackMeal_metabolic_health_detail_comparison.md`
3. **Results Aggregation Prompts** (for generating evaluation reports):
- `trackMeal_hormone_balance_rating_results.md`
- `trackMeal_hormone_balance_detail_results.md`
- `trackMeal_metabolic_health_rating_results.md`
- `trackMeal_metabolic_health_detail_results.md`
**Naming Pattern**:
- Production/Candidate prompts: `*_bot.md`
- Ground truth prompts: `*_evaluator.md`
- Comparison prompts: `*_comparison.md`
- Results aggregation prompts: `*_results.md`
**Prompt Flow Overview**:
```
Test Meal → [*_evaluator.md] → Ground Truth Output
↓
Test Meal → [*_bot.md] → Candidate Output
↓
[*_comparison.md] → Comparison Score + Reasoning
↓
[*_results.md] → Aggregated Report
```
## n8n Credentials Configuration
The following credentials must be configured in n8n before building the workflow:
### Google Services Credentials
**1. Google Sheets IntenseLife**
- **Purpose**: Access to read test meals and write evaluation results to Google Sheets
- **Required Permissions**: Read and write access to sheets
- **Used In**:
- Reading test meals data
- Reading ground truth outputs
- Writing evaluation results
- Writing evaluation summaries
**2. Google Drive account**
- **Purpose**: Access to Google Sheets files via Drive API
- **Required Permissions**: Drive file access
- **Used In**:
- Additional sheet access if needed
- File management operations
### OpenAI Credentials
**3. OpenAI account 2**
- **Purpose**: Access to GPT-5.2 model for ground truth generation and evaluation
- **Required Permissions**: API access with GPT-5.2 model availability
- **Used In**:
- Ground truth generation (GPT-5.2 with reasoning enabled)
- Candidate prompt evaluation (GPT-5.2 with reasoning "none")
- Comparison prompt execution
- Results aggregation prompt execution
**Configuration Notes**:
- Ensure all credentials are properly authenticated before starting workflow setup
- Verify GPT-5.2 model access is available in OpenAI account 2
- Test credential connections in n8n before full workflow deployment
- Set appropriate rate limits and error handling for API calls
## Updated Architecture: Ground-Truth Generation Strategy
### Evaluation Approach Overview
```
[Google Sheet with Test Meals - Provided by User]
↓
[Ground Truth Generation Path]
├─ Run GPT-5.2 with reasoning enabled
├─ Generate reference outputs for all 6 fields
└─ Store ground truth in evaluation sheet
↓
[Model Comparison Path]
├─ Run candidate model: GPT-5.2 with reasoning "none"
├─ Compare against ground truth using metrics
├─ Calculate similarity scores
└─ Generate evaluation report
```
### Key Changes from Original Plan
1. **User-Provided Test Data**: Golden dataset is provided by you via Google Sheet, not manually curated
2. **Automated Ground Truth**: GPT-5.2 with reasoning enabled generates reference answers automatically
3. **Single Model Evaluation**: Comparing prompt variations using GPT-5.2 with reasoning "none"
4. **Comparison-Based Metrics**: Use semantic similarity and AI-based correctness to compare outputs
5. **Two-Stage Evaluation**: First generate truth, then evaluate candidate prompts
## Architecture Requirements
### 1. Three-Path Workflow Structure
```
[Google Sheets Trigger - Read Meals]
↓
[Workflow Mode Switch]
├─ GROUND_TRUTH_GENERATION → [Ground Truth Path]
│ ├─ Load test meals (10-20 samples)
│ ├─ Run thinking model with all prompts
│ ├─ Store outputs as "ground truth"
│ └─ Save to evaluation reference sheet
│
├─ EVALUATION → [Evaluation Path]
│ ├─ Load test meals + ground truth
│ ├─ Run candidate models/prompts
│ ├─ Compare outputs against ground truth
│ ├─ Calculate metrics (similarity, correctness)
│ ├─ Aggregate results
│ └─ Store evaluation results
│
└─ PRODUCTION → [Production Path]
├─ Process all meals from sheet
├─ Run selected optimal prompts/model
└─ Write outputs to new Google Sheet
```
### 2. Test Meals Dataset (User-Provided)
You will provide a **Google Sheet with test meals** that includes:
**Required Columns (Input Only)**:
- ID, tracked_foods, weighted_average, meal_quality, nutritional_value, blood_sugar_impact, blood_fat_impact, meal_score, meal_time
**No Expected Output Columns Needed** - these will be generated by the thinking model
**Recommended Dataset Composition** (for best evaluation coverage):
- Diverse meal types: balanced, high-sugar, high-fat, low-carb, vegan, etc.
- Various meal timings: breakfast, lunch, dinner, unusual times
- Edge cases: incomplete data, extreme portions, mixed quality meals
- Minimum 10-20 test meals recommended for comprehensive evaluation
### 3. Ground Truth Generation Configuration
#### Thinking Model Selection
**Ground Truth Generator**: GPT-5.2 with reasoning enabled
**Selection Rationale**:
- Prioritizes accuracy over speed/cost (ground truth only runs once)
- Provides best domain knowledge in nutrition/metabolism
- Strongest reasoning capabilities for complex meal analysis
#### Ground Truth Prompt Configuration
Ground truth will be generated using the **evaluator prompts** you provide (e.g., `trackMeal_hormone_balance_rating_evaluator.md`).
These evaluator prompts are specifically designed for:
- High-quality reasoning with GPT-5.2
- Generating reference outputs that serve as ground truth
- Comprehensive analysis of meal data
- Scientific accuracy and detailed explanations
The evaluator prompts should prioritize accuracy and thoroughness over speed, as they only run once to establish the reference standard.
#### Ground Truth Storage Schema
Create **"GroundTruth_Outputs"** sheet with columns:
- ID (matches test meal)
- tracked_foods (for reference)
- **ground_truth_hormone_balance_rating**
- **ground_truth_hunger_and_fullness_explanation**
- **ground_truth_fat_storage_explanation**
- **ground_truth_metabolic_health_rating**
- **ground_truth_metabolic_health_risks_explanation**
- **ground_truth_metabolic_health_benefits_explanation**
- generation_timestamp
- thinking_model_used (GPT-5.2 with reasoning)
- prompt_version
### 4. Updated Evaluation Metrics
For each output field, compare candidate model output against ground truth:
#### For Rating Fields (hormone_balance_rating, metabolic_health_rating):
**Metric 1: Exact Match**
- Binary: Does candidate output exactly match ground truth?
- Score: 0 (no match) or 1 (match)
- Use for: Fixed categorical ratings
**Metric 2: Semantic Similarity**
- Calculate embedding similarity between candidate and ground truth
- Score: 0-1 (cosine similarity)
- Use for: When ratings might use different wording but same meaning
**Metric 3: AI-Based Correctness**
- Use your provided comparison prompts (e.g., `*_comparison.md`) to compare candidate vs ground truth
- The comparison prompt evaluates semantic alignment between outputs
- Score: 1-5 scale (as defined in your comparison prompts)
#### For Explanation Fields (all explanation outputs):
**Metric 1: Embedding Similarity (BERT Score)**
- Calculate semantic similarity using embeddings
- Score: 0-1 (cosine similarity)
- Primary metric for semantic comparison
**Metric 2: AI-Based Correctness**
- Use your provided comparison prompts (e.g., `*_comparison.md`) to evaluate candidate vs ground truth
- The comparison prompt will analyze how well the candidate output matches ground truth
- Score: 1-5 scale (as defined in your comparison prompts)
- Each comparison prompt is tailored to the specific output type
**Metric 3: Key Information Extraction**
- Extract key facts from both candidate and ground truth
- Compare fact overlap
- Score: Percentage of ground truth facts present in candidate
**Metric 4: Length Ratio**
- Compare output lengths (too verbose or too terse)
- Score: Ratio of candidate/ground truth length (ideal: 0.8-1.2)
### 5. Workflow Components
#### Component 1: Ground Truth Generation Path
```
[Manual Trigger: "Generate Ground Truth"]
↓
[Google Sheets: Read Test Meals from User-Provided Sheet]
↓
[Loop Through Each Test Meal]
↓
[For Each Meal - Run GPT-5.2 with Reasoning Enabled]:
├─ Call LLM: hormone_balance_rating (trackMeal_hormone_balance_rating_evaluator.md)
├─ Call LLM: hormone_balance_detail (trackMeal_hormone_balance_detail_evaluator.md)
├─ Call LLM: metabolic_health_rating (trackMeal_metabolic_health_rating_evaluator.md)
└─ Call LLM: metabolic_health_detail (trackMeal_metabolic_health_detail_evaluator.md)
↓
[Structure Ground Truth Output]
↓
[Google Sheets: Write to GroundTruth_Outputs sheet]
↓
[Completion Notice: "Ground truth generated for X meals"]
```
**Important**: Ground truth generation only runs when explicitly triggered, not automatically.
#### Component 2: Evaluation Path (Candidate Testing)
```
[Manual Trigger: "Run Evaluation"]
↓
[Set Evaluation Config]:
- Candidate Model: GPT-5.2 with reasoning "none"
- Prompt Version: [e.g., V1, V2, etc.]
↓
[Google Sheets: Read Test Meals]
↓
[Google Sheets: Read Ground Truth Outputs]
↓
[Loop Through Test Meals]
↓
[For Each Meal - Run Candidate Model (GPT-5.2 reasoning: none)]:
├─ Call LLM: hormone_balance_rating (*_bot.md)
├─ Call LLM: hormone_balance_detail (*_bot.md)
├─ Call LLM: metabolic_health_rating (*_bot.md)
└─ Call LLM: metabolic_health_detail (*_bot.md)
↓
[Comparison & Metrics Calculation]:
├─ Calculate embedding similarity for each field
├─ Run comparison prompts (*_comparison.md) for correctness scoring
├─ Calculate exact matches for ratings
└─ Aggregate scores per meal
↓
[Store Individual Evaluation Results]
↓
[Generate Aggregated Report]:
├─ Run results prompts (*_results.md)
└─ Create evaluation summary report
```
#### Component 3: n8n Evaluation Node Configuration
Configure n8n Evaluation node:
- **Input Column**: test_meals.tracked_foods
- **Expected Output Source**: GroundTruth_Outputs sheet
- **Candidate Output Source**: Current evaluation run
- **Metrics to Calculate**:
- Embedding Similarity (for all 6 fields)
- Comparison Scores (using *_comparison.md prompts for all 6 fields)
- Exact Match (for rating fields only)
- **Results Aggregation**: Use *_results.md prompts to generate summary reports
#### Component 4: Comparison Prompts Configuration
You will provide **comparison prompts** for each output field that evaluate candidate outputs against ground truth.
**Comparison Prompt Structure**:
Each comparison prompt (e.g., `trackMeal_hormone_balance_rating_comparison.md`) should:
- Accept ground truth output as input
- Accept candidate output as input
- Compare the two outputs based on domain-specific criteria
- Provide a score (1-5 scale recommended)
- Include reasoning for the score
**Example Comparison Flow**:
```
Input to Comparison Prompt:
- Ground Truth: [output from evaluator]
- Candidate: [output from bot]
- Meal Context: [original meal data]
Output from Comparison Prompt:
- Score: [1-5]
- Reasoning: [explanation of differences/similarities]
- Key Gaps: [what the candidate missed]
- Alignment: [where candidate matched ground truth]
```
**Comparison Model**: Use GPT-4, Claude Sonnet, or GPT-5.2 for running comparison prompts
#### Component 5: Results Aggregation Prompts
You will provide **results aggregation prompts** that synthesize comparison results into evaluation reports.
**Results Prompt Structure**:
Each results prompt (e.g., `trackMeal_hormone_balance_rating_results.md`) should:
- Accept all comparison results for that field across test meals
- Aggregate scores and identify patterns
- Generate insights about prompt performance
- Highlight specific failure cases
- Provide recommendations for improvement
**Example Results Flow**:
```
Input to Results Prompt:
- All comparison scores for this field
- All comparison reasoning
- Test meal IDs and contexts
- Statistical metrics (avg, min, max, median)
Output from Results Prompt:
- Overall Performance Summary
- Success Rate: [percentage of high-scoring outputs]
- Common Failure Patterns: [what types of meals failed]
- Best Performance: [which meal types worked well]
- Recommendations: [prompt improvements to consider]
```
**Aggregation Model**: Use GPT-4 or Claude Sonnet for generating summary reports
#### Component 6: Results Storage & Reporting
**Evaluation Results Sheet** with columns:
- evaluation_run_id
- timestamp
- candidate_model (GPT-5.2 reasoning: none)
- prompt_version
- test_meal_id
- **field_name** (which output being evaluated)
- **ground_truth_value**
- **candidate_value**
- **embedding_similarity_score**
- **comparison_score** (from *_comparison.md prompts)
- **comparison_reasoning** (explanation from comparison prompt)
- **exact_match** (boolean)
- **key_gaps_identified** (what candidate missed)
**Evaluation Summary Sheet** with aggregated metrics:
- evaluation_run_id
- timestamp
- candidate_model (GPT-5.2 reasoning: none)
- prompt_version
- **avg_embedding_similarity** (across all fields)
- **avg_comparison_score** (across all fields)
- **rating_exact_match_rate** (%)
- **cost_per_meal**
- **total_evaluation_cost**
- **quality_to_cost_ratio**
- **recommended_for_production** (boolean - best performing prompt)
- **aggregated_insights** (from *_results.md prompts)
- **common_failure_patterns** (from results prompts)
- **improvement_recommendations** (from results prompts)
#### Component 7: Production Path (Unchanged)
```
[Production Trigger: Schedule/Webhook]
↓
[Google Sheets: Read All Meals]
↓
[Loop Through Meals]
↓
[Run Selected Model with Optimal Prompts]
↓
[Write Results to Output Sheet]
```
### Evaluation Best Practices
- **Isolate variables**: Test one prompt variation at a time (keep model constant at GPT-5.2 reasoning: none)
- **Multiple runs**: Run each evaluation 2-3 times to account for non-determinism
- **Cost tracking**: Calculate total cost (ground truth generation + evaluation + production)
- **Threshold setting**: Define minimum acceptable scores (e.g., >0.8 embedding similarity, >4.0 comparison score)
### Comparison Framework
```
Evaluation Decision Matrix:
Prompt Version A (Baseline):
- Avg Embedding Similarity: 0.85
- Avg Comparison Score: 4.2/5
- Cost per meal: $0.02
- Quality-Cost Ratio: 212.5
Prompt Version B (Enhanced):
- Avg Embedding Similarity: 0.88
- Avg Comparison Score: 4.5/5
- Cost per meal: $0.02
- Quality-Cost Ratio: 220.0
Prompt Version C (Few-Shot):
- Avg Embedding Similarity: 0.92
- Avg Comparison Score: 4.8/5
- Cost per meal: $0.02
- Quality-Cost Ratio: 240.0
DECISION: Choose Prompt Version C
- Highest quality scores (from comparison prompts)
- Same cost (same model)
- Best semantic alignment with ground truth
- Optimal for production use
```- Without a harness, you **can't compare** prompts, models, retrieval configs, or costs.
Evaluate, benchmark, and regression-test AI/LLM systems. Covers evaluation framework design, benchmark creation, human evaluation protocols, automated evaluation (LLM-as-judge), regression testing, statistical significance, and continuous evaluation pipelines.
<img width="1388" height="298" alt="full_diagram" src="https://github.com/user-attachments/assets/12a2371b-8be2-4219-9b48-90503eb43c69" />
A list of all public EEG-datasets. This list of EEG-resources is not exhaustive. If you find something new, or have explored any unfiltered link in depth, please update the repository.