Instructions for Claude Code: n8n Meal Feedback LLM Evaluation Workflow

Updated with Ground-Truth Generation Approach

Create a plan to build an n8n workflow that evaluates multiple LLM prompts for generating meal feedback using a thinking model to generate ground truth for comparison.

Project Context

Goal: Build an n8n workflow that uses a high-quality thinking model (e.g., GPT-4o with reasoning, Claude Opus, or o1) to generate reference "ground truth" answers, then evaluates faster/cheaper models against this ground truth.

Input Source: Google Sheet "Meals" with columns:

ID, tracked_foods, weighted_average, meal_quality, nutritional_value, blood_sugar_impact, blood_fat_impact, meal_score, meal_time

Output Target: New Google Sheet with columns:

ID, tracked_foods, hormone_balance_rating, hunger_and_fullness_explanation, fat_storage_explanation, metabolic_health_rating, metabolic_health_risks_explanation, metabolic_health_benefits_explanation

Prompts to Evaluate:

trackMeal_hormone_balance_rating_bot.md → hormone_balance_rating
trackMeal_hormone_balance_detail.md → hunger_and_fullness_explanation, fat_storage_explanation
trackMeal_metabolic_health_rating_bot.md → metabolic_health_rating
trackMeal_metabolic_health_detail_bot.md → metabolic_health_risks_explanation, metabolic_health_benefits_explanation

Prompt Structure & Naming Convention:

You will provide three types of prompts for each evaluation category:

Ground Truth Prompts (for GPT-5.2 with reasoning):
- trackMeal_hormone_balance_rating_evaluator.md
- trackMeal_hormone_balance_detail_evaluator.md
- trackMeal_metabolic_health_rating_evaluator.md
- trackMeal_metabolic_health_detail_evaluator.md
Comparison Prompts (for evaluating candidate vs ground truth):
- trackMeal_hormone_balance_rating_comparison.md
- trackMeal_hormone_balance_detail_comparison.md
- trackMeal_metabolic_health_rating_comparison.md
- trackMeal_metabolic_health_detail_comparison.md
Results Aggregation Prompts (for generating evaluation reports):
- trackMeal_hormone_balance_rating_results.md
- trackMeal_hormone_balance_detail_results.md
- trackMeal_metabolic_health_rating_results.md
- trackMeal_metabolic_health_detail_results.md

Naming Pattern:

Production/Candidate prompts: *_bot.md
Ground truth prompts: *_evaluator.md
Comparison prompts: *_comparison.md
Results aggregation prompts: *_results.md

Prompt Flow Overview:

Test Meal → [*_evaluator.md] → Ground Truth Output
                                        ↓
Test Meal → [*_bot.md] → Candidate Output
                                        ↓
                        [*_comparison.md] → Comparison Score + Reasoning
                                        ↓
                        [*_results.md] → Aggregated Report

n8n Credentials Configuration

The following credentials must be configured in n8n before building the workflow:

Google Services Credentials

1. Google Sheets IntenseLife

Purpose: Access to read test meals and write evaluation results to Google Sheets
Required Permissions: Read and write access to sheets
Used In:
- Reading test meals data
- Reading ground truth outputs
- Writing evaluation results
- Writing evaluation summaries

2. Google Drive account

Purpose: Access to Google Sheets files via Drive API
Required Permissions: Drive file access
Used In:
- Additional sheet access if needed
- File management operations

OpenAI Credentials

3. OpenAI account 2

Purpose: Access to GPT-5.2 model for ground truth generation and evaluation
Required Permissions: API access with GPT-5.2 model availability
Used In:
- Ground truth generation (GPT-5.2 with reasoning enabled)
- Candidate prompt evaluation (GPT-5.2 with reasoning "none")
- Comparison prompt execution
- Results aggregation prompt execution

Configuration Notes:

Ensure all credentials are properly authenticated before starting workflow setup
Verify GPT-5.2 model access is available in OpenAI account 2
Test credential connections in n8n before full workflow deployment
Set appropriate rate limits and error handling for API calls

Updated Architecture: Ground-Truth Generation Strategy

Evaluation Approach Overview

[Google Sheet with Test Meals - Provided by User]
    ↓
[Ground Truth Generation Path]
    ├─ Run GPT-5.2 with reasoning enabled
    ├─ Generate reference outputs for all 6 fields
    └─ Store ground truth in evaluation sheet
    ↓
[Model Comparison Path]
    ├─ Run candidate model: GPT-5.2 with reasoning "none"
    ├─ Compare against ground truth using metrics
    ├─ Calculate similarity scores
    └─ Generate evaluation report

Key Changes from Original Plan

User-Provided Test Data: Golden dataset is provided by you via Google Sheet, not manually curated
Automated Ground Truth: GPT-5.2 with reasoning enabled generates reference answers automatically
Single Model Evaluation: Comparing prompt variations using GPT-5.2 with reasoning "none"
Comparison-Based Metrics: Use semantic similarity and AI-based correctness to compare outputs
Two-Stage Evaluation: First generate truth, then evaluate candidate prompts

Architecture Requirements

1. Three-Path Workflow Structure

[Google Sheets Trigger - Read Meals]
    ↓
[Workflow Mode Switch]
    ├─ GROUND_TRUTH_GENERATION → [Ground Truth Path]
    │         ├─ Load test meals (10-20 samples)
    │         ├─ Run thinking model with all prompts
    │         ├─ Store outputs as "ground truth"
    │         └─ Save to evaluation reference sheet
    │
    ├─ EVALUATION → [Evaluation Path]
    │         ├─ Load test meals + ground truth
    │         ├─ Run candidate models/prompts
    │         ├─ Compare outputs against ground truth
    │         ├─ Calculate metrics (similarity, correctness)
    │         ├─ Aggregate results
    │         └─ Store evaluation results
    │
    └─ PRODUCTION → [Production Path]
              ├─ Process all meals from sheet
              ├─ Run selected optimal prompts/model
              └─ Write outputs to new Google Sheet

2. Test Meals Dataset (User-Provided)

You will provide a Google Sheet with test meals that includes:

Required Columns (Input Only):

ID, tracked_foods, weighted_average, meal_quality, nutritional_value, blood_sugar_impact, blood_fat_impact, meal_score, meal_time

No Expected Output Columns Needed - these will be generated by the thinking model

Recommended Dataset Composition (for best evaluation coverage):

Diverse meal types: balanced, high-sugar, high-fat, low-carb, vegan, etc.
Various meal timings: breakfast, lunch, dinner, unusual times
Edge cases: incomplete data, extreme portions, mixed quality meals
Minimum 10-20 test meals recommended for comprehensive evaluation

3. Ground Truth Generation Configuration

Thinking Model Selection

Ground Truth Generator: GPT-5.2 with reasoning enabled

Selection Rationale:

Prioritizes accuracy over speed/cost (ground truth only runs once)
Provides best domain knowledge in nutrition/metabolism
Strongest reasoning capabilities for complex meal analysis

Ground Truth Prompt Configuration

Ground truth will be generated using the evaluator prompts you provide (e.g., trackMeal_hormone_balance_rating_evaluator.md).

These evaluator prompts are specifically designed for:

High-quality reasoning with GPT-5.2
Generating reference outputs that serve as ground truth
Comprehensive analysis of meal data
Scientific accuracy and detailed explanations

The evaluator prompts should prioritize accuracy and thoroughness over speed, as they only run once to establish the reference standard.

Ground Truth Storage Schema

Create "GroundTruth_Outputs" sheet with columns:

ID (matches test meal)
tracked_foods (for reference)
ground_truth_hormone_balance_rating
ground_truth_hunger_and_fullness_explanation
ground_truth_fat_storage_explanation
ground_truth_metabolic_health_rating
ground_truth_metabolic_health_risks_explanation
ground_truth_metabolic_health_benefits_explanation
generation_timestamp
thinking_model_used (GPT-5.2 with reasoning)
prompt_version

4. Updated Evaluation Metrics

For each output field, compare candidate model output against ground truth:

For Rating Fields (hormone_balance_rating, metabolic_health_rating):

Metric 1: Exact Match

Binary: Does candidate output exactly match ground truth?
Score: 0 (no match) or 1 (match)
Use for: Fixed categorical ratings

Metric 2: Semantic Similarity

Calculate embedding similarity between candidate and ground truth
Score: 0-1 (cosine similarity)
Use for: When ratings might use different wording but same meaning

Metric 3: AI-Based Correctness

Use your provided comparison prompts (e.g., *_comparison.md) to compare candidate vs ground truth
The comparison prompt evaluates semantic alignment between outputs
Score: 1-5 scale (as defined in your comparison prompts)

For Explanation Fields (all explanation outputs):

Metric 1: Embedding Similarity (BERT Score)

Calculate semantic similarity using embeddings
Score: 0-1 (cosine similarity)
Primary metric for semantic comparison

Metric 2: AI-Based Correctness

Use your provided comparison prompts (e.g., *_comparison.md) to evaluate candidate vs ground truth
The comparison prompt will analyze how well the candidate output matches ground truth
Score: 1-5 scale (as defined in your comparison prompts)
Each comparison prompt is tailored to the specific output type

Metric 3: Key Information Extraction

Extract key facts from both candidate and ground truth
Compare fact overlap
Score: Percentage of ground truth facts present in candidate

Metric 4: Length Ratio

Compare output lengths (too verbose or too terse)
Score: Ratio of candidate/ground truth length (ideal: 0.8-1.2)

5. Workflow Components

Component 1: Ground Truth Generation Path

[Manual Trigger: "Generate Ground Truth"]
    ↓
[Google Sheets: Read Test Meals from User-Provided Sheet]
    ↓
[Loop Through Each Test Meal]
    ↓
[For Each Meal - Run GPT-5.2 with Reasoning Enabled]:
    ├─ Call LLM: hormone_balance_rating (trackMeal_hormone_balance_rating_evaluator.md)
    ├─ Call LLM: hormone_balance_detail (trackMeal_hormone_balance_detail_evaluator.md)
    ├─ Call LLM: metabolic_health_rating (trackMeal_metabolic_health_rating_evaluator.md)
    └─ Call LLM: metabolic_health_detail (trackMeal_metabolic_health_detail_evaluator.md)
    ↓
[Structure Ground Truth Output]
    ↓
[Google Sheets: Write to GroundTruth_Outputs sheet]
    ↓
[Completion Notice: "Ground truth generated for X meals"]

Important: Ground truth generation only runs when explicitly triggered, not automatically.

Component 2: Evaluation Path (Candidate Testing)

[Manual Trigger: "Run Evaluation"]
    ↓
[Set Evaluation Config]:
    - Candidate Model: GPT-5.2 with reasoning "none"
    - Prompt Version: [e.g., V1, V2, etc.]
    ↓
[Google Sheets: Read Test Meals]
    ↓
[Google Sheets: Read Ground Truth Outputs]
    ↓
[Loop Through Test Meals]
    ↓
[For Each Meal - Run Candidate Model (GPT-5.2 reasoning: none)]:
    ├─ Call LLM: hormone_balance_rating (*_bot.md)
    ├─ Call LLM: hormone_balance_detail (*_bot.md)
    ├─ Call LLM: metabolic_health_rating (*_bot.md)
    └─ Call LLM: metabolic_health_detail (*_bot.md)
    ↓
[Comparison & Metrics Calculation]:
    ├─ Calculate embedding similarity for each field
    ├─ Run comparison prompts (*_comparison.md) for correctness scoring
    ├─ Calculate exact matches for ratings
    └─ Aggregate scores per meal
    ↓
[Store Individual Evaluation Results]
    ↓
[Generate Aggregated Report]:
    ├─ Run results prompts (*_results.md)
    └─ Create evaluation summary report

Component 3: n8n Evaluation Node Configuration

Configure n8n Evaluation node:

Input Column: test_meals.tracked_foods
Expected Output Source: GroundTruth_Outputs sheet
Candidate Output Source: Current evaluation run
Metrics to Calculate:
- Embedding Similarity (for all 6 fields)
- Comparison Scores (using *_comparison.md prompts for all 6 fields)
- Exact Match (for rating fields only)
Results Aggregation: Use *_results.md prompts to generate summary reports

Component 4: Comparison Prompts Configuration

You will provide comparison prompts for each output field that evaluate candidate outputs against ground truth.

Comparison Prompt Structure: Each comparison prompt (e.g., trackMeal_hormone_balance_rating_comparison.md) should:

Accept ground truth output as input
Accept candidate output as input
Compare the two outputs based on domain-specific criteria
Provide a score (1-5 scale recommended)
Include reasoning for the score

Example Comparison Flow:

Input to Comparison Prompt:
- Ground Truth: [output from evaluator]
- Candidate: [output from bot]
- Meal Context: [original meal data]

Output from Comparison Prompt:
- Score: [1-5]
- Reasoning: [explanation of differences/similarities]
- Key Gaps: [what the candidate missed]
- Alignment: [where candidate matched ground truth]

Comparison Model: Use GPT-4, Claude Sonnet, or GPT-5.2 for running comparison prompts

Component 5: Results Aggregation Prompts

You will provide results aggregation prompts that synthesize comparison results into evaluation reports.

Results Prompt Structure: Each results prompt (e.g., trackMeal_hormone_balance_rating_results.md) should:

Accept all comparison results for that field across test meals
Aggregate scores and identify patterns
Generate insights about prompt performance
Highlight specific failure cases
Provide recommendations for improvement

Example Results Flow:

Input to Results Prompt:
- All comparison scores for this field
- All comparison reasoning
- Test meal IDs and contexts
- Statistical metrics (avg, min, max, median)

Output from Results Prompt:
- Overall Performance Summary
- Success Rate: [percentage of high-scoring outputs]
- Common Failure Patterns: [what types of meals failed]
- Best Performance: [which meal types worked well]
- Recommendations: [prompt improvements to consider]

Aggregation Model: Use GPT-4 or Claude Sonnet for generating summary reports

Component 6: Results Storage & Reporting

Evaluation Results Sheet with columns:

evaluation_run_id
timestamp
candidate_model (GPT-5.2 reasoning: none)
prompt_version
test_meal_id
field_name (which output being evaluated)
ground_truth_value
candidate_value
embedding_similarity_score
comparison_score (from *_comparison.md prompts)
comparison_reasoning (explanation from comparison prompt)
exact_match (boolean)
key_gaps_identified (what candidate missed)

Evaluation Summary Sheet with aggregated metrics:

evaluation_run_id
timestamp
candidate_model (GPT-5.2 reasoning: none)
prompt_version
avg_embedding_similarity (across all fields)
avg_comparison_score (across all fields)
rating_exact_match_rate (%)
cost_per_meal
total_evaluation_cost
quality_to_cost_ratio
recommended_for_production (boolean - best performing prompt)
aggregated_insights (from *_results.md prompts)
common_failure_patterns (from results prompts)
improvement_recommendations (from results prompts)

Component 7: Production Path (Unchanged)

[Production Trigger: Schedule/Webhook]
    ↓
[Google Sheets: Read All Meals]
    ↓
[Loop Through Meals]
    ↓
[Run Selected Model with Optimal Prompts]
    ↓
[Write Results to Output Sheet]

Evaluation Best Practices

Isolate variables: Test one prompt variation at a time (keep model constant at GPT-5.2 reasoning: none)
Multiple runs: Run each evaluation 2-3 times to account for non-determinism
Cost tracking: Calculate total cost (ground truth generation + evaluation + production)
Threshold setting: Define minimum acceptable scores (e.g., >0.8 embedding similarity, >4.0 comparison score)

Comparison Framework

Evaluation Decision Matrix:

Prompt Version A (Baseline):
- Avg Embedding Similarity: 0.85
- Avg Comparison Score: 4.2/5
- Cost per meal: $0.02
- Quality-Cost Ratio: 212.5

Prompt Version B (Enhanced):
- Avg Embedding Similarity: 0.88
- Avg Comparison Score: 4.5/5
- Cost per meal: $0.02
- Quality-Cost Ratio: 220.0

Prompt Version C (Few-Shot):
- Avg Embedding Similarity: 0.92
- Avg Comparison Score: 4.8/5
- Cost per meal: $0.02
- Quality-Cost Ratio: 240.0

DECISION: Choose Prompt Version C
- Highest quality scores (from comparison prompts)
- Same cost (same model)
- Best semantic alignment with ground truth
- Optimal for production use

Instructions for Claude Code: n8n Meal Feedback LLM Evaluation Workflow

Instructions for Claude Code: n8n Meal Feedback LLM Evaluation Workflow

Updated with Ground-Truth Generation Approach

Project Context

n8n Credentials Configuration

Google Services Credentials

OpenAI Credentials

Updated Architecture: Ground-Truth Generation Strategy

Evaluation Approach Overview

Key Changes from Original Plan

Architecture Requirements

1. Three-Path Workflow Structure

2. Test Meals Dataset (User-Provided)

3. Ground Truth Generation Configuration

Thinking Model Selection

Ground Truth Prompt Configuration

Ground Truth Storage Schema

4. Updated Evaluation Metrics

For Rating Fields (hormone_balance_rating, metabolic_health_rating):

For Explanation Fields (all explanation outputs):

5. Workflow Components

Component 1: Ground Truth Generation Path

Component 2: Evaluation Path (Candidate Testing)

Component 3: n8n Evaluation Node Configuration

Component 4: Comparison Prompts Configuration

Component 5: Results Aggregation Prompts

Component 6: Results Storage & Reporting

Component 7: Production Path (Unchanged)

Evaluation Best Practices

Comparison Framework

Related Documents

AI Tools for Developers

Lesson 01: Evaluation Frameworks Overview

Evaluating AI Agent Systems: Metrics, Benchmarks, and Quality Assurance (2024-2026)

IATA BCBP Standard Compliance