Loading...
Loading...
This document tracks planned features and completed work. Future roadmap items are listed first, followed by completed features.
# Evalyn Roadmap
This document tracks planned features and completed work. Future roadmap items are listed first, followed by completed features.
---
## Roadmap (Planned Features)
### Tracing & Instrumentation
- [x] **Multi-modal Tracing** - Capture images, audio, video in traces
- [x] Image input/output capture with thumbnails
- [x] Audio transcription logging
- [x] Video frame sampling
- [x] Base64/URL reference storage options
- [x] **Streaming Support** - Capture streaming LLM responses
- [x] Streaming response capture (OpenAI, Anthropic, Gemini via StreamingSpanWrapper)
- [x] Token-by-token capture with timing
- [x] First-token latency (TTFT) metric
- [x] Streaming interruption detection
- [x] **More LLM Provider Instrumentors**
- [x] Cohere
- [x] Mistral
- [x] AWS Bedrock
- [x] Azure OpenAI
- [x] Groq
- [x] Together AI
- [x] Replicate
- [x] **Framework Instrumentors**
- [x] CrewAI
- [x] AutoGen
- [x] DSPy
- [x] Haystack
- [x] LlamaIndex
- [x] Semantic Kernel
- [x] **Memory/RAG Tracing** - Capture retrieval context and memory operations
- [x] Capture retrieved documents with relevance scores per query
- [x] Track vector store lookup latency and result count
- [x] Link retrieval spans to downstream LLM calls that consume them
- [x] Memory read/write operation logging for stateful agents
- [x] **Async/Parallel Call Tracking** - Better support for concurrent LLM calls
- [x] Detect concurrent spans and render as parallel branches in show-trace
- [x] Measure total wall-clock vs sum of individual span durations
- [x] asyncio-native context propagation (ContextVar across await boundaries)
- [x] Thread-pool executor span grouping
- [x] **Trace Export to OTel Backends** - Export traces to Jaeger, Zipkin, or any OpenTelemetry collector
- [x] OTLP gRPC exporter alongside existing SQLiteSpanExporter
- [x] OTLP HTTP/JSON exporter for firewall-friendly environments
- [x] Configurable export filters (only export errors, only export slow spans)
- [x] Dual-write mode: SQLite for evalyn + OTLP for observability platform
- [x] **Trace Replay** - Re-run a captured trace against a different model to compare outputs
- [x] Extract input messages from each LLM span for replay
- [x] Swap model name and re-execute captured prompts
- [x] Generate side-by-side diff of original vs replayed outputs
- [x] Cost comparison report between original and replayed model
- [x] **Cost Budget Alerts** - Warn or stop when cumulative LLM cost exceeds a configurable threshold
- [x] Per-session budget limit in evalyn.yaml
- [x] Per-run budget limit as --max-cost flag
- [x] Warning at 80% threshold, hard stop at 100%
- [x] Budget tracking across multiple eval runs in a session
- [x] **Trace Diff** - Side-by-side comparison of two traces showing divergent spans
- [x] Align spans by name/type and highlight added/removed/changed spans
- [x] Show output text diff for matching spans
- [x] Cost and latency delta per span
- [x] ASCII and HTML diff output formats
- [x] **Trace Search Query Language** - Filter traces by span attributes, duration, cost, or error status
- [x] SQL-like syntax: "spans where type=llm_call and duration_ms > 5000"
- [x] Attribute filtering: model name, token count, error status
- [x] Aggregate queries: "traces with total_cost > $0.10"
- [x] Integration with list-calls command via --query flag
- [x] **PII Redaction** - Scrub sensitive data from inputs/outputs before storage
- [x] Regex-based patterns for emails, phone numbers, SSNs, credit cards
- [x] Named entity recognition for names and addresses
- [x] Configurable redaction strategy: mask, hash, or remove
- [x] Pre-storage hook in SQLiteSpanExporter and SQLiteStorage
- [x] **Trace Sampling Rate** - Capture only N% of traces in production to reduce storage overhead
- [x] Configurable sample rate in evalyn.yaml (0.0 to 1.0)
- [x] Priority-based sampling: always capture errors and slow traces
- [x] Per-project sampling rate override
- [x] **Distributed Trace Propagation** - Pass trace context across service boundaries via HTTP headers
- [x] W3C Trace Context (traceparent/tracestate) header injection
- [x] HTTP client instrumentation to propagate headers on outbound calls
- [x] Incoming header extraction to attach child spans to external parent
- [x] **Trace Size Limits** - Cap span payload size with configurable truncation for large inputs/outputs
- [x] Max input/output size in bytes with tail truncation
- [x] Configurable per span type (larger limit for llm_call, smaller for tool_call)
- [x] Truncation marker in span metadata when content is clipped
- [x] **Custom Span Types** - Register user-defined span types beyond the built-in set (llm_call, tool_call, etc.)
- [x] Registration API: register_span_type(name, icon, color)
- [x] Custom span type validation in span creation
- [x] Custom types rendered in show-trace with user-defined icons
- [x] **Span Tagging at Trace Time** - Add custom key-value tags to spans during execution for later filtering
- [x] API: tag_current_span(key, value) callable inside traced functions
- [x] Tags stored in span metadata and queryable via list-calls
- [x] Standard tags: environment, user_id, experiment_id, variant
- [x] **Native Embedding and Reranker Span Types** - First-class span types for embedding and reranking operations
- [x] "embedding" span type capturing model name, input text, vector dimensions
- [x] "reranker" span type capturing query, documents, and re-ranked scores
- [x] "guardrail" span type capturing check name, pass/fail, and blocked content
- [x] Update SPAN_KIND_TO_TYPE mapping in conventions.py (currently mapped to "custom")
- [x] **Span Attribute Extraction Plugins** - Pluggable attribute extractors for SpanConverter
- [x] Plugin interface for extracting custom attributes from OTEL spans
- [x] Provider-specific extractors (e.g. extract function_call from OpenAI tool use spans)
- [x] Configurable truncation limits per attribute (currently hardcoded 1000 chars)
- [x] **Trace Compression** - Compress span payloads before SQLite storage to reduce database size
- [x] gzip or zstd compression for input/output fields exceeding size threshold
- [x] Transparent decompression on read in SQLiteStorage
- [x] Compression ratio reporting in storage-stats command
- [x] Configurable compression level and minimum payload size for compression
- [x] **Span Dependency Graph** - Auto-detect causal data flow between spans within a trace
- [x] Detect when output of span A appears as input to span B (content overlap heuristic)
- [x] Build directed dependency graph from data flow analysis
- [x] Visualize as Mermaid or ASCII DAG in show-trace
- [x] Identify bottleneck spans that block the most downstream work
- [x] **Hot Path Detection** - Identify the most frequently executed span sequences across traces
- [x] Extract sequential span-type patterns (e.g. llm_call->tool_call->llm_call)
- [x] Rank patterns by frequency and cumulative cost
- [x] Highlight optimization opportunities for repeated expensive patterns
- [x] **Trace Density Heatmap** - Time-based visualization showing trace volume across hours and days
- [x] Hour-of-day x day-of-week grid showing trace counts
- [x] Overlay cost or error rate on the heatmap
- [x] ASCII heatmap for terminal, HTML for reports
- [x] **Provider SDK Version Tracking** - Capture installed SDK versions of instrumented providers in span metadata
- [x] Record openai, anthropic, google-generativeai package versions at instrumentation time
- [x] Store as span attributes (evalyn.provider_sdk_version)
- [x] Surface version mismatches across traces in show-trace output
- [x] **Trace Anonymization Export** - Export traces with user content replaced by synthetic equivalents for sharing
- [x] Replace input/output text with length-preserving placeholder content
- [x] Preserve span structure, timing, token counts, and cost data
- [x] evalyn export-traces --anonymize for safe sharing and bug reports
- [x] **Trace Flame Graph** - Flame graph rendering for span durations within a trace
- [x] Stacked bar visualization where width represents wall-clock time per span
- [x] Color-code by span type (llm_call, tool_call, node, etc.)
- [x] ASCII flame graph for terminal, SVG for HTML reports
- [x] Identify time-dominant spans at a glance vs nested show-trace tree
- [x] **Trace Summary Generation** - LLM-generated natural language summary of trace behavior
- [x] Summarize what the agent did: tools called, decisions made, output produced
- [x] evalyn summarize-trace --id <id> producing 2-3 sentence summary
- [x] Batch summaries for dataset items to understand coverage
- [x] **Trace Metadata Inheritance** - Child spans automatically inherit parent's custom tags
- [x] Inheritance rules configurable: inherit-all, inherit-listed, no-inherit
- [x] Override inherited tags at child level
- [x] Useful for propagating environment, user_id, experiment_id down the span tree
- [x] **Trace Cost Breakdown by Phase** - Attribute cost to trace phases (reasoning, tool use, output)
- [x] Classify spans into phases based on span type and position in tree
- [x] Per-phase cost aggregation in show-trace and analyze output
- [x] Identify which phase consumes the most tokens/cost
- [x] **Trace Correlation with External Events** - Link traces to deployments, incidents, or config changes
- [x] evalyn mark-event --type deploy --label "v2.1 rollout" recording event timestamp
- [x] Overlay events on trend charts to correlate metric changes with deploys
- [x] Query traces around an event: evalyn list-calls --around-event <event-id>
- [x] **Trace Complexity Score** - Single numeric score summarizing trace complexity for quick triage
- [x] Weighted combination of span depth, breadth, total span count, and tool call count
- [x] Score stored in FunctionCall metadata for filtering in list-calls --sort complexity
- [x] Threshold alerts: flag traces exceeding expected complexity for the project
- [x] **Trace Template Matching** - Detect if a trace matches known execution patterns
- [x] Built-in templates: "RAG pattern" (retrieve->generate), "retry loop", "fan-out/fan-in"
- [x] Custom pattern definitions in evalyn.yaml as span-type sequences
- [x] evalyn classify-traces showing which pattern each trace matches
- [x] Pattern coverage report: what % of traces match known patterns vs are novel
- [x] **Span Type Distribution** - Per-project statistics on span type frequencies over time
- [x] Count and percentage of each span type (llm_call, tool_call, node, agent, etc.)
- [x] Trend: how span type distribution shifts across weeks
- [x] Useful for detecting architectural changes (e.g. suddenly more tool calls)
- [x] **Instrumentation Compatibility Report** - Track which provider SDK versions have been tested
- [x] Record provider package version on first instrumentation in session
- [x] evalyn check-compat showing tested vs current SDK versions
- [x] Warning when using an untested SDK version
- [x] **Trace Lineage Graph** - Visualize how one trace's output becomes another trace's input
- [x] Detect session-level chaining where output of call A is input to call B
- [x] Render as directed graph showing data flow across function calls
- [x] evalyn show-lineage --session <id> producing Mermaid or ASCII graph
- [x] **Orphan Span Recovery** - Detect and attach spans captured outside an active trace context
- [x] Orphan spans collected in _orphan_spans list (context.py) are currently lost
- [x] Match orphans to the nearest active FunctionCall by timestamp proximity
- [x] Report recovered vs truly lost orphan spans in show-trace
- [x] **Context Propagation Diagnostics** - Verify ContextVar propagation across async and thread boundaries
- [x] evalyn check-context that spawns test async tasks and threads to verify span hierarchy
- [x] Detect when ThreadPoolExecutor breaks ContextVar inheritance
- [x] Recommend workarounds when propagation failures are detected
- [x] **Instrumentation Toggle API** - Hot-toggle instrumentation on/off at runtime without restart
- [x] evalyn_sdk.toggle_instrumentation(enabled=False) to pause tracing
- [x] Useful for excluding specific code sections from tracing overhead
- [x] Toggle state visible in show-projects output
- [x] **Span Collector Statistics** - Report collected, orphaned, and lost spans per session
- [x] Track spans collected vs expected (from OTEL SpanProcessor callbacks)
- [x] Warning when span loss exceeds threshold (e.g. >5% lost)
- [x] Statistics available via evalyn show-call --stats flag
- [x] **Instrumentation Dry-Run** - Show what would be patched without actually applying instrumentation
- [x] evalyn check-instrumentation --dry-run listing SDK methods that would be wrapped
- [x] Report detected SDK versions and instrumentation strategy per provider
- [x] Useful for verifying compatibility before enabling auto-instrumentation
### Trace Lifecycle Management
- [x] **Trace Archival** - Move old traces to cold storage instead of deleting
- [x] evalyn archive-traces --older-than 90d moving traces to archive.sqlite
- [x] Archive is read-only and queryable via --db archive flag
- [x] Restore from archive: evalyn restore-traces --from archive --id <id>
- [x] **Post-Hoc Trace Annotation** - Add notes and tags to existing traces after capture
- [x] evalyn tag-trace --id <id> --tag "regression-candidate"
- [x] evalyn annotate-trace --id <id> --note "Root cause: stale prompt cache"
- [x] Tags and notes queryable in list-calls and build-dataset filters
- [x] **Trace Bookmarking** - Mark interesting traces for later review or inclusion in datasets
- [x] evalyn bookmark --id <id> --reason "edge case: empty input"
- [x] evalyn list-bookmarks showing all bookmarked traces
- [x] --bookmarked-only flag on build-dataset to create datasets from bookmarks
### Provider-Specific Feature Capture
- [x] **Gemini Safety Rating Capture** - Capture safety ratings from Gemini responses
- [x] Extract safetyRatings array from GenerateContent responses
- [x] Store per-category ratings (harassment, hate, dangerous, sexual) in span attributes
- [x] Surface safety blocks in show-trace output
- [x] **Gemini Grounding Metadata Capture** - Capture search grounding results from Gemini
- [x] Extract groundingMetadata and searchEntryPoint from grounded responses
- [x] Store grounding sources and confidence in span attributes
- [x] Link grounding data to grounding metrics (source_attribution, claim_verification)
- [x] **@trace Decorator Span Upgrade** - Upgrade @trace from event-based to span-based tracing
- [x] Create proper Span objects instead of TraceEvent pairs (start/end)
- [x] Automatic parent-child hierarchy via span_context stack
- [x] Visible in show-trace as child spans alongside LLM and tool spans
- [x] **Anthropic Thinking Block Capture** - Capture extended thinking/reasoning from Claude responses
- [x] Extract thinking content blocks from Anthropic Messages API responses
- [x] Store thinking text in span attributes alongside output content
- [x] Display thinking blocks in show-trace with distinct styling
- [x] Enable reasoning quality evaluation on captured thinking content
- [x] **Metric-Specific Provider Routing** - Use different judge providers for different metric categories
- [x] Route safety metrics to Gemini, quality metrics to OpenAI, etc.
- [x] Provider routing config per metric in evalyn.yaml
- [x] Cost optimization: use cheap models for simple metrics, expensive for nuanced ones
### Instrumentation & Decorator Enhancements
- [x] **Selective Instrumentation** - Only instrument specific methods or classes, not entire SDK
- [x] Allowlist/blocklist of method names to instrument per provider
- [x] Config in evalyn.yaml: instrument.openai.methods: ["chat.completions.create"]
- [x] Reduce overhead by skipping low-value calls (e.g. embeddings, moderation)
- [x] **Instrumentation Health Check** - Verify instrumentation is capturing spans correctly
- [x] evalyn check-instrumentation that runs a test call and verifies span capture
- [x] Report which providers are instrumented, which failed, and why
- [x] Warning when instrumented SDK is imported before evalyn_sdk
- [x] **Instrumentation Overhead Measurement** - Measure performance impact of tracing
- [x] Benchmark: instrumented vs uninstrumented call latency
- [x] Report added overhead in ms and % per provider
- [x] Auto-disable instrumentation if overhead exceeds threshold
- [x] **Experiment Tracking** - Group traces by experiment ID for A/B comparisons
- [x] @eval(experiment="prompt-v2") decorator parameter
- [x] Filter traces by experiment in list-calls and build-dataset
- [x] Cross-experiment metric comparison in analyze command
- [x] **Conditional Tracing** - Only trace when runtime conditions are met
- [x] Sample-based: trace 10% of calls via @eval(sample_rate=0.1)
- [x] Predicate-based: @eval(trace_if=lambda args: args["user_id"] in sample_set)
- [x] Environment-based: only trace in production, skip in unit tests
### Onboarding & Templates
- [x] **Quickstart Templates** - Framework-specific guided templates beyond generic quickstart
- [x] evalyn quickstart --template rag for RAG pipeline setup
- [x] evalyn quickstart --template chatbot for conversational agent setup
- [x] evalyn quickstart --template multi-agent for multi-agent orchestration
- [x] Each template pre-selects relevant metric bundles
- [x] **Interactive Tutorial Mode** - Step-by-step in-terminal tutorial for learning evalyn
- [x] evalyn tutorial that walks through trace/build/eval/analyze cycle
- [x] Bundled sample traces so tutorial works without API keys
- [x] Progressive disclosure: each step explains what happened and why
- [x] **Example Agent Gallery** - Bundled working example agents for each supported framework
- [x] example_agents/ directory with one example per framework
- [x] Each example includes: agent code, pre-built dataset, expected results
- [x] evalyn example --framework openai to scaffold from template
### Config & Project Management
- [x] **Config Inheritance** - Base config with per-project overrides
- [x] Global ~/.evalyn/config.yaml for shared settings (API keys, provider defaults)
- [x] Project-level evalyn.yaml inherits and overrides global config
- [x] Per-dataset config override via meta.json
- [x] **Project Scaffolding** - evalyn new-project to create standard project structure
- [x] Create data/ directory, evalyn.yaml, and .gitignore entries
- [x] Optional: create example agent file for chosen framework
- [x] Optional: create GitHub Actions workflow for CI evaluation
- [x] **Multi-Project Dashboard** - View and compare metrics across multiple projects
- [x] evalyn projects showing all projects with latest run status
- [x] Cross-project regression detection
- [x] Unified cost tracking across projects
### Confidence & Judge Robustness
- [x] **Confidence Method Comparison** - Run all confidence methods on same data and compare calibration
- [x] Side-by-side comparison of logprobs, deepconf, consistency, verbalized methods
- [x] Calibration curve: confidence score vs actual correctness
- [x] Recommend best method per metric/provider combination
- [x] **Hybrid Confidence** - Combine multiple confidence methods into a single robust score
- [x] Weighted ensemble of available methods
- [x] Fall back gracefully when a method is unavailable (e.g. no logprobs)
- [x] Bayesian combination with learned weights
- [x] **Structured Output Enforcement** - Force JSON mode on judge LLM calls for reliable parsing
- [x] Use provider-native JSON mode (Gemini response_mime_type, OpenAI response_format)
- [x] Schema enforcement via provider-specific structured output features
- [x] Fallback to regex extraction when JSON mode unavailable
- [x] **Judge Output Retry** - Automatically retry judge calls when output fails to parse
- [x] Configurable max retries (default 2)
- [x] Append "respond with valid JSON" on retry attempts
- [x] Track parse failure rate per metric for diagnostics
- [x] **Judge Latency Optimization** - Reduce judge call overhead for large-scale evaluation
- [x] Prompt caching: reuse system prompt prefix across items
- [x] Batch multiple items into single judge call where possible
- [x] Model-specific prompt length optimization
### Evaluation Units & Views
- [x] **Custom Unit Builder Plugins** - User-defined evaluation boundaries via pluggable builders
- [x] Register custom EvalUnitBuilder subclasses via entry points
- [x] Builder configuration in evalyn.yaml per metric
- [x] Example builders: per-paragraph, per-code-block, per-citation
- [x] **Unit Type Auto-Detection** - Infer best EvalUnit type from trace structure
- [x] Detect multi-turn patterns from sequential LLM spans
- [x] Detect tool-use patterns from tool_call/tool_result span pairs
- [x] Default to outcome when trace structure is flat
- [x] **Unit-Level Reporting** - Per-unit-type metric breakdowns in analysis
- [x] Separate pass rates for outcome vs single_turn vs tool_use units
- [x] Unit type distribution chart in analysis output
- [x] Filter analysis by unit type: --unit-type single_turn
### Batch Evaluation Enhancements
- [x] **Batch Job Persistence** - Save batch job state to disk for recovery after crash or restart
- [x] Write BatchJob to .evalyn/batch_jobs/ as JSON on submit
- [x] evalyn batch-status to list pending/completed batch jobs
- [x] evalyn batch-resume to collect results from a previously submitted batch
- [x] **Mixed-Mode Evaluation** - Use batch API for large runs, real-time for small runs
- [x] Auto-select mode based on item count threshold (e.g. batch if > 50 items)
- [x] --mode auto/batch/realtime flag on run-eval
- [x] Cost/speed comparison in dry-run output
- [x] **Batch Progress Polling** - Live progress updates while batch job is processing
- [x] Poll provider API for completion percentage
- [x] Display progress bar with ETA during batch wait
- [x] Configurable poll interval (default 30s)
- [x] **Multi-Provider Batch Splitting** - Split a single evaluation batch across multiple providers
- [x] Route N% of items to gemini, M% to openai for cost/latency comparison
- [x] Provider-aware retry: re-route failed items to alternate provider
- [x] Unified result merging regardless of which provider evaluated each item
- [x] **Streaming Partial Results** - Start analyzing results before the full batch completes
- [x] Process completed items as they arrive from batch polling
- [x] Live-updating analysis dashboard during batch wait
- [x] Early termination: stop batch if enough results show clear pass/fail
### Session Management
- [x] **Session-Level Analysis** - Aggregate metrics across all calls within an eval_session
- [x] Group traces by session_id in analysis output
- [x] Per-session pass rate, cost, and latency summaries
- [x] Cross-session comparison for the same user journey
- [x] **Session Replay** - Re-execute a full session against a different model or prompt version
- [x] Extract all inputs from session traces in order
- [x] Replay with swapped model/provider
- [x] Session-level diff: compare original vs replayed outputs turn by turn
### Reproducibility
- [x] **Deterministic Evaluation Mode** - Ensure runs produce identical results given identical inputs
- [x] Fixed random seed for all sampling operations
- [x] Temperature 0 enforcement for judge LLM calls
- [x] --seed flag on run-eval for reproducible runs
- [x] **Run Manifest** - Record every parameter that could affect evaluation results
- [x] Store: evalyn version, Python version, provider versions, metric hashes, config hash
- [x] Manifest file alongside eval run results
- [x] evalyn verify-manifest to check reproducibility of a past run
- [x] **Custom Cost Models** - User-defined pricing for custom or self-hosted models
- [x] Per-model cost-per-token config in evalyn.yaml
- [x] Override default pricing for Ollama and other local models
- [x] Cost model versioning for tracking price changes over time
### Cost Intelligence
- [x] **Auto-Update Pricing Tables** - Fetch latest model pricing from provider APIs
- [x] Scrape/fetch pricing from OpenAI, Anthropic, Google pricing pages
- [x] evalyn update-pricing command to refresh COST_PER_1M_TOKENS in _shared.py
- [x] Warn when using a model not in the pricing table
- [x] **Prompt Cache Savings Report** - Show how much prompt caching saved per run
- [x] Aggregate cache_creation_tokens and cache_read_tokens from spans
- [x] Calculate: actual cost vs hypothetical cost without caching
- [x] Recommend caching strategy based on prompt repetition patterns
- [x] **Context Window Utilization Alerts** - Warn when spans approach context limits
- [x] Alert when context_utilization_pct exceeds configurable threshold (default 80%)
- [x] Per-run summary: max utilization, mean utilization, models hitting limits
- [x] Suggest model upgrade when context is consistently near capacity
### Confidence Enhancements
- [x] **Adaptive Consistency Sampling** - Stop early when judge agreement is already clear
- [x] Sequential sampling: stop after 3 samples if all agree (skip remaining 2)
- [x] Configurable early-stop threshold (e.g. 100% agreement after 3 of 5 samples)
- [x] Cost savings report: samples skipped vs full sampling
- [x] **Confidence-Based Re-Evaluation** - Re-evaluate uncertain items with a stronger model
- [x] Identify items where confidence score < threshold after initial eval
- [x] Automatically re-run those items with a more capable model (e.g. flash -> pro)
- [x] Merge re-evaluated scores back into the run results
- [x] **Confidence Threshold Tuning** - Find optimal confidence cutoff per metric
- [x] Binary search for threshold that maximizes alignment with human annotations
- [x] Per-metric optimal threshold stored in calibration record
- [x] evalyn tune-confidence command
### Config Enhancements
- [x] **Config Profiles** - Named environment profiles (dev/staging/prod) in evalyn.yaml
- [x] profiles: section with per-profile overrides
- [x] --profile flag on all commands to select active profile
- [x] Profiles inherit from base config, override specific keys
- [x] **Environment Variable Validation** - Check all required env vars at command startup
- [x] Required vars per command (e.g. run-eval needs GEMINI_API_KEY)
- [x] Validate key format and basic connectivity before starting long operations
- [x] Clear error messages: "GEMINI_API_KEY is set but invalid (HTTP 401)"
### Evaluation Enhancements
- [x] **Span-Level Evaluation** - Evaluate individual spans within a trace
- [x] Per-LLM-call quality metrics
- [x] Tool call success/failure analysis
- [x] Node-level evaluation for graph agents
- [x] Span-specific rubrics
- [x] **Multi-Turn Evaluation** - Specialized evaluation for conversations
- [x] Turn-by-turn quality assessment
- [x] Conversation flow metrics
- [x] Context carryover evaluation
- [x] Memory consistency across turns
- [x] Topic drift detection
- [x] Response latency patterns
- [x] **Pairwise Comparison** - A vs B evaluation mode
- [x] Side-by-side LLM judge comparison
- [x] Elo rating system for models
- [x] Win/loss/tie statistics
- [x] **Reference-Free Evaluation** - Metrics that don't need ground truth
- [x] Self-consistency checking (via --confidence consistency)
- [x] Uncertainty quantification (via confidence module)
- [x] **Evaluation Budget Control** - Stop early if token or cost budget is exceeded mid-run
- [x] --max-tokens and --max-cost flags on run-eval
- [x] Real-time budget tracking in ProgressCallback
- [x] Graceful stop: finish current item, checkpoint, report partial results
- [x] Budget summary in EvalRun metadata
- [x] **Differential Evaluation** - Only re-evaluate items that changed between dataset versions
- [x] Hash-based change detection using datasets.hash_inputs
- [x] Carry forward unchanged MetricResults from previous run
- [x] --diff-from flag to specify baseline run ID
- [x] Report showing only changed items and their score deltas
- [x] **Evaluation Caching** - Skip re-computing unchanged metric/item pairs across runs
- [x] Content-addressable cache keyed by (item_hash, metric_id, prompt_hash)
- [x] Cache stored in SQLite alongside eval runs
- [x] --no-cache flag to force re-evaluation
- [x] Cache hit/miss statistics in run summary
- [x] **Evaluation Dry-Run** - Estimate token cost and wall-clock time before executing
- [x] Count items x metrics, estimate tokens per metric type
- [x] Cost estimate by provider (Gemini, OpenAI pricing)
- [x] --dry-run flag that prints estimate and exits
- [x] Wall-clock estimate based on historical run data
- [x] **Cross-Validation Evaluation** - K-fold scoring for statistically robust metric estimates
- [x] --cv-folds N flag to split dataset into N folds
- [x] Stratified splitting by metadata or score
- [x] Per-fold and aggregate metric statistics with std deviation
- [x] Identify items with high variance across folds
- [x] **Evaluation Replay** - Re-run a past evaluation with different judge prompts or providers
- [x] --replay-run flag to reuse items/metrics from a previous run
- [x] Override provider, model, or calibrated prompts
- [x] Automatic comparison report between original and replayed run
- [x] **Conditional Metrics** - Run expensive subjective metrics only if cheap objective metrics pass first
- [x] Metric dependency declaration: "run helpfulness only if json_valid passes"
- [x] Gate conditions: pass/fail, score threshold, or custom predicate
- [x] Skip tracking: report which items had metrics skipped and why
- [x] **Evaluation Profiles** - Named configs (fast/thorough/cost-optimized) bundling workers, providers, and metric sets
- [x] Profile definitions in evalyn.yaml (fast: 8 workers, objective only; thorough: all metrics, 2 workers)
- [x] --profile flag on run-eval
- [x] Built-in profiles: smoke-test, standard, comprehensive
- [x] **Evaluation Tagging** - Tag runs with custom labels for filtering and organization
- [x] --tag flag on run-eval (multiple tags allowed)
- [x] Tags stored in EvalRun metadata and queryable via list-runs
- [x] Filter list-runs by tag: --filter-tag experiment-v2
- [x] **Async Evaluation Strategy** - Native asyncio execution strategy alongside sequential and parallel
- [x] AsyncStrategy using asyncio.gather for concurrent metric calls
- [x] Semaphore-based concurrency control (replaces ThreadPoolExecutor)
- [x] Compatible with async LLM client libraries (httpx, aiohttp)
- [x] --strategy flag: sequential, parallel, async
- [x] **Distributed Evaluation** - Fan out metric evaluation across multiple machines via task queue
- [x] Redis/RabbitMQ task queue for distributing metric evaluations
- [x] Worker process that pulls and evaluates metric tasks
- [x] Centralized result collection and checkpoint merging
- [x] --distributed flag with queue URL configuration
- [x] **Canary Evaluation** - Run eval on a small random subset first; abort full run if pass rate is below threshold
- [x] --canary N flag to evaluate N items before committing to full run
- [x] Configurable abort threshold (default: 20% pass rate on canary)
- [x] Cost savings report: how much was saved by aborting early
- [x] **Evaluation Warm-Up** - Discard first K results to reduce cold-start score variance from judge LLM
- [x] --warmup K flag discarding first K item scores
- [x] Re-evaluate warm-up items after LLM cache is primed
- [x] Measure score variance reduction from warm-up vs no warm-up
- [x] **Multi-Language Auto-Detection** - Detect output language and apply language-appropriate metric rubrics automatically
- [x] Language detection via character set and n-gram analysis (no external API)
- [x] Route to language-matched rubric variant when available
- [x] Report language distribution across dataset items
- [x] **Metric Score Normalization** - Normalize scores across metrics to a common scale for fair cross-metric comparison
- [x] Z-score normalization using historical score distributions per metric
- [x] Min-max normalization to [0, 1] range
- [x] --normalize flag on analyze and compare for normalized views
- [x] **Evaluation Resource Monitoring** - Track memory and CPU usage during evaluation to detect resource issues
- [x] Per-worker memory tracking via psutil (optional dependency)
- [x] Warning when memory exceeds configurable threshold
- [x] Resource usage summary in eval run metadata
- [x] **Evaluation Abort Conditions** - Compound abort rules beyond simple pass rate threshold
- [x] Rule syntax: "abort if any safety metric < 50% on any item"
- [x] Multiple abort conditions combinable with AND/OR logic
- [x] --abort-on flag on run-eval with condition expression
- [x] **Human-AI Hybrid Scoring** - Route uncertain items to human annotator during evaluation
- [x] Confidence threshold below which items are queued for human review
- [x] Interactive prompt during eval for human labels on flagged items
- [x] Merge human and judge scores in final EvalRun results
- [x] **Evaluation Result Changelog** - LLM-generated summary of differences between two runs
- [x] Natural language description: "3 items regressed on helpfulness, all related to multi-step queries"
- [x] evalyn changelog --run1 <id> --run2 <id> producing human-readable diff
- [x] Highlight most impactful changes
- [x] **Metric Execution Priority Queue** - Run most-likely-to-fail metrics first for faster feedback
- [x] Priority based on historical failure rate per metric
- [x] Surface failing metrics early in progress output
- [x] Combine with abort conditions for fast-fail workflows
- [x] **Evaluation Retry Budget** - Bound total retries across all items with a per-run budget
- [x] --max-retries-total flag (default: unlimited)
- [x] Track retries consumed vs budget in progress output
- [x] Prevent retry storms from consuming excessive tokens
- [x] **Evaluation Progress API** - Structured progress events for external monitoring tools
- [x] Emit JSON events to a file or socket: item_started, item_complete, metric_scored
- [x] Enable integration with CI dashboards, Slack bots, and custom UIs
- [x] --progress-file flag on run-eval writing JSONL progress events
- [x] **Evaluation Throttle Control** - Dynamically adjust concurrency based on API response times
- [x] Reduce workers when latency exceeds threshold (provider overloaded)
- [x] Increase workers when latency is low (headroom available)
- [x] Adaptive mode: --workers auto on run-eval
- [x] **Evaluation Split-Model Routing** - Route objective metrics to local compute, subjective to API
- [x] Automatic: objective metrics skip API entirely, subjective use configured provider
- [x] Cost savings report showing how much was saved by local objective evaluation
- [x] --local-objectives flag (default: true) on run-eval
- [x] **Evaluation Partial Result Access** - Query in-progress evaluation results before run completes
- [x] evalyn show-run --id <id> works on actively running evaluations via checkpoint data
- [x] Live pass rate estimate from completed items
- [x] Useful for monitoring long-running evaluations without waiting for completion
- [x] **Evaluation Comparison Auto-Trigger** - Automatically compare against pinned baseline after each run
- [x] When a baseline run is pinned, run-eval auto-runs compare at the end
- [x] Regression summary appended to run-eval output
- [x] --no-auto-compare flag to disable
- [x] **Evaluation Isolation Mode** - Run each metric in a subprocess to prevent crashes from affecting other metrics
- [x] --isolate flag spawning each metric evaluation in a child process
- [x] Crash in one metric produces error result without killing the run
- [x] Useful for untested custom metrics or unstable provider connections
- [x] **Evaluation Result Signing** - Cryptographic hash of results for tamper detection
- [x] SHA-256 hash of all MetricResults stored in EvalRun metadata
- [x] evalyn verify-run --id <id> checking result integrity against stored hash
- [x] Detect if results were manually edited after evaluation
- [x] **Evaluation Item-Level Cost Attribution** - Track exact LLM cost per dataset item
- [x] Sum input/output tokens across all metrics for each item
- [x] Per-item cost in show-run output and export formats
- [x] Identify most expensive items for cost optimization
- [x] **Evaluation Output Diff** - Show exact text differences between expected and actual output per item
- [x] evalyn diff-outputs --run <id> showing per-item expected vs actual text diff
- [x] Highlight added/removed/changed text with color coding
- [x] Filter to only items where expected reference is available
- [x] **Judge Debiasing** - Mitigate known LLM judge biases (position, length, verbosity)
- [x] Position-bias mitigation: swap answer order in pairwise comparisons and average
- [x] Length-controlled scoring: GLM correction for length preference (AlpacaEval approach)
- [x] Regression-based bias correction from small human-annotated calibration set
- [x] Report bias metrics per judge model in calibration output
- [x] **Agent Goal Completion Metrics** - Evaluate whether agents achieve stated objectives
- [x] ToolCallAccuracy: sequence + argument correctness (Ragas-inspired)
- [x] ToolCallF1: unordered tool call matching
- [x] AgentGoalAccuracy: end-state vs expected outcome assessment
- [x] TopicAdherence: domain boundary enforcement for conversational agents
- [x] **Automatic Test Case Generation from Behaviors** - Generate diverse scenarios from behavior descriptions
- [x] Bloom-style pipeline: understand behavior -> generate scenarios -> execute -> score
- [x] Mine production traces for challenging evaluation cases (Arena-Hard BenchBuilder pattern)
- [x] Synthesize adversarial variants of existing test cases
- [x] **DAG-Based Deterministic Evaluation** - Decision-tree scoring as middle ground between rules and LLM judge
- [x] DAGMetric: LLM-powered decision trees for structured scoring (DeepEval-inspired)
- [x] Deterministic evaluation paths based on input characteristics
- [x] Lower cost than full LLM judge, more flexible than regex rules
- [x] **Statistical Evaluation Reporting** - Confidence intervals and power analysis for all metrics
- [x] Bootstrap confidence intervals (1000 resamples) on metric scores
- [x] Power analysis: recommend minimum sample size for target precision
- [x] Significance testing for run-to-run comparisons (two-proportion z-test)
### Calibration & Optimization
- [x] **More Optimizers**
- [x] DSPy MIPROv2 - Multi-stage instruction optimization
- [x] TextGrad - Gradient-based prompt optimization
- [x] EvoPrompt - Evolutionary prompt optimization
- [x] PromptBreeder - Self-referential prompt evolution
- [x] **Rubric Optimization** - Auto-generate and refine evaluation rubrics
- [x] LLM-generated rubric from example pass/fail items
- [x] Iterative rubric refinement based on disagreement analysis
- [x] Rubric clarity scoring (can a different LLM interpret it consistently?)
- [x] A/B test rubric variants for inter-judge agreement
- [x] **Few-Shot Example Selection** - Optimize which examples to include in prompts
- [x] Select maximally informative examples from annotation pool
- [x] Diversity-based selection: cover different failure modes
- [x] Leave-one-out evaluation to measure example contribution
- [x] Dynamic example count optimization (find optimal k)
- [x] **Judge Ensemble** - Combine multiple judges for robust evaluation
- [x] Majority vote across N judges (same or different models)
- [x] Weighted ensemble based on per-judge calibration accuracy
- [x] Disagreement flagging: items where judges disagree go to human review
- [x] Cost-aware ensemble: use cheap judge first, expensive only on uncertain items
- [x] **Active Learning** - Smart sample selection for annotation
- [x] Uncertainty sampling: prioritize items where judge confidence is lowest
- [x] Disagreement sampling: prioritize items where judge and heuristics disagree
- [x] Diversity sampling: ensure coverage of input space
- [x] Batch-mode active learning with configurable batch size
- [x] **Transfer Calibration** - Apply calibration learned on one metric to similar metrics
- [x] Metric similarity detection based on rubric text embedding
- [x] Shared preamble transfer with metric-specific rubric
- [x] Transfer effectiveness validation on held-out samples
- [x] **Calibration Staleness Detection** - Warn when calibration age or dataset drift exceeds threshold
- [x] Track calibration date and dataset hash at calibration time
- [x] Alert when dataset changes exceed drift threshold (new items, distribution shift)
- [x] Re-calibration recommendation with estimated alignment degradation
- [x] **Cross-Provider Calibration** - Calibrate for consistency when switching judge providers
- [x] Run same calibration set across providers (Gemini, OpenAI, Ollama)
- [x] Provider-specific preamble adjustments
- [x] Cross-provider agreement metrics
- [x] **Calibration A/B Testing** - Compare calibrated vs uncalibrated prompts on the same dataset
- [x] Side-by-side evaluation run with original and calibrated prompts
- [x] Per-item comparison showing score changes
- [x] Statistical significance test for improvement
- [x] **Calibration Rollback** - Revert to a previous calibration if the new one degrades alignment
- [x] Calibration history stored in CalibrationRecord
- [x] --rollback flag on calibrate command
- [x] Automatic rollback suggestion when validation metrics drop
- [x] **Multi-Objective Calibration** - Optimize jointly for accuracy and cost (fewer tokens per judgment)
- [x] Pareto front of accuracy vs token count
- [x] Prompt compression as optimization objective
- [x] Configurable accuracy/cost trade-off weight
- [x] **Calibration Cost Tracking** - Report total LLM cost of the calibration process itself
- [x] Per-optimizer token usage tracking (extend TokenAccumulator)
- [x] Cost breakdown by calibration phase (alignment, optimization, validation)
- [x] Historical cost trends across calibration runs
- [x] **Calibration Curriculum** - Start optimization on easy examples, progressively add harder ones
- [x] Sort calibration examples by judge confidence (easy = high confidence)
- [x] Progressive expansion: start with top-50% easiest, add harder items
- [x] Early stopping if optimizer plateaus before reaching hard examples
- [x] **Calibration Convergence Visualization** - Plot alignment score vs optimization step to diagnose optimizer behavior
- [x] Record per-step alignment scores during optimization
- [x] Detect plateau, oscillation, and divergence patterns
- [x] ASCII convergence chart in terminal, SVG in HTML reports
- [x] Recommend optimizer parameter changes based on convergence shape
- [x] **Prompt Length Regularization** - Penalize prompt length during calibration to keep judge prompts concise
- [x] Add token count penalty term to optimizer objective function
- [x] Configurable weight: --length-penalty 0.1 (default 0, no penalty)
- [x] Report prompt token savings vs alignment trade-off
- [x] **Calibration Data Augmentation** - Augment calibration examples by paraphrasing to improve optimizer generalization
- [x] LLM-powered paraphrase of calibration inputs preserving semantics
- [x] Expand calibration set 2-5x without additional human annotation
- [x] Validate paraphrased items preserve original labels
- [x] **Calibration Difficulty Weighting** - Weight alignment errors by item difficulty so hard items count more
- [x] Difficulty estimate from cross-annotator disagreement or judge confidence
- [x] Weighted accuracy metric in optimizer objective
- [x] Prevent optimizer from gaming easy items while ignoring hard ones
- [x] **Per-Score-Level Calibration** - Calibrate separately for each score level to reduce systematic bias
- [x] Detect if judge systematically over/under-scores at specific levels
- [x] Score-level-specific preamble adjustments
- [x] Confusion matrix per score level showing calibration effectiveness
- [x] **Calibration Ensemble Fusion** - Run multiple optimizers and fuse outputs via tournament selection
- [x] Run 2-3 optimizers in parallel on same calibration data
- [x] Tournament: evaluate each optimizer's prompt on held-out set
- [x] Select best-performing prompt or blend top-K prompts
- [x] **Calibration Sensitivity Analysis** - Measure alignment sensitivity to small prompt perturbations
- [x] Perturb calibrated prompt (word swaps, paraphrase, reorder)
- [x] Measure alignment variance across perturbations
- [x] Flag calibrations that are fragile (small change causes large alignment drop)
- [x] **Few-Shot Example Ordering** - Optimize the order of examples in few-shot judge prompts
- [x] Test permutations of example order and measure alignment impact
- [x] Heuristics: put hardest examples last, group by failure type
- [x] Store optimal order in CalibrationRecord
- [x] **Calibration Diagnostic Report** - Detailed analysis of why calibration improved or degraded alignment
- [x] Per-item breakdown: which items flipped from wrong to right (and vice versa)
- [x] Prompt diff showing exactly what changed in the preamble
- [x] Categorize improvements by item type (false positive fixes vs false negative fixes)
- [x] **Calibration Freeze** - Lock a calibration record to prevent accidental overwriting
- [x] evalyn freeze-calibration --id <id> marking calibration as immutable
- [x] Prevent calibrate command from overwriting frozen records
- [x] evalyn unfreeze-calibration to unlock when intentional re-calibration is needed
- [x] **Calibration Comparison Dashboard** - Side-by-side view of multiple calibration attempts
- [x] evalyn compare-calibrations --ids <id1> <id2> showing alignment metrics
- [x] Prompt diff between calibration versions
- [x] Per-item score change matrix across calibrations
- [x] **Calibration Checkpoint** - Save optimizer state mid-run for resuming long calibrations
- [x] Atomic checkpoint writes at configurable intervals during optimization
- [x] evalyn calibrate --resume to continue from last checkpoint
- [x] Prevent wasted compute on interrupted calibration runs
- [x] **Calibration Human Validation** - Present calibrated prompt to human for approval before committing
- [x] Show before/after prompt diff and alignment metrics change
- [x] Interactive confirm/reject/edit before writing CalibrationRecord
- [x] --auto-accept flag to skip validation in CI
- [x] **Calibration Memory** - Remember what approaches failed in past calibration runs
- [x] Store failed prompt variants and their alignment scores
- [x] Optimizer avoids re-exploring previously failed regions of prompt space
- [x] Accumulated across calibration runs for the same metric
- [x] **Calibration Scope Control** - Calibrate only for specific item subsets
- [x] --scope flag: calibrate for long inputs only, or specific metadata values
- [x] Scope-specific preambles stored separately in CalibrationRecord
- [x] Apply scope-matched calibration at eval time based on item characteristics
- [x] **Calibration Time Budget** - Stop optimization after N minutes regardless of convergence
- [x] --max-time flag on calibrate command (e.g. --max-time 10m)
- [x] Return best prompt found within time budget
- [x] Report whether optimizer converged or was time-limited
- [x] **Calibration Alignment Curve** - Plot alignment vs annotation count to find diminishing returns
- [x] Re-calibrate with increasing annotation subsets (10%, 25%, 50%, 75%, 100%)
- [x] Plot alignment improvement vs annotation count
- [x] Recommend minimum annotation count for acceptable calibration quality
- [x] **Calibration Negative Example Mining** - Find the hardest examples where calibrated prompt still fails
- [x] After calibration, identify items where the calibrated judge still disagrees with humans
- [x] Cluster these remaining failures by pattern
- [x] Use as targeted additions to calibration set for next round
- [x] **Calibration Prompt Templates** - Reusable preamble templates for common calibration patterns
- [x] Built-in templates: "strict evaluator", "lenient evaluator", "domain expert"
- [x] --template flag on calibrate to start from a template instead of blank
- [x] Save successful calibration preambles as custom templates
- [x] **Calibration Batch Processing** - Calibrate multiple metrics in one command
- [x] evalyn calibrate --metrics all calibrating every metric with annotations
- [x] Parallel calibration of independent metrics for speed
- [x] Combined calibration report showing per-metric alignment improvements
- [x] **SAMMO-Style Structural Optimization** - Treat prompts as symbolic DAGs with structural mutations
- [x] Represent prompt as sections (instruction, context, examples, rubric) with structural operators
- [x] Mutations: paraphrase section, drop section, reformat, reorder examples
- [x] Multi-objective search: accuracy vs prompt length vs cost
- [x] **Annotation Queue Flywheel** - Closed loop where human labels improve judge, reducing future annotation needs
- [x] Track judge accuracy on human-labeled items over time
- [x] Identify metrics where judge is now reliable enough to skip human review
- [x] Gradually reduce annotation requirement as calibration improves
- [x] **CAPO Optimizer** - Current SOTA prompt optimization algorithm
- [x] Implement CAPO (Confidence-Aware Prompt Optimization) as new optimizer
- [x] Add to OPTIMIZER_REGISTRY alongside existing 9 optimizers
- [x] Benchmark against existing optimizers on standard calibration tasks
- [x] **Specialized Judge Model Support** - Fine-tuned evaluation models outperform general LLM-as-judge
- [x] Support custom model endpoints as judge providers (Patronus Lynx pattern)
- [x] Configurable per-metric: use specialized model for safety, general model for quality
- [x] Track and compare judge model accuracy across calibration rounds
### Multi-Modal Evaluation
- [x] **Image Evaluation Metrics**
- [x] Image-text alignment (CLIP score)
- [x] Visual quality assessment
- [x] OCR accuracy for generated images
- [x] Style consistency
- [x] **Audio Evaluation Metrics**
- [x] Speech clarity
- [x] Transcription accuracy (WER)
- [x] Prosody and tone
- [x] **Video Evaluation Metrics**
- [x] Frame consistency
- [x] Temporal coherence
- [x] Action recognition accuracy
### Agent-Specific Evaluation
- [x] **Tool Use Evaluation**
- [x] Tool selection appropriateness
- [x] Parameter correctness
- [x] Error recovery patterns
- [x] Tool chain efficiency
- [x] **Planning Evaluation**
- [x] Plan completeness
- [x] Step ordering correctness
- [x] Resource efficiency
- [x] Replanning quality
- [x] **Reasoning Evaluation**
- [x] Chain-of-thought faithfulness
- [x] Logical consistency
- [x] Evidence usage
- [x] Conclusion validity
- [x] **Multi-Agent Communication Scoring** - Evaluate quality of inter-agent communication
- [x] Communication Score (1-5 per utterance): relevance, clarity, information density
- [x] Collaborative efficiency: ratio of useful exchanges to total messages
- [x] Milestone-based KPIs: track which coordination milestones are achieved (MARBLE approach)
- [x] **Agent Consistency Testing** - Measure reliability across repeated runs
- [x] Run agent N times on same input, measure consistency of tool calls and outputs
- [x] Research finding: 60% single-run success drops to 25% at 8-run consistency
- [x] Report consistency score alongside pass rate
- [x] **Agentic Benchmark Integration** - Run standard agent benchmarks within evalyn
- [x] SWE-bench integration for coding agent evaluation
- [x] WebArena integration for web agent evaluation
- [x] GAIA integration for general agent evaluation
- [x] Unified reporting across benchmarks
### Graph & Multi-Agent Evaluation
- [x] **Graph Topology Extraction** - Extract and visualize LangGraph execution topology from traces
- [x] Build DAG from graph/node spans captured by LangGraphInstrumentor
- [x] Identify critical path (longest execution chain through nodes)
- [x] Detect cycles and redundant node executions
- [x] evalyn show-graph --call <id> rendering ASCII or Mermaid diagram
- [x] **Node-Level Metric Attribution** - Attribute eval failures to specific graph nodes
- [x] Map MetricResult failures back to the node span that produced the failing output
- [x] Per-node pass rate aggregation across dataset items
- [x] Identify "bottleneck nodes" that cause the most failures
- [x] **Subagent Cost Allocation** - Track cost per subagent in multi-agent traces
- [x] Aggregate token/cost from Claude Agent SDK's SubagentContext hierarchy
- [x] Per-subagent cost breakdown in show-trace and analyze output
- [x] Identify most expensive subagent paths for optimization
- [x] **Agent Decision Tree Visualization** - Render agent's tool selection choices as a tree
- [x] Build decision tree from tool_call/tool_result span sequences
- [x] Highlight decision points where agent chose between tools
- [x] Compare decision trees across different runs or models
### Pipeline Customization
- [x] **Custom Pipeline Definitions** - User-defined step sequences beyond the fixed 7-step pipeline
- [x] Pipeline definition in evalyn.yaml with ordered step list
- [x] Skip/include steps declaratively (instead of --skip-annotation flags)
- [x] Custom step plugins: user-defined Python functions as pipeline steps
- [x] **Pipeline Templates** - Preset pipelines for different evaluation goals
- [x] "quick-check" template: build-dataset -> objective metrics only -> analyze
- [x] "full-audit" template: all 7 steps + simulation + deep insights
- [x] "ci-gate" template: objective metrics + threshold check + exit code
- [x] evalyn one-click --template quick-check
- [x] **Pipeline Comparison** - Compare results of two one-click pipeline runs
- [x] evalyn compare-pipelines <dir1> <dir2>
- [x] Step-by-step comparison: dataset size, metric count, scores, cost
- [x] Identify which pipeline changes improved or degraded results
### Infrastructure & Platform
- [x] **Web Dashboard** - Browser-based UI for viewing traces, datasets, and results
- [x] Trace viewer with span tree navigation (like Phoenix/LangSmith)
- [x] Dataset browser with item search, sort, and filter
- [x] Eval run comparison view with metric charts
- [x] Real-time run progress monitoring
- [x] Lightweight server (Flask/FastAPI) bundled with evalyn
- [x] **CI/CD Integration** - GitHub Actions for automated testing and evaluation on PR
- [x] GitHub Action YAML template for evalyn run-eval
- [x] PR comment bot posting eval results as markdown table
- [x] Regression gate: fail CI if metrics drop below threshold
- [x] Artifact upload of HTML reports and datasets
- [x] GitLab CI and Jenkins pipeline examples
- [x] **GitHub Action for Evalyn** - Dedicated reusable GitHub Action for PR evaluation
- [x] braintrustdata/eval-action-style: run eval, post diff as PR comment
- [x] Caching of previous run results for fast comparison
- [x] Quality gate: configurable pass/fail threshold as PR check status
- [x] **Regression Detection** - Automatic alerts when metrics drop below threshold
- [x] **Multi-model Comparison** - Compare same prompts across different LLM providers
- [x] --models flag to run same eval across multiple providers in one command
- [x] Cross-model comparison table (rows=items, columns=models)
- [x] Cost/latency/quality trade-off analysis per model
- [x] Best-model-per-item analysis
- [x] **Cost Tracking Dashboard** - Visualize LLM API costs over time
- [x] Per-run cost breakdown by metric and provider
- [x] Cumulative cost chart across all runs
- [x] Cost-per-item and cost-per-metric averages
- [x] Budget forecast based on historical usage
- [x] **API Server Mode** - REST API for programmatic access
- [x] REST endpoints: /runs, /traces, /datasets, /metrics
- [x] Trigger eval runs via POST /runs with JSON config
- [x] WebSocket endpoint for real-time run progress
- [x] API key authentication for multi-user access
- [x] **Team Collaboration** - Multi-user annotation with conflict resolution
- [x] User identity tracking on annotations
- [x] Assignment queue: distribute items across annotators
- [x] Conflict detection when multiple users annotate same item
- [x] Resolution strategies: majority vote, senior override, discussion
- [x] **Cloud Storage Backend** - Optional S3/GCS storage for large datasets
- [x] S3-compatible backend implementing StorageBackend protocol
- [x] GCS backend with service account authentication
- [x] Hybrid mode: SQLite for metadata, cloud for large payloads
- [x] Configurable via evalyn.yaml storage section
- [x] **Storage Compaction** - Vacuum and optimize SQLite database on demand
- [x] evalyn compact command to VACUUM and ANALYZE
- [x] Auto-compaction trigger when DB exceeds size threshold
- [x] Orphan cleanup: remove spans not linked to any function_call
- [x] **Data Retention Policies** - Auto-delete traces and runs older than a configurable threshold
- [x] retention_days setting in evalyn.yaml
- [x] evalyn purge --older-than 30d command
- [x] Exempt pinned/starred runs from auto-deletion
- [x] Dry-run mode showing what would be deleted
- [x] **Storage Migration** - Export/import data between different storage backends
- [x] evalyn export-db --format sqlite/json/parquet
- [x] evalyn import-db to load from another backend
- [x] Schema version validation on import
- [x] Incremental export: only new data since last export
- [x] **Encrypted Storage** - At-rest encryption for sensitive trace and evaluation data
- [x] SQLCipher integration for encrypted SQLite
- [x] Key management via environment variable or keyring
- [x] Selective encryption: encrypt input/output payloads, keep metadata queryable
- [x] **Storage Statistics** - Show database size, row counts, and growth rate over time
- [x] evalyn storage-stats command
- [x] Row counts per table (function_calls, eval_runs, annotations, otel_spans)
- [x] Size breakdown: data vs index vs free space
- [x] Growth rate: new rows per day/week
- [x] **Plugin System** - Third-party metric, instrumentor, and storage backend plugins via entry points
- [x] Python entry_points discovery for evalyn.metrics, evalyn.instrumentors, evalyn.storage
- [x] Plugin manifest with version compatibility declaration
- [x] evalyn list-plugins command
- [x] Plugin isolation: plugins cannot modify core behavior
- [x] **Webhook Notifications** - Trigger HTTP webhooks on eval completion, failure, or regression
- [x] Configurable webhook URLs in evalyn.yaml
- [x] Event types: run_complete, regression_detected, annotation_needed
- [x] Payload includes run summary, metric scores, and delta from previous
- [x] Retry with exponential backoff on delivery failure
- [x] **Rate Limit Awareness** - Respect LLM provider rate limits with automatic throttling during evaluation
- [x] Per-provider rate limit config (RPM, TPM) in evalyn.yaml
- [x] Adaptive backoff when 429 errors received
- [x] Token bucket rate limiter shared across parallel workers
- [x] Rate limit status in progress callback output
- [x] **Connection Pooling** - Reuse SQLite connections for high-throughput multi-threaded evaluation
- [x] Thread-local connection pool with configurable max size
- [x] Connection health checking and recycling
- [x] WAL mode auto-enable for concurrent readers
- [x] **Incremental Backup** - Periodic automatic backup of database to a secondary location
- [x] SQLite online backup API integration
- [x] Configurable backup schedule and destination path
- [x] Backup rotation: keep last N backups
- [x] **Auto Model Selection** - Choose judge model based on task complexity (fast model for easy items, smart model for hard ones)
- [x] Complexity heuristic based on input length, output length, and metric type
- [x] Model routing: flash-lite for simple items, flash for complex items
- [x] Cost savings report showing how much auto-selection saved vs always-smart
- [x] **Storage Partitioning** - Partition SQLite databases by time period for better performance at scale
- [x] Monthly or weekly database files (evalyn_2026_03.sqlite)
- [x] Transparent cross-partition queries via ATTACH DATABASE
- [x] Auto-archive old partitions to reduce active DB size
- [x] **Storage Integrity Checks** - Verify referential integrity between tables
- [x] Check function_calls referenced by eval_runs still exist
- [x] Check otel_spans have valid parent span references
- [x] evalyn storage-check producing integrity report with fixable/unfixable issues
- [x] **Storage Schema Introspection** - Show current database schema and statistics
- [x] evalyn storage-schema listing table schemas, column types, index definitions
- [x] Schema version and migration history
- [x] Useful for debugging and plugin development
- [x] **Storage Merge** - Merge two SQLite databases from different machines with conflict resolution
- [x] evalyn storage-merge --source <db2> --into <db1>
- [x] Deduplication by primary key (function call ID, span ID, run ID)
- [x] Conflict strategy: skip, overwrite, or rename
- [x] **Storage Index Tuning** - Auto-create indexes based on common query patterns
- [x] Profile slow queries in list-calls, list-runs, build-dataset
- [x] evalyn storage-tune creating recommended indexes
- [x] Report query speedup after index creation
- [x] **Storage Query Logging** - Log SQL queries for performance debugging and optimization
- [x] EVALYN_QUERY_LOG=1 env var enabling query logging to .evalyn/queries.log
- [x] Log query text, execution time, rows returned
- [x] Identify slowest queries for index tuning
- [x] **Storage Cross-Reference Report** - Show relationships between stored entities
- [x] evalyn storage-xref showing: traces -> datasets -> runs -> annotations linkage
- [x] Identify orphaned entities (runs referencing deleted datasets, etc.)
- [x] Entity count summary per relationship type
- [x] **Storage Connection Diagnostics** - Report SQLite configuration and health
- [x] evalyn storage-diag showing WAL mode, journal mode, page size, cache size
- [x] File lock status and concurrent access warnings
- [x] Recommend optimal SQLite pragmas for current workload
- [x] **Storage Snapshot/Restore** - Point-in-time snapshots for safe experimentation
- [x] evalyn storage-snapshot --name "before-cleanup" creating named copy
- [x] evalyn storage-restore --name "before-cleanup" reverting to snapshot
- [x] Snapshot list with timestamps and sizes
- [x] **Storage Usage Forecast** - Predict storage growth based on current usage rate
- [x] Compute growth rate from last 7/30/90 days
- [x] Estimate when storage will reach configurable size threshold
- [x] evalyn storage-forecast showing projected growth chart
- [x] **Storage Migration Versioning** - Formal migration version tracking with up/down support
- [x] Version table tracking which migrations have been applied
- [x] Down-migration support for rolling back schema changes
- [x] evalyn storage-migrate --status showing current schema version
- [x] **Storage Read-Only Mode** - Prevent accidental writes during analysis
- [x] EVALYN_DB_READONLY=1 env var opening database in read-only mode
- [x] Useful when sharing databases or running analysis on production data
- [x] Clear error message when write is attempted in read-only mode
- [x] **Storage Multi-DB Queries** - Query across prod and test databases simultaneously
- [x] evalyn list-calls --db all searching both prod.sqlite and test.sqlite
- [x] Cross-database comparison: production traces vs test traces
- [x] ATTACH DATABASE under the hood with transparent result merging
- [x] **Storage WAL Monitoring** - Monitor Write-Ahead Log size and checkpoint frequency
- [x] evalyn storage-wal showing WAL file size, checkpoint status
- [x] Warning when WAL exceeds configurable size threshold
- [x] Auto-checkpoint recommendation based on write patterns
- [x] **Storage Auto-Vacuum Scheduling** - Schedule automatic vacuum based on database growth
- [x] auto_vacuum_threshold setting in evalyn.yaml (e.g. 500MB)
- [x] Run VACUUM automatically when DB crosses threshold during write operations
- [x] Log vacuum events with space reclaimed
- [x] **Storage Data Checksums** - Verify data integrity with per-row checksums
- [x] Store SHA-256 hash of critical fields (input, output, spans) alongside rows
- [x] evalyn storage-verify checking all rows against stored checksums
- [x] Detect corruption from concurrent writes or filesystem errors
- [x] **Storage Anonymous Export** - Strip identifying information when sharing databases
- [x] evalyn storage-export --anonymous replacing PII-like content with placeholders
- [x] Preserve data structure, metadata, and statistics while removing content
- [x] Useful for sharing databases for debugging without exposing user data
- [x] **Denormalized Storage Optimization** - Flatten trace hierarchy for query performance
- [x] Langfuse found 10x dashboard speedup by denormalizing trace attributes onto span rows
- [x] Store trace-level metadata (project, session_id, user_id) on every span row
- [x] Eliminate JOIN overhead for common query patterns (list spans with trace context)
### Data & Dataset
- [x] **Dataset Versioning** - Track dataset changes over time with diff view
- [x] Content-hash versioning on each build-dataset invocation
- [x] Diff view: items added, removed, and modified between versions
- [x] Version log stored alongside dataset.jsonl
- [x] Rollback to previous version via evalyn dataset-rollback
- [x] **Synthetic Data Generation**
- [x] Adversarial example generation
- [x] Edge case mining
- [x] Demographic variation
- [x] Domain-specific generators
- [x] **Data Augmentation** - Automatically expand datasets
- [x] Paraphrase generation: rephrase inputs preserving semantics
- [x] Input perturbation: typos, casing, formatting variations
- [x] Language translation: generate multilingual variants
- [x] Context expansion: add/remove context to test robustness
- [x] **Golden Set Management** - Curate and maintain evaluation benchmarks
- [x] evalyn golden-set create/add/remove commands
- [x] Lock golden set items from modification
- [x] Track golden set coverage: % of metrics with golden examples
- [x] Periodic validation: re-evaluate golden set to detect model drift
- [x] **Dataset Splitting** - Train/test/validation splits with stratification by metadata fields
- [x] evalyn split-dataset --ratio 0.7/0.15/0.15
- [x] Stratification by metadata keys (tag, source, difficulty)
- [x] Deterministic splitting with configurable random seed
- [x] Output as separate JSONL files in split/ subdirectory
- [x] **Dataset Statistics** - Auto-compute input/output length distributions, token counts, label balance
- [x] evalyn dataset-stats command
- [x] Input/output token count histograms
- [x] Metadata field value distributions
- [x] Expected reference coverage (% items with ground truth)
- [x] Duplicate detection report
- [x] **Dataset Merge and Diff** - Combine two datasets or show item-level differences between them
- [x] evalyn dataset-merge --deduplicate
- [x] evalyn dataset-diff showing added/removed/changed items
- [x] Conflict resolution for items with same ID but different content
- [x] **External Format Import** - Import from HuggingFace datasets, LMSYS Arena, or custom CSV schemas
- [x] evalyn import --format huggingface --dataset-name <name>
- [x] CSV import with column mapping config
- [x] LMSYS Arena format (conversation pairs with human preference)
- [x] Auto-detect format from file extension and content
- [x] **Schema Evolution** - Handle format changes across dataset versions with automatic migration
- [x] Version field in dataset header line
- [x] Automatic migration on load (old format to current)
- [x] Migration log showing which transformations were applied
- [x] **Dataset Sampling Preview** - Show sample items and summary stats before building full dataset
- [x] --preview flag on build-dataset showing 5 sample items
- [x] Summary: item count, avg input/output length, metadata distribution
- [x] Confirmation prompt before writing full dataset
- [x] **Dataset Pinning** - Lock a dataset version hash for reproducible evaluations across environments
- [x] SHA-256 hash stored in dataset metadata
- [x] --pinned flag on run-eval to verify hash before evaluation
- [x] Pin file (.evalyn-pin) for CI/CD reproducibility
- [x] **Dataset Lineage** - Track which traces and runs produced each dataset item
- [x] Source trace ID and function_call ID in item metadata
- [x] Lineage query: "which traces contributed to this dataset?"
- [x] Reverse lineage: "which datasets use this trace?"
- [x] **Dataset Filtering DSL** - Query-based item filtering (e.g. "items where output_length > 500 and tag=production")
- [x] --filter flag on build-dataset and run-eval
- [x] Operators: =, !=, >, <, contains, matches (regex)
- [x] Compound filters with AND/OR
- [x] Filter on metadata fields, input/output length, and item ID patterns
- [x] **Incremental Dataset Build** - Append new traces to an existing dataset without full rebuild
- [x] --append flag on build-dataset
- [x] Track last-build timestamp to only process new traces
- [x] Deduplication against existing items using hash_inputs
- [x] **Dataset Health Check** - Validate dataset quality before evaluation
- [x] Reference coverage: % of items with ground truth (uses _dataset_has_reference logic)
- [x] Empty/null field detection in input, output, and metadata
- [x] Duplicate input detection via hash_inputs
- [x] evalyn dataset-health command with pass/warn/fail summary
- [x] **Dataset Decontamination** - Detect items that overlap with known LLM benchmark/training data
- [x] N-gram overlap check against common benchmarks (MMLU, HumanEval, GSM8K)
- [x] Configurable contamination threshold (default: 13-gram exact match)
- [x] evalyn dataset-decontaminate --report showing contaminated items
- [x] Auto-exclude contaminated items from evaluation datasets
- [x] **Dataset Drift Detection** - Statistical tests comparing input distributions between dataset versions
- [x] Kolmogorov-Smirnov test on input length, token count distributions
- [x] Chi-square test on categorical metadata field distributions
- [x] Embedding centroid shift measurement between versions
- [x] evalyn dataset-drift --v1 <path1> --v2 <path2> with drift severity score
- [x] **Dataset Annotation Coverage Map** - Visualize which items have annotations and which need them
- [x] Per-metric coverage percentage across dataset items
- [x] ASCII heatmap: items on Y-axis, metrics on X-axis, filled/empty cells
- [x] Prioritize unannotated items in items with lowest judge confidence
- [x] **Dataset from Production Logs** - Import HTTP request/response logs as trace-like dataset items
- [x] Parse common log formats (JSON, Apache, nginx) into DatasetItem input/output
- [x] evalyn import-logs --format json --input-field request --output-field response
- [x] Auto-deduplicate against existing traces in storage
- [x] **Dataset Snapshot Comparison** - Compare two dataset versions showing item-level content diffs
- [x] Side-by-side text diff for modified items (input or output changed)
- [x] Summary: items added, removed, modified, unchanged
- [x] evalyn dataset-snapshot-diff --before <v1> --after <v2>
- [x] **Dataset Complexity Scoring** - Auto-compute per-item difficulty from input features
- [x] Heuristics: input length, vocabulary diversity, question complexity indicators
- [x] Store complexity_score in item metadata for filtering and stratification
- [x] evalyn dataset-stats --complexity showing difficulty distribution
- [x] **Dataset Bias Auditing** - Detect systematic biases in input distribution
- [x] Topic distribution analysis via LLM classification
- [x] Length and vocabulary skew detection
- [x] evalyn dataset-audit producing bias report with recommendations
- [x] **Dataset Curation Suggestions** - LLM-powered gap analysis suggesting items to add
- [x] Analyze current dataset coverage against metric requirements
- [x] Suggest input types, edge cases, and scenarios not yet represented
- [x] evalyn dataset-suggest --dataset <path> producing curation plan
- [x] **Dataset A/B Split Generator** - Create matched pairs for controlled model comparison
- [x] Stratified pairing by complexity, topic, and metadata fields
- [x] Ensure balanced splits for statistical validity
- [x] evalyn dataset-ab-split --dataset <path> producing split_a.jsonl and split_b.jsonl
- [x] **Dataset Subset Extraction** - Extract semantically meaningful subsets via clustering
- [x] Cluster items by embedding similarity into N groups
- [x] evalyn dataset-subset --clusters N --dataset <path> extracting per-cluster subsets
- [x] Useful for focused evaluation on specific input categories
- [x] **Dataset Embedding Index** - Pre-compute and store embeddings for fast similarity queries
- [x] Build embedding index on build-dataset using SentenceTransformer
- [x] Store embeddings alongside dataset.jsonl as embeddings.npy
- [x] Enable fast nearest-neighbor queries for sampling, dedup, and clustering
- [x] **Dataset Interleaving** - Round-robin merge from multiple datasets for balanced evaluation
- [x] evalyn dataset-interleave --datasets d1/ d2/ d3/ producing merged dataset
- [x] Interleave by metadata field (e.g. alternate "production" and "synthetic" items)
- [x] Source tracking: tag each item with originating dataset
- [x] **Dataset Quality Gate** - Block evaluation start if dataset fails quality checks
- [x] Configurable rules in evalyn.yaml: min_items, max_duplicate_rate, required_metadata_fields
- [x] run-eval refuses to start unless gate passes (--skip-quality-gate to override)
- [x] Gate report showing which checks passed and failed
- [x] **Dataset Item Clustering Report** - Show natural clusters with LLM-generated descriptions
- [x] Auto-cluster items by embedding similarity into K groups
- [x] LLM-generated label per cluster describing what the items have in common
- [x] evalyn dataset-clusters --k 5 showing cluster summary with example items
- [x] **Dataset Changelog** - Automatic log of all build-dataset operations and parameters
- [x] Append entry to data/changelog.jsonl on each build-dataset invocation
- [x] Record: timestamp, filters used, item count, sampling mode, hash
- [x] evalyn dataset-changelog showing chronological build history
- [x] **Dataset Cross-Contamination Check** - Verify no item leakage between train/test/calibration splits
- [x] Hash-based check that no item appears in both train and test splits
- [x] Embedding-based check for near-duplicate items across splits
- [x] evalyn dataset-xcontam --train <path1> --test <path2> reporting contamination
- [x] **Dataset Item Semantic Search** - Find items by natural language query using embeddings
- [x] evalyn dataset-search --query "user asks about refund policy" finding nearest items
- [x] Uses pre-built embedding index (from Dataset Embedding Index feature)
- [x] Return top-K matches with similarity scores
- [x] **Dataset Format Autodetect** - Auto-detect and load from multiple formats without explicit --format flag
- [x] Detect JSONL, JSON array, CSV, and TSV from file content and extension
- [x] Auto-map columns to input/output/metadata fields using heuristics
- [x] Warn when auto-detection is ambiguous and suggest explicit format
- [x] **Dataset Metadata Schema Enforcement** - Validate item metadata against a defined schema
- [x] Schema definition in meta.json: required_fields, field_types, allowed_values
- [x] Validation on build-dataset and import, rejecting non-conforming items
- [x] evalyn dataset-validate --schema showing validation results
### Reporting & Analytics
- [x] **Custom Report Templates** - User-defined HTML report layouts
- [x] Jinja2 template engine for HTML report customization
- [x] Template variables: run data, analysis, insights, charts
- [x] Built-in templates: executive summary, technical deep-dive, compliance
- [x] evalyn export --template custom_template.html
- [x] **Slack/Discord Notifications** - Alert on evaluation completion or failures
- [x] Slack webhook integration with rich message formatting
- [x] Discord webhook with embedded metric summary
- [x] Configurable alert thresholds: only notify on regression or failure
- [x] Channel routing: different alerts to different channels
- [x] **Metric Correlation Analysis** - Understand relationships between metrics
- [x] **Failure Root Cause Analysis** - Automated diagnosis of failures
- [x] LLM-powered analysis of common patterns in failed items
- [x] Feature attribution: which input features correlate with failure
- [x] Failure clustering by root cause category (prompt, data, model, tool)
- [x] Actionable fix suggestions per failure cluster
- [x] **Trend Anomaly Detection** - Alert on unusual metric patterns
- [x] Z-score based anomaly detection on metric time series
- [x] Configurable sensitivity threshold
- [x] Automatic alert when anomaly detected during trend analysis
- [x] Visual anomaly markers in trend charts
- [x] **Cohort Analysis** - Compare metrics across user-defined item groups (by metadata, input length, etc.)
- [x] --cohort-by flag on analyze command (split by metadata field)
- [x] Per-cohort metric statistics and pass rates
- [x] Cross-cohort comparison table
- [x] Identify worst-performing cohort with improvement suggestions
- [x] **Statistical Significance Testing** - P-values and confidence intervals for run-to-run comparisons
- [x] Two-proportion z-test for pass rate differences
- [x] Bootstrap confidence intervals for score means
- [x] Effect size (Cohen's d) alongside p-values
- [x] Automatic significance flag in compare output
- [x] **Judge Confusion Matrix** - Visualize agreement/disagreement patterns between judge and human
- [x] 2x2 matrix: TP/FP/TN/FN per metric
- [x] ASCII table and HTML heatmap renderers
- [x] Per-metric confusion matrix in annotation-stats
- [x] Aggregate confusion matrix across all metrics
- [x] **Jupyter Notebook Export** - Generate .ipynb with pre-built charts and analysis from eval runs
- [x] evalyn export --format notebook
- [x] Pre-built cells: data loading, metric charts, distribution plots, correlations
- [x] Interactive widgets for filtering by metric, item, or cohort
- [x] nbformat-based generation (no Jupyter dependency required)
- [x] **Metric Budget Analysis** - Estimate cost savings from dropping low-signal metrics
- [x] Compute information gain of each metric (redundancy with others)
- [x] Cost attribution: how much each metric costs per run
- [x] Recommended metric subset that preserves N% of signal at minimum cost
- [x] **Regression Bisection** - Binary search across dataset items to pinpoint exact cause of a regression
- [x] evalyn bisect --baseline <run1> --current <run2>
- [x] Identify items that changed from pass to fail
- [x] Cluster newly-failing items by input features
- [x] Rank items by regression severity (score delta)
- [x] **Comparative Heatmap** - Visual heatmap of metric scores across items and runs
- [x] Items on Y-axis, metrics on X-axis, color = score
- [x] Multi-run heatmap: side-by-side comparison
- [x] ASCII heatmap for terminal, HTML/SVG for reports
- [x] Sort by worst-performing items or metrics
- [x] **Failure Taxonomy** - Auto-categorize failures into a structured taxonomy (prompt, model, data, tool)
- [x] LLM-powered categorization of failure reasons
- [x] Built-in taxonomy: prompt_ambiguity, model_limitation, data_quality, tool_error, hallucination
- [x] Custom taxonomy definition in evalyn.yaml
- [x] Taxonomy distribution chart in analysis output
- [x] **Analysis Snapshots** - Save analysis state at a point in time for later comparison
- [x] evalyn snapshot --name "pre-refactor" saves RunAnalysis + InsightsReport
- [x] evalyn compare-snapshots for before/after comparison
- [x] Snapshots stored in .evalyn/ directory as JSON
- [x] **Item Difficulty Estimation** - Compute per-item difficulty scores based on cross-run fail rates
- [x] Aggregate pass/fail across multiple eval runs per item
- [x] Difficulty score: inverse of average pass rate across runs
- [x] Rank items by difficulty in analysis output
- [x] Use difficulty scores to weight calibration and sampling
- [x] **Metric Interaction Effects** - Detect non-linear interactions between metrics beyond pairwise correlation
- [x] Chi-square test for co-failure: items failing both A and B more than expected by chance
- [x] Interaction strength score per metric pair
- [x] Surface metric pairs with strong interactions in insights report
- [x] **Improvement Priority Ranking** - Rank metrics by expected ROI: which improvement would raise overall pass rate most
- [x] Compute marginal gain: if metric M improved by 10%, how much does overall pass rate increase
- [x] Factor in metric weight from weighting profiles
- [x] Actionable ranking in insights output: "Fix metric X first for maximum impact"
- [x] **Score Distribution Normality Testing** - Verify if metric scores follow expected distributions
- [x] Shapiro-Wilk test per metric score distribution
- [x] Flag metrics with non-normal distributions (bimodal, heavy-tailed)
- [x] Recommend appropriate statistical tests based on distribution shape
- [x] **Cross-Run Stability Analysis** - Measure how stable metric scores are across repeated runs of same data
- [x] Run same eval N times and compute per-metric coefficient of variation
- [x] Flag metrics with high variance as unreliable
- [x] Recommend increasing samples or switching judge model for unstable metrics
- [x] **Metric Contribution Analysis** - SHAP-style attribution of each metric's contribution to overall pass/fail
- [x] Compute marginal contribution of each metric to overall item pass rate
- [x] Identify metrics that are decisive (flip overall pass/fail) vs redundant
- [x] Visualization: waterfall chart showing per-metric contribution
- [x] **Worst-Case Item Identification** - Surface items that fail across the most metrics simultaneously
- [x] Rank items by number of failed metrics (cross-metric failure count)
- [x] Highlight items that are "universally bad" vs "edge case failures"
- [x] Useful for prioritizing which agent behaviors to fix first
- [x] **Time-to-Fix Tracking** - Track how many runs it takes for failing items to start passing
- [x] Per-item pass/fail history across consecutive runs
- [x] Average time-to-fix per metric and per failure category
- [x] Identify persistently failing items that resist fixes
- [x] **Analysis Report Diff** - Diff two RunAnalysis outputs showing what changed
- [x] evalyn analysis-diff --run1 <id> --run2 <id>
- [x] Delta per metric: pass rate change, score mean change, new/resolved failures
- [x] ASCII table with color-coded improvements/regressions
- [x] **Run Quality Score** - Composite score summarizing overall run health
- [x] Weighted combination: pass rate, cost efficiency, coverage, judge confidence
- [x] Single 0-100 score for quick run quality assessment
- [x] Configurable weights in evalyn.yaml
- [x] **Trend Forecasting** - Predict future metric values using time series extrapolation
- [x] Linear regression and exponential smoothing on metric pass rates over runs
- [x] Forecast next N runs with confidence bands
- [x] Alert when forecast predicts metric dropping below threshold
- [x] **Analysis Natural Language Summary** - LLM-generated plain English analysis report
- [x] Summarize key findings, regressions, and recommendations in 3-5 paragraphs
- [x] evalyn analyze --summary producing human-readable narrative
- [x] Useful for sharing results with non-technical stakeholders
- [x] **Metric Volatility Index** - Measure historical stability of each metric across runs
- [x] Coefficient of variation across last N runs per metric
- [x] Classify metrics as stable, moderate, or volatile
- [x] Recommend increasing judge samples or switching models for volatile metrics
- [x] **Analysis Change Attribution** - Attribute metric changes to dataset, model, or prompt factors
- [x] Detect which factor changed between compared runs (dataset hash, source hash, prompt hash)
- [x] Attribute score deltas to the changed factor
- [x] "Pass rate dropped 15%, likely due to dataset change (12 new items added)"
- [x] **Analysis Comparison Template** - Configurable comparison layouts for different audiences
- [x] Executive template: overall pass rate, top regressions, cost summary
- [x] Engineering template: per-metric details, failed item list, prompt diffs
- [x] --template flag on compare command
- [x] **Analysis What-If Simulator** - Interactively model "what if metric X improved by N%"
- [x] evalyn what-if --metric helpfulness --improve 20% showing projected overall pass rate
- [x] Model multiple simultaneous improvements
- [x] Identify the minimum improvement per metric needed to reach a target pass rate
- [x] **Analysis Dashboard Theming** - Configurable chart colors and styles for HTML reports
- [x] Theme definitions in evalyn.yaml: primary color, accent, chart palette
- [x] Built-in themes: corporate, academic, dark-mode, print-friendly
- [x] Custom CSS injection for branded reports
- [x] **Analysis Data Export API** - Export analysis data as structured Python objects for custom analysis
- [x] evalyn.analyze_to_dict(run) returning dict-of-lists for pandas DataFrame construction
- [x] evalyn export --format feather producing columnar format for direct notebook loading
- [x] Enable custom statistical analysis beyond built-in insights
- [x] **Analysis Time Series Decomposition** - Separate trend, seasonality, and noise in metric time series
- [x] Decompose metric pass rates across runs into systematic trend and random variation
- [x] Distinguish genuine improvement from normal score fluctuation
- [x] Visualize decomposed components in trend analysis output
### Interoperability
- [x] **Phoenix/Langfuse Trace Export** - Native export to popular LLM observability platforms
- [x] evalyn export-traces --format phoenix to produce Phoenix-compatible JSONL
- [x] evalyn export-traces --format langfuse for Langfuse import format
- [x] Preserve span hierarchy and OpenInference attributes in export
- [x] **Trace Import from External Platforms** - Bring existing traces into evalyn for evaluation
- [x] evalyn import-traces --format phoenix/langfuse/otel
- [x] Map external span types to Evalyn span types via conventions.py
- [x] Deduplicate against existing traces by span ID
- [x] **OpenInference Full Compliance** - Complete implementation of OpenInference semantic conventions
- [x] Full document/retrieval attribute capture (DocumentAttributes, RetrievalAttributes)
- [x] Embedding attribute capture (EmbeddingAttributes.EMBEDDINGS, TEXT)
- [x] Session and user attribute propagation (SessionAttributes)
- [x] Reranker score capture and display in show-trace
- [x] **Eval Result Export to Observability Platforms** - Push evaluation scores back to trace viewers
- [x] Annotate Phoenix spans with evalyn metric scores
- [x] Push eval results as Langfuse scores
- [x] Bi-directional sync: traces in, scores out
### Resilience & Error Handling
- [x] **Circuit Breaker for Providers** - Stop calling a provider after N consecutive failures
- [x] Configurable failure threshold (default: 5 consecutive errors)
- [x] Cool-down period before retrying (exponential backoff)
- [x] Automatic fallback to alternative provider when circuit opens
- [x] Circuit state visible in progress output
- [x] **Graceful Item-Level Failure** - Continue evaluation when individual items fail
- [x] Catch and log per-item errors without stopping the run
- [x] Record failure reason in MetricResult.details
- [x] Summary of failed items at end of run with error categories
- [x] --fail-fast flag to override and stop on first error
- [x] **Provider Fallback Chain** - Automatically try alternative providers on failure
- [x] Ordered provider list: [gemini, openai, ollama]
- [x] Fall back to next provider on timeout, rate limit, or API error
- [x] Log which provider was actually used per item
- [x] **Evaluation Timeout Per Item** - Prevent single slow items from blocking the entire run
- [x] --item-timeout flag (default: 120s per item)
- [x] Timeout recorded as failure with reason "timeout"
- [x] Separate timeout for objective vs subjective metrics
### Output & Formatting
- [x] **Color-Coded Terminal Output** - ANSI colors for pass/fail/warning states
- [x] Green for pass, red for fail, yellow for warning across all commands
- [x] Respect NO_COLOR env var and --no-color flag for CI environments
- [x] Color-coded score ranges in analyze and compare output
- [x] **Compact Output Mode** - Minimal output for CI logs and scripting
- [x] --compact flag producing single-line summaries per command
- [x] Summary format: "RUN <id> PASS 85% (17/20) COST $0.12 TIME 45s"
- [x] Pair with exit codes for CI gate integration (exit 1 if pass rate < threshold)
- [x] **PDF Report Export** - Generate PDF reports from HTML dashboards
- [x] evalyn export --format pdf using headless browser or weasyprint
- [x] Page breaks between sections, print-friendly layout
- [x] Cover page with run metadata, date, project name
- [x] **HTML Report Dark Mode** - Dark theme option for HTML dashboards and insights
- [x] CSS dark mode support via prefers-color-scheme media query
- [x] Manual toggle button in report header
- [x] Dark-friendly Chart.js color palette
### Code Change Tracking
- [x] **Source Code Diff Correlation** - Track agent code changes alongside metric changes
- [x] Store source_hash from _extract_code_meta in each eval run
- [x] Detect when source code changed between consecutive runs
- [x] Correlate code diffs with metric deltas in compare output
- [x] evalyn code-diff --run1 <id> --run2 <id> showing code changes alongside score changes
- [x] **Prompt Version Tracking** - Track judge prompt changes across calibration rounds
- [x] Hash judge prompts and store in MetricResult metadata
- [x] Warn when comparing runs that used different prompt versions
- [x] Prompt changelog: show how each metric's prompt evolved over time
### Programmatic SDK
- [x] **Python API for Running Evaluations** - Run evaluations from Python code without CLI
- [x] evalyn.run(dataset, metrics, provider) returning EvalRun object
- [x] evalyn.analyze(run) returning RunAnalysis directly
- [x] evalyn.compare(run_a, run_b) returning comparison dict
- [x] Async variants: await evalyn.run_async(...)
- [x] **Event Callback Hooks** - Register functions that fire on evaluation events
- [x] on_item_complete(callback) for per-item processing
- [x] on_metric_complete(callback) for per-metric processing
- [x] on_run_complete(callback) for post-run triggers
- [x] Hook registration via evalyn.yaml or Python API
- [x] **Context Manager Tracing** - Manual span creation with `with` syntax
- [x] with evalyn.span("name", "type") as s: for explicit span boundaries
- [x] Automatic parent-child linking via context propagation
- [x] Span attribute setting: s.set_attribute("key", "value")
- [x] **Embedding as Library** - Use evalyn as imported library in test suites
- [x] pytest plugin: @pytest.mark.evalyn(metrics=["helpfulness"])
- [x] Assert on metric scores: assert result.metrics["helpfulness"].passed
- [x] Integration with pytest-xdist for parallel testing
- [x] **Declarative Evaluation API** - Single-call evaluation matching industry patterns
- [x] Braintrust-style: evalyn.Eval("project", data=fn, task=fn, scores=[...])
- [x] Weave-style: evalyn.Evaluation(dataset=..., scorers=[...]).run(model)
- [x] Both patterns return structured results with .to_pandas() support
- [x] **Semantic Caching for Judge Calls** - Cache identical LLM judge calls to reduce cost
- [x] Content-addressable cache keyed by hash(prompt + input + output + model)
- [x] Research finding: up to 68.8% API call reduction (GPTCache benchmark)
- [x] Optional embedding-based fuzzy matching for similar-but-not-identical inputs
### Testing & Quality Enhancements
- [x] **Snapshot Testing for Metrics** - Detect unintended changes to metric scoring behavior
- [x] Record expected scores for a golden dataset
- [x] Flag when metric output changes (new code, model update)
- [x] evalyn test-metrics --update-snapshots to accept changes
- [x] **Performance Benchmark Suite** - Track and prevent performance regressions in evalyn itself
- [x] Benchmarks for: dataset loading, metric scoring, analysis, export
- [x] Baseline timings stored in repo
- [x] CI check: fail if any benchmark regresses > 20%
- [x] **Fuzz Testing for Parsers** - Stress-test JSON/judge output parsing with malformed inputs
- [x] Fuzz _extract_json_object and extract_json_list with random strings
- [x] Fuzz _parse_passed with edge case values
- [x] Ensure no unhandled exceptions on any input
- [x] **Sandboxed Agent Evaluation** - Safe execution environment for agent evals where models run code
- [x] Docker-based sandbox for executing agent tool calls safely (Inspect AI pattern)
- [x] Configurable timeout and resource limits per sandbox
- [x] Capture sandbox output as part of trace spans
- [x] **Composable Assertion Framework** - PromptFoo-style assertion primitives for evaluation
- [x] Assertion types: contains, not_contains, regex_match, llm_rubric, similar, cost_below
- [x] Composable with AND/OR logic for complex pass/fail criteria
- [x] YAML-configurable assertions in metrics definition
- [x] **Evaluation Result Schema Standard** - Define a JSON schema for evaluation results
- [x] Enable cross-platform evaluation result exchange
- [x] Schema covers: items, metrics, scores, metadata, provenance
- [x] No universal standard exists yet (industry gap evalyn could fill)
- [x] **Knowledge Graph Test Generation** - Generate evaluation questions from document knowledge graphs
- [x] Extract entities and relationships from source documents (Ragas pattern)
- [x] Generate questions that test understanding of specific relationships
- [x] Configurable question types: factual, inferential, multi-hop
### Packaging & Distribution
- [x] **Docker Image** - Official Docker image for CI/CD and isolated evaluation environments
- [x] Dockerfile with evalyn pre-installed and all optional dependencies
- [x] Configurable via environment variables (API keys, config path)
- [x] Docker Compose example with SQLite volume mount for data persistence
- [x] GitHub Actions example using the Docker image for eval-on-PR
- [x] **Standalone Binary** - Single-file executable without Python dependency
- [x] PyInstaller or Nuitka build for Linux, macOS, Windows
- [x] GitHub Releases automation for versioned binaries
- [x] Install script: curl -sSL https://evalyn.dev/install | sh
- [x] **evalyn version and Update Check** - Version management and update notifications
- [x] evalyn version showing installed version and latest available
- [x] Optional update check on startup (configurable, off by default)
- [x] evalyn self-update command to upgrade in place
### Documentation Generation
- [x] **CLI Reference Auto-Generation** - Generate CLI docs from argparse definitions
- [x] evalyn docs --format markdown producing per-command reference pages
- [x] Include all flags, defaults, examples, and cross-references
- [x] Auto-update on release via CI
- [x] **Metric Catalog** - Auto-generated browsable catalog of all 133 metrics
- [x] evalyn docs --metrics producing metric reference with rubrics, categories, scopes
- [x] HTML format with search and filter by category/type
- [x] Include metric bundle membership and recommended use cases
- [x] **Config Reference** - Auto-generated documentation for evalyn.yaml options
- [x] Generate from evalyn.yaml.example with type annotations and valid values
- [x] Show default values, environment variable overrides, and CLI flag mappings
### Deprecation & Migration
- [x] **Deprecation Warnings** - Warn when using deprecated config keys, flags, or APIs
- [x] Deprecation registry mapping old names to new names
- [x] Yellow warning on first use, error after N versions
- [x] evalyn migrate-config to auto-update deprecated config keys
- [x] **Breaking Change Detection** - Detect when upgrading evalyn would break existing runs
- [x] Compare metric version hashes between installed version and pinned run manifest
- [x] Warn before evaluation if metric behavior changed since last run
- [x] Migration guide output for each detected breaking change
### Rubric Engineering
- [x] **Multi-Language Rubrics** - Judge prompts and rubrics in languages other than English
- [x] Rubric translation support in JUDGE_TEMPLATES (locale field per template)
- [x] Language-matched judging: use rubric language matching the output language
- [x] Cross-language evaluation: judge non-English outputs with English rubrics vs native rubrics
- [x] **Community Rubric Library** - Import and export rubrics from a shared repository
- [x] evalyn rubric-export --metric <id> producing a portable YAML rubric file
- [x] evalyn rubric-import from URL or local file
- [x] Rubric metadata: author, version, tested-on, accuracy stats
- [x] **Rubric Testing** - Validate that a rubric produces consistent scores on test cases
- [x] evalyn test-rubric --metric <id> running rubric against a set of known pass/fail items
- [x] Consistency score: same rubric, same item, N runs, measure agreement
- [x] Edge case detection: find items where rubric is ambiguous (close to threshold)
- [x] **Domain-Specific Rubric Packs** - Downloadable rubric sets for specialized domains
- [x] Medical: HIPAA compliance, clinical accuracy, patient safety, drug interaction checks
- [x] Legal: jurisdictional accuracy, precedent citation, privilege preservation
- [x] Finance: SEC compliance, fiduciary duty, risk disclosure completeness
- [x] evalyn install-rubric-pack medical
### Dashboard Interactivity
- [x] **Embeddable Widget Mode** - Iframe-friendly dashboard for embedding in other tools
- [x] evalyn dashboard --embed producing minimal HTML without navigation chrome
- [x] Configurable widget size and chart selection
- [x] PostMessage API for parent page communication (filter events, score updates)
- [x] **In-Dashboard Data Export** - CSV/JSON export buttons on each chart in HTML reports
- [x] Download button per chart exporting underlying data as CSV
- [x] Full dataset export button in failed items section
- [x] Copy-to-clipboard for individual metric summaries
- [x] **Comparison Overlay Dashboard** - Overlay two runs on same charts for visual comparison
- [x] evalyn dashboard --compare <run1> <run2>
- [x] Dual bar charts, overlaid radar plots, side-by-side heatmaps
- [x] Toggle visibility of each run for clean comparison
### Audit & Governance
- [x] **Evaluation Audit Trail** - Immutable log of who ran what and when
- [x] Record: user, timestamp, command, args, config hash, result summary
- [x] Append-only audit log in .evalyn/audit.jsonl
- [x] evalyn audit-log showing evaluation history with filters
- [x] **Data Governance Metadata** - Track data provenance and compliance attributes
- [x] Dataset-level tags: PII-present, internal-only, customer-data, synthetic
- [x] Eval run compliance flag: was evaluation run on approved infrastructure?
- [x] Exportable governance report for compliance audits
- [x] **Structured Logging** - JSON-formatted logs with configurable verbosity
- [x] --log-level flag (debug, info, warning, error) on all commands
- [x] JSON log format for machine parsing in production environments
- [x] Log file output: --log-file evalyn.log
### Security
- [x] **API Key Rotation Support** - Gracefully handle key rotation without interrupting evaluation runs
- [x] Accept multiple API keys per provider in evalyn.yaml (primary + fallback)
- [x] Automatic fallback to secondary key when primary returns 401/403
- [x] evalyn rotate-key --provider gemini to update key and verify connectivity
- [x] **Secrets Backend Integration** - Load API keys from external secret managers instead of plaintext config
- [x] Support AWS Secrets Manager, GCP Secret Manager, HashiCorp Vault
- [x] evalyn.yaml secrets_backend: "aws" with ARN references
- [x] Environment variable passthrough as default (no config change needed)
- [x] **Trace Content Redaction Policies** - Configurable rules for what gets stored in trace payloads
- [x] Policy definitions in evalyn.yaml: never store full messages, only store first/last N chars
- [x] Per-project redaction rules (strict for production, relaxed for test)
- [x] Redaction audit: report showing how much content was redacted per trace
- [x] **Prompt Injection Detection Metric** - Objective metric detecting prompt injection attempts in inputs/outputs
- [x] Tier 1: 4-category regex patterns (instruction override, role injection, prompt extraction, encoding signals)
- [x] Tier 2: optional LLM-based classification for higher accuracy
- [x] Tier 3: optional vector similarity against known attack embeddings (self-hardening via Rebuff pattern)
- [x] Scoring: 0.0 (injection detected) to 1.0 (clean), configurable sensitivity
- [x] **Embedding PII Safety Check** - Detect whether stored embeddings could leak PII via inversion attacks
- [x] Warn when embedding vectors are stored alongside PII-containing text
- [x] Research finding: 93-98% text recovery from ada-002 embeddings via inversion
- [x] Recommend PII stripping before embedding or Eguard-style defense
- [x] **EU AI Act Compliance Report** - Auto-generate evaluation documentation for regulatory compliance
- [x] Document evaluation methodology, benchmarks used, and results
- [x] Export as PDF/HTML for regulatory submission
- [x] Cover NIST AI RMF and ISO 42001 reporting requirements
### Offline & Air-Gapped Mode
- [x] **Fully Offline Evaluation** - Run complete evaluation pipeline without internet access
- [x] Objective-only mode: all 73 objective metrics work offline with no API calls
- [x] Ollama provider for subjective metrics using local models
- [x] Pre-download and cache model artifacts for sentence-transformers embeddings
- [x] evalyn run-eval --offline flag that errors if any metric would require internet
- [x] **Local Model Performance Baselines** - Benchmark local models against API models for judge quality
- [x] evalyn benchmark-judges --local ollama:llama3 --api gemini comparing alignment
- [x] Per-metric local vs API agreement scores
- [x] Recommend which metrics are safe to evaluate locally
### Scale & Performance
- [x] **Large Dataset Optimization** - Handle 10k+ item datasets without memory issues
- [x] Streaming evaluation: process items without loading full dataset into memory
- [x] Chunked metric result storage: write results in batches to avoid OOM
- [x] Progress checkpointing every N items (currently only on interrupt)
- [x] Memory usage monitoring and warning when approaching system limits
- [x] **SQLite Full-Text Search** - FTS5 index for searching trace content and outputs
- [x] FTS index on function_call inputs and outputs
- [x] evalyn search "user asked about refund policy" finding matching traces
- [x] Search integration with build-dataset for content-based dataset curation
- [x] **Aggregation Queries** - Efficient database queries for cost and usage analytics
- [x] Cost by project, by date range, by model
- [x] Trace count and token usage per provider
- [x] evalyn stats --project <name> --since 2026-03-01 for project-level analytics
---
## Completed Features
### Setup & Configuration
- [x] **evalyn init** - Initialize evalyn.yaml config file
- [x] **evalyn one-click** - Run complete pipeline in one command
- [x] **evalyn help** - Show available commands with examples
- [x] **Environment Variables** - GEMINI_API_KEY, OPENAI_API_KEY, EVALYN_NO_HINTS, EVALYN_AUTO_INSTRUMENT
### Tracing & Instrumentation
- [x] **@eval decorator** - Automatic function call tracing
- [x] **Auto-instrumentation** - Automatic LLM SDK patching (OpenAI, Anthropic, Gemini, LangChain, LangGraph)
- [x] **Span tree capture** - Hierarchical trace of LLM calls, tool calls, graph nodes
- [x] **Token & cost tracking** - Automatic token counting and cost estimation
- [x] **evalyn list-calls** - List captured traces with filtering and sorting
- [x] **evalyn show-call** - View detailed call information
- [x] **evalyn show-trace** - Phoenix-style span tree visualization
- [x] **evalyn show-projects** - Project summary with trace counts
- [x] **Streaming response capture** - StreamingSpanWrapper for OpenAI, Anthropic, Gemini
- [x] **GenAI semantic convention attributes** - OpenTelemetry gen_ai.* attributes on spans
- [x] **Span-metric attribution** - Link metric results to specific spans with relevance scoring
- [x] **Context window utilization tracking** - Track context usage in spans
- [x] **--db flag** - Switch between prod/test databases
- [x] **Short ID support** - 8-character ID prefixes for convenience
### Dataset Management
- [x] **evalyn build-dataset** - Build dataset.jsonl from traces
- [x] **evalyn validate** - Validate dataset format
- [x] **evalyn status** - Show comprehensive dataset status
- [x] **--latest flag** - Auto-resolve most recent dataset
- [x] **Production/simulation filtering** - Separate real vs synthetic traces
- [x] **Date range filtering** - --since and --until options
### Metrics System
- [x] **73 Objective Metrics** - Deterministic code-based evaluation
- [x] Efficiency: latency_ms, cost, token_length, compression_ratio
- [x] Structure: json_valid, json_schema_keys, regex_match, xml_valid, syntax_valid
- [x] Correctness: bleu, rouge_l, rouge_1, rouge_2, exact_match, levenshtein_similarity
- [x] Robustness: tool_call_count, llm_call_count, tool_success_ratio, retry_count
- [x] Grounding: url_count, citation_count, source_diversity
- [x] Style: word_count, sentence_count, avg_sentence_length, vocabulary_diversity
- [x] Diversity: unique_ngrams, type_token_ratio
- [x] **60 Subjective Metrics** - LLM judge evaluation
- [x] Safety: toxicity_safety, pii_safety, manipulation_resistance, bias_detection
- [x] Correctness: helpfulness_accuracy, factual_accuracy, technical_accuracy
- [x] Style: tone_alignment, formality_match, brand_voice_consistency
- [x] Instruction: instruction_following, constraint_adherence, format_compliance
- [x] Grounding: hallucination_risk, source_attribution, claim_verification
- [x] Agent: reasoning_quality, tool_use_appropriateness, planning_quality
- [x] Domain: medical_accuracy, legal_compliance, financial_prudence
- [x] Conversation: context_retention, memory_consistency, empathy, patience
- [x] **evalyn list-metrics** - List all available metrics
- [x] **evalyn suggest-metrics** - Suggest metrics for a function
- [x] basic mode - Fast heuristic-based
- [x] bundle mode - Pre-configured metric sets
- [x] llm-registry mode - LLM picks from registry
- [x] llm-brainstorm mode - LLM generates custom metrics
- [x] auto mode - Uses function hints or defaults
- [x] **evalyn select-metrics** - Interactive LLM-guided selection
### Metric Bundles (17 Curated Sets)
- [x] **Conversational AI**
- [x] chatbot - Safety, helpfulness, multi-turn memory
- [x] customer-support - Empathy, patience, escalation handling
- [x] **Content Generation**
- [x] content-writer - Style, engagement, readability
- [x] summarization - Compression, reference overlap, grounding
- [x] creative-writer - Originality, engagement, vocabulary diversity
- [x] **Knowledge & Research**
- [x] rag-qa - Grounding, citations, factual accuracy
- [x] research-agent - Citations, grounding, tool use
- [x] tutor - Pedagogical clarity, examples, patience
- [x] **Code & Technical**
- [x] code-assistant - Syntax validity, complexity, technical accuracy
- [x] data-extraction - JSON validity, schema compliance
- [x] **Agents & Orchestration**
- [x] orchestrator - Tool success, planning, error handling
- [x] multi-step-agent - Planning, context retention, memory
- [x] **High-Stakes Domains**
- [x] medical-advisor - Medical accuracy, safety, ethics
- [x] legal-assistant - Legal compliance, citations, accuracy
- [x] financial-advisor - Financial prudence, safety, ethics
- [x] **Safety & Translation**
- [x] moderator - Toxicity, bias, PII, manipulation
- [x] translator - BLEU, Levenshtein, cultural sensitivity
### Evaluation Engine
- [x] **evalyn run-eval** - Run evaluation on dataset
- [x] **Parallel execution** - Multi-threaded metric evaluation (--workers)
- [x] **Batch API mode** - 50% cost savings for large-scale evaluation (--batch)
- [x] Gemini batch provider
- [x] OpenAI batch provider
- [x] Anthropic batch provider
- [x] **Confidence estimation** - Confidence scores for LLM judgments (--confidence)
- [x] Logprobs-based confidence (OpenAI/Ollama)
- [x] DeepConf confidence (Meta AI's bottom-10% strategy)
- [x] Self-consistency confidence (multi-sample agreement)
- [x] Perplexity and entropy methods
- [x] **Multi-provider support** - Choose judge provider (--provider)
- [x] Gemini (default)
- [x] OpenAI
- [x] Ollama (local)
- [x] **Token usage tracking** - Track LLM API token consumption per eval run
- [x] Per-metric input/output token counts
- [x] Aggregated usage summary in EvalRun
- [x] Display in run-eval output and show-run command
- [x] **Checkpoint & resume** - Save progress on interrupt, resume later
- [x] **HTML reports** - Interactive visualization with Chart.js
- [x] **evalyn list-runs** - List past evaluation runs
- [x] **evalyn show-run** - View run details
- [x] **--use-calibrated** - Apply calibrated prompts
### Analysis & Insights
- [x] **evalyn analyze** - Analyze evaluation results
- [x] **evalyn compare** - Compare two runs side-by-side
- [x] **evalyn trend** - View metric trends over time
- [x] **evalyn cluster-failures** - Cluster failed items by failure reason
- [x] **evalyn cluster-misalignments** - Cluster judge vs human disagreements
- [x] **Pass rate charts** - ASCII bar charts in terminal
- [x] **Score distributions** - Mini histograms
- [x] **Failed item breakdown** - List items with failure reasons
- [x] **evalyn insights** - Comprehensive diagnostic, prescriptive, and proactive analysis
- [x] Metric correlations, regressions, distributions, feature analysis
- [x] Prioritized recommendations
- [x] LLM expert panel (--deep) with 4 expert roles + moderator synthesis
- [x] Interactive HTML dashboard (--format html) with Chart.js charts
### Annotation Enhancements
- [x] **Inter-Annotator Agreement** - Track and visualize consistency between multiple annotators
- [x] Cohen's Kappa and Krippendorff's Alpha per metric
- [x] Pairwise agreement matrix across annotators
- [x] Identify items with highest disagreement for re-annotation
- [x] Agreement trend over time as annotators calibrate
- [x] **Annotation Delegation** - Assign specific items to specific annotators by expertise
- [x] Annotator profiles with domain expertise tags
- [x] Auto-assignment based on item metadata and annotator expertise match
- [x] Workload balancing across annotators
- [x] Progress dashboard per annotator
- [x] **Bulk Pre-Annotation via LLM** - Use LLM to pre-fill annotations for human review and correction
- [x] evalyn pre-annotate --provider gemini to generate draft annotations
- [x] Confidence-based triage: auto-accept high-confidence, human-review low-confidence
- [x] Track pre-annotation accuracy vs human corrections
- [x] Use corrections to improve pre-annotation prompts
- [x] **Annotation Guidelines Generator** - Auto-generate annotation guidelines from metric definitions
- [x] Convert metric rubrics to annotator-friendly instructions
- [x] Include concrete pass/fail examples from existing annotations
- [x] Export as markdown document or HTML with examples
- [x] **Annotation Conflict Resolution UI** - Side-by-side view when annotators disagree, with tiebreaker workflow
- [x] Display both annotators' labels with their confidence and reasoning
- [x] Third-party tiebreaker annotation with full context
- [x] Resolution policies: majority vote, senior override, discussion required
- [x] **Annotation UX Improvements** - Faster, more forgiving annotation workflow
- [x] Undo/edit previous annotation without re-annotating from scratch
- [x] Skip items with "s" key (mark as skipped, return to later)
- [x] Keyboard shortcuts: y=pass, n=fail, 1-5=confidence, s=skip, u=undo
- [x] Batch mode: present N items at once for rapid annotation
- [x] **Annotation Session Persistence** - Save and resume annotation progress
- [x] Track annotated item IDs in session file per annotator
- [x] evalyn annotate --resume to continue where last session ended
- [x] Session statistics: items/hour, agreement rate over time
### Human Annotation
- [x] **evalyn annotate** - Interactive annotation interface
- [x] Simple mode - Overall pass/fail
- [x] Per-metric mode - Agree/disagree with each metric
- [x] Span mode - Annotate individual LLM/tool calls
- [x] **evalyn annotation-stats** - Show annotation coverage
- [x] **evalyn import-annotations** - Import from JSONL
- [x] **evalyn export-for-annotation** - Export for external tools
- [x] **Confidence scores** - 1-5 scale for annotation certainty
- [x] **Immediate save** - Each annotation saved instantly
### Calibration (LLM Judge Optimization)
- [x] **evalyn calibrate** - Optimize judge prompts
- [x] Basic method - Single-shot LLM analysis of disagreements
- [x] APE method - Search-based optimization with UCB selection
- [x] OPRO method - Trajectory-based optimization
- [x] GEPA method - Evolutionary prompt optimization (external library)
- [x] GEPA-Native method - Evolutionary optimization with token tracking
- [x] EvoPrompt method - Population-based mutation/crossover
- [x] TextGrad method - Iterative critique-revise refinement
- [x] MIPROv2 method - Joint instruction + few-shot demo optimization
- [x] PromptBreeder method - Self-referential prompt evolution
- [x] BaseOptimizer base class + factory dispatch
- [x] **evalyn list-calibrations** - List calibration records
- [x] **Alignment metrics** - Accuracy, precision, recall, F1, Cohen's Kappa
- [x] **Validation split** - Test calibration on held-out samples
### Simulation (Synthetic Data)
- [x] **evalyn simulate** - Generate synthetic test data
- [x] similar mode - Variations of existing queries
- [x] outlier mode - Edge cases and unusual inputs
- [x] **Temperature control** - Separate temps for similar/outlier
- [x] **Seed sampling** - Control number of seed examples
- [x] **Persona-Based Simulation** - Generate inputs as specific user personas (novice, expert, adversarial)
- [x] Built-in personas: novice user, power user, adversarial attacker, non-native speaker
- [x] Custom persona definitions in evalyn.yaml
- [x] Persona tag in generated item metadata for cohort analysis
- [x] **Multi-Turn Simulation** - Generate full multi-turn conversations, not just single queries
- [x] Configurable conversation length (2-10 turns)
- [x] Follow-up generation based on agent response
- [x] Conversation flow patterns: clarification, topic shift, error recovery
- [x] **Adversarial Simulation** - Deliberately craft inputs targeting known failure modes
- [x] Prompt injection attempts
- [x] Boundary inputs: empty, max length, special characters, unicode edge cases
- [x] Contradiction inputs that conflict with system prompt
- [x] Jailbreak pattern variations
- [x] **Domain Transfer Simulation** - Adapt seed inputs from one domain to another (e.g. medical to legal)
- [x] LLM-powered domain rewriting preserving query structure
- [x] Domain vocabulary substitution
- [x] Complexity preservation across domain transfer
- [x] **Regression Simulation** - Re-generate past failure inputs to verify they no longer fail
- [x] Extract failure patterns from cluster-failures output
- [x] Generate new inputs matching each failure pattern
- [x] Track fix rate: % of previously-failing patterns now passing
- [x] **Conditional Simulation** - Generate inputs that specifically test edge conditions (empty input, max length, unicode)
- [x] Edge condition library: empty, null, max_length, unicode, mixed_language
- [x] Combinatorial generation across edge conditions
- [x] Configurable via --conditions flag
- [x] **Simulation Validation** - Auto-verify that generated items match expected statistical distributions
- [x] Input length distribution comparison (generated vs seed)
- [x] Vocabulary overlap check between generated and seed
- [x] Deduplication against both seed and existing dataset
- [x] **Parallel Simulation** - Generate synthetic data with configurable concurrency for large-scale runs
- [x] --workers flag on simulate command
- [x] Batch LLM calls for generation efficiency
- [x] Progress bar with items generated / total target
- [x] **Structured Input Simulation** - Generate dict/JSON inputs, not just text prompts
- [x] Infer input schema from seed dataset items (detect keys, types, value ranges)
- [x] Generate valid structured inputs conforming to detected schema
- [x] Configurable field-level variation (mutate one field at a time for targeted testing)
- [x] **Seed Selection Optimization** - Choose which seed items produce the most diverse simulations
- [x] Score seeds by diversity of generated outputs
- [x] Greedy selection: pick seeds that maximize coverage of unexplored input space
- [x] Drop seeds that produce near-duplicate simulations
- [x] **Simulation with Reference Answers** - Generate both inputs and expected outputs for automatic golden set creation
- [x] LLM generates input-output pairs where the output serves as ground truth
- [x] Configurable quality threshold: only keep pairs where LLM confidence is high
- [x] Useful for bootstrapping evaluation datasets with expected references
- [x] **Simulation Coverage Report** - Compare embedding space coverage of simulated vs production traces
- [x] Compute coverage overlap between simulated and real item embeddings
- [x] Identify production input regions not represented in simulated data
- [x] Recommend additional simulation targets to fill coverage gaps
- [x] **Simulation Budget Optimizer** - Given a token budget, optimize the mix of similar/outlier/adversarial items
- [x] Estimate token cost per simulation mode based on prompt complexity
- [x] Maximize diversity under budget constraint via greedy allocation
- [x] Report actual vs budgeted cost after generation
- [x] **Constraint-Guided Simulation** - Generate inputs satisfying specific constraints
- [x] --constraint "topic=refunds AND length>200" flag on simulate command
- [x] LLM-guided generation with constraint verification loop
- [x] Reject and regenerate items that fail constraint checks
- [x] **Simulation Diversity Metrics** - Quantify how diverse the generated set is vs seed set
- [x] Embedding spread: average pairwise distance in generated set
- [x] Vocabulary uniqueness ratio vs seed set
- [x] Novelty score: fraction of generated items far from all seed items
- [x] **Simulation Evaluation Loop** - Generate, evaluate, and iterate on simulated data in one command
- [x] evalyn simulate-and-eval --rounds 3 running simulate + run-eval in a loop
- [x] Each round generates items targeting previous round's failure patterns
- [x] Convergence tracking: stop when pass rate stabilizes
- [x] **Simulation with Tool Schemas** - Generate inputs that exercise specific tool call patterns
- [x] Provide tool definitions in evalyn.yaml; simulator generates queries requiring those tools
- [x] Coverage tracking: % of tools exercised by generated inputs
- [x] Useful for testing tool selection and parameter correctness
- [x] **Simulation Seed Clustering** - Cluster seeds before simulation to ensure diverse coverage
- [x] Auto-cluster seed items into groups by embedding similarity
- [x] Sample proportionally from each cluster for simulation seeds
- [x] Prevent simulation from over-representing one cluster of similar inputs
- [x] **Simulation Template Library** - Pre-built simulation configs for common use cases
- [x] Templates: customer-support, rag-qa, code-review, multi-step-agent
- [x] Each template defines persona mix, edge case types, output format constraints
- [x] evalyn simulate --template customer-support
- [x] **Simulation Difficulty Grading** - Auto-tag generated items with estimated difficulty level
- [x] Difficulty heuristics: input complexity, number of constraints, ambiguity level
- [x] Tag in metadata as difficulty: easy/medium/hard
- [x] Ensure generated set has balanced difficulty distribution
- [x] **Simulation Quality Score** - Evaluate generated items for naturalness compared to seed set
- [x] LLM-based naturalness rating: does this look like a real user query?
- [x] Statistical comparison: generated vs seed item length/vocabulary distributions
- [x] Auto-reject generated items scoring below quality threshold
- [x] **Simulation Provider Diversity** - Use multiple LLM providers to increase variety in generated items
- [x] Round-robin across configured providers (Gemini, OpenAI, Ollama)
- [x] Merge results with provider tag in metadata
- [x] Compare generation quality per provider
- [x] **Simulation Cost Estimation** - Estimate token cost before running simulation
- [x] --dry-run flag on simulate showing estimated tokens and cost
- [x] Cost breakdown: similar mode vs outlier mode estimates
- [x] Useful for budgeting large-scale simulation runs
- [x] **Simulation Reproducibility Seed** - Deterministic seed for exact reproduction of generated items
- [x] --seed flag on simulate command for reproducible LLM outputs (temperature + seed)
- [x] Record seed in simulation metadata for audit trail
- [x] Verify reproducibility: re-run with same seed produces identical items
- [x] **Simulation Feedback Injection** - Inject specific failure patterns into simulation prompts
- [x] Accept failure cluster labels from cluster-failures as simulation targets
- [x] Generate items specifically designed to trigger each failure mode
- [x] Coverage tracking: % of known failure patterns with generated test cases
- [x] **Evol-Instruct Data Evolution** - Evolve evaluation items through iterative complexity increases
- [x] In-depth evolution: add constraints, reasoning steps, edge cases to existing items
- [x] In-breadth evolution: generate topic variations and domain transfers
- [x] Quality scoring: rate evolved items on clarity, depth, structure, relevance
- [x] Auto-filtering: reject evolved items that degrade below quality threshold
- [x] **Persona Hub Integration** - Generate diverse user personas for simulation
- [x] Large-scale persona generation from behavior descriptions
- [x] Persona-to-Persona expansion for combinatorial diversity
- [x] Structured diversity controls: ensure coverage across demographics, expertise, intent
- [x] **Cascade Model Routing for Evaluation** - Use cheap models for easy items, expensive for hard
- [x] Difficulty estimation from input complexity heuristics
- [x] Route easy items to flash-lite, hard items to flash/pro
- [x] 87% cost reduction benchmark (ETH Zurich finding)
- [x] Quality estimator to determine when to escalate
### Sampling
- [x] **Importance Sampling** - Weight sample selection by item difficulty or model uncertainty
- [x] Weight by inverse pass rate from previous eval run
- [x] Weight by judge confidence (low confidence = high importance)
- [x] Configurable weight function via Python callable
- [x] **Curriculum Sampling** - Order samples from easy to hard for progressive evaluation
- [x] Difficulty estimation from input length, complexity heuristics, or past scores
- [x] Progressive disclosure: evaluate easy items first, add harder ones
- [x] Early stopping if easy items already fail
- [x] **Time-Weighted Sampling** - Prefer recent traces over older ones during dataset construction
- [x] Exponential decay weighting by trace timestamp
- [x] Configurable half-life parameter (e.g. 7 days, 30 days)
- [x] Minimum representation guarantee for older traces
- [x] **Coverage-Aware Sampling** - Maximize coverage of the input feature space
- [x] Embedding-based coverage using existing SentenceTransformer infrastructure
- [x] Greedy maximal-diversity selection
- [x] Coverage report: % of embedding space represented
- [x] **Balanced Sampling** - Ensure equal representation across metadata categories or labels
- [x] Balance by any metadata field (tag, source, difficulty)
- [x] Undersample majority or oversample minority categories
- [x] Report sampling ratio adjustments applied
- [x] **Adversarial Sampling** - Select items most likely to trigger model failures based on past results
- [x] Prioritize items that failed in previous runs
- [x] Select items near decision boundaries (scores close to threshold)
- [x] Include items from underperforming cohorts
- [x] **Score-Stratified Sampling** - Ensure representation across the full metric score range
- [x] Bin items by score range (0-0.2, 0.2-0.4, ..., 0.8-1.0)
- [x] Equal sampling from each bin
- [x] Useful for calibration datasets needing score diversity
- [x] **Embedding Drift Sampling** - Prioritize items whose embeddings shifted most between dataset versions
- [x] Compute per-item embedding delta between old and new dataset
- [x] Sample items with largest cosine distance change
- [x] Useful for targeting evaluation on items most affected by data updates
- [x] **Cost-Aware Sampling** - Prefer shorter/cheaper items when evaluation budget is constrained
- [x] Estimate per-item evaluation cost from input/output token counts
- [x] Greedy selection maximizing item count within token/cost budget
- [x] --max-eval-cost flag on build-dataset to cap total evaluation expense
- [x] **Human Disagreement Sampling** - Prioritize items where annotators previously disagreed
- [x] Query annotation store for items with divergent human labels
- [x] Weight by disagreement severity (binary flip vs minor score difference)
- [x] Useful for building targeted calibration datasets
- [x] **Cluster Boundary Sampling** - Sample items near cluster decision boundaries for maximum information gain
- [x] Identify items closest to cluster centroids vs farthest from all centroids
- [x] Preferentially sample boundary items that are hardest to classify
- [x] Combine with existing clustered sampling mode
- [x] **Bootstrap Resampling** - Generate bootstrap samples for confidence interval estimation on metrics
- [x] --bootstrap N flag on run-eval to create N resampled evaluation runs
- [x] Report 95% confidence intervals for each metric from bootstrap distribution
- [x] Useful for small datasets where point estimates are unreliable
- [x] **Similarity-Based Sampling** - Sample items most or least similar to a given reference item
- [x] --similar-to <item-id> flag selecting nearest neighbors by embedding distance
- [x] --dissimilar-to <item-id> for maximum diversity from a reference
- [x] Useful for focused investigation around a specific failure or success case
- [x] **Error-Pattern Sampling** - Preferentially sample items matching known failure patterns
- [x] Extract failure patterns from cluster-failures output
- [x] Match new items against known patterns via embedding similarity
- [x] Ensures calibration and evaluation sets include known-hard cases
- [x] **Progressive Sampling** - Start with small sample, expand if metrics are statistically inconclusive
- [x] Initial sample of N items, evaluate, check confidence intervals
- [x] Expand sample size if CI width exceeds threshold
- [x] Stop when statistical power is sufficient or budget exhausted
- [x] **Metadata-Conditional Sampling** - Variable sample rates by metadata field values
- [x] Config: sample 100% of "production" items, 20% of "test" items
- [x] Per-field rate definitions in evalyn.yaml or --sample-by flag
- [x] Report actual sampling ratios applied per metadata value
- [x] **Novelty Sampling** - Prioritize items most unlike the existing labeled/annotated set
- [x] Compute embedding distance from each unlabeled item to nearest labeled item
- [x] Sample items with maximum novelty for annotation or calibration
- [x] Expand labeled set coverage efficiently
- [x] **Sampling Reproducibility Report** - Log exactly which items were selected and why
- [x] Record sampling mode, seed, parameters, and selected item IDs in meta.json
- [x] Verify reproducibility: re-run with same params produces identical selection
- [x] Audit trail for dataset construction decisions
- [x] **Multi-Stage Sampling Pipeline** - Chain arbitrary sampling strategies in sequence
- [x] Config: sampling_pipeline: [deduplicate, stratified, diverse] in evalyn.yaml
- [x] Each stage feeds its output as input to the next
- [x] Per-stage statistics showing how many items survived each filter
- [x] **Sampling Impact Analysis** - Estimate how sample size affects metric confidence intervals
- [x] Given historical run data, compute expected CI width for different sample sizes
- [x] evalyn sample-impact --dataset <path> --sizes 50,100,200 showing precision vs cost
- [x] Recommend minimum sample size for target precision level
- [x] **Locale-Aware Sampling** - Sample proportionally by language or region for i18n testing
- [x] Detect language/locale from input text or metadata field
- [x] Ensure minimum representation per locale in sample
- [x] --sample-by locale flag on build-dataset
- [x] **Embedding Model Selection** - Configurable embedding model for diversity and clustered sampling
- [x] embedding_model setting in evalyn.yaml (default: all-MiniLM-L6-v2)
- [x] Support custom models from HuggingFace or local paths
- [x] Cache embeddings keyed by model name to avoid recomputation
- [x] **Reservoir Sampling** - Online sampling for streaming dataset construction
- [x] Build dataset from continuous trace stream without knowing total count upfront
- [x] Maintain fixed-size sample with uniform probability guarantees
- [x] Useful for production monitoring: always keep a representative sample of recent traces
- [x] **Coreset Sampling** - Find minimal representative subset preserving distribution properties
- [x] Greedy coreset construction minimizing maximum approximation error
- [x] Guarantee that statistics computed on coreset approximate full dataset within bounds
- [x] --coreset N flag on build-dataset for maximum compression with minimal information loss
- [x] **IRT-Based Tiny Benchmarks** - Use Item Response Theory to find minimal representative subset
- [x] Psychometrics-inspired: 100 items can replace 14K (140x reduction) within 2% error
- [x] Estimate item difficulty and discrimination from historical eval data
- [x] Select items maximizing information at target ability level
- [x] evalyn dataset-optimize --method irt --target-size 100
- [x] **BenchBuilder Auto-Curation** - Automatically curate evaluation prompts from production traces
- [x] Cluster production traces by topic (Arena-Hard pattern)
- [x] Score each trace for quality and difficulty
- [x] Select diverse, high-quality traces as evaluation dataset
- [x] 98.6% human correlation at $20 cost (Arena-Hard benchmark)
### Export & Reporting
- [x] **evalyn export** - Export results in multiple formats
- [x] JSON - Full structured data
- [x] CSV - Spreadsheet-compatible
- [x] Markdown - Human-readable report
- [x] HTML - Standalone interactive report
- [x] **evalyn export-for-annotation** - Export for external annotation tools
### Additional Export Formats
- [x] **Parquet Export** - Columnar format for big data tooling and ML pipelines
- [x] evalyn export --format parquet using pyarrow (optional dependency)
- [x] Schema: one row per (item, metric) pair with score, passed, details columns
- [x] Efficient for loading into pandas, DuckDB, or Spark
- [x] **OpenAI Evals Format Export** - Compatibility with OpenAI's evaluation framework
- [x] evalyn export --format openai-evals producing JSONL in OpenAI evals schema
- [x] Map evalyn MetricResult to OpenAI eval sample format
- [x] Include system prompt and messages for replay in OpenAI's eval harness
- [x] **Experiment Tracker Integration** - Push eval results to W&B, MLflow, or Neptune
- [x] evalyn export --format wandb logging metrics as W&B runs
- [x] evalyn export --format mlflow logging as MLflow experiments
- [x] Configurable tracker URL and credentials in evalyn.yaml
### Developer Experience
- [x] **Context-aware hints** - Suggests next steps after each command
- [x] **--quiet flag** - Suppress hints
- [x] **--format flag** - table/json output for all commands
- [x] **--last flag** - Quick access to most recent item
- [x] **Short IDs** - 8-character prefixes for easier use
- [x] **Error messages with hints** - Helpful troubleshooting suggestions
### CLI Enhancements
- [x] **Interactive TUI Mode** - Rich terminal UI with navigation, filtering, and drill-down
- [x] Textual or Rich-based TUI framework
- [x] Views: trace list, run list, metric dashboard, item detail
- [x] Keyboard navigation: j/k scroll, enter drill-down, q quit
- [x] Real-time eval progress view with per-metric status
- [x] **Shell Completion** - Bash/zsh/fish tab completion for all commands and flags
- [x] argcomplete integration for automatic completion generation
- [x] Complete command names, flag names, and flag values (run IDs, dataset paths)
- [x] Installation helper: evalyn --install-completion
- [x] **Watch Mode** - Auto-rerun evaluation when dataset or config file changes
- [x] File watcher on dataset.jsonl and evalyn.yaml
- [x] Debounce: wait 2s after last change before re-running
- [x] Diff output: only show changed metrics since last run
- [x] --watch flag on run-eval command
- [x] **Profile Command** - Show storage size, run counts, disk usage, and system health
- [x] Database file size and table row counts
- [x] Total eval runs, traces, and annotations
- [x] Disk usage by data directory
- [x] Python environment info: version, installed providers, API key status
- [x] **Config Validation Command** - Check evalyn.yaml for errors, missing fields, and deprecations
- [x] Schema validation against expected evalyn.yaml structure
- [x] Warn on unknown keys, deprecated fields, and type mismatches
- [x] Suggest fixes for common misconfigurations
- [x] evalyn config-check command
- [x] **evalyn doctor** - Diagnose common setup issues (missing API keys, stale data, broken config)
- [x] Check API key validity for each configured provider
- [x] Verify database accessibility and schema version
- [x] Check disk space and write permissions
- [x] Verify Python dependencies are installed (sentence-transformers, etc.)
- [x] Generate diagnostic report for bug reports
- [x] **evalyn playground** - Interactive prompt testing with live metric scoring in the terminal
- [x] Enter input, see agent output, instantly score with selected metrics
- [x] Side-by-side: original prompt vs modified prompt
- [x] Score history across playground iterations
- [x] Save good examples to dataset
- [x] **evalyn diff** - Diff two evaluation runs showing changed scores per item
- [x] Per-item score delta table sorted by largest regression
- [x] Metric-level summary: improved/regressed/unchanged counts
- [x] --threshold flag to only show items with delta > N
- [x] ASCII color coding: green for improvement, red for regression
- [x] **evalyn gc** - Garbage collect orphaned data (stale checkpoints, runs without datasets)
- [x] Identify orphaned checkpoint files without matching runs
- [x] Find runs referencing deleted datasets
- [x] Remove temporary files in .evalyn/ directory
- [x] --dry-run mode showing what would be cleaned
- [x] **Piped JSON Mode** - Machine-readable JSON output for scripting and CI pipeline integration
- [x] --output json on all commands producing structured JSON to stdout
- [x] JSONL streaming for long-running operations (progress events)
- [x] Exit codes: 0=pass, 1=fail, 2=error for CI gate integration
- [x] jq-friendly output structure
- [x] **CLI Plugin System** - Register custom commands via Python entry points
- [x] evalyn.commands entry point group for third-party command modules
- [x] Auto-discovery and registration at startup
- [x] evalyn list-plugins showing installed command plugins
- [x] **CLI Alias Support** - User-defined command aliases in evalyn.yaml
- [x] aliases: section mapping short names to full commands (e.g. "q" -> "quickstart")
- [x] Aliases can include default flags (e.g. "fast-eval" -> "run-eval --workers 8 --provider ollama")
- [x] evalyn alias list showing configured aliases
- [x] **CLI Command History** - Record and replay command sequences for reproducible workflows
- [x] Auto-log commands to .evalyn/history.jsonl with timestamps and exit codes
- [x] evalyn history showing recent commands
- [x] evalyn replay --from <timestamp> to re-run a sequence of commands
- [x] **CLI Batch Script** - Run multiple commands from a script file
- [x] evalyn batch commands.txt executing one command per line
- [x] Stop-on-error vs continue-on-error modes
- [x] Variable substitution: $DATE, $LATEST_RUN, $LATEST_DATASET
- [x] **CLI Output Pagination** - Built-in pager for long terminal outputs
- [x] Auto-page when output exceeds terminal height
- [x] Respect PAGER env var, default to less
- [x] --no-pager flag to disable for piping
- [x] **CLI Notification on Completion** - System notification when long-running commands finish
- [x] Desktop notification via notify-send (Linux), osascript (macOS), or toast (Windows)
- [x] --notify flag on run-eval, calibrate, and one-click commands
- [x] Include pass/fail summary in notification body
- [x] **CLI Config Show** - Display effective merged configuration from all sources
- [x] evalyn config-show displaying global + project + env var + flag overrides
- [x] Highlight which source each setting comes from
- [x] Useful for debugging "why is this provider being used?"
- [x] **CLI Compare Shorthand** - Quick comparison shortcuts for common comparison patterns
- [x] evalyn compare --last-2 comparing two most recent runs
- [x] evalyn compare --latest-vs-pinned comparing latest against pinned baseline
- [x] evalyn compare --latest-vs-previous for sequential regression checking
- [x] **CLI Checkpoint Inspection** - View and manage evaluation checkpoints
- [x] evalyn checkpoints listing all saved checkpoints with item counts and timestamps
- [x] evalyn checkpoint-info <id> showing checkpoint details
- [x] evalyn checkpoint-delete <id> cleaning up stale checkpoints
- [x] **CLI Pipeline Visualization** - Show pipeline steps as ASCII flowchart before execution
- [x] evalyn one-click --show-plan displaying step sequence with estimated times
- [x] Indicate which steps will be skipped based on flags
- [x] Confirm before executing the visualized plan
- [x] **CLI Side-by-Side View** - Display two outputs side by side in terminal
- [x] evalyn compare --side-by-side rendering left/right columns for two runs
- [x] Per-item comparison with visual diff markers
- [x] Automatic column width adjustment based on terminal size
- [x] **CLI Progress Dashboard** - Unified progress view for all concurrent operations
- [x] Multi-bar display: per-metric progress within a run
- [x] ETA estimation based on completed items and average per-item time
- [x] Rich-based dashboard with live updates (optional dependency)
- [x] **CLI Command Chaining** - Pipe output of one command as input to another
- [x] evalyn build-dataset | evalyn run-eval passing dataset path automatically
- [x] --stdin flag reading dataset path or run ID from standard input
- [x] Useful for scripting multi-step workflows without temp variables
- [x] **CLI Time Tracking** - Track total time spent per command type for operational analytics
- [x] Auto-log command name and duration to .evalyn/timing.jsonl
- [x] evalyn timing-stats showing per-command average/total time
- [x] Identify slowest commands for optimization opportunities
- [x] **CLI Quick Rerun** - Rerun last command with modified flags
- [x] evalyn !! repeating last command exactly
- [x] evalyn !! --workers 8 repeating with flag override
- [x] Command history stored in .evalyn/history.jsonl
- [x] **CLI Color Theme Configuration** - User-configurable terminal color scheme
- [x] theme setting in evalyn.yaml: default, solarized, monokai, high-contrast
- [x] EVALYN_THEME env var for quick switching
- [x] Separate from NO_COLOR which disables all colors entirely
- [x] **CLI Output Width Control** - Respect terminal width for table and chart formatting
- [x] Auto-detect terminal width and adjust table column widths accordingly
- [x] --width N flag to override detected width (useful for piping to files)
- [x] Truncate long cell values to fit within available space
- [x] **CLI Execution Audit Log** - Log every CLI command with full arguments for reproducibility
- [x] Auto-append to .evalyn/command_log.jsonl: timestamp, command, args, exit code, duration
- [x] evalyn audit-log showing chronological command history
- [x] Distinct from evaluation audit trail (covers all commands, not just eval runs)
### Run Management
- [x] **Run Naming** - Give eval runs human-readable names instead of only UUIDs
- [x] --name flag on run-eval: evalyn run-eval --name "prompt-v3-experiment"
- [x] Name stored in EvalRun metadata, displayed in list-runs
- [x] Resolve runs by name: evalyn show-run --name "prompt-v3-experiment"
- [x] **Run Pinning** - Mark a run as baseline for automatic comparison
- [x] evalyn pin-run --id <id> marking a run as the project baseline
- [x] Subsequent analyze and compare commands auto-compare against pinned run
- [x] evalyn list-runs showing pinned run with a marker
- [x] **Run Cleanup** - Bulk delete runs matching criteria
- [x] evalyn cleanup-runs --older-than 30d --keep-pinned
- [x] evalyn cleanup-runs --below-pass-rate 0.3 for removing low-quality runs
- [x] --dry-run mode showing what would be deleted with total storage savings
### Metrics Enhancements
- [x] **Custom Metric DSL** - Define metrics via YAML config without writing Python code
- [x] YAML metric definition: name, type, prompt template, threshold, scoring rubric
- [x] Variable interpolation: {{input}}, {{output}}, {{expected}} in prompt templates
- [x] Custom objective metrics via Python expressions (e.g. "len(output) < 500")
- [x] Hot-reload: modify YAML, re-run eval without code changes
- [x] **Metric Composition** - Combine multiple metrics into weighted composite scores
- [x] Composite metric definition: weighted average of child metrics
- [x] Min/max/mean aggregation strategies
- [x] Pass threshold on composite score
- [x] Drill-down: see child metric contributions to composite
- [x] **Metric Weighting Profiles** - Named weight sets for different evaluation use cases
- [x] Profile definitions in evalyn.yaml (e.g. "safety-first": safety=3x, quality=1x)
- [x] --weight-profile flag on analyze and compare commands
- [x] Weighted pass rate and weighted overall score
- [x] **Metric Versioning** - Track when metric implementations change and flag affected runs
- [x] Hash metric prompt + scoring logic as version identifier
- [x] Store metric version in MetricResult metadata
- [x] Warn when comparing runs with different metric versions
- [x] evalyn metric-history showing version changes over time
- [x] **Metric Benchmarking** - Measure computation cost and latency per metric
- [x] Per-metric timing in evaluation runner
- [x] Token usage and cost per metric type
- [x] Benchmark report: slowest metrics, most expensive metrics
- [x] Optimization suggestions for costly metrics
- [x] **Inter-Rater Reliability** - Compute agreement stats when multiple judges score the same items
- [x] Run same metric with N different judges (models or prompts)
- [x] Fleiss' Kappa for multi-rater agreement
- [x] Identify items with lowest agreement for human review
- [x] Recommend judge selection based on reliability
- [x] **Metric Sensitivity Analysis** - Measure score stability across small input perturbations
- [x] Perturb inputs (typos, rephrasing) and measure score variance
- [x] Flag metrics with high sensitivity to minor input changes
- [x] Robustness score per metric
- [x] **Metric Correlation Pruning** - Auto-suggest removing redundant metrics that track the same signal
- [x] Pearson/Spearman correlation matrix across all metrics
- [x] Flag pairs with r > 0.95 as candidates for pruning
- [x] Recommend minimal metric set preserving signal coverage
- [x] **Metric Dependencies** - Declare that metric B requires metric A to run first (dependency graph)
- [x] Dependency declaration in MetricSpec
- [x] Topological sort of metrics before evaluation
- [x] Pass metric A results as context to metric B prompt
- [x] **Conditional Metric Chains** - If metric A fails, automatically run a diagnostic follow-up metric B
- [x] Chain definition: "if toxicity_safety fails, run toxicity_type_classifier"
- [x] Diagnostic metrics produce detailed failure categorization
- [x] Chain results stored alongside primary metric results
- [x] **Metric Namespacing** - Organize metrics by project/team namespace to avoid collisions
- [x] Namespace prefix: "team-safety/toxicity" vs "team-quality/toxicity"
- [x] Namespace-scoped metric search in list-metrics
- [x] Cross-namespace metric comparison
- [x] **Metric Score Explanations** - Return human-readable explanations for objective metric scores
- [x] Per-metric explain() function describing why the score is what it is
- [x] Example: "json_valid: FAIL - parse error at line 3, column 12: unexpected token"
- [x] Include explanations in show-run and failed item breakdown output
- [x] **Metric Warmup Averaging** - Run each subjective metric N times and average to reduce LLM variance
- [x] --metric-samples N flag on run-eval (default 1)
- [x] Report per-metric score variance across samples
- [x] Flag items where samples disagree (high variance) for review
- [x] **Metric Runtime Estimation** - Predict eval duration per metric based on historical timing data
- [x] Store per-metric median execution time from past runs
- [x] Estimate total run time before execution starts
- [x] Surface slow metrics in dry-run output with time contribution
- [x] **Metric Compatibility Matrix** - Show which metrics work with which evaluation unit types
- [x] Matrix: metrics on Y-axis, unit types (outcome, single_turn, tool_use, multi_turn) on X-axis
- [x] evalyn list-metrics --compatibility showing supported unit types per metric
- [x] Warn when user selects metrics incompatible with their trace structure
- [x] **Metric Score Binning** - Configurable score-to-grade mapping for human-friendly reporting
- [x] Grade definitions in evalyn.yaml (e.g. A=0.8-1.0, B=0.6-0.8, C=0.4-0.6, F=0-0.4)
- [x] Grade distribution chart in analyze output
- [x] Custom grade labels and thresholds per project
- [x] **Reference-Adaptive Metrics** - Auto-switch metric rubric based on whether expected reference is present
- [x] Detect reference availability per item via _dataset_has_reference
- [x] Use reference-based rubric when available, reference-free rubric otherwise
- [x] Report which rubric variant was used per item in MetricResult details
- [x] **Metric Debug Mode** - Verbose logging of the complete judge interaction per item
- [x] --debug-metrics flag showing: prompt sent, raw response, parsed result per item
- [x] Log to .evalyn/metric_debug.jsonl for post-hoc analysis
- [x] Useful for diagnosing why a metric scores differently than expected
- [x] **Metric Template Variables** - Custom variables in judge prompt templates beyond standard input/output/expected
- [x] User-defined variables in evalyn.yaml: template_vars: {domain: "healthcare", persona: "clinician"}
- [x] Variable interpolation in judge prompts: "Evaluate from the perspective of a {{persona}}"
- [x] Per-dataset variable overrides in meta.json
- [x] **Metric Registry Freeze** - Lock the metric set for a project to prevent accidental changes
- [x] evalyn freeze-metrics --project <name> locking current metrics.json
- [x] Warn when attempting to modify frozen metric set
- [x] evalyn unfreeze-metrics to unlock for intentional changes
- [x] **Metric Output Post-Processing** - Pluggable post-processors on raw judge output before scoring
- [x] Post-processor chain in evalyn.yaml per metric (e.g. normalize, clamp, round)
- [x] Built-in processors: score_clamp(0,1), binary_threshold(0.5), invert_score
- [x] Custom Python post-processor functions via entry points
- [x] **Metric Deprecation Lifecycle** - Formal deprecation with migration path and sunset date
- [x] Deprecation metadata on MetricSpec: deprecated_since, replacement, sunset_date
- [x] Warning when using deprecated metrics in run-eval
- [x] evalyn list-metrics --deprecated showing deprecated metrics with migration hints
- [x] **Metric Category Pass Rates** - Aggregate reporting by subjective category (safety, correctness, style, etc.)
- [x] Group metrics by CATEGORIES mapping in analyze output
- [x] Per-category pass rate bar charts
- [x] Identify weakest category for targeted improvement
- [x] **Metric Rubric Preview** - Show exact judge prompt before evaluation starts
- [x] evalyn preview-metric --id helpfulness_accuracy showing full prompt with rubric
- [x] Include template variable substitution with sample input/output
- [x] Verify rubric looks correct before committing to expensive evaluation
- [x] **Metric Cross-Reference View** - Show which bundles include each metric
- [x] evalyn list-metrics --show-bundles displaying bundle membership per metric
- [x] Inverse view: evalyn list-bundles --show-metrics for bundle contents
- [x] Useful for understanding metric coverage across different evaluation profiles
- [x] **Metric Score Curve Fitting** - Fit parametric distributions to historical metric scores
- [x] Fit beta/normal/bimodal distributions to score history per metric
- [x] Detect distribution changes between runs (shift, spread, shape)
- [x] Use fitted distribution for anomaly detection on new scores
- [x] **Metric Prompt Token Count** - Show estimated prompt token count per metric before evaluation
- [x] Estimate tokens from metric prompt template + average input/output sizes
- [x] evalyn list-metrics --show-tokens displaying per-metric token cost
- [x] Factor into cost estimation in dry-run mode
- [x] **Metric A/B Variant Testing** - Evaluate same items with two rubric variants of the same metric
- [x] Define variant rubrics in evalyn.yaml: helpfulness_v1 vs helpfulness_v2
- [x] Run both variants in a single eval, compare scores and agreement
- [x] Select the variant with better alignment to human annotations
- [x] **Metric Cold Start Detection** - Detect when a metric's first N items score differently than the rest
- [x] Compare score distribution of first K items vs remaining items per metric
- [x] Statistical test (KS or Mann-Whitney) for distribution shift
- [x] Recommend warm-up if cold start effect is significant
### Metric Bundle Customization
- [x] **User-Defined Bundles** - Define custom metric bundles in evalyn.yaml
- [x] bundles: section in evalyn.yaml with named metric lists
- [x] evalyn suggest-metrics --mode bundle --bundle my-custom-bundle
- [x] Inherit from built-in bundles and override (e.g. extend "chatbot" with custom metrics)
- [x] **Bundle Composition** - Combine multiple bundles into one with deduplication
- [x] evalyn suggest-metrics --bundle chatbot+safety merging two bundles
- [x] Automatic deduplication when combining overlapping bundles
- [x] Conflict resolution when same metric appears with different configs
- [x] **Bundle Recommendation** - Auto-suggest bundle based on captured trace patterns
- [x] Analyze trace spans to detect agent type (RAG, orchestrator, chatbot, etc.)
- [x] Match detected patterns to best-fit built-in bundle
- [x] evalyn suggest-metrics --mode auto-bundle choosing bundle without user input
### LLM Provider Support
- [x] **Gemini** - Full support with auto-instrumentation
- [x] **OpenAI** - Full support with auto-instrumentation
- [x] **Anthropic** - Full support with auto-instrumentation
- [x] **xAI (Grok)** - Full support with auto-instrumentation
- [x] **Ollama** - Local model support (--provider ollama)
### Framework Support
- [x] **LangChain** - Automatic instrumentation
- [x] **LangGraph** - Automatic instrumentation with node tracking
- [x] **Google ADK** - Automatic instrumentation
- [x] **Claude Agent SDK** - Automatic instrumentation
### Storage & Data
- [x] **SQLite storage** - Local-first, no cloud dependencies
- [x] **Prod/test separation** - Separate databases for environments
- [x] **JSONL datasets** - Human-readable, git-friendly format
- [x] **Checkpoint system** - Resume interrupted evaluations
### Testing & Quality
- [x] **Test coverage improvement** - 1,063 tests across 30 test files
- [x] Analysis engine: trends, reports, core properties, insights
- [x] Model roundtrips: Span, FunctionCall, DatasetItem, Annotation, SpanMetricLink
- [x] SQLiteStorage: CRUD, ID resolution, annotations
- [x] CLI utilities: formatters, validation, config
- [x] CLI commands: analyze, compare, trend, list-runs, show-run, insights
- [x] Export formats: markdown, HTML, CSV builders
- [x] Metrics: HeuristicSuggester, subjective template validation, objective metrics
- [x] Tracing: instrumentation, streaming, provider instrumentors
- [x] **Realistic test fixtures** - 10+ items, 3 metrics, mixed scores, failure reasons
- [x] **pytest-cov integration** - Coverage reporting via `--cov=evalyn_sdk`
- [x] **Integration test unskip** - Fixed 2 skipped integration tests
*Last updated: 2026-03-25*
- Without a harness, you **can't compare** prompts, models, retrieval configs, or costs.
Evaluate, benchmark, and regression-test AI/LLM systems. Covers evaluation framework design, benchmark creation, human evaluation protocols, automated evaluation (LLM-as-judge), regression testing, statistical significance, and continuous evaluation pipelines.
<img width="1388" height="298" alt="full_diagram" src="https://github.com/user-attachments/assets/12a2371b-8be2-4219-9b48-90503eb43c69" />
A list of all public EEG-datasets. This list of EEG-resources is not exhaustive. If you find something new, or have explored any unfiltered link in depth, please update the repository.