Loading...
Loading...
This document describes the lightweight evaluation system designed for pull request checks.
# PR Check Evaluations
This document describes the lightweight evaluation system designed for pull request checks.
## Overview
The `pr-check` scenario is a multi-turn evaluation designed to quickly verify core agent capabilities during CI/CD without the cost and time of running the full evaluation suite.
## Why PR Check Evals?
### The Problem
Full evaluation suites are comprehensive but have significant drawbacks for PR workflows:
1. **Cost**: Running 15+ scenarios across 3 providers uses substantial API tokens
2. **Time**: Full evals can take 10-15 minutes, slowing down PR iteration
3. **Feedback latency**: Developers wait too long to know if their changes work
### The Solution
The `pr-check` scenario addresses these issues by:
1. **Single multi-turn scenario**: Tests multiple capabilities in one conversation
2. **~60 second runtime**: Fast enough for responsive CI feedback
3. **Core capability coverage**: Validates the most critical agent functions
4. **Provider-agnostic**: Works with vertex-claude, openai, and zai
## What It Tests
The pr-check scenario validates 9 core capabilities in a single multi-turn conversation:
| Turn | Capability | What It Tests |
|------|------------|---------------|
| 1 | Tool Awareness | Agent correctly lists available tools (read_file, edit_file, grep, etc.) |
| 2 | Sub-Agent Awareness | Agent knows about coder, analyzer, explorer, researcher, executor |
| 3 | File Operations | List directory contents, create new files |
| 4 | Edit & Search | Edit existing files, use grep to search |
| 5 | AST-grep | Structural code search using ast_grep patterns |
| 6 | Coder Create File | Delegate to coder sub-agent to create a new file using udiff |
| 7 | Coder Edit File | Delegate to coder sub-agent to edit the created file |
| 8 | Executor Delete File | Delegate to executor sub-agent to delete a file |
| 9 | Creative Response | Generate a poem about AI evals |
### Metrics Evaluated
- `tool_awareness`: LLM judge verifies agent lists core tools
- `sub_agent_awareness`: LLM judge verifies agent describes sub-agents
- `file_created`: Verifies src/lib.rs was created
- `file_edited`: Verifies src/lib.rs contains "modified"
- `coder_file_created`: Verifies src/greeting.rs was created by coder sub-agent
- `coder_file_edited`: Verifies src/greeting.rs was edited by coder (contains "edited")
- `file_deleted`: Verifies src/temp.rs was deleted by executor sub-agent
- `poem_quality`: LLM judge evaluates the creative response
- `turns_completed`: Score of completed turns (9/9)
- `sufficient_tool_usage`: At least 5 tool calls made
## CLI Usage
### Running PR Check
```bash
# Basic usage
qbit-cli --eval --scenario pr-check
# With specific provider
qbit-cli --eval --scenario pr-check --eval-provider vertex-claude
# With transcript output (shows full agent conversation with actual prompts)
qbit-cli --eval --scenario pr-check --transcript
# With transcript and pretty results
qbit-cli --eval --scenario pr-check --transcript --pretty
# Save results to file
qbit-cli --eval --scenario pr-check --output results.json --pretty
```
### The --transcript Flag
The `--transcript` flag provides visibility into what the agent actually did during evaluation:
```bash
qbit-cli --eval --scenario pr-check --transcript
```
This outputs:
1. **First**: The full agent transcript with each turn's response
2. **Then**: The evaluation results summary
The transcript shows:
- Each user turn and agent response
- All tool calls made
- Clear visual separators between turns
Example output:
```
═══════════════════════════════════════════════════════════════
AGENT TRANSCRIPT
═══════════════════════════════════════════════════════════════
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
┃ Scenario: pr-check
┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
┌─ User Turn 1 ──────────────────────────────────────────────────
│ [prompt sent to agent]
├─ Agent Response ─────────────────────────────────────────────
│ # Main Tools Available
│ - read_file, edit_file, create_file...
└───────────────────────────────────────────────────────────────
┌─ User Turn 2 ──────────────────────────────────────────────────
│ [prompt sent to agent]
├─ Agent Response ─────────────────────────────────────────────
│ ## Sub-Agents
│ - coder: Handles multiple related edits...
└───────────────────────────────────────────────────────────────
```
## GitHub Actions Integration
The evaluation workflow automatically runs `pr-check` for pull requests:
```yaml
# For PRs: lightweight pr-check scenario with transcript
if [ "${{ github.event_name }}" = "pull_request" ]; then
./target/debug/qbit-cli --eval --scenario pr-check --transcript --pretty
fi
```
This ensures:
- Fast feedback on PRs (~60 seconds per provider)
- Full transcript available in CI logs for debugging
- Reduced API costs compared to full eval suite
- Full eval suite still runs on main branch and scheduled runs
### Exit Codes and PASS/FAIL Output
The CLI outputs a clear PASS or FAIL summary at the end for easy CI integration:
```
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
PASS: All 1 scenarios passed
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
```
Exit codes:
- **0**: All scenarios passed
- **1**: One or more scenarios failed
This allows GitHub Actions to correctly detect and report failures
## When to Use What
| Scenario | When to Use |
|----------|-------------|
| `--scenario pr-check` | PR checks, quick validation |
| Full suite (no --scenario) | Main branch, release validation, weekly runs |
| `--scenario <specific>` | Debugging a specific capability |
## Adding New Capabilities to PR Check
If you need to test additional capabilities in PR checks, edit:
`backend/crates/qbit-evals/src/scenarios/pr_check.rs`
Guidelines:
1. Keep the total scenario under 90 seconds
2. Each turn should test one distinct capability
3. Add corresponding metrics to verify success
4. Test with all three providers before merging
- Without a harness, you **can't compare** prompts, models, retrieval configs, or costs.
Evaluate, benchmark, and regression-test AI/LLM systems. Covers evaluation framework design, benchmark creation, human evaluation protocols, automated evaluation (LLM-as-judge), regression testing, statistical significance, and continuous evaluation pipelines.
<img width="1388" height="298" alt="full_diagram" src="https://github.com/user-attachments/assets/12a2371b-8be2-4219-9b48-90503eb43c69" />
A list of all public EEG-datasets. This list of EEG-resources is not exhaustive. If you find something new, or have explored any unfiltered link in depth, please update the repository.