Extensible benchmarking suite for evaluating AI coding agents on web search tasks. Compare native search vs MCP servers (You.com, expanding) across multiple agents (Claude Code, Gemini, Droid, Codex, expanding) with automated Docker workflows and statistical analysis.
# Web Search Agent Evaluations
Evaluate multiple agents (Claude Code, Gemini, Droid, Codex) with different web search tools (builtin, You.com MCP) in isolated Docker containers.
## Overview
This evaluation system runs a matrix comparison: 4 agents Ć 2 tools = 8 pairings, capturing full trajectories for analysis.
**Key Features:**
- **Headless adapters** - No custom code, just JSON schemas ([@plaited/agent-eval-harness](https://www.npmjs.com/package/@plaited/agent-eval-harness))
- **Flag-based architecture** - Single service per agent, MCP mode selected via environment variable
- **Type-safe constants** - MCP server definitions in `mcp-servers.ts`
- **TypeScript entrypoint** - Bun shell script for runtime MCP configuration
- **Isolated execution** - Each pairing runs in its own Docker container
### Evaluation Pipeline
```mermaid
flowchart TD
Env[SEARCH_PROVIDER env var] --> Entrypoint[docker/entrypoint]
Entrypoint -->|builtin| SkipMCP[Skip MCP setup]
Entrypoint -->|you| ConfigMCP[Configure MCP via CLI]
SkipMCP --> Harness[agent-eval-harness trials]
ConfigMCP --> Harness
Prompts[prompts.jsonl] --> Harness
Schemas[agent-schemas/*.json] --> Harness
Harness --> Results[data/results/YYYY-MM-DD/agent/tool.jsonl]
```
### Analysis Pipeline
```mermaid
flowchart LR
Results[Trial Results<br/>YYYY-MM-DD/agent/tool.jsonl] --> Compare[compare.ts]
Grader[inline-grader.ts] --> Compare
Compare -->|weighted| Weighted[*-weighted.json<br/>Capability + Reliability]
Compare -->|statistical| Statistical[*-statistical.json<br/>Bootstrap CIs]
Weighted --> Report[report.ts]
Statistical --> Report
Report --> REPORT[REPORT.md<br/>Rankings + Analysis]
style Compare fill:#e1f5ff
style Report fill:#e1f5ff
style REPORT fill:#d4edda
```
## Latest Results
š **[View Latest Evaluation Report](data/comparisons/2026-02-18/REPORT.md)** - Comprehensive analysis with quality rankings, performance metrics, tool call statGoogle's AI-powered research notebook that ingests your documents and becomes an expert on your content. Generates audio overviews, study guides, FAQs, and interactive discussions from uploaded sources.
Google DeepMind's experimental AI agent that can navigate websites, fill forms, and complete multi-step browser tasks autonomously. Uses Gemini's multimodal understanding to interact with web interfaces.
Google DeepMind's universal AI assistant prototype that can see, hear, and respond in real-time through your device camera and microphone. Demonstrates the future of multimodal AI interaction.
Google Cloud's enterprise platform for building, deploying, and managing AI agents powered by Gemini. Supports multi-agent orchestration, tool integration, and enterprise governance.
Gemini's agentic research capability that autonomously browses the web, synthesizes information from dozens of sources, and produces comprehensive research reports on any topic.
Interactive coding and content creation agent that generates, previews, and iterates on code, documents, and interactive applications in a side panel. Supports HTML/CSS/JS, Python, and more.