web-search-agent-evals

Name: web-search-agent-evals
Author: youdotcom-oss

youdotcom-oss January 20, 2026

5 copies 0 downloads

Extensible benchmarking suite for evaluating AI coding agents on web search tasks. Compare native search vs MCP servers (You.com, expanding) across multiple agents (Claude Code, Gemini, Droid, Codex, expanding) with automated Docker workflows and statistical analysis.

Web Search Agent Evaluations

Evaluate multiple agents (Claude Code, Gemini, Droid, Codex) with different web search tools (builtin, You.com MCP) in isolated Docker containers.

Overview

This evaluation system runs a matrix comparison: 4 agents × 2 tools = 8 pairings, capturing full trajectories for analysis.

Key Features:

Headless adapters - No custom code, just JSON schemas (@plaited/agent-eval-harness)
Flag-based architecture - Single service per agent, MCP mode selected via environment variable
Type-safe constants - MCP server definitions in mcp-servers.ts
TypeScript entrypoint - Bun shell script for runtime MCP configuration
Isolated execution - Each pairing runs in its own Docker container

Evaluation Pipeline

flowchart TD
    Env[SEARCH_PROVIDER env var] --> Entrypoint[docker/entrypoint]
    Entrypoint -->|builtin| SkipMCP[Skip MCP setup]
    Entrypoint -->|you| ConfigMCP[Configure MCP via CLI]

    SkipMCP --> Harness[agent-eval-harness trials]
    ConfigMCP --> Harness

    Prompts[prompts.jsonl] --> Harness
    Schemas[agent-schemas/*.json] --> Harness
    Harness --> Results[data/results/YYYY-MM-DD/agent/tool.jsonl]

Analysis Pipeline

flowchart LR
    Results[Trial Results<br/>YYYY-MM-DD/agent/tool.jsonl] --> Compare[compare.ts]
    Grader[inline-grader.ts] --> Compare

    Compare -->|weighted| Weighted[*-weighted.json<br/>Capability + Reliability]
    Compare -->|statistical| Statistical[*-statistical.json<br/>Bootstrap CIs]

    Weighted --> Report[report.ts]
    Statistical --> Report

    Report --> REPORT[REPORT.md<br/>Rankings + Analysis]

    style Compare fill:#e1f5ff
    style Report fill:#e1f5ff
    style REPORT fill:#d4edda

Latest Results

📊 View Latest Evaluation Report - Comprehensive analysis with quality rankings, performance metrics, tool call stat

Comments

More Agents

View all

agentic-ai

Agentsmith

Universal, model-agnostic operating harness for AI agents (Claude, Codex, Gemini, …) — a lean core + work-type profiles assembled by one setup script.

PromptPartner

308

agent-skills

Awesome Gamedev Agent Skills

Game-development Agent Skills for AI coding agents: install once and a master router loads the right skill for your engine and task. 66 original, version-pinned skills (plus a master router) in the portable SKILL.md format that runs across Claude Code, Cursor, Codex, Copilot, Gemini CLI and more, for Godot, Unity, Unreal, web and beyond.

gamedev-skills

303

ai-agents

Agentpet

A desktop pet for macOS & Windows that monitors your AI coding agents (Claude Code, Codex, Cursor, Gemini...) in real time, and grows as you code, feed it tokens, level it up, climb the leaderboard.

ntd4996

279

ai-agent

UltraGameStudio

UltraGameStudio - AI coding agent for game development: engine workflows, gameplay code, and asset generation.

wellingfeng

260

Zero

The coding agent that answers to you, your model, your machine, your rules.

Gitlawb

1,099

agent-bridge

Lucarne

Stop babysitting local AI agents. Just notifications, approve, and resume your Codex,Pi,Grok, or Claude code sessions anywhere. 0-Intrusion mobile control bridge via Telegram/微信/飞书. No hooks, no skills, no MCP.

tuchg

314