web-search-agent-evals — Gemini Agents | Neura Market
    Neura MarketNeura Market/Gemini
    ChatGPTChatGPTClaudeClaudeGeminiGeminiCursorCursorGrokGrokPerplexityPerplexityDeepSeekDeepSeek
    CoPilotCoPilotStable DiffusionStable DiffusionMidjourneyMidjourney
    View All Directories
    OverviewRulesPromptsMCPsAgentsBlogVideosGuidesCoursesCommunityGemsExtensionsTrendingGenerate
    GeminiAgentsweb-search-agent-evals
    Back to Agents
    web-search-agent-evals

    web-search-agent-evals

    youdotcom-oss January 20, 2026
    5 copies 0 downloads

    Extensible benchmarking suite for evaluating AI coding agents on web search tasks. Compare native search vs MCP servers (You.com, expanding) across multiple agents (Claude Code, Gemini, Droid, Codex, expanding) with automated Docker workflows and statistical analysis.

    Agent Definition
    # Web Search Agent Evaluations
    
    Evaluate multiple agents (Claude Code, Gemini, Droid, Codex) with different web search tools (builtin, You.com MCP) in isolated Docker containers.
    
    ## Overview
    
    This evaluation system runs a matrix comparison: 4 agents Ɨ 2 tools = 8 pairings, capturing full trajectories for analysis.
    
    **Key Features:**
    - **Headless adapters** - No custom code, just JSON schemas ([@plaited/agent-eval-harness](https://www.npmjs.com/package/@plaited/agent-eval-harness))
    - **Flag-based architecture** - Single service per agent, MCP mode selected via environment variable
    - **Type-safe constants** - MCP server definitions in `mcp-servers.ts`
    - **TypeScript entrypoint** - Bun shell script for runtime MCP configuration
    - **Isolated execution** - Each pairing runs in its own Docker container
    
    ### Evaluation Pipeline
    
    ```mermaid
    flowchart TD
        Env[SEARCH_PROVIDER env var] --> Entrypoint[docker/entrypoint]
        Entrypoint -->|builtin| SkipMCP[Skip MCP setup]
        Entrypoint -->|you| ConfigMCP[Configure MCP via CLI]
    
        SkipMCP --> Harness[agent-eval-harness trials]
        ConfigMCP --> Harness
    
        Prompts[prompts.jsonl] --> Harness
        Schemas[agent-schemas/*.json] --> Harness
        Harness --> Results[data/results/YYYY-MM-DD/agent/tool.jsonl]
    ```
    
    ### Analysis Pipeline
    
    ```mermaid
    flowchart LR
        Results[Trial Results<br/>YYYY-MM-DD/agent/tool.jsonl] --> Compare[compare.ts]
        Grader[inline-grader.ts] --> Compare
    
        Compare -->|weighted| Weighted[*-weighted.json<br/>Capability + Reliability]
        Compare -->|statistical| Statistical[*-statistical.json<br/>Bootstrap CIs]
    
        Weighted --> Report[report.ts]
        Statistical --> Report
    
        Report --> REPORT[REPORT.md<br/>Rankings + Analysis]
    
        style Compare fill:#e1f5ff
        style Report fill:#e1f5ff
        style REPORT fill:#d4edda
    ```
    
    ## Latest Results
    
    šŸ“Š **[View Latest Evaluation Report](data/comparisons/2026-02-18/REPORT.md)** - Comprehensive analysis with quality rankings, performance metrics, tool call stat

    Tags

    agent-evaluationai-agentsbenchmarkclaude-codecodexcoding-agentsdroidevaluation-suitegeminiheadless-testing

    Comments

    More Agents

    View all
    research

    NotebookLM

    Google's AI-powered research notebook that ingests your documents and becomes an expert on your content. Generates audio overviews, study guides, FAQs, and interactive discussions from uploaded sources.

    G
    Google
    browser

    Project Mariner (Browser Agent)

    Google DeepMind's experimental AI agent that can navigate websites, fill forms, and complete multi-step browser tasks autonomously. Uses Gemini's multimodal understanding to interact with web interfaces.

    G
    Google DeepMind
    multimodal

    Project Astra (Multimodal Agent)

    Google DeepMind's universal AI assistant prototype that can see, hear, and respond in real-time through your device camera and microphone. Demonstrates the future of multimodal AI interaction.

    G
    Google DeepMind
    enterprise

    Gemini Enterprise Agent Platform

    Google Cloud's enterprise platform for building, deploying, and managing AI agents powered by Gemini. Supports multi-agent orchestration, tool integration, and enterprise governance.

    G
    Google Cloud
    research

    Gemini Deep Research Agent

    Gemini's agentic research capability that autonomously browses the web, synthesizes information from dozens of sources, and produces comprehensive research reports on any topic.

    G
    Google
    canvas

    Gemini Canvas Agent

    Interactive coding and content creation agent that generates, previews, and iterates on code, documents, and interactive applications in a side panel. Supports HTML/CSS/JS, Python, and more.

    G
    Google

    Stay up to date

    Get the latest Gemini prompts, rules, and resources delivered to your inbox weekly.

    Neura Market LogoNeura Market

    Discover the best AI prompts, plugins, and resources for Gemini and more.

    Content Types

    • Rules
    • Prompts
    • MCPs
    • Agents
    • Guides

    Platforms

    • ChatGPT Directory
    • Claude Directory
    • Gemini Directory
    • Cursor Directory
    • Grok Directory
    • Perplexity Directory
    • DeepSeek Directory
    • CoPilot Directory
    • Stable Diffusion Directory
    • Midjourney Directory
    • All Directories

    Resources

    • Blog
    • Documentation
    • Help Center
    • Marketplace

    Legal

    • Privacy Policy
    • Terms of Service

    Ā© 2026 Neura Market. All rights reserved.

    |

    Not affiliated with any AI platform vendors.