The open benchmark for AI agent task execution. Claude Code vs Gemini CLI — who wins? Live leaderboard inside.
<div align="center"> # AgentBench-Live ### The benchmark that tests what agents *do*, not what they *say*. Real tasks. Real sandboxes. Real scores. No vibes. [](https://github.com/jackjin1997/AgentBench-Live/actions/workflows/ci.yml) [](tests/) [](tests/) [](https://opensource.org/licenses/MIT) [](https://www.python.org/downloads/) [](https://github.com/jackjin1997/AgentBench-Live/releases/tag/v0.1.0) [Live Leaderboard](https://jackjin1997.github.io/AgentBench-Live/) | [Methodology](docs/methodology.md) | [Contributing](CONTRIBUTING.md) </div> --- Most agent benchmarks test toy problems or let agents self-report. AgentBench-Live drops agents into **Docker-sandboxed workspaces** with real codebases, real data files, and real multi-step workflows — then scores them automatically with test suites and LLM judges. **Why not SWE-bench / OpenHarness?** Those benchmark a single axis (GitHub issue resolution). We test **5 capability domains** — code, data analysis, multi-step orchestration, research, and tool use — because real work isn't just fixing bugs. --- ## Leaderboard > **10 tasks across 5 domains** | Docker sandbox | Auto-eval + LLM Judge scoring | Agent | Code | Data | Multi-Step | Research | Tool Use | **Overall** | |:---|:---:|:---:|:---:|:---:|:---:|:---:| | **Claude Code** | 1.00 | 0.07 | 0.74 | 0.60 | 0.25 | **0.53** | | **Gemini CLI** | 1.00 | 0.32 | 0.77 | 0.45 | 0.05 | **0.52** | | Codex CLI | - | - | - | - | - | *pending* | | Aider | - | - | - | - | - | *pending* | <details> <summary><b>Full task-level breakdown</b></summ
Google's AI-powered research notebook that ingests your documents and becomes an expert on your content. Generates audio overviews, study guides, FAQs, and interactive discussions from uploaded sources.
Google DeepMind's experimental AI agent that can navigate websites, fill forms, and complete multi-step browser tasks autonomously. Uses Gemini's multimodal understanding to interact with web interfaces.
Google DeepMind's universal AI assistant prototype that can see, hear, and respond in real-time through your device camera and microphone. Demonstrates the future of multimodal AI interaction.
Google Cloud's enterprise platform for building, deploying, and managing AI agents powered by Gemini. Supports multi-agent orchestration, tool integration, and enterprise governance.
Gemini's agentic research capability that autonomously browses the web, synthesizes information from dozens of sources, and produces comprehensive research reports on any topic.
Interactive coding and content creation agent that generates, previews, and iterates on code, documents, and interactive applications in a side panel. Supports HTML/CSS/JS, Python, and more.