AgentBench-Live

The benchmark that tests what agents do, not what they say.

Real tasks. Real sandboxes. Real scores. No vibes.

Live Leaderboard | Methodology | Contributing

</div>

Most agent benchmarks test toy problems or let agents self-report. AgentBench-Live drops agents into Docker-sandboxed workspaces with real codebases, real data files, and real multi-step workflows — then scores them automatically with test suites and LLM judges.

Why not SWE-bench / OpenHarness? Those benchmark a single axis (GitHub issue resolution). We test 5 capability domains — code, data analysis, multi-step orchestration, research, and tool use — because real work isn't just fixing bugs.

Leaderboard

10 tasks across 5 domains | Docker sandbox | Auto-eval + LLM Judge scoring

Agent	Code	Data	Multi-Step	Research	Tool Use	Overall
Claude Code	1.00	0.07	0.74	0.60	0.25	0.53
Gemini CLI	1.00	0.32	0.77	0.45	0.05	0.52
Codex CLI	-	-	-	-	-	pending
Aider	-	-	-	-	-	pending

<details> <summary><b>Full task-level breakdown</b></summ

AgentBench-Live

AgentBench-Live

The benchmark that tests what agents do, not what they say.

Leaderboard

Tags

Comments

More Agents

Agentsmith

Awesome Gamedev Agent Skills

Agentpet

UltraGameStudio

Zero

Lucarne

Ready-made automations for this