5 agents available in the Gemini directory
The open benchmark for AI agent task execution. Claude Code vs Gemini CLI — who wins? Live leaderboard inside.
AI agent simulation framework
A unified evaluation toolkit and leaderboard for rigorously assessing the scientific intelligence of large language and vision–language models across the full research workflow.
End-to-end TeamCity framework to run AI agents on SWE-Bench Lite. Spin up isolated Docker images per task, extract patches, score with the official harness, and aggregate success rates. As an example, we'll look at Junie and Google Gemini CLI
Test your prompts, agents, and RAGs. Red teaming/pentesting/vulnerability scanning for AI. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with command line and CI/CD integration. Used by OpenAI and Anthropic.