End-to-end TeamCity framework to run AI agents on SWE-Bench Lite. Spin up isolated Docker images per task, extract patches, score with the official harness, and aggregate success rates. As an example, we'll look at Junie and Google Gemini CLI
 <div align="center"> <p> <a href="#overview">Overview</a> • <a href="#project-structure">Project structure</a> • <a href="#setup-instructions">Setup instructions</a> • <a href="#monitoring-and-results">Monitoring and results</a> • <a href="https://www.jetbrains.com/teamcity/use-cases/ai/" target="_blank">TeamCity for AI agent evaluation ↗️</a> </p> </div> # SWE-Bench AI Agent Testing with TeamCity [TeamCity](https://www.jetbrains.com/teamcity/use-cases/ai/) gives you a reproducible CI/CD backbone for AI agent evaluation: orchestrate parallel, isolated (Docker) runs across benchmark tasks (e.g., SWE-Bench), validate patches automatically, and capture metrics, logs, and artifacts to track performance and costs at scale via Kotlin-DSL pipelines. This TeamCity configuration provides a complete framework for testing AI agents against the SWE-Bench Lite dataset, which contains 300+ software engineering tasks from popular Python repositories. ## Overview The system evaluates AI agents by: 1. Preparing isolated Docker environments for each SWE-Bench task 2. Running AI agents against specific coding problems 3. Evaluating solutions using the official SWE-Bench evaluation harness 4. Collecting performance metrics and success rates ## Architecture ### Projects Structure - **JetBrains Junie AI Agent** (`JetBrain_Junie_AI_Agent.kt`) - Downloads Junie CLI from GitHub releases and IntelliJ IDEA - Creates task subsets for progressive testing - Individual task execution builds for all 300+ SWE-Bench tasks - **Google Gemini CLI AI Agent** (`Google_Gemini_CLI_AI_Agent.kt`) - Builds from the official Google Gemini CLI repository (`https://github.com/google-gemini/gemini-cli.git`) - Uses Node.js execution environment with npm build process - Creates task subsets for progressive testing - **SWE-Bench Lite** (`SWE_Bench_Lite.kt`): Core dataset and e
Google's AI-powered research notebook that ingests your documents and becomes an expert on your content. Generates audio overviews, study guides, FAQs, and interactive discussions from uploaded sources.
Google DeepMind's experimental AI agent that can navigate websites, fill forms, and complete multi-step browser tasks autonomously. Uses Gemini's multimodal understanding to interact with web interfaces.
Google DeepMind's universal AI assistant prototype that can see, hear, and respond in real-time through your device camera and microphone. Demonstrates the future of multimodal AI interaction.
Google Cloud's enterprise platform for building, deploying, and managing AI agents powered by Gemini. Supports multi-agent orchestration, tool integration, and enterprise governance.
Gemini's agentic research capability that autonomously browses the web, synthesizes information from dozens of sources, and produces comprehensive research reports on any topic.
Interactive coding and content creation agent that generates, previews, and iterates on code, documents, and interactive applications in a side panel. Supports HTML/CSS/JS, Python, and more.