teamcity-ai-agent-testing-demo

Name: teamcity-ai-agent-testing-demo
Author: JetBrains

JetBrains July 1, 2025

5 copies 0 downloads

End-to-end TeamCity framework to run AI agents on SWE-Bench Lite. Spin up isolated Docker images per task, extract patches, score with the official harness, and aggregate success rates. As an example, we'll look at Junie and Google Gemini CLI

<div align="center"> <p> <a href="#overview">Overview</a> • <a href="#project-structure">Project structure</a> • <a href="#setup-instructions">Setup instructions</a> • <a href="#monitoring-and-results">Monitoring and results</a> • <a href="https://www.jetbrains.com/teamcity/use-cases/ai/" target="_blank">TeamCity for AI agent evaluation ↗️</a> </p> </div>

SWE-Bench AI Agent Testing with TeamCity

TeamCity gives you a reproducible CI/CD backbone for AI agent evaluation: orchestrate parallel, isolated (Docker) runs across benchmark tasks (e.g., SWE-Bench), validate patches automatically, and capture metrics, logs, and artifacts to track performance and costs at scale via Kotlin-DSL pipelines.

This TeamCity configuration provides a complete framework for testing AI agents against the SWE-Bench Lite dataset, which contains 300+ software engineering tasks from popular Python repositories.

Overview

The system evaluates AI agents by:

Preparing isolated Docker environments for each SWE-Bench task
Running AI agents against specific coding problems
Evaluating solutions using the official SWE-Bench evaluation harness
Collecting performance metrics and success rates

Architecture

Projects Structure

JetBrains Junie AI Agent (JetBrain_Junie_AI_Agent.kt)
- Downloads Junie CLI from GitHub releases and IntelliJ IDEA
- Creates task subsets for progressive testing
- Individual task execution builds for all 300+ SWE-Bench tasks
Google Gemini CLI AI Agent (Google_Gemini_CLI_AI_Agent.kt)
- Builds from the official Google Gemini CLI repository (https://github.com/google-gemini/gemini-cli.git)
- Uses Node.js execution environment with npm build process
- Creates task subsets for progressive testing
SWE-Bench Lite (SWE_Bench_Lite.kt): Core dataset and e

Comments

More Agents

View all

agentic-ai

Agentsmith

Universal, model-agnostic operating harness for AI agents (Claude, Codex, Gemini, …) — a lean core + work-type profiles assembled by one setup script.

PromptPartner

308

agent-skills

Awesome Gamedev Agent Skills

Game-development Agent Skills for AI coding agents: install once and a master router loads the right skill for your engine and task. 66 original, version-pinned skills (plus a master router) in the portable SKILL.md format that runs across Claude Code, Cursor, Codex, Copilot, Gemini CLI and more, for Godot, Unity, Unreal, web and beyond.

gamedev-skills

303

ai-agents

Agentpet

A desktop pet for macOS & Windows that monitors your AI coding agents (Claude Code, Codex, Cursor, Gemini...) in real time, and grows as you code, feed it tokens, level it up, climb the leaderboard.

ntd4996

279

ai-agent

UltraGameStudio

UltraGameStudio - AI coding agent for game development: engine workflows, gameplay code, and asset generation.

wellingfeng

260

Zero

The coding agent that answers to you, your model, your machine, your rules.

Gitlawb

1,099

agent-bridge

Lucarne

Stop babysitting local AI agents. Just notifications, approve, and resume your Codex,Pi,Grok, or Claude code sessions anywhere. 0-Intrusion mobile control bridge via Telegram/微信/飞书. No hooks, no skills, no MCP.

tuchg

314