Case Study: How Junie Uses TeamCity to Evaluate Coding Agents

Introduction Junie is an intelligent coding agent developed by JetBrains. It automates the full development loop: reading project files, editing code, running tests, and applying fixes, going far beyond simple code generation. Similar to how developers use tools like ChatGPT to solve coding problems, Junie takes it a step further by automating the entire process. […]

Jun 3, 2025 - 21:10

Case Study: How Junie Uses TeamCity to Evaluate Coding Agents

Introduction

Junie is an intelligent coding agent developed by JetBrains. It automates the full development loop: reading project files, editing code, running tests, and applying fixes, going far beyond simple code generation.

Similar to how developers use tools like ChatGPT to solve coding problems, Junie takes it a step further by automating the entire process.

As the agent’s architecture evolved, the team needed a secure, robust way to measure progress. They wanted to build a scalable, reproducible evaluation pipeline that would be able to track changes across hundreds of tasks.

That’s where TeamCity came in. Junie’s development team uses TeamCity to orchestrate large-scale evaluations, coordinate Dockerized environments, and track important metrics that guide Junie’s improvements.

The challenge

Validating agent improvements at scale

As Junie’s agents became more capable, with new commands and smarter decision-making, every change needed to be tested for real impact. Evaluation had to be systematic, repeatable, and grounded in data.

“Did it get better or not?’ is a very poor way to evaluate. If I just try three examples from memory and see if it got better, that leads nowhere. That’s not how you achieve stable, consistent improvements. You need a benchmark with a large and diverse enough set of tasks to actually measure anything.”

Danila Savenkov, Team Lead, JetBrains Junie

The team identified five core requirements for this process:

Scale: Evaluations had to cover at least 100 tasks per run to minimize statistical noise. Running fewer tasks made it hard to draw meaningful conclusions.
Parallel execution: Tasks needed to be evaluated in parallel, as running them sequentially would take over 24 hours and delay feedback loops.
Reproducibility: It had to be possible to trace every evaluation back to the exact version of the agent, datasets, and environment used. Local experiments or inconsistent setups were not acceptable.
Cost control: Each evaluation involved significant LLM API usage, typically costing USD 100+ per run. Tracking and managing these costs was essential.
Data preservation: Results, logs, and artifacts needed to be stored reliably for analysis, debugging, and long-term tracking.

Benchmarking with SWE-bench

For a reliable signal, Junie adopted SWE-bench, a benchmark built from real GitHub issues and PRs. They also used SWE-bench Verified, a curated 500-task subset validated by OpenAI for clarity and feasibility.

In parallel, Junie created in-house benchmarks for their internal monorepo (Java/Kotlin), Web stack, and Go codebases, continuously covering more languages and technologies by the benchmarks.

The operational challenge

Running these large-scale evaluations posed operational challenges:

Spinning up consistent, isolated environments for each task.
Managing dependencies and project setups.
Applying patches generated by agents and running validations automatically.
Collecting structured logs and metrics for deep analysis.

Manual workflows wouldn’t scale. Junie needed automation that was fast, repeatable, and deeply integrated into their engineering stack.

TeamCity enabled that orchestration. With it, the Junie team built an evaluation pipeline that is scalable, traceable, and deeply integrated into their development loop.

The solution

To support reliable, large-scale evaluation of its coding agents, Junie implemented an evaluation pipeline powered by TeamCity, a CI/CD solution developed by JetBrains.

TeamCity orchestrates the execution of hundreds of tasks in parallel, manages isolated environments for each benchmark case, and coordinates patch validation and result collection.

“If we tried running this locally, it just wouldn’t be realistic. A single evaluation would take a full day. That’s why we use TeamCity: to do everything in parallel, isolated environments, and to ensure the results are reproducible.”

Danila Savenkov, Team Lead, JetBrains Junie

The setup enables the team to trace outcomes to specific agent versions, gather detailed logs for analysis, and run evaluations efficiently, while keeping infrastructure complexity and LLM usage costs under control.

Execution pipeline design

At the heart of the system is a composite build configuration defined using Kotlin DSL, which gives Junie full control over task orchestration. Each top-level evaluation run includes multiple build steps.

Environment setup

Each coding task is paired with a dedicated environment, typically a pre-built Docker container with the necessary dependencies already installed. This guarantees consistency across runs and eliminates local setup variability.

Agent execution

Junie’s agent is launched against the task. It receives a full prompt, including the issue description, code structure, system commands, and guidelines. It then autonomously works through the problem, issuing actions such as file edits, replacements, and test runs.

The final output is a code patch meant to resolve the issue.

Patch evaluation

The generated patch is passed to the next build step, where TeamCity applies it to the project and runs the validation suite. This mimics the GitHub pull request flow – if the original tests were failing and now pass, the task is marked as successfully completed.

Metric logging

Execution metadata, including logs, command traces, and success/failure flags, is exported to an open-source distributed storage and processing system. Junie uses it to store evaluation artifacts and perform large-scale analysis.

With the solution’s support for SQL-like querying and scalable data processing, the team can efficiently aggregate insights across hundreds of tasks and track agent performance over time.

Developers rely on this data to:

Track the percentage of solved tasks (their “North Star” metric).
Analyze the average cost per task for LLM API usage.
Break down agent behavior ( like the most frequent commands or typical failure points).
Compare performance between agent versions.

Scalability through automation

By using Kotlin DSL and TeamCity’s composable build model, Junie scales evaluations to hundreds of tasks per session – far beyond what could be managed manually. For larger datasets (typically 300-2000 tasks), each execution is spun up in parallel, minimizing runtime and allowing the team to test changes frequently.

“We use Kotlin DSL to configure everything. When you have 13 builds, you can still manage them manually, but when it’s 399, or 500, or 280, it starts getting tricky.”

Danila Savenkov, Team Lead, JetBrains Junie

Results: reproducible, scalable, insight-driven agent development

TeamCity has enabled Junie to measure agent performance efficiently and at scale, making their development process faster, more reliable, and data-driven.

Key outcomes

Challenge	Result with TeamCity
Validate agent changes at scale	100+ tasks per run, reducing statistical noise
Long evaluation cycles (24+ hrs)	Tasks run in parallel – now completed in a manageable window
Inconsistent local testing	Every run is reproducible and traceable to the exact agent and dataset
Expensive LLM usage	Per-task usage is tracked, helping optimize development and costs
Fragile logging and data loss	Logs and outcomes are automatically stored for later debugging and review

Need to scale your AI workflows?

TeamCity gives you the infrastructure to evaluate and iterate with confidence. Start your free trial or request a demo.