Supporting the launch of JetBrains’ Developer Productivity AI Arena

on February 12, 2026

High-quality human expert data. Now accessible for all on Toloka Platform.

Learn more

When JetBrains launched the Developer Productivity AI Arena (DPAI Arena) in October, they set a new standard for benchmarking AI agents. We were proud to work alongside their team to provide the production-grade data that helped turn this vision into a reality at launch.

The goal was to create production-grade tasks grounded in real workflows, where success looks beyond whether code simply compiles and instead focuses on how a change integrates into an existing codebase and test environment.

When benchmarks stop reflecting real development

AI coding agents are moving into real development workflows, but many existing benchmarks are built around narrow, fixed setups, often tied to a single language, framework, or task format. That makes them hard to evolve as real-world engineering practices change.

Tasks are derived from real GitHub issues rather than synthetic prompts. While the Arena supports multiple languages, the initial focus was on enterprise-style Java workloads built on the Spring framework. Large production systems are still overwhelmingly built in Java, often on Spring. Most benchmarks don’t reflect that reality.

Turning engineering problems into benchmark-ready tasks

JetBrains needed tasks for the Developer Productivity AI Arena before launch, and we worked with their team to build them. The goal was to translate real engineering problems into tasks that could handle rigorous evaluation without falling apart under scrutiny.

Our work began with real GitHub issues from open-source projects. One issue, for example, asked developers to “implement API endpoints for roadmap visualization and reporting.” On its own, that left too much open to interpretation. It wasn’t clear what data should be returned, how calculations should be handled, or what would count as a correct result.

We rewrote issues like this into technical tasks that defined the expected behavior end to end. The goal wasn’t to simplify the problem, so much as it was to make it testable. By turning vague requests into explicit specifications with clear acceptance criteria, each task could be validated deterministically and used for reliable agent evaluation.

We worked closely with JetBrains’ engineers to deliver tasks into the Arena and adapt our task-building process to its Multi-Track architecture. Integration improved over time through direct collaboration, as both teams adjusted how tasks were produced and brought into the platform.

Some issues required more than a direct fix. In one case, we started from a broad epic to build release planning and roadmapping tools. The scope covered too much at once, which made it unsuitable as a single benchmark task.

We narrowed the problem by separating it into independent tasks that could be evaluated on their own. One task focused only on extending the Release data model and updating the relevant APIs so planning information could be stored and retrieved. The requirements were explicit and the outcome could be tested directly, which made the task suitable for deterministic evaluation.

Building benchmark-ready tasks took serious effort, with a single task requiring 20 to 25 hours between a coding expert and a QA reviewer. Some even went through multiple completion and review cycles before they were ready. In total, we delivered 27 tasks that represented 551 hours of expert engineering work.

From conception to delivery

The main delivery phase happened as JetBrains prepared the Arena for external contributors ahead of the public announcement. We provided benchmark tasks built directly on real Spring-based open-source repositories used in production-style environments, including projects such as spring-petclinic and train-ticket. It meant agents had to operate inside existing codebases rather than solve isolated or synthetic problems.

Regular check-ins helped surface edge cases at an early stage, which meant feedback could be incorporated without slowing delivery.

Making a new benchmark credible at launch

Success meant JetBrains could launch the Arena on schedule with tasks that tested the challenges AI coding agents face in production.

Each task required agents to work within existing codebases and produce changes that integrated into live repositories. Where GitHub issues were ambiguous or under-specified (common in open-source development), we turned them into precise technical requirements that could be validated deterministically. Agents couldn't pass by generating plausible-looking code. They had to solve the engineering problem as defined.

The task design and review process also clarified what "quality" meant for agent evaluation. Expectations around task definition and validation became explicit, creating a foundation external contributors could understand and build on. Clarity mattered as JetBrains opened the Arena to vendors, framework maintainers, and users.

The launch prioritized trust over scale. Instead of maximizing task volume, the Arena established a standard for realism and reproducibility that supports meaningful comparison and long-term growth.

Supporting the launch

The Developer Productivity AI Arena launched with tasks that reflected real development work. We helped JetBrains meet their release timeline and deliver benchmark content alongside their internal efforts.

Realistic agent evaluation requires precisely defined tasks with validated reference solutions, backed by deterministic tests that differentiate agent behavior. As the Arena grows and evolves its approach to representativeness, it’s this standard that remains part of the platform's foundation.

Toloka delivers expert-level data for AI evaluation platforms. If you're building benchmarks that need to launch credibly, let's talk.

Subscribe to Toloka news

Case studies, product news, and other articles straight to your inbox.