Read our papers on AI training, evaluation, and safety

Read our papers on AI training, evaluation, and safety

Read our papers on AI training, evaluation, and safety

From word docs to data analysis: Evaluating AI agent performance across everyday apps

October 1, 2025

October 1, 2025

Customer cases

Customer cases

Challenge:

A leading LLM developer needed to prove its AI agents could handle everyday computer tasks but lacked a robust way to measure real-world performance.

Solution:

Toloka deployed hundreds of trained annotators within a week to create realistic prompts spanning multiple common applications.

Impact

Revealed clear strengths and gaps, guiding targeted improvements and establishing a repeatable benchmark for future testing. When an AI company says their agent can "use computer applications," what does that actually mean? Can it open a calculator and add two numbers? Sure. But can it handle the reality of software, like the way Excel sometimes takes forever to load, or how Wikipedia's search suggestions can send you down a rabbit hole? Does it know that drawing applications have their tools scattered across different menus? A major LLM developer found itself asking the same questions. They'd built an agent that could perform basic computer actions, such as mouse movements, clicks, typing. But there's a gap between having those capabilities and being able to complete actual tasks that people care about. The problem wasn't technical specs or benchmark scores. It was much more fundamental in that nobody really knew whether it could handle the day-to-day, essential work that happens on computers. Opening files. Following multi-step instructions. Dealing with the thousand small interface decisions that human users make without thinking. So they needed to find out. Not through more demos or cherry-picked examples, but through systematic testing across the applications people actually use.

The challenge of proving performance beyond demos

The AI agent had all the right technical elements. It could move a cursor with precision, click on interface elements, type text, press keyboard shortcuts, and capture screenshots. On a technical level, it worked.

But working in isolation is different from working in practice. The real test was whether the agent could navigate the interface quirks of different applications without getting lost when an app displayed a formatting dialog or when a file took longer than expected to load.

The client needed answers, but not from a handful of curated demos. Understanding their agent's capabilities meant testing across the full spectrum of everyday computer tasks, from simple calculator operations to multi-step data analysis workflows, and seeing how it performed across different software categories, each with their own interface conventions and unexpected behaviours.

Most importantly, they needed scale. A few dozen test cases wouldn't reveal the systematic patterns of where the agent succeeded and where it consistently failed. It was important to create thousands of realistic scenarios that would expose both the obvious failures and the subtle edge cases that only emerge when you test comprehensively.

That meant scaling a complex data operation to generate thousands of annotations of AI agent task trajectories across specific software categories. Each annotation required human evaluators to assess not just whether the agent completed a task, but how well it handled the inevitable complications that crop up in real software environments.

A solution built on speed and scale with Toloka

Speed mattered. Within 24 hours, Toloka had over one hundred annotators creating tasks. Within five days, we reached our target throughput. All of them were familiar with common desktop applications, and each had been tested on their ability to evaluate computer task completion before joining the project.

The annotators created queries across specific categories of everyday computer use. The tasks focused on reliability in everyday work, such as following multi-step instructions and handling formatting requests. They had to evaluate the accuracy of mouse movements, the precision of clicks, and how well each response handled the small complications that inevitably arise when software doesn't behave exactly as expected. 

How we made sure annotators stayed consistent

Quality control was built into every step of data collection. We established a direct feedback loop with the client, providing weekly progress updates and immediate flagging of recurring agent errors.

The quality control operated on multiple levels:

  • Annotator consistency: Constant monitoring of how consistently different annotators were evaluating the same tasks. When agreement rates started dropping, it signaled that evaluation criteria needed clarification.

  • Automated and manual checks: Regular spot checks caught issues early, while every interaction flagged as 'perfect' or an unusual edge case underwent manual review to verify accuracy.

  • Performance metrics: Task completion times revealed bottlenecks where either the agent or the instructions were causing confusion, helping identify problem areas before they became systematic issues.

Rather than just catching errors after the fact, this approach provided real-time insights into the agent's behavioral patterns and highlighted which areas needed the most development attention.

The agent passed most tests, but the failures told the real story

The real value wasn't in the dataset size as much as it was in what the data revealed about where AI agents succeed and where they consistently break down.

The client now had something no amount of internal testing could provide – a systematic view of their agent's performance patterns across real software environments. Not just whether it could complete tasks, but how it handled the unexpected moments that define actual computer use. Think interface quirks, the loading delays, the pop-up dialogs that derail perfectly planned task sequences.

More importantly, as they improved their agent, they could run the same evaluation framework to measure progress. The dataset became both a diagnostic tool and development roadmap, showing exactly which capabilities needed work and which were ready for real-world deployment.

Most AI companies are still doing evaluation with handpicked examples that make their agents look good. But handpicked examples don't crash when applications freeze mid-task, or when interface elements appear in unexpected places. Real software is messy and unpredictable, and agents that can't handle that messiness aren't ready for actual work.

Want to know how your AI agent performs on real computer tasks? Let's find out.

Subscribe to Toloka News

Case studies, product news, and other articles straight to your inbox.

Subscribe to Toloka News

Case studies, product news, and other articles straight to your inbox.

Subscribe to Toloka News

Case studies, product news, and other articles straight to your inbox.

More about Toloka

What is Toloka’s mission?

Where is Toloka located?

What is Toloka’s key area of expertise?

How long has Toloka been in the AI market?

How does Toloka ensure the quality and accuracy of the data collected?

How does Toloka source and manage its experts and AI tutors?

What types of projects or tasks does Toloka typically handle?

What industries and use cases does Toloka focus on?

What is Toloka’s mission?

Where is Toloka located?

What is Toloka’s key area of expertise?

How long has Toloka been in the AI market?

How does Toloka ensure the quality and accuracy of the data collected?

How does Toloka source and manage its experts and AI tutors?

What types of projects or tasks does Toloka typically handle?

What industries and use cases does Toloka focus on?

What is Toloka’s mission?

Where is Toloka located?

What is Toloka’s key area of expertise?

How long has Toloka been in the AI market?

How does Toloka ensure the quality and accuracy of the data collected?

How does Toloka source and manage its experts and AI tutors?

What types of projects or tasks does Toloka typically handle?

What industries and use cases does Toloka focus on?