Read our papers on AI training, evaluation, and safety

Read our papers on AI training, evaluation, and safety

Read our papers on AI training, evaluation, and safety

TAU-bench extension: benchmarking policy-aware agents in realistic settings

September 24, 2025

September 24, 2025

Insights

Insights

Can your AI say no? In production, it’s a question that matters as much as raw capability. AI is gaining more responsibility in the real world and is capable of handling tasks like processing an invoice or taking a food order. Sometimes the right move is to complete the request, but other times the request breaches a policy, be it a blocked account or a payment flagged as suspicious. In this scenario, the correct response is to refuse. 

There can be real-world consequences if AI gets these decisions wrong, which is why benchmarks that test both performance and policy adherence matter so much.

Building benchmarks that test more than accuracy

Many AI benchmarks focus on how well a model completes a task, but that alone isn’t enough when real-world constraints are in play. In policy-bound environments, the correct decision isn’t always to proceed. The test needs to capture both capability and discipline, with results that clearly show whether the AI followed the rules it was given.

This is the principle behind extending benchmarks like TAU, which was first developed by Sierra Research to evaluate how well an AI agent manages dynamic conversations while operating within strict policies. Pushing this concept further allows for the creation of highly accurate, granular evaluations of an agent's real-world reliability.

The challenge: building reliable agentic benchmarks

Designing a benchmark to evaluate AI agents in realistic environments can prove tricky. Creating one that could be auto-evaluated across hundreds of tasks, with misalignment measured solely on the final database state, makes the job even more demanding. This is where Toloka comes in. 

Successfully building a benchmark of high caliber requires a combination of technical expertise in agentic systems and a commitment to producing high-quality data. The goal is to generate fully auto-evaluable test cases across any domain and any corporate policy. Each task simulates a conversation between a user and a customer service representative agent where an agent can use specific tools to interact with databases, ultimately deciding to complete or refuse the request based on policy.

Achieving deterministic evaluation

Since every evaluation needs to be exact, a deterministic reward system is essential. User prompts should describe realistic requests, each designed to test one specific aspect of a policy. A single reference solution must guide each task, making results auto-evaluable on a pass-or-fail basis and keeping scoring consistent.

Using a binary scoring model leaves no room for ambiguity. This demands two things: every task must have only one valid solution path, and the reward system must accurately represent the agent's behavior. Achieving this level of precision can be quite challenging. For instance, a zero score might not be caused by the agent’s failure, but by the user LLM hallucinating, or by a task allowing more than one possible solution, or by an error in the golden set itself. At the same time, an agent could score one even if it broke a policy that did not affect the database state, letting violations slip through undetected. Because such evaluation flaws make the data unreliable for reinforcement learning, rigorous quality control becomes absolutely essential. 

Every element of a task has to be correct and logically sound in order to achieve accurate evaluations. Each case is treated as a single data point, built from a user prompt and a defined solution path, all within a domain environment of policies, tools, and data. A flaw in any part of that environment could affect hundreds of test cases.

Finding the right balance 

The policies themselves must be carefully balanced. Policies that are too loose allow agents to pass without meeting the intended requirements. Policies that are too strict add unnecessary complexity and increase the quality assurance workload. The balance has to be just right: not so easy that the benchmark fails to surface meaningful insights, and not so difficult that it becomes unstable or impossible to evaluate automatically.

Toloka’s approach: a multi-stage pipeline for high-quality data

To achieve the highest quality, a structured, multi-stage approach is non-negotiable.

  1. Human-in-the-loop review: A human-in-the-loop system, where trained specialists and automated checks work together, is critical to validate every task. Using this combination helps to keep the benchmark stable and reliable. Evaluators are trained to understand the domain in detail, along with its policies and evaluation standards.

  2. Multi-layered validation process: Each task passes through multiple validation layers in an iterative process. If a task fails a check, it is returned with feedback for refinement, ensuring it meets all required standards before delivery. This keeps the benchmark stable and reliable on a task-by-task basis.

  3. Collaborative feedback loops: The project itself is designed to evolve. We use an open feedback loop and real-time observations to constantly refine project definitions and improve tooling. This strategic oversight keeps the entire benchmark aligned with technical requirements and allows us to adapt to new needs as they emerge.

The outcome is a dataset capable of testing AI agents in high-stakes, policy-driven environments without sacrificing either scale or precision. Each task can be evaluated automatically but is also designed to reveal weaknesses that a simpler benchmark might overlook.

Impact: TAU-Bench extensions in action

TAU-bench extensions are designed to be deliberately challenging, focusing on problems that even advanced agents struggle to solve. These benchmarks highlight where current systems fall short and create valuable training material for improvement.

To achieve this, Toloka’s tailored benchmarks provide broad coverage by spanning a wide array of databases, tools, and policies. This approach tests agents across diverse domains and a multitude of real-world scenarios. The framework is also fully customizable, allowing for the creation of benchmarks tailored to any specific field or use case.

Tasks are crafted to expose weaknesses in policy reasoning, using a range of techniques from benign requests to manipulative and adversarial prompts.

This kind of rigorous testing reveals failures in areas where accuracy matters most, such as security, account management, orders, and financial transfers.

Broader implications: A domain-agnostic framework

While this methodology is often applied to domains with a heavy customer service focus, it is fundamentally domain-agnostic. The same framework applies to any setting where an AI agent is required to handle complex interactions without breaking policy.

The outcome is a foundation for semi-automated systems that can make decisions under policy constraints at scale, from approving a purchase order to rejecting a service request that breaches a rule. In each case, the benchmark measures whether the agent can follow the right course of action without introducing errors.

This results in a framework that not only evaluates policy-aware decision-making at scale but also provides training material based on tasks that agents routinely fail on. It is the key to determining whether your AI really can say no—and do it for all the right reasons.

Bring policy-aware AI to your domain

See how our framework can work in your domain. Get in touch to find out how we can help you build AI agents that stay reliable under pressure and deliver results you can trust.

Subscribe to Toloka News

Case studies, product news, and other articles straight to your inbox.

Subscribe to Toloka News

Case studies, product news, and other articles straight to your inbox.

Subscribe to Toloka News

Case studies, product news, and other articles straight to your inbox.

More about Toloka

What is Toloka’s mission?

Where is Toloka located?

What is Toloka’s key area of expertise?

How long has Toloka been in the AI market?

How does Toloka ensure the quality and accuracy of the data collected?

How does Toloka source and manage its experts and AI tutors?

What types of projects or tasks does Toloka typically handle?

What industries and use cases does Toloka focus on?

What is Toloka’s mission?

Where is Toloka located?

What is Toloka’s key area of expertise?

How long has Toloka been in the AI market?

How does Toloka ensure the quality and accuracy of the data collected?

How does Toloka source and manage its experts and AI tutors?

What types of projects or tasks does Toloka typically handle?

What industries and use cases does Toloka focus on?

What is Toloka’s mission?

Where is Toloka located?

What is Toloka’s key area of expertise?

How long has Toloka been in the AI market?

How does Toloka ensure the quality and accuracy of the data collected?

How does Toloka source and manage its experts and AI tutors?

What types of projects or tasks does Toloka typically handle?

What industries and use cases does Toloka focus on?