Toloka Arena.
Independent evaluation of
agentic intelligence

Toloka Arena.
Independent evaluation of agentic intelligence

Compare leading LLMs on our suite of private benchmarks.
See how frontier models actually perform on tasks they've never trained on.

Trusted by Leading AI Teams

Agentic intelligence index

Composite pass^5 score across Tool use evaluations (higher is better).
Error bars show 95% confidence intervals. Browser use coming soon.
We test models on private, non-contaminated tasks. Here's what we found.

Composite pass^5 score (%)
Last updated: April 6
Composite pass^5 score (%)
Last updated: April 6

RL / evaluation datasets available

The benchmarks powering this leaderboard are available for purchase.
License our RL Gyms and evaluation data across many domains to train and test your own models.

Performance vs. cost

Score (pass^5) vs. average inference cost per task

Bubble represents model parameter class. Ideal: top-left (high score, low cost).

Composite pass^5 score (%)

0%7%14%21%29%36%43%50%$0.00$0.22$0.43$0.65$0.87$1.08$1.30
0%10%20%30%40%50%60%70%$0.00$0.30$0.60$0.90$1.20$1.50$1.80

Average inference cost per task ($)

Models over time

Score (pass^5) vs. model release date

Composite pass^5 score (%)

0.0%10.0%20.0%30.0%40.0%50.0%11.2512.2501.2602.2603.2604.26Release dateGemini 3 FlashOpus 4.6Sonnet 4.6Gemini 3.1 ProGPT-5.4
Anthropic
OpenAI
Google
0.0%10.0%20.0%30.0%40.0%50.0%11.2512.2501.2602.2603.2604.26Release dateGemini 3 FlashOpus 4.6Sonnet 4.6Gemini 3.1 ProGPT-5.4
Anthropic
OpenAI
Google

Tokens used to run Toloka benchmarks

Tokens in, tokens out and tokens total

Evaluation areas

Tool use

7 Domains

Multi-turn task completion with MCP-like tools, policy adherence, and database operations across industry verticals.

Manufacturing

Airbnb

Telecom

Airlines

+5 more

Browser & mobile use

Coming soon

Real-world web navigation, form filling, and multi-step browser-based workflows.

WebArena

VisualWebArena

Coding

Coming soon

Software development tasks and long-horizon workflows

SWE-Bench

Terminal Bench

RL / evaluation datasets available

The benchmarks powering this leaderboard are available for purchase. License our RL/evaluation data across many domains to train and test your own models.

Methodology

What we use

We evaluate our agents in simulated real-world customer service scenarios—inspired by the τ-bench methodology (Sierra, 2024)—with live databases, API tools, and strict business rules. Our primary metric is pass^5 — the probability that all five independent trials of the same task succeed — capturing not just accuracy but the consistency and reliability of the agent across natural conversational variation.

Configuration

Our standard model configuration is:
Reasoning: medium
Max tokens: 16,384
Temperature: 0.6
, unless stated otherwise.

Trusted by Leading AI Teams

Evaluate your model on our private benchmarks

Enter Password