Compare leading LLMs on our suite of private benchmarks.
See how frontier models actually perform on tasks they've never trained on.
Trusted by Leading AI Teams
Agentic intelligence index
Composite pass^5 score across Tool use evaluations (higher is better).
Error bars show 95% confidence intervals. Browser use coming soon.
We test models on private, non-contaminated tasks. Here's what we found.
RL / evaluation datasets available
The benchmarks powering this leaderboard are available for purchase.
License our RL Gyms and evaluation data across many domains to train and test your own models.
Performance vs. cost
Score (pass^5) vs. average inference cost per task
Bubble represents model parameter class. Ideal: top-left (high score, low cost).
Composite pass^5 score (%)
Average inference cost per task ($)
Models over time
Score (pass^5) vs. model release date
Composite pass^5 score (%)
Tokens used to run Toloka benchmarks
Tokens in, tokens out and tokens total
Evaluation areas
Tool use
7 Domains
Multi-turn task completion with MCP-like tools, policy adherence, and database operations across industry verticals.
Manufacturing
Airbnb
Telecom
Airlines
+5 more
Browser & mobile use
Coming soon
Real-world web navigation, form filling, and multi-step browser-based workflows.
WebArena
VisualWebArena
Coding
Coming soon
Software development tasks and long-horizon workflows
SWE-Bench
Terminal Bench
RL / evaluation datasets available
The benchmarks powering this leaderboard are available for purchase. License our RL/evaluation data across many domains to train and test your own models.
Domain standings
TAU manufacturing
Bank - internal HR
Short-term rental platform
Airlines
Restaurant operations
Hotel management
Logistics
Methodology
What we use
We evaluate our agents in simulated real-world customer service scenarios—inspired by the τ-bench methodology (Sierra, 2024)—with live databases, API tools, and strict business rules. Our primary metric is pass^5 — the probability that all five independent trials of the same task succeed — capturing not just accuracy but the consistency and reliability of the agent across natural conversational variation.
Configuration
Our standard model configuration is: Reasoning: medium , unless stated otherwise.
Max tokens: 16,384
Temperature: 0.6
Trusted by Leading AI Teams