Performance vs. cost
Score (pass^5) vs. average inference cost per task
Ideal: top-left (high score, low cost).
Composite pass^5 score (%)
Average inference cost per task ($)
Models over time
Score (pass^5) vs. model release date
Composite pass^5 score (%)
Tokens in, tokens out and tokens total
Total cost to run each model on whole benchmark
Evaluation areas
Tool use
7 Domains
Multi-turn task completion with MCP-like tools, policy adherence, and database operations across industry verticals.
Manufacturing
Airbnb
Telecom
Airlines
+5 more
Browser & mobile use
Coming soon
Real-world web navigation, form filling, and multi-step browser-based workflows.
WebArena
VisualWebArena
Coding
Coming soon
Software development tasks and long-horizon workflows
SWE-Bench
Terminal Bench
RL / evaluation datasets available
The benchmarks powering this leaderboard are available for purchase. License our RL/evaluation data across many domains to train and test your own models.
Domain standings
TAU manufacturing
Bank - internal HR
Short-term rental platform
Airlines
Restaurant operations
Hotel management
Logistics
Methodology
What we use
We evaluate our agents in simulated real-world customer service scenarios—inspired by the τ-bench methodology (Sierra, 2024)—with live databases, API tools, and strict business rules. Our primary metric is pass^5 — the probability that all five independent trials of the same task succeed — capturing not just accuracy but the consistency and reliability of the agent across natural conversational variation.
Configuration
Our standard model configuration is: Reasoning: medium , unless stated otherwise.
Max tokens: 16,384
Temperature: 0.6
Trusted by Leading AI Teams