Compare leading LLMs on our hidden suite of private benchmarks. Powered by Toloka Forge, our open-source evaluation harness.
Leaderboard
Off-the-shelf evaluation datasets available
The benchmarks powering this leaderboard are available for purchase.
License our RL Gyms and evaluation data across many domains to train and test your own models.
Agentic intelligence index
Composite pass^1 score across Tool use evaluations (higher is better).
Error bars show 95% confidence intervals. Browser use coming soon.
44.0
40.8
32.0
23.6
13.2
Sonnet 4.5
GPT-5
Gemini 2.5
Kimi K2 Thinking
Minimax
Performance vs. cost
Score (pass^1) vs. average inference cost per task
Bubble represents model parameter class. Ideal: top-left (high score, low cost).
Evaluation areas
Tool use
9 Domains
Multi-turn task completion with MCP-like tools, policy adherence, and database operations across industry verticals.
Manufacturing
Airbnb
Telecom
Airlines
+5 more
Browser use
Coming soon
Real-world web navigation, form filling, and multi-step browser-based workflows.
WebArena
VisualWebArena
Key takeaways
Sub-50% across the board
No model exceeds ~44% single-run success on TAU Manufacturing. Complex, multi-step tool chains remain largely unsolved.
Pending confidence intervals — statistical significance TBA
Sonnet 4.5 leads
Highest pass^1 (44.0%) and pass@5 (60.0%) in tool use. Near-flawless execution but conceptual gaps remain.
Confidence intervals needed to confirm statistical significance vs. GPT-5
GPT-5 variants show range
Performance varies significantly by reasoning effort setting (minimal → xhigh). Technical tool-handling errors are repetitive and systematic.
Detailed per-variant breakdown TBA
Trusted by Leading AI Teams