Toloka Arena 2026
Compare leading LLMs on our hidden suite of private benchmarks. Powered by Toloka Forge, our open-source evaluation harness.
Agentic intelligence index
Composite score across Tool use and Browser use evaluations (higher is better).
44
40.8
32
23.6
13.2
Leaderboard
Evaluation areas
Tool use
9 Domains
Multi-turn task completion with API tools, policy adherence, and database operations across industry verticals.
Manufacturing
Airbnb
Telecom
+5 more
Browser use
Coming soon
Real-world web navigation, form filling, and multi-step browser-based workflows.
WebArena
VisualWebArena
Key Takeaways
Sub-50% across the board
No model exceeds 44% single-run success on Tau Manufacturing. Complex tool chains remain unsolved.
Sonnet 4.5 leads
Highest pass^1 (44%) and pass@5 (60%) in tool use. Excellent execution but conceptual gaps remain.
GPT-5 reasons well
Near-flawless reasoning but fails on tool mechanics. Errors are technical and repetitive.