Toloka Arena.
Independent evaluation of
agentic intelligence

Toloka Arena.
Independent evaluation of agentic intelligence

Compare leading LLMs on our hidden suite of private benchmarks. Powered by Toloka Forge, our open-source evaluation harness.

Leaderboard

Rank

Model

Score (pass^1)

Tool use

Price/1M

#1

Sonnet 4.5

Anthropic

44.0

[TBA

,

TBA]

44.0

TBA

#2

GPT-5

OpenAI

reasoning: default

40.8

[TBA

,

TBA]

40.8

TBA

#3

Gemini 2.5

Google

32.0

[TBA

,

TBA]

32.0

TBA

#4

Kimi K2 Thinking

Moonshot

23.6

[TBA

,

TBA]

23.6

TBA

#5

Minimax

Minimax

13.2

[TBA

,

TBA]

13.2

TBA

Rank

Model

Score (pass^1)

Tool use

Price/1M

#1

Sonnet 4.5

Anthropic

44.0

[TBA

,

TBA]

44.0

TBA

#2

GPT-5

OpenAI

reasoning: default

40.8

[TBA

,

TBA]

40.8

TBA

#3

Gemini 2.5

Google

32.0

[TBA

,

TBA]

32.0

TBA

#4

Kimi K2 Thinking

Moonshot

23.6

[TBA

,

TBA]

23.6

TBA

#5

Minimax

Minimax

13.2

[TBA

,

TBA]

13.2

TBA

Off-the-shelf evaluation datasets available

The benchmarks powering this leaderboard are available for purchase.
License our RL Gyms and evaluation data across many domains to train and test your own models.

Agentic intelligence index

Composite pass^1 score across Tool use evaluations (higher is better).
Error bars show 95% confidence intervals. Browser use coming soon.

44.0

40.8

32.0

23.6

13.2

Sonnet 4.5

GPT-5

Gemini 2.5

Kimi K2 Thinking

Minimax

Performance vs. cost

Score (pass^1) vs. average inference cost per task

Bubble represents model parameter class. Ideal: top-left (high score, low cost).

0%10%20%30%40%50%60%$0$1$2$3$4$5Sonnet 4.5GPT-5Gemini 2.5Kimi K2Minimax
0%10%20%30%40%50%60%$0$1$2$3$4$5Sonnet 4.5GPT-5Gemini 2.5Kimi K2Minimax

Evaluation areas

Tool use

9 Domains

Multi-turn task completion with MCP-like tools, policy adherence, and database operations across industry verticals.

Manufacturing

Airbnb

Telecom

Airlines

+5 more

Browser use

Coming soon

Real-world web navigation, form filling, and multi-step browser-based workflows.

WebArena

VisualWebArena

Key takeaways

Sub-50% across the board

No model exceeds ~44% single-run success on TAU Manufacturing. Complex, multi-step tool chains remain largely unsolved.

Pending confidence intervals — statistical significance TBA

Sonnet 4.5 leads

Highest pass^1 (44.0%) and pass@5 (60.0%) in tool use. Near-flawless execution but conceptual gaps remain.

Confidence intervals needed to confirm statistical significance vs. GPT-5

GPT-5 variants show range

Performance varies significantly by reasoning effort setting (minimal → xhigh). Technical tool-handling errors are repetitive and systematic. 

Detailed per-variant breakdown TBA

Trusted by Leading AI Teams

Test your model
with Toloka Forge