Toloka Arena 2026

Independent evaluation of
Agentic Intelligence

Independent evaluation of Agentic Intelligence

Compare leading LLMs on our hidden suite of private benchmarks. Powered by Toloka Forge, our open-source evaluation harness.

Agentic intelligence index

Composite score across Tool use and Browser use evaluations (higher is better).

44

Sonnet 4.5

Anthropic

40.8

GPT-5

OpenAI

32

Gemini 2.5

Google

23.6

Kimi K2

Moonshot

13.2

Minimax

Minimax

Sonnet 4.5

Anthropic

GPT-5

OpenAI

Kimi K2 Thinking

Moonshot

Minimax

Minimax

Gemini 2.5

Google

Leaderboard

Composite score

Composite score

Rank

Model

Score

Tool use

Browser use

Price/1M

#1

Sonnet 4.5

Anthropic

44

44

$3.00

#2

GPT-5

OpenAI

40.8

40.8

$10.00

#3

Gemini 2.5

Google

32

32

$8.50

#4

Kimi K2 Thinking

Moonshot

23.6

23.6

$2.00

#5

Minimax

Minimax

13.2

13.2

$1.00

Rank

Model

Score

Tool use

Browser use

Price/1M

#1

Sonnet 4.5

Anthropic

44

44

$3.00

#2

GPT-5

OpenAI

40.8

40.8

$10.00

#3

Gemini 2.5

Google

32

32

$8.50

#4

Kimi K2 Thinking

Moonshot

23.6

23.6

$2.00

#5

Minimax

Minimax

13.2

13.2

$1.00

Evaluation areas

Tool use

9 Domains

Multi-turn task completion with API tools, policy adherence, and database operations across industry verticals.

Manufacturing

Airbnb

Telecom

+5 more

Browser use

Coming soon

Real-world web navigation, form filling, and multi-step browser-based workflows.

WebArena

VisualWebArena

Key Takeaways

Sub-50% across the board

No model exceeds 44% single-run success on Tau Manufacturing. Complex tool chains remain unsolved.

Sonnet 4.5 leads

Highest pass^1 (44%) and pass@5 (60%) in tool use. Excellent execution but conceptual gaps remain.

GPT-5 reasons well

Near-flawless reasoning but fails on tool mechanics. Errors are technical and repetitive.

Test your model
with Toloka Forge