Data Solutions

Platform

Resource Hub

Company

Arena

Talk to us

Arena

Leaderboard

Catalog

Get the data

Toloka Arena.
Independent evaluation of agentic intelligence.

Composite pass^5 score (%)

Last updated: May 22

Agentic intelligence index

Composite pass^5 score across Tool use evaluations (higher is better).
Error bars show 95% confidence intervals.We test models on private, non-contaminated tasks.

Compare leading LLMs on our suite of private benchmarks.
See how frontier models actually perform on tasks they've never trained on.

Test your model

Toloka Forge

About Arena

RL / evaluation datasets available

The benchmarks powering this leaderboard are available for purchase.
License our RL Gyms and evaluation data across many domains to train and test your own models.

Browse datasets

Toloka Arena

Composite pass^5 score (%)

Last updated: May 22

Agentic intelligence index

Composite pass^5 score across Tool use evaluations (higher is better).
Error bars show 95% confidence intervals.
We test models on private, non-contaminated tasks.

Compare leading LLMs on our suite of private benchmarks.
See how frontier models actually perform on tasks they've never trained on.

Test your model

Toloka Forge

About Arena

RL / evaluation datasets available

The benchmarks powering this leaderboard are available for purchase.
License our RL Gyms and evaluation data across many domains to train and test your own models.

Browse datasets

Performance vs. cost

Score (pass^5) vs. average inference cost per task

Ideal: top-left (high score, low cost).

Composite pass^5 score (%)

Average inference cost per task ($)

Models over time

Score (pass^5) vs. model release date

Composite pass^5 score (%)

Anthropic

OpenAI

Google

Anthropic

OpenAI

Google

Tokens used to run
Toloka benchmarks

Tokens used to run Toloka benchmarks

Tokens in, tokens out and tokens total

Total cost to run each model on whole benchmark

Evaluation areas

Learn more

Tool use

7 Domains

Multi-turn task completion with MCP-like tools, policy adherence, and database operations across industry verticals.

Manufacturing

Airbnb

Telecom

Airlines

+5 more

Browser & mobile use

Coming soon

Real-world web navigation, form filling, and multi-step browser-based workflows.

WebArena

VisualWebArena

Coding

Coming soon

Software development tasks and long-horizon workflows

SWE-Bench

Terminal Bench

RL / evaluation datasets available

The benchmarks powering this leaderboard are available for purchase. License our RL/evaluation data across many domains to train and test your own models.

Purchase now

Domain standings

Toloka Arena.Independent evaluation of agentic intelligence.

Agentic intelligence index

RL / evaluation datasets available

Toloka Arena

Agentic intelligence index

RL / evaluation datasets available

Performance vs. cost

Models over time

Tokens used to run Toloka benchmarks

Tokens used to run Toloka benchmarks

Total cost to run each model on whole benchmark

Evaluation areas

Tool use

Browser & mobile use

Coding

RL / evaluation datasets available

Domain standings

TAU manufacturing

Bank - internal HR

Short-term rental platform

Airlines

Restaurant operations

Hotel management

Logistics

Methodology

What we use

Configuration

Evaluate your model on our private benchmarks

Toloka Arena.
Independent evaluation of agentic intelligence.

Tokens used to run
Toloka benchmarks