Data Solutions

Platform

Resource Hub

Company

Arena

Talk to us

Arena

Leaderboard

Catalog

Get the data

Domain catalog

Tool Use evaluations across industries. Browser Use environments coming soon.

All benchmark data is available for purchase as off-the-shelf datasets.

Talk to an expert

Explore Toloka Arena

Trusted by Leading AI Teams

Benchmarks are available as OTS datasets

Each domain corresponds to an RL Gym you can license.
Use them for fine-tuning, RLHF, or internal evaluation of your own models.

Get pricing

Tool use

Choose a live domain to explore detailed results.

Manufacturing

Live

Production order lifecycle

Lot management

Material allocation with buffer rules

CAPA tracking

Inventory control

Bank - internal HR

Live

Internal IT & ops support

Leave requests

Benefits enrollment

Payroll inquiries

Remote work arrangements

Resignations

Short-term rental platform

Live

Stay & experience bookings

Payments

User verification & trust

Damage claims

Listing management

Corporate travel

Credits & escalations

Airlines

Live

Booking changes & cancellations

Ancillaries

Payments & refunds

Travel credits

Check-in issues

Complaints

Loyalty support

Logistics

Live

Shipment tracking

Equipment & maintenance requests

Safety incidents

HR inquiries

System access issues

Restaurant operations

Live

Equipment incidents

Food safety

Delivery discrepancies & credits

Time corrections

Hotel management

Live

Employee profiles

Certification compliance

Facilities work orders

Badge lifecycle

Inventory

Shift scheduling

Payroll & benefits

Telecom

Soon

Activations

Account verification

Billing & payments

Outage handling

Credits & refunds

Notifications

Air cargo

Soon

Booking lifecycle

Shipment tracking

AWB creation/ validation/ amendment

Cargo claims

Reporting

Invoice discrepancies and credit notes

Document submission and compliance

Browser & mobile use

Coming soon

WebArena/Toloka

Autonomous web navigation and task completion across realistic web applications.

Visual WebArena/Toloka

Visually-grounded web agent tasks requiring screenshot understanding and interaction.

Coding

Coming soon

SWE-Bench/Toloka

A benchmark that tests AI models on resolving real GitHub issues from popular Python repos. Models must generate code patches that pass the project's test suite.

Terminal Bench/Toloka

A benchmark evaluating AI agents on complex terminal/command-line tasks. It tests the ability to navigate filesystems, chain commands, and solve system administration challenges programmatically.