Domain catalog

Tool Use evaluations across industries. Browser Use environments coming soon.

All benchmark data is available for purchase as off-the-shelf datasets.

Trusted by Leading AI Teams

Benchmarks are available as OTS datasets

Each domain corresponds to an RL Gym you can license.
Use them for fine-tuning, RLHF, or internal evaluation of your own models.

Browser & mobile use

Coming soon

WebArena/Toloka

Autonomous web navigation and task completion across realistic web applications.

Visual WebArena/Toloka

Visually-grounded web agent tasks requiring screenshot understanding and interaction.

Coding

Coming soon

SWE-Bench/Toloka

A benchmark that tests AI models on resolving real GitHub issues from popular Python repos. Models must generate code patches that pass the project's test suite.

Terminal Bench/Toloka

A benchmark evaluating AI agents on complex terminal/command-line tasks. It tests the ability to navigate filesystems, chain commands, and solve system administration challenges programmatically.

Enter Password