Domain catalog
Tool Use evaluations across industries. Browser Use environments coming soon.
All benchmark data is available for purchase as off-the-shelf datasets.
Trusted by Leading AI Teams
Benchmarks are available as OTS datasets
Each domain corresponds to an RL Gym you can license.
Use them for fine-tuning, RLHF, or internal evaluation of your own models.
Tool use
Choose a live domain to explore detailed results.
Manufacturing
Live
Production order lifecycle
Lot management
Material allocation with buffer rules
CAPA tracking
Inventory control
Bank - internal HR
Live
Internal IT & ops support
Leave requests
Benefits enrollment
Payroll inquiries
Remote work arrangements
Resignations
Short-term rental platform
Live
Stay & experience bookings
Payments
User verification & trust
Damage claims
Listing management
Corporate travel
Credits & escalations
Airlines
Live
Booking changes & cancellations
Ancillaries
Payments & refunds
Travel credits
Check-in issues
Complaints
Loyalty support
Logistics
Live
Shipment tracking
Equipment & maintenance requests
Safety incidents
HR inquiries
System access issues
Restaurant operations
Live
Equipment incidents
Food safety
Delivery discrepancies & credits
Time corrections
Hotel management
Live
Employee profiles
Certification compliance
Facilities work orders
Badge lifecycle
Inventory
Shift scheduling
Payroll & benefits
Telecom
Soon
Activations
Account verification
Billing & payments
Outage handling
Credits & refunds
Notifications
Air cargo
Soon
Booking lifecycle
Shipment tracking
AWB creation/ validation/ amendment
Cargo claims
Reporting
Invoice discrepancies and credit notes
Document submission and compliance
Browser & mobile use
Coming soon
WebArena/Toloka
Autonomous web navigation and task completion across realistic web applications.
Visual WebArena/Toloka
Visually-grounded web agent tasks requiring screenshot understanding and interaction.
Coding
Coming soon
SWE-Bench/Toloka
A benchmark that tests AI models on resolving real GitHub issues from popular Python repos. Models must generate code patches that pass the project's test suite.
Terminal Bench/Toloka
A benchmark evaluating AI agents on complex terminal/command-line tasks. It tests the ability to navigate filesystems, chain commands, and solve system administration challenges programmatically.