Domain catalog
Our evaluation covers 9 Tool Use domains across industries and 2 Browser Use environments. Click a live domain to explore detailed results. All benchmark data is available for purchase as off-the-shelf datasets.
Benchmarks are available as OTS Datasets
Each domain below corresponds to an RL Gym you can license. Use them for fine-tuning, RLHF, or internal evaluation of your own models.
Tool use
Manufacturing
Live
50 tasks
Manufacturing ops: workforce (badge access, scheduling/leave/OT, training/compliance, HR/time) + after-sales (parts, warranty/recalls, service scheduling, dealer escalation). D365.
Airbnb
Soon
Hospitality marketplace support: guest/host booking changes, cancellations/refunds (Zendesk+Genesys).
Internal HR / Neobank
Soon
Consulting-firm internal IT/ops support: access/licenses, devices, travel/expenses, onboarding.
Telecom
Soon
Telecom support + outage credits: network issues, billing, plans/devices, case mgmt + outage diagnosis, credit eligibility & application (ServiceNow CSM + ITSM Incidents).
Airlines
Soon
Airline disruption servicing: rebooking/standby + vouchers/claims/baggage (Zendesk+Genesys).
Food Services
Soon
Internal store ops: equipment/POS/network + workforce/time + vendor dispatch.
Logistics
Soon
Shipping/logistics support: tracking + delivery exceptions/claims (ServiceNow CSM).
Travel
Soon
Travel company internal ITSM: access requests + incidents/requests (ServiceNow ITSM).
Pharma / Healthcare
Soon
Pharma supply chain support: deliveries/cold-chain, returns/credits/recalls + internal workflows (D365).
Browser use
Coming soon
WebArena
Autonomous web navigation and task completion across realistic web applications.
VisualWebArena
Visually-grounded web agent tasks requiring screenshot understanding and interaction.