Domain catalog

Our evaluation covers 9 Tool Use domains across industries and 2 Browser Use environments. Click a live domain to explore detailed results. All benchmark data is available for purchase as off-the-shelf datasets.

Benchmarks are available as OTS Datasets

Each domain below corresponds to an RL Gym you can license. Use them for fine-tuning, RLHF, or internal evaluation of your own models.

Tool use

Manufacturing

Live

50 tasks

Manufacturing ops: workforce (badge access, scheduling/leave/OT, training/compliance, HR/time) + after-sales (parts, warranty/recalls, service scheduling, dealer escalation). D365.

Airbnb

Soon

Hospitality marketplace support: guest/host booking changes, cancellations/refunds (Zendesk+Genesys).

Internal HR / Neobank

Soon

Consulting-firm internal IT/ops support: access/licenses, devices, travel/expenses, onboarding.

Telecom

Soon

Telecom support + outage credits: network issues, billing, plans/devices, case mgmt + outage diagnosis, credit eligibility & application (ServiceNow CSM + ITSM Incidents).

Airlines

Soon

Airline disruption servicing: rebooking/standby + vouchers/claims/baggage (Zendesk+Genesys).

Food Services

Soon

Internal store ops: equipment/POS/network + workforce/time + vendor dispatch.

Logistics

Soon

Shipping/logistics support: tracking + delivery exceptions/claims (ServiceNow CSM).

Travel

Soon

Travel company internal ITSM: access requests + incidents/requests (ServiceNow ITSM).

Pharma / Healthcare

Soon

Pharma supply chain support: deliveries/cold-chain, returns/credits/recalls + internal workflows (D365).

Browser use

Coming soon

WebArena

Autonomous web navigation and task completion across realistic web applications.

VisualWebArena

Visually-grounded web agent tasks requiring screenshot understanding and interaction.