Airlines
Airline customer care agent supporting frontline teams in managing passenger booking servicing, changes and cancellations, ancillaries, payments, refunds, travel credits, check-in issues, complaints, and loyalty support.
100Test cases available OTS
36Agent tools
Domain agentic intelligence index
We test models on private, non-contaminated tasks.
Here's what we found.
Composite pass^5 score across Tool use evaluations (higher is better).
Error bars show 95% confidence intervals.
Scaling curves
K = 1…5 runs
pass^k — Consistency
% tasks passed in every one of k runs.
Task difficulty distribution
Tasks bucketed by aggregate success rate
Buckets show difficulty tiers based on aggregate of models results on the benchmarking subset.
100%
0 of 110 tasks (0%)
0
75%+
16 of 110 tasks (15%)
16
50%+
47 of 110 tasks (43%)
47
25%+
46 of 110 tasks (42%)
46
0%
1 of 110 tasks (1%)
1
Example task
User Request
Correct Agent Solution
What Is Tested
Trusted by Leading AI Teams