Logistics
Internal Operations Support for a mid-to-large regional logistics company: shipment tracking, equipment/maintenance requests, safety incidents, and HR inquiries.
154
Test cases
26
Agent tools
Domain agentic intelligence index
We test models on private, non-contaminated tasks.
Here's what we found.
Composite pass^5 score across Tool use evaluations (higher is better).
Error bars show 95% confidence intervals.
Scaling curves
K = 1…5 runs
pass^k — Consistency
% tasks passed in every one of k runs.
Task difficulty distribution
Tasks bucketed by aggregate success rate
Buckets show difficulty tiers based on aggregate of models results.
100%
1 of 143 tasks (1%)
1
75%+
4 of 143 tasks (3%)
4
50%+
94 of 143 tasks (66%)
94
25%+
43 of 143 tasks (30%)
43
0%
1 of 143 tasks (1%)
1
Example task
User Request
Correct Agent Solution
What Is Tested
Trusted by Leading AI Teams