Logistics

Internal Operations Support for a mid-to-large regional logistics company: shipment tracking, equipment/maintenance requests, safety incidents, and HR inquiries.

154

Test cases

26

Agent tools

Domain agentic intelligence index

We test models on private, non-contaminated tasks.
Here's what we found.

Composite pass^5 score (%)
Last updated: May 19
Composite pass^5 score (%)
Last updated: May 19

Composite pass^5 score across Tool use evaluations (higher is better).
Error bars show 95% confidence intervals.

Scaling curves

K = 1…5 runs

pass^k — Consistency

% tasks passed in every one of k runs.

0%15%30%45%60%75%90%k=1k=2k=3k=4k=5
GPT-5.4
Sonnet 4.6
Gemini 3 Flash
Minimax
Kimi K2.5
0%15%30%45%60%75%90%k=1k=2k=3k=4k=5
GPT-5.4
Sonnet 4.6
Gemini 3 Flash
Minimax
Kimi K2.5

Task difficulty distribution

Tasks bucketed by aggregate success rate

Buckets show difficulty tiers based on aggregate of models results on the benchmarking subset.

100%

1 of 143 tasks (1%)

1

75%+

6 of 143 tasks (4%)

6

50%+

96 of 143 tasks (67%)

96

25%+

40 of 143 tasks (28%)

40

0%

0 of 143 tasks (0%)

0

Example task

User Request

Correct Agent Solution

What Is Tested

Trusted by Leading AI Teams

Bank - internal HR dataset available for purchase