Bank - internal HR

Internal IT & ops support: Leave requests, benefits enrollment, payroll inquiries, remote work arrangements, resignations.

130

Test cases

22

Agent tools

Domain agentic intelligence index

Composite pass^5 score across Tool use evaluations (higher is better).
Error bars show 95% confidence intervals. Browser use coming soon.
We test models on private, non-contaminated tasks. Here's what we found.

Composite pass^5 score (%)
Last updated: April 6
Composite pass^5 score (%)
Last updated: April 6

Scaling curves

K = 1…5 runs

pass^k — Consistency

% tasks passed in every one of k runs.

0%10%20%30%40%50%k=1k=2k=3k=4k=5
GPT-5.4
Sonnet 4.6
Gemini 3 Flash
Minimax
Kimi K2.5
0%10%20%30%40%50%k=1k=2k=3k=4k=5
GPT-5.4
Sonnet 4.6
Gemini 3 Flash
Minimax
Kimi K2.5

Task difficulty distribution

Tasks bucketed by aggregate success rate

Buckets show difficulty tiers based on aggregate of models results.

100%

0 of 130 tasks (0%)

0

75%+

3 of 130 tasks (2%)

0

50%+

21 of 130 tasks (16%)

21

25%+

81 of 130 tasks (62%)

81

0%

29 of 130 tasks (22%)

29

Example task

User Request

Correct Agent Solution

What Is Tested

Trusted by Leading AI Teams

Bank - internal HR dataset available for purchase