Bank - internal HR
Internal IT & ops support: Leave requests, benefits enrollment, payroll inquiries, remote work arrangements, resignations.
130
Test cases
22
Agent tools
Domain agentic intelligence index
Composite pass^5 score across Tool use evaluations (higher is better).
Error bars show 95% confidence intervals. Browser use coming soon.
We test models on private, non-contaminated tasks. Here's what we found.
Scaling curves
K = 1…5 runs
pass^k — Consistency
% tasks passed in every one of k runs.
Task difficulty distribution
Tasks bucketed by aggregate success rate
Buckets show difficulty tiers based on aggregate of models results.
100%
0 of 130 tasks (0%)
0
75%+
3 of 130 tasks (2%)
0
50%+
21 of 130 tasks (16%)
21
25%+
81 of 130 tasks (62%)
81
0%
29 of 130 tasks (22%)
29
Example task
User Request
Correct Agent Solution
What Is Tested
Trusted by Leading AI Teams