Arena

Leaderboard

Catalog

Get the data

Bank - internal HR

Internal IT & ops support: Leave requests, benefits enrollment, payroll inquiries, remote work arrangements, resignations.

Access sample data

100Test cases available OTS

22Agent tools

We test models on private, non-contaminated tasks.
Here's what we found.

Last updated: July 29

0 of 0 models

10%

20%

30%

40%

50%

Composite pass^5 score (%)

Last updated: July 29

0 of 0 models

Composite pass^5 score across Tool use evaluations (higher is better).
Error bars show 95% confidence intervals.

K = 1…5 runs

pass^k — Consistency

% tasks passed in every one of k runs.

Legend

Buckets show difficulty tiers based on aggregate of models results on the benchmarking subset.

Trusted by Leading AI Teams

Purchase now