Tau Manufacturing
Manufacturing ops: workforce (badge access, scheduling/leave/OT, training/compliance, HR/time) + after-sales (parts, warranty/recalls, service scheduling, dealer escalation). D365-based. B2E + B2B.
50
Test cases
17
Agent tools
9
Databases
5
Models evaluated
Domain Leaderboard
Scaling curves
pass^k — Consistency
% tasks passed in every one of k runs.

pass@k — Ceiling
% tasks passed in at least one of k runs.

Task Difficulty Distribution
Tasks Bucketed by Aggregate Success Rate
Each of the 50 tasks was attempted 15 times. Buckets show difficulty tiers.
2
4
24
15
5
Key Findings
Sonnet 4.5
Highest single-run reliability (44%). Near-flawless tool execution, but conceptual errors on policy edge cases.
GPT-5
Strong reasoning ("almost flawless"), but poor tool handling leads to repetitive technical failures.
Gemini 2.5 Pro
Lowest consistency (pass^5 = 6%) but high ceiling (pass@5 = 60%). Potential unlocked with retries.
Methodology
Built on Sierra's Tau-Bench. 17 tools, 9 JSON DBs, verified by manufacturing SMEs. Golden trajectories scored via DB state hash comparison. pass^k = all k succeed. pass@k = ≥1 of k succeeds.