TAU manufacturing
Production order lifecycle, lot management, material allocation with buffer rules, CAPA tracking, and inventory controller. D365-based. B2E + B2B.
90
Test cases
19
Agent tools
Domain leaderboard
Scaling curves
K = 1…5 runs
pass^k — Consistency
% tasks passed in every one of k runs.
pass@k — Ceiling
% tasks passed in at least one of k runs.
Example tasks & failure patterns
Domain-specific findings
Sonnet 4.5
Highest single-run reliability (44.0%, 95% CI: [TBA, TBA]). Near-flawless tool execution, but conceptual errors on policy edge cases.
GPT-5
Strong reasoning but poor tool handling leads to repetitive technical failures. Performance varies significantly by reasoning effort setting.
Gemini 2.5 Pro
Lowest consistency (pass^5 = 6%) but high ceiling (pass@5 = 60%). Potential unlocked with retries.
Methodology
Built on Sierra's TAU-Bench. 17 tools, 9 JSON data bases, verified by manufacturing subject matter experts. Golden trajectories scored via data base state hash comparison. pass^k = all k runs succeed. pass@k = ≥1 of k runs succeeds. Confidence intervals computed via bootstrap resampling (1000 iterations).
Trusted by Leading AI Teams
TAU manufacturing dataset available for purchase
License the TAU manufacturing RL Gym — 90 tasks, 19 tools, 9 databases with golden trajectories. Use for training, RLHF, or internal benchmarking.