TAU manufacturing
Production order lifecycle, lot management, material allocation with buffer rules, CAPA tracking, and inventory control.
90
Test cases
19
Agent tools
Domain agentic intelligence index
Composite pass^5 score across Tool use evaluations (higher is better).
Error bars show 95% confidence intervals. Browser use coming soon.
We test models on private, non-contaminated tasks. Here's what we found.
Scaling curves
K = 1…5 runs
pass^k — Consistency
% tasks passed in every one of k runs.
Task difficulty distribution
Tasks bucketed by aggregate success rate
Buckets show difficulty tiers based on aggregate of models results.
100%
2 of 90 tasks (2%)
2
75%+
6 of 90 tasks (7%)
6
50%+
12 of 90 tasks (14%)
12
25%+
25 of 90 tasks (29%)
25
0%
41 of 90 tasks (48%)
41
Example task
User Request
Correct Agent Solution
What Is Tested
Methodology
Built on Sierra's TAU-Bench. 17 tools, 9 JSON data bases, verified by manufacturing subject matter experts. Golden trajectories scored via data base state hash comparison. pass^k = all k runs succeed. pass@k = ≥1 of k runs succeeds. Confidence intervals computed via bootstrap resampling (1000 iterations).
Trusted by Leading AI Teams
TAU manufacturing dataset available for purchase
License the TAU manufacturing RL Gym — 90 tasks, 19 tools, 9 databases with golden trajectories. Use for training, RLHF, or internal benchmarking.