TAU manufacturing

Production order lifecycle, lot management, material allocation with buffer rules, CAPA tracking, and inventory control.

90

Test cases

19

Agent tools

Domain agentic intelligence index

Composite pass^5 score across Tool use evaluations (higher is better).
Error bars show 95% confidence intervals. Browser use coming soon.
We test models on private, non-contaminated tasks. Here's what we found.

Composite pass^5 score (%)
Last updated: April 6
Composite pass^5 score (%)
Last updated: April 6

Scaling curves

K = 1…5 runs

pass^k — Consistency

% tasks passed in every one of k runs.

0%10%20%30%40%50%k=1k=2k=3k=4k=5
GPT-5.4
Sonnet 4.6
Gemini 3 Flash
Minimax
Kimi K2.5
0%10%20%30%40%50%k=1k=2k=3k=4k=5
GPT-5.4
Sonnet 4.6
Gemini 3 Flash
Minimax
Kimi K2.5

Task difficulty distribution

Tasks bucketed by aggregate success rate

Buckets show difficulty tiers based on aggregate of models results.

100%

2 of 90 tasks (2%)

2

75%+

6 of 90 tasks (7%)

6

50%+

12 of 90 tasks (14%)

12

25%+

25 of 90 tasks (29%)

25

0%

41 of 90 tasks (48%)

41

Example task

User Request

Correct Agent Solution

What Is Tested

Methodology

Built on Sierra's TAU-Bench. 17 tools, 9 JSON data bases, verified by manufacturing subject matter experts. Golden trajectories scored via data base state hash comparison. pass^k = all k runs succeed. pass@k = ≥1 of k runs succeeds. Confidence intervals computed via bootstrap resampling (1000 iterations).

Trusted by Leading AI Teams

TAU manufacturing dataset available for purchase

License the TAU manufacturing RL Gym — 90 tasks, 19 tools, 9 databases with golden trajectories. Use for training, RLHF, or internal benchmarking.