TAU manufacturing

Production order lifecycle, lot management, material allocation with buffer rules, CAPA tracking, and inventory controller. D365-based. B2E + B2B.

90

Test cases

19

Agent tools

Domain leaderboard

Rank

Model

Pass^1

Pass@5

95% CI

Price / task

#1

Sonnet 4.5

Anthropic

44.0%

60.0%

[TBA

,

TBA]

TBA

#2

GPT-5

OpenAI

40.8%

52.0%

[TBA

,

TBA]

TBA

#3

Gemini 2.5

Google

32.0%

60.0%

[TBA

,

TBA]

TBA

#4

Kimi K2 Thinking

Moonshot

23.6%

46.0%

[TBA

,

TBA]

TBA

#5

Minimax

Minimax

13.2%

30.0%

[TBA

,

TBA]

TBA

Rank

Model

Pass^1

Pass@5

95% CI

Price / task

#1

Sonnet 4.5

Anthropic

44.0%

60.0%

[TBA

,

TBA]

TBA

#2

GPT-5

OpenAI

40.8%

52.0%

[TBA

,

TBA]

TBA

#3

Gemini 2.5

Google

32.0%

60.0%

[TBA

,

TBA]

TBA

#4

Kimi K2 Thinking

Moonshot

23.6%

46.0%

[TBA

,

TBA]

TBA

#5

Minimax

Minimax

13.2%

30.0%

[TBA

,

TBA]

TBA

Scaling curves

K = 1…5 runs

pass^k — Consistency

% tasks passed in every one of k runs.

0%10%20%30%40%50%k=1k=2k=3k=4k=5

Sonnet 4.5

GPT-5

Gemini 2.5

Kimi K2

Minimax

pass@k — Ceiling

% tasks passed in at least one of k runs.

0%10%20%30%40%50%60%70%k=1k=2k=3k=4k=5

Sonnet 4.5

GPT-5

Gemini 2.5

Kimi K2

Minimax

Sonnet 4.5

GPT-5

Gemini 2.5

Kimi K2

Minimax

Domain-specific findings

Sonnet 4.5

Highest single-run reliability (44.0%, 95% CI: [TBA, TBA]). Near-flawless tool execution, but conceptual errors on policy edge cases.

GPT-5

Strong reasoning but poor tool handling leads to repetitive technical failures. Performance varies significantly by reasoning effort setting.

Gemini 2.5 Pro

Lowest consistency (pass^5 = 6%) but high ceiling (pass@5 = 60%). Potential unlocked with retries.

Methodology

Built on Sierra's TAU-Bench. 17 tools, 9 JSON data bases, verified by manufacturing subject matter experts. Golden trajectories scored via data base state hash comparison. pass^k = all k runs succeed. pass@k = ≥1 of k runs succeeds. Confidence intervals computed via bootstrap resampling (1000 iterations).

Trusted by Leading AI Teams

TAU manufacturing dataset available for purchase

License the TAU manufacturing RL Gym — 90 tasks, 19 tools, 9 databases with golden trajectories. Use for training, RLHF, or internal benchmarking.