Data Solutions

Platform

Resource Hub

Company

Talk to us

Arena

Leaderboard

Catalog

Pricing

TAU manufacturing

Production order lifecycle, lot management, material allocation with buffer rules, CAPA tracking, and inventory controller. D365-based. B2E + B2B.

90

Test cases

19

Agent tools

Purchase now

Domain leaderboard

Expand

Rank

Model

Pass^1

Pass@5

95% CI

Price / task

Sonnet 4.5

Anthropic

44.0%

60.0%

[TBA

TBA]

TBA

GPT-5

OpenAI

40.8%

52.0%

[TBA

TBA]

TBA

Gemini 2.5

Google

32.0%

60.0%

[TBA

TBA]

TBA

Kimi K2 Thinking

Moonshot

23.6%

46.0%

[TBA

TBA]

TBA

Minimax

13.2%

30.0%

[TBA

TBA]

TBA

Rank

Model

Pass^1

Pass@5

95% CI

Price / task

Sonnet 4.5

Anthropic

44.0%

60.0%

[TBA

TBA]

TBA

GPT-5

OpenAI

40.8%

52.0%

[TBA

TBA]

TBA

Gemini 2.5

Google

32.0%

60.0%

[TBA

TBA]

TBA

Kimi K2 Thinking

Moonshot

23.6%

46.0%

[TBA

TBA]

TBA

Minimax

13.2%

30.0%

[TBA

TBA]

TBA

Scaling curves

K = 1…5 runs

pass^k — Consistency

% tasks passed in every one of k runs.

Sonnet 4.5

GPT-5

Gemini 2.5

Kimi K2

Minimax

pass@k — Ceiling

% tasks passed in at least one of k runs.

Sonnet 4.5

GPT-5

Gemini 2.5

Kimi K2

Minimax

Sonnet 4.5

GPT-5

Gemini 2.5

Kimi K2

Minimax

Task difficulty distribution

Always solved

15/15 tasks

Mostly solved

11-14/15 tasks

Medium

6-10/15 tasks

Rarely solved

1-5/15 tasks

Never solved

0/15 tasks

Tasks bucketed by aggregate success rate

Each of the 50 tasks was attempted 15 times. Buckets show difficulty tiers.

Medium

6-10/15 tasks

Always solved

15/15 tasks

Rarely solved

1-5/15 tasks

Never solved

0/15 tasks

Mostly solved

11-14/15 tasks

Example tasks & failure patterns

Sample task trajectories

Sample task

Sample task description with multi-step workflow involving warranty recall lookup and dealer escalation.

Trajectory & failure analysis: to be added

Example task

Example task involving badge access scheduling conflict resolution across multiple databases.

Trajectory & failure analysis: to be added

Common failure patterns

Analysis of systematic failure modes: tool call syntax errors, policy misinterpretation, multi-step state tracking failures, premature task termination.

Domain-specific findings

Sonnet 4.5

Highest single-run reliability (44.0%, 95% CI: [TBA, TBA]). Near-flawless tool execution, but conceptual errors on policy edge cases.

GPT-5

Strong reasoning but poor tool handling leads to repetitive technical failures. Performance varies significantly by reasoning effort setting.

Gemini 2.5 Pro

Lowest consistency (pass^5 = 6%) but high ceiling (pass@5 = 60%). Potential unlocked with retries.

Methodology

Built on Sierra's TAU-Bench. 17 tools, 9 JSON data bases, verified by manufacturing subject matter experts. Golden trajectories scored via data base state hash comparison. pass^k = all k runs succeed. pass@k = ≥1 of k runs succeeds. Confidence intervals computed via bootstrap resampling (1000 iterations).

Read blog

Trusted by Leading AI Teams

TAU manufacturing dataset available for purchase

License the TAU manufacturing RL Gym — 90 tasks, 19 tools, 9 databases with golden trajectories. Use for training, RLHF, or internal benchmarking.

Purchase now