Tau Manufacturing

Manufacturing ops: workforce (badge access, scheduling/leave/OT, training/compliance, HR/time) + after-sales (parts, warranty/recalls, service scheduling, dealer escalation). D365-based. B2E + B2B.

50
Test cases

17
Agent tools

9
Databases

5
Models evaluated

Domain Leaderboard

Model

Pass^1

Pass^5

Pass@1

Pass@5

Tasks solved (5/5)

Sonnet 4.5

Anthropic

44.0%

24.0%

44.0%

60.0%

12 / 50

GPT-5

OpenAI

40.8%

22.0%

40.8%

52.0%

14 / 50

Gemini 2.5

Google

32.0%

6.0%

32.0%

60.0%

7 / 50

Kimi K2 Thinking

Moonshot

23.6%

10.0%

23.6%

46.0%

Minimax

Minimax

13.2%

4.0%

13.2%

30.0%

Model

Pass^1

Pass^5

Pass@1

Pass@5

Tasks solved (5/5)

Sonnet 4.5

Anthropic

44.0%

24.0%

44.0%

60.0%

12 / 50

GPT-5

OpenAI

40.8%

22.0%

40.8%

52.0%

14 / 50

Gemini 2.5

Google

32.0%

6.0%

32.0%

60.0%

7 / 50

Kimi K2 Thinking

Moonshot

23.6%

10.0%

23.6%

46.0%

Minimax

Minimax

13.2%

4.0%

13.2%

30.0%

Scaling curves

pass^k — Consistency

% tasks passed in every one of k runs.

Sonnet 4.5

GPT-5

Gemini 2.5

Kimi K2

Minimax

pass@k — Ceiling

% tasks passed in at least one of k runs.

Sonnet 4.5

GPT-5

Gemini 2.5

Kimi K2

Minimax

Sonnet 4.5

GPT-5

Gemini 2.5

Kimi K2

Minimax

Key Findings

Sonnet 4.5

Highest single-run reliability (44%). Near-flawless tool execution, but conceptual errors on policy edge cases.

GPT-5

Strong reasoning ("almost flawless"), but poor tool handling leads to repetitive technical failures.

Gemini 2.5 Pro

Lowest consistency (pass^5 = 6%) but high ceiling (pass@5 = 60%). Potential unlocked with retries.

Methodology

Built on Sierra's Tau-Bench. 17 tools, 9 JSON DBs, verified by manufacturing SMEs. Golden trajectories scored via DB state hash comparison. pass^k = all k succeed. pass@k = ≥1 of k succeeds.