Tau-bench Dataset Extension
Benchmark agent reasoning, planning, and tool-use in complex B2E workflows.
Trusted by Leading AI Teams
About the dataset
What it is
Test cases with task description, user instructions and a single correct golden trajectory.
Domain covers 9 structured databases, 17 tools and 1 policy.
Who’s it for
Frontier labs looking to benchmark agent reasoning, planning and tool-use in complex business-to-employee manufacturing workflows.
On request
Expert annotated trajectories to highlight and classify agent behaviour and failures to clearly identify model gaps.
Complexity assessment
Model performance will drift with updates; benchmark calibrated to <50% pass^1
pass^k score¹
pass@k score²
# Runs
GPT-5
Sonnet 4.5
Gemini 2.5 Pro
GPT-5
Sonnet 4.5
Gemini 2.5 Pro
1
40.8%
44.0%
32.0%
40.8%
44.0%
32.0%
3
28.6%
29.8%
12.0%
50.2%
56.8%
52.8%
5
22.0%
24.0%
6.0%
52.0%
60.0%
60.0%
1. Pass^k estimates the probability that an agent would succeed on all k independent attempts
2. Pass@k measures the probability that at least one of k independent solution attempts will succeed
Example
Task description
Execute a standard material allocation cascade: verify the user’s access level, select the correct lot, create the allocation with correct buffers, modify the order to update quantities and status, then modify the lot status, and unreserve the corresponding quantities in the inventory while reducing the stock amounts accordingly.
Instructions
You are Alex Chen. Your user ID is USR-PROD-01. You are working on production order ORD-MATRIX-7F4, and you need to urgently fully allocate material for it. After checking the available lots, you determined that LOT-AURORA-3VQ is a good fit, as it can fully satisfy the order. Ask the assistant to use this lot to allocate material for SKU-ALLOY-7XK. Wait until the assistant completes all planned actions, then thank them and end the conversation.
Solution
Golden Trajectory
<...>
create_allocation(order_id="ORD-MATRIX-7F4", sku_id="SKU-ALLOY-7XK", requested_quantity=1350, final_allocation_quantity=1350, lot_buffer=0, ddmrp_buffer=0, lots={"LOT-AURORA-3VQ": {"allocated_quantity":1350}})
modify_lot(lot_id="LOT-AURORA-3VQ",
status="consumed", lot_remaining_quantity=0)
modify_order(order_id="ORD-MATRIX-7F4", status="in_progress", upsert_lines={"SKU-ALLOY-7XK": {"requested_quantity":1350,"allocated_quantity":1350}}) modify_inventory(sku_id="SKU-ALLOY-7XK", available_to_promise_quantity=1350, reserved_quantity=0,
in_stock_quantity=1350)
<...>
Executed Trajectory
<...>
create_allocation(order_id="ORD-MATRIX-7F4", sku_id="SKU-ALLOY-7XK", requested_quantity=1350, final_allocation_quantity=1350, lot_buffer=0, ddmrp_buffer=0, lots={"LOT-AURORA-3VQ": {"allocated_quantity":1350}})
modify_lot(lot_id="LOT-AURORA-3VQ", status="consumed", lot_remaining_quantity=0)
modify_order(order_id="ORD-MATRIX-7F4", status="in_progress", upsert_lines={"SKU-ALLOY-7XK": {"requested_quantity":1350,"allocated_quantity":1350}})
modify_inventory(sku_id="SKU-ALLOY-7XK", available_to_promise_quantity=1350, reserved_quantity=0)
in_stock_quantity=1350
<...>
Agent failure classification:
Argument & data errors
Description:
After allocation, the agent failed to reduce inventory stock to 1350
Policy
Manufacturing Operations Policy Manual
Critical Operating Constraints
The current time is 2024-05-15 15:00:00 EST.
As a manufacturing operations assistant, you help users manage production orders, lots, material allocations, and CAPAs, and provide relevant information from the database. You must not give subjective judgments or provide any other information that was not explicitly requested by the user. You should make at most one tool call at a time. If you make a tool call, you should not respond to the user at the same time. If you respond to the user, you should not make
a tool call. You should transfer the user to a human agent only if the request cannot be handled within the scope of your actions and the user explicitly agrees to be transferred after you propose it. You should begin by requesting the user_id and verifying the user’s access level. You are forbidden from helping the user with any tasks until this step is complete. You can only serve one user per conversation.
Key Definitions
<sub-section removed>
Inventory
Inventory is consolidated per-SKU view that tracks in_stock_quantity, hold_quantity, reserved_quantity,available_to_promise_quantity.
in_stock_quantity – The sum of all lot_remaining_quantity values for lots related to this specific SKU.
hold_quantity – The sum of all lot_remaining_quantity values for lots that are pending release checks,
expired or under a CAPA.
reserved_quantity – The quantity requested by all production orders but not yet allocated.
available_to_promise_quantity – Calculated as: Max(in_stock_quantity – hold_quantity – reserved_quantity, 0). In other words, ATP can't be less than zero.
Only when an order is created or modified, an allocation is created, a lot is created/updated, inventory can be updated, if these changes affect inventory metrics.
It is the AI responsibility to keep inventory up to date and account for all the changes immediately after they're taken.
Material Allocation
Material allocation is the step where physical quantities of lots are committed to a production order. Buffers (lot-based and DDMRP) must be applied during this stage according to the Buffer management section of the policy. Once material is successfully allocated, changes in following corresponding entities must be reflected immediately after the allocation is completed. Material allocation can only be performed for production orders in pending stage.
When selecting lots to fulfill allocation for a given SKU of a given order:
Select lots in order of earliest expiry date (FEFO – First Expired, First Out).
If two or more lots share the same expiry date, select the lot with the smallest remaining quantity first.
Continue selecting lots until the total quantity (requested + buffers) is reached.
Lots belonging to one SKU cannot be used to allocate material for a different SKU.
When closing an order, a new lot must be created immediately after closure.
Partial allocations are possible, meaning that requested quantity of the allocation does not have to match requested quantity of the order: requested quantity is the amount user wants to allocate this time.
Lot Management
A lot is a distinct batch of an SKU created during production or received from a supplier, tracked separately for traceability, expiry, and quality status.
A new lot must be created whenever a production order enters the closed stage.
When lot_remaining_quantity = 0, the lot status must be changed to consumed.
Material allocation is only permitted from lots in released status.
Lots in expired status cannot be used for any new material allocations, even if quantity remains.
When assistant creates a new lot, it must leave expiry date empty. It will be manually filled later by the QA agent.
When releasing a lot, expiry date must be set. If expiry date is not explicitly provided by the user, the default date if 2026-01-01.
After the expiry date is set, it cannot be changed anymore.
Lot Stages
"pending_release" – production order was completed, but there are still pending checks left until this lot is allowed to be released, and no open CAPA.
"released" – means that the lot was released and is available for allocation (all checks have passed, there is still material left and no open CAPA)
"on_hold" – means that there's open CAPA associated with this lot. It may only override “released” status.
"consumed" – means that all quantity from this lot was already allocated
"expired" – means that quantity of this lot wasn't allocated in time until it expired
Buffer Management
During material allocation in certain cases we allocate more to the order than it actually requests to compensate for potential losses and buffer rules. The final allocation quantity is calculated as: final_allocation_quantity = requested_quantity + lot_buffer + ddmrp_buffer, where requested quantity is the amount we initially planning to allocate for a particular order.
Executives can bypass all buffer allocation and blockers if they want to, but if not specified in the conversation they apply be default.
Allocation must be blocked, if the available to promise quantity after allocation will be negative.
When allocating material to a specific line of an order through multiple allocations, the lot buffer and DDMRP buffer must be calculated based on the total allocated quantity for that line, not per individual allocation. This ensures that splitting one large allocation into several smaller ones does not reduce the total buffer applied.
The buffer amounts should be consolidated as if it were a single allocation, and any resulting buffer difference must be applied to the latest allocation.
For example, allocating 100 units in a single allocation or splitting it into two allocations of 50 units each from separate lots must result in the same total buffer amount being added.
Lot Buffer (Handling Loss Compensation)
If allocating from 1 lot: No buffer added
If allocating from 2 lots: Add 5% buffer (round up)
If allocating from 3 or more lots: Add 10% buffer (round up)
lot_buffer = requested_quantity × buffer_rate where buffer_rate ∈ {0%, 5%, 10%} depending on lot count.
DDMRP Buffer (Dynamic Zone Adjustment)
DDMRP buffers apply only to SKUs that have apply_ddmrp = true.
Each SKU defines three zone shares in the database: red_share, yellow_share, and green_share — each representing a fraction of the current in-stock quantity. These shares are dynamically converted into absolute thresholds before each allocation:
red_threshold = in_stock_quantity × red_share
yellow_threshold = in_stock_quantity × yellow_share
green_threshold = in_stock_quantity × green_share (Rounded up)
The system then evaluates the post-allocation stock position: stock_after_allocation = in_stock_quantity − hold_quantity − requested_quantity
Depending on where this value falls relative to the thresholds, buffer adjustments are applied as follows:
If stock_after_allocation > green_threshold → no buffer
If yellow_threshold < stock_after_allocation ≤ green_threshold → add 5% buffer (round up)
If red_threshold < stock_after_allocation ≤ yellow_threshold → add 15% buffer (round up)
If stock_after_allocation ≤ red_threshold → allocation is blocked
The DDMRP buffer is therefore: ddmrp_buffer = requested_quantity × ddmrp_rate, where ddmrp_rate ∈ {0%, 5%, 15%} based on the determined zone.
Final ATP Check
<sub-section removed>
Tools schema
# Manufacturing Tools Reference
All tools operate over the in-memory manufacturing tables: `sku`, `user`, `lot`, `order`, `equipment`, `supplier`, `capa`, `allocation`, `inventory`. Return values are JSON-encoded strings unless stated otherwise. Errors are literal strings beginning with "Error:". Update tools mutate the in-memory snapshot used for grading; they do not enforce wiki policies and callers must apply them.
## General read helpers
## Order tools
### modify_order
- Data sources: `order`
- Description: Patch-update order fields and lines (canonical JSON output).
- Inputs: `order_id` (req); optional `status`, `produced_sku_id`, `produced_quantity`, `supplier_id`;
`upsert_lines` { `sku_id`: { `requested_quantity`, `allocated_quantity` } }, `remove_lines` [`sku_id`]
- Returns: Canonical JSON of updated record
- Writes: Mutates existing order
## Lot tools
### modify_lot
- Data sources: `lot`
- Description: Patch-update lot fields.
- Inputs: `lot_id` (req); optional `sku_id`, `lot_release_quantity`, `lot_remaining_quantity`, `status`, `expiry_date`, `order_id`, `pre_release_checks_passed`
- Returns: JSON record
- Writes: Mutates existing lot
## Allocation tools
### create_allocation
- Data sources: `allocation`
- Description: Create an allocation with auto ID `ALLOC-00`. No business validation.
- Inputs (all required, integers): `order_id`, `sku_id`, `requested_quantity`, `lot_buffer`, `ddmrp_buffer`, `final_allocation_quantity`, `lots` { `lot_id`: { `allocated_quantity` (int ≥ 0) } } with at least one lot
- Returns: JSON object `{ allocation_id: record }`
- Writes: Inserts new allocation
## Inventory and SKU tools
### modify_inventory
- Data sources: `inventory`
- Description: Patch per-SKU inventory metrics.
- Inputs: `sku_id` (req); optional `in_stock_quantity`, `hold_quantity`, `reserved_quantity`, `available_to_promise_quantity` (numbers; NaN ignored)
- Returns: JSON record
- Writes: Mutates existing inventory entry
## Quality and CAPA tools
## Equipment and Supplier tools
Trusted by Leading AI Teams