Production data,
ready now.
High-quality, expert-validated datasets across agentic AI, reasoning, robotics, and STEM — available immediately or spun up within 48 hours. Built to challenge frontier models and move fast.
Trusted by Leading AI Teams
Three ways to get started
01 — Ready to ship
Order and receive within 2 days.
Sample-ready datasets are fully produced and validated. Place an order, align on format and delivery, and receive your data within 48 hours. No ramp time, no pipeline setup — just data.
Samples ready
02 — Pre-order (pipeline-ready)
Pipeline live within 48 hours.
Pipeline-ready datasets can be spun up within 2 days of your order. At kickoff we align on volume, format, and timeline — then our production pipeline runs to your specifications at scale.
Pipeline ready
03 — Expand or customize
Tailor any OTS dataset to your needs.
Any off-the-shelf dataset can be expanded in volume, adapted to your domain, or restructured to match your evaluation harness. New dataset requests are prioritized in our pipeline — tell us what you need.
Custom order
Browse by category
RL & agentic
Tool agent use RL gyms
High-fidelity RL environments for training and evaluating agentic reasoning in complex enterprise workflows across 11 industry domains.
1,100+ test cases
JSON
17–36 tools/domain
10+ domains
Samples ready
View samples
Coding
Terminal bench extension
Terminal-based agentic task environments for real-world command-line and system-level workflows, fully compliant with TerminalBench 3.
4 domains
Deterministic
Harbor harness
<50% resolution target
Pipeline ready
View samples
Coding
SWE-Bench extension: large codebases
SWE-bench-style tasks for evaluating coding agents on realistic software engineering work in large, real-world repositories — human-reviewed, no synthetic PRs.
Multi-file changes
Dockerized envs
F2P/P2P validation
Expert-curated
Pipeline ready
View samples
Robotics
Egocentric household video
First-person household manipulation videos with action annotation, designed for embodied AI and imitation learning in real home environments.
6,000+ hours
1080p / 30fps
MP4 + JSON metadata
HITL QA
Samples ready
View samples
STEM
University-level math reasoning
Text-only and multimodal university math problems with step-by-step solutions, spanning calculus, algebra, and multivariable analysis.
13,500+ text problems
600+ multimodal
JSON w/ reasoning steps
Expert-authored
Samples ready
View samples
STEM
SciCode hill-climbing dataset
Novel multi-step scientific computing tasks that extend the SciCode benchmark — purpose-built to lift model performance on this evaluation.
SciCode-compatible
Step-level scoring
Structured JSON
Contamination-free
Pipeline ready
View samples
STEM
Frontier STEM hill-climbing
PhD-authored tasks targeting frontier benchmarks — HLE, GPQA, AIME, and AMO-Bench — across math, physics, chemistry, biology, and CS.
7 domains
PhD-level authoring
Verification scripts
LaTeX + JSON
Pipeline ready
View samples
Reasoning
Charxiv CoT visual reasoning
Chart-based visual reasoning tasks with gold chain-of-thought trajectories, sourced from arXiv papers. Each answer is deterministic and unambiguous.
1,000 CoT samples
644 chart images
≥3 reasoning steps
Model-challenging
Pipeline ready
View samples
Reasoning
Multimodal reasoning dialogues
Multi-turn image-grounded reasoning conversations designed to develop contextual inference and visual analysis in multimodal models.
3,500+ dialogues
4-turn per sample
6 image categories
Expert-validated
Samples ready
View samples
Trusted by Leading AI Teams
See the data before you commit.
Every dataset comes with sample data available on request. Tell us which categories are relevant and we'll share samples, specs, and delivery timelines.