Data Solutions

Enterprise

Platform

Resource Hub

Company

Arena

Talk to us

Off-the-shelf datasets

Production data,
ready now.

High-quality, expert-validated datasets across agentic AI, reasoning, robotics, and STEM — available immediately or spun up within 48 hours. Built to challenge frontier models and move fast.

Browse the catalog

Access the data

5Dataset categories

48hMax time to delivery

100%Expert-validated

CustomExpand any dataset

Trusted by Leading AI Teams

Three ways to get started

01 — Ready to ship

Order and receive within 2 days.

Sample-ready datasets are fully produced and validated. Place an order, align on format and delivery, and receive your data within 48 hours. No ramp time, no pipeline setup — just data.

Samples ready

02 — Pre-order (pipeline-ready)

Pipeline live within 48 hours.

Pipeline-ready datasets can be spun up within 2 days of your order. At kickoff we align on volume, format, and timeline — then our production pipeline runs to your specifications at scale.

Pipeline ready

03 — Expand or customize

Tailor any OTS dataset to your needs.

Any off-the-shelf dataset can be expanded in volume, adapted to your domain, or restructured to match your evaluation harness. New dataset requests are prioritized in our pipeline — tell us what you need.

Custom order

Browse by category

All categories

RL & agentic

Coding

Robotics

STEM

Reasoning

RL & agentic

Tool agent use RL gyms

High-fidelity RL environments for training and evaluating agentic reasoning in complex enterprise workflows across 11 industry domains.

1,100+ test cases

JSON

17–36 tools/domain

10+ domains

Samples ready

View samples

Reasoning

GDPval

Expert-authored evaluation tasks that benchmark frontier models on real professional deliverables — slide decks, financial models, legal memos — scored against human-created baselines.

Finance · Consulting · Legal

~39 criteria/sample

0.95 human baseline

+0.33 human–model gap

Samples ready

View samples

Coding

Terminal bench extension

Terminal-based agentic task environments for real-world command-line and system-level workflows, fully compliant with TerminalBench 3.

4 domains

Deterministic

Harbor harness

<50% resolution target

Pipeline ready

View samples

Coding

SWE-Bench extension: large codebases

SWE-bench-style tasks for evaluating coding agents on realistic software engineering work in large, real-world repositories — human-reviewed, no synthetic PRs.

Multi-file changes

Dockerized envs

F2P/P2P validation

Expert-curated

Pipeline ready

View samples

Robotics

Egocentric household video

First-person household manipulation videos with action annotation, designed for embodied AI and imitation learning in real home environments.

20k hours

1080p / 30fps

MP4 + JSON metadata

HITL QA

Samples ready

View samples

STEM

University-level math reasoning

Text-only and multimodal university math problems with step-by-step solutions, spanning calculus, algebra, and multivariable analysis.

13,500+ text problems

600+ multimodal

JSON w/ reasoning steps

Expert-authored

Samples ready

View samples

STEM

SciCode hill-climbing dataset

Novel multi-step scientific computing tasks that extend the SciCode benchmark — purpose-built to lift model performance on this evaluation.

SciCode-compatible

Step-level scoring

Structured JSON

Contamination-free

Pipeline ready

View samples

STEM

Frontier STEM hill-climbing

PhD-authored tasks targeting frontier benchmarks — HLE, GPQA, AIME, and AMO-Bench — across math, physics, chemistry, biology, and CS.

7 domains

PhD-level authoring

Verification scripts

LaTeX + JSON

Pipeline ready

View samples

Reasoning

Charxiv CoT visual reasoning

Chart-based visual reasoning tasks with gold chain-of-thought trajectories, sourced from arXiv papers. Each answer is deterministic and unambiguous.

1,000 CoT samples

644 chart images

≥3 reasoning steps

Model-challenging

Pipeline ready

View samples

Reasoning

Multimodal reasoning dialogues

Multi-turn image-grounded reasoning conversations designed to develop contextual inference and visual analysis in multimodal models.

3,500+ dialogues

4-turn per sample

6 image categories

Expert-validated

Samples ready

View samples

Trusted by Leading AI Teams

See the data before you commit.

Every dataset comes with sample data available on request. Tell us which categories are relevant and we'll share samples, specs, and delivery timelines.

Connect with our team

Production data,ready now.

Three ways to get started

Order and receive within 2 days.

Pipeline live within 48 hours.

Tailor any OTS dataset to your needs.

Browse by category

Tool agent use RL gyms

GDPval

Terminal bench extension

SWE-Bench extension: large codebases

Egocentric household video

University-level math reasoning

SciCode hill-climbing dataset

Frontier STEM hill-climbing

Charxiv CoT visual reasoning

Multimodal reasoning dialogues

See the data before you commit.

Production data,
ready now.