Measuring real-world performance in physical AI: Toloka's role in the PhAIL leaderboard

on March 31, 2026

High-quality human expert data. Now accessible for all on Toloka Platform.

Learn more

Deploying physical AI in production is not just an engineering challenge, it's a measurement problem.Most benchmarks can tell you whether a system completed a task. Very few can tell you whether it will still be completing that task reliably six hours into a shift. Without that signal, deployment decisions are educated guesses dressed up as engineering rigor.

The PhAIL (Physical AI Leaderboard), launched by Positronic Robotics, is built to close that gap. Starting with commercial picking tasks on a single hardware configuration, it evaluates models under real operational conditions – not in simulation or curated demos – and is designed to expand across tasks and embodiments as the consortium grows. Toloka participates as the data partner, providing the human verification layer that ensures every result reflects what a system genuinely did, not just what its logs reported.

The gap between demos and deployment

Physical AI has shown strong results in controlled settings. The harder question, and one the industry has largely deferred, is what happens when these systems run continuously, at commercial scale, on real hardware.

Most existing benchmarks don't answer this. They measure whether a system can complete a task once, under favorable conditions. In production environments, what matters is throughput: how many units per hour a system can sustain, and reliability: how long it operates before requiring human intervention or failing outright. These are the metrics that determine whether a physical AI system is actually deployable. They are precisely the metrics that lab benchmarks leave out.

While current models can complete individual tasks, they struggle significantly to maintain throughput and reliability over continuous operation, falling well short of human performance levels. PhAIL provides the first controlled quantification of a gap the industry has long recognized but never rigorously measured.

A benchmark built for how production actually works

PhAIL evaluates models the way operators actually run them: on real hardware, performing commercial tasks. The inaugural evaluation focuses on bin-to-bin order picking — a repetitive, high-volume task representative of early physical AI deployment in logistics and fulfillment.

Rather than reporting abstract success rates, PhAIL measures the metrics that determine whether automation makes economic sense on a real shop floor: Units Per Hour (UPH) and Mean Time Between Failures or Assists (MTBF/A). Every evaluation is run on a Franka Research 3 arm with a Robotiq 2F-85 gripper — a widely available, open-source hardware configuration that participating teams can reproduce.

The process is open end to end. Positronic publishes a fine-tuning dataset collected through teleoperated demonstrations, freely available for non-commercial use, alongside open-source scripts any team can use to prepare their model for evaluation. During evaluation, model checkpoints are rotated randomly so the operator doesn't know which model is running — blinding that prevents bias and makes scores trustworthy.

Every run is recorded and published: synchronized video, robot telemetry, scoring logs, and configuration metadata. Anyone can audit any result. In a field that moves as fast as physical AI, that transparency is the foundation of trust.

Why leaderboard credibility depends on human validation

Physical systems produce outcomes that telemetry alone cannot fully capture. Whether a pick was correct, a placement met the required standard, or an assist occurred — these determinations require human judgment applied consistently across every run.

Automated signals catch a great deal, but edge cases remain. A log entry can indicate an action was completed; it cannot always tell you whether it was completed correctly. Human verification closes that gap, ensuring that PhAIL's results are not just measurable, but auditable.

This is where Toloka's role becomes critical. Independent, scalable verification of outcomes — consistent classification of failures, edge cases, and borderline results across every evaluation run — is what turns a leaderboard from a snapshot into a trustworthy record of progress. But detection is only half the job. Every failure is categorized by type (perception error, grasp planning failure, recovery failure, object-specific edge case), creating a diagnostic layer that tells model developers not just that their system failed, but where in the pipeline it broke down. The result is a set of numbers that model developers, operators, and deployment partners can actually act on, not just as a ranking, but as a map of what to fix next.

What this means for physical AI builders

Evaluation tells you where your model stands. Getting it to stand somewhere better requires data — and data for physical AI requires a completely different approach than data for language models.

Toloka's physical AI capabilities cover the full pipeline:

Crowdsourced and onsite data collection: structured collection workflows that can scale beyond what any single lab can capture in-house
High-precision video and image annotation: multi-modal annotation with the synchronization standards that robotics training data actually requires
RoboBILT: Toloka's evaluation framework for physical AI. Where standard benchmarks assess whether a task completed, RoboBILT assesses whether the data behind it is trustworthy enough to train on — catching issues that outcome metrics miss entirely, like a grasp that succeeded but applied unsafe force levels, or calibration drift accumulating silently across a full collection rig.

A benchmark's credibility depends on who validates the results.

PhAIL is structured as a governed consortium to reflect exactly that principle. Positronic develops the methodology, operates the rigs, and publishes the fine-tuning data, the technical foundation the benchmark is built on. Nebius provides the compute infrastructure. Toloka provides independent verification of outcomes across every run, a validation layer that sits outside the evaluation pipeline itself. No single member controls the full picture.

The goal is a standard that model developers, operators, and partners can rely on because the results hold up to outside scrutiny — not just because the leaderboard says so.

The PhAIL leaderboard, along with its protocol, dataset, and submission process, is publicly available at phail.ai. If you're building or deploying physical AI systems and want evaluation results you can act on, the consortium is open.

Subscribe to Toloka news

Case studies, product news, and other articles straight to your inbox.