← Blog

/

Customer cases

Customer cases

MCP evaluations - how to test AI agents in real environments in 2026

Frontier models can win at IMO, but they still can't check their own assumptions.

Toloka Arena is live. See how your model ranks.

Benchmark scores on STEM evaluations keep climbing. On the evaluations that matter most — multi-step reasoning, original research tasks, and benchmarks like HLE, AMO-Bench, and SciCode that remain genuinely unsaturated — reliability isn't keeping pace. The reason that's hard to fix is also hard to detect: it's not direct contamination — it's soft-contamination.

A model that has seen a boundary value problem in electrostatics during pretraining encounters a new one at evaluation time. It hasn't memorized the answer. It interpolates — draws on the structural pattern of the original derivation, adapts it plausibly, and produces an output that looks correct. It may even score correct on an automated grader. But it may be adapting a known solution template rather than deriving the result from scratch — and on problems that require genuine novelty, that distinction is what matters.

Direct contamination is detectable — there are established methods to test for it, whether or not you own the training data. Near-contamination isn't. There's no clean signal. And in STEM, where problem families are structurally similar by design (boundary value problems, perturbation expansions, partition functions), the surface area for it is enormous.

This is the lens through which we think about evaluation integrity at Toloka — and it shapes how we build STEM datasets. We've spent the past year building evaluation and training data  targeting frontier STEM benchmarks – HLE, GPQA, AIME, AMO-Bench, SciCode – working with PhD-level domain experts across mathematics, physics, chemistry, and engineering. What follows is what we found: the failure modes that kept recurring, what they imply about training data, and why most existing STEM data doesn't address them.

SciCode as a concrete case

Before the failure modes: it's worth grounding this in a specific benchmark where the contamination problem is measurable.

SciCode tests a different side of STEM than most frontier benchmarks: not just answering a science question, but applying knowledge to implement a working solution.The benchmark asks whether a model can take a multi-step scientific computing problem, decompose it correctly, implement each step in Python, and produce numerically correct outputs. It requires chaining algorithmic steps —knowing the right formulas isn't enough if you can't implement them correctly across a multi-step procedure and produce verifiable numerical outputs.

We extended SciCode+ with a curated set of novel, expert-authored tasks precisely because the existing benchmark was approaching the limits of its evaluation integrity. Tasks authored from published papers carry contamination risk. Even when the underlying code is internal and never open-sourced, the paper itself describes the methodology — the algorithm, the steps, the expected outputs. That description is almost certainly in the pretraining corpus. A model doesn't need to have seen the code to pattern-match to the solution structure. Our extension includes a synthetic subset where tasks are freshly designed rather than extracted from literature, structurally disjoint from the public benchmark, and validated against ground-truth outputs through automated test cases with assertions.

The practical finding from building it: even when subproblems are explicitly provided, models tend to collapse them,  implementing a single monolithic solution rather than respecting the step structure. The intermediate function headers get ignored, the verifiable outputs at each stage disappear, and the result becomes impossible to audit. Ground-truth implementations need to be step-decomposed, with individual function headers and verifiable intermediate outputs, to train models out of this pattern. Numerical precision is a compounding problem: wrong dtypes, incorrect array shapes, and unstable implementations are endemic and require explicit ground-truth output specifications to address.

This is the baseline. Now the five failure modes that apply across both scientific computing and frontier reasoning benchmarks.

Five failure modes — and what they share

What links all five is a common root: training data that rewards surface pattern-matching over first-principles reasoning. A model trained on compressed solutions, underspecified problems, and single-modality outputs learns to produce outputs that look like correct reasoning. Getting it to actually reason correctly requires different data.

1. Models can't write a well-posed problem

The most common rejection reason in our pipeline — and the one that surprised us most — was not a wrong derivation. It was an underspecified problem.

A well-posed graduate-level physics problem is often a page long. It specifies boundary conditions explicitly. It defines the coordinate system. It states which theory applies and what isolation assumptions are in force. Without all of this, the problem admits multiple valid answers — it's not that there's a wrong solution, it's that there's no uniquely correct one.

What we consistently saw from model-generated content: thin problem statements, missing constraints, undefined terms, implicit assumptions the solver is expected to infer. Expert reviewers flagged this as the root cause of the majority of rejections in our pipeline. The fix is training data composed of complete, fully-constrained problem formulations from graduate qualifying exams and textbooks, each paired with explicit justification for why every specification element is necessary for a unique solution.

2. Solution chains are too compressed

A complete solution to a hard graduate problem should be 4 to 5 pages with every step explicitly motivated. What models currently produce is closer to 5 to 8 short paragraphs.

The specific failure: skipping intermediate steps, compressing multiple logical moves into a single line, and omitting justification for approximations. The approximation problem matters most. A model that writes "using the small-angle approximation" without stating when that approximation is valid or what error it introduces has learned to reproduce a solution shape, not to reason. The compounding effect is particularly acute with synthetic data. Models trained on auto-generated solutions, which tend to be compressed by default, learn to skip steps. And research on model collapse suggests that repeatedly training on synthetic outputs accelerates this degradation over generations: the reasoning gets shallower each time. Verbose, expert-written derivations are partly a corrective to what synthetic pipelines systematically remove.

The training data you want: verbose, expert-written derivations where every approximation is justified, every intermediate step is shown, and alternative solution paths are included where relevant. 

3. Failure rates are not uniform — and the pattern is actionable

Failure rates across domains are not evenly distributed. Our expert annotation pipeline surfaces this directly: some subdomains generate significantly higher correction rates and rejection frequencies than others. Uniform data collection across STEM is therefore inefficient. The subdomains worth concentrating on:

Physics: Electromagnetism is the highest-failure area — boundary value problems, eddy currents, radiation, multipole expansions. Thermodynamics and statistical mechanics follow, clustering around phase transitions, partition functions, and critical phenomena. Quantum mechanics failures concentrate in scattering theory, perturbation methods, and many-body systems.

Mathematics: Functional analysis (operator theory, spectral methods, infinite-dimensional systems) and algebraic topology and geometry (homology computations, fiber bundles, characteristic classes) show the highest correction rates.

The directional finding is clear enough to act on: a disproportionate investment in these subdomains will move frontier benchmark performance more efficiently than evenly-distributed coverage.

4. Self-correction is nearly absent — and this is the deepest gap

Modern reasoning models can self-correct — the visible backtracking in o3 and R1-class models is real. But self-correction tends to happen at the exploration stage, before a reasoning path is committed. Once a model has established a trajectory, it becomes increasingly unlikely to revise an intermediate result even when it's wrong. The error propagates forward, each subsequent step building on the flawed assumption, until the final answer is wrong in a way that looks internally consistent. 

To illustrate: a model solving a thermodynamics problem applies the ideal gas law without checking whether the conditions warrant it. The derivation is clean, the answer has the right units, the format is correct. It's wrong because the assumption was never verified. Nothing in the output signals a problem — a format-checking grader wouldn't flag it, and the reasoning chain looks coherent. The error is invisible without a model that audits its own assumptions before proceeding.

What fixes this is training data that explicitly models the error-detection process. Our annotation pipeline produces exactly this structure as a natural byproduct: when an expert reviewer identifies an error in generated content, we capture the original generated content, the per-step correct/incorrect evaluation, the expert's written explanation of what went wrong and why, and the corrected version. This find-and-fix pairing is rare in existing training data and directly teaches models to audit their own intermediate results — checking whether key assumptions hold, verifying limiting cases, and recognising when an output is structurally plausible but mathematically wrong.

5. Mathematics and code are treated as separate modalities

Most models can write mathematics. Most can write code. Very few treat them as complementary verification tools. Every task we deliver in the Frontier STEM datasets ships with a self-contained Python verification script that independently confirms the mathematical answer. The verification code uses a different method than the analytic solution — symbolic computation, numerical simulation, dimensional checks — to provide genuine cross-verification rather than circular confirmation. A symbolic computation that re-derives the same result via a different algebraic path, or a numerical simulation that checks the analytic answer against a Monte Carlo estimate, provides a verification signal that's structurally independent from the original derivation.

Training on these code-math pairs develops a model that treats computational verification as a natural part of mathematical reasoning. The capability you're developing is bidirectional: from mathematics to code (can the model implement a verification?), and from code back to mathematics (can it interpret a numerical result against an analytic expectation?). For reliable STEM problem-solving, this bidirectional capability is essential.

What the data needs to look like

Pulling it together as a specification:

Novelty and contamination control: Tasks freshly authored and structurally disjoint from existing benchmarks. For scientific computing tasks, a synthetic subset designed independently from published literature — the paper describing an algorithm is likely in the pretraining corpus even if the code never was.

Problem formulations: Complete, fully-constrained problems from graduate qualifying exams and textbooks. Not paraphrased. Not simplified. Every constraint is explicit, with justification for why it's necessary.

Reasoning chains: Verbose, step-by-step expert derivations. Target length is pages, not paragraphs. Every approximation is justified. Every intermediate step shown. Alternative solution paths included where they exist.

Domain coverage: Disproportionate investment in high-failure subdomains — EM, thermodynamics and statistical mechanics, quantum mechanics, functional analysis, algebraic topology — rather than uniform STEM distribution.

Error-correction pairs: Original generated content, per-step evaluation, expert error explanation, corrected version. Captured as a natural byproduct of the expert annotation workflow.

Cross-verification pairs: Mathematical derivation accompanied by Python verification code using an independent method. The code and the mathematics should not share derivation logic.

Scientific computing tasks: Step-decomposed implementations with individual function headers, explicit intermediate outputs, ground-truth type and shape information, and a synthetic subset structurally disjoint from the public SciCode benchmark.

The Expert Requirement

The common thread across all of this: the failure modes are happening at the level of scientific correctness, not surface presentation. Automated pipelines can scaffold the workflow — boilerplate generation, test scaffolding, formatting — but the content itself requires PhD-level domain expertise. Problem specification, solution correctness, approximation validity, error identification: these judgments cannot be delegated to annotators without domain depth.

This is what makes high-quality frontier STEM data expensive to produce and why it's worth being specific about what the pipeline needs to look like. Every task in our Frontier STEM dataset is authored end-to-end by a PhD-level expert who writes the problem, constructs the full derivation, derives the ground-truth answer, and writes the verification code. The expert-led approach is not a quality assurance layer on top of automated generation — it's the method.

If you're building toward frontier STEM capabilities and want to see what this data looks like in practice — the task structure, the verification format, the domain and difficulty distribution — we're sharing sample packages for both the Frontier STEM and SciCode datasets. Connect with our team to access sample data.


Subscribe to Toloka news

Case studies, product news, and other articles straight to your inbox.