Test your LLM's math skills with U-MATH, Toloka's benchmark for complex problems and step-by-step reasoning

Test your LLM's math skills with U-MATH, Toloka's benchmark for complex problems and step-by-step reasoning

Test your LLM's math skills with U-MATH, Toloka's benchmark for complex problems and step-by-step reasoning

Feb 12, 2025

Feb 12, 2025

Insights

Insights

R1 is not on par with o1, and the difference is qualitative, not quantitative

What sets a reasoning system apart, how to test for it, and how to achieve it

New benchmarking results compare the latest models

A few weeks ago, DeepSeek made waves with their latest R1 model, pushing ChatGPT out of the #1 spot in the Apple App Store and prompting tech stock selloffs that sent Nvidia share prices plummeting 17%.

The claim is that DeepSeek R1 — an open-weights [1] model produced by a small Chinese tech lab — rivals OpenAI’s o1 model in performance, but at a fraction of the cost for training and inference. The o1 model represents a shift towards adaptive reasoning systems that dynamically scale their computation via test-time scaling, where models allocate more computational resources as needed during inference. Reasoning models have more powerful problem-solving capabilities than traditional LLMs, which have less capacity for adaptive allocation of compute. To access these capabilities in their latest o1 model, OpenAI charges six times more than for GPT-4o. DeepSeek offers the same type of reasoning system, but their model is open source.

Setting aside the geopolitical aspect, open vs closed source questions, and implications of an imminent decline in training and inference costs, let's focus on DeepSeek's biggest splash — the performance parity.

The Toloka Research team has been testing R1 and o1 and analyzing existing evaluations, and our findings bring R1’s dominance into question.

Read on to learn:

  1. How R1 really compares to o1

  2. Where we’ve exposed the most significant gaps

  3. What makes o1 stand apart from all the current reasoner models

  4. How to approach closing the gap


Public comparisons show R1 on par with o1

AI enthusiasts and the media think that R1 is on par with o1, based on public benchmarks that show remarkably similar performance. Let's look at available comparisons of R1 and o1.

DeepSeek report

This comparison table from DeepSeek’s own report shows R1 and o1 neck and neck on major benchmarks. The majority of articles comparing R1 to o1 are based on this table, or tiny-scale manual vibe checks using a handful of prompts.

Popular online leaderboards

Additional “go-to” evaluation sources are the full LiveBench leaderboard and ChatBot Arena [2].

Style comparisons

There are also numerous anecdotal reports on R1’s superior writing style, generally described as “more creative and fun, if a bit unhinged,” which is in line with some independent evaluations.

General sentiment

The widespread narrative based on those evaluations is that the models are more or less equal. R1 is slightly better at math and coding and has a more free and creative writing style, while o1 is somewhat better at factuality, question answering, and instruction following, with a writing style that focuses on careful structure, grammar, and logic.

Most AI experts vaguely attribute these differences to the focus: R1 training was "less restricted", with an emphasis on math and coding domains that lend themselves well to reinforcement learning, while o1 was presumably "more guided", with more attention to world knowledge and alignment. OpenAI invested heavily in o1's alignment as a public-facing product via ChatGPT, ensuring the model's safety and adaptation to consumer preferences across general domains. It all seems to add up.

But we think there’s more to the story.


Long-tail benchmarks reveal gaps

As soon as we leave the beaten path, alternative benchmarks paint a different picture. The Toloka Research team investigated evaluations in niche subdomains and uncommon domains and noted quantitative and qualitative gaps in model performance.

Niche subdomains

A few months ago, Toloka released U-MATH — a benchmark to test LLMs on university-level mathematics, sourced from unpublished problems used in the curriculum at top US universities. We developed the benchmark precisely to represent an uncommon, under-explored niche within the math domain — real-world university math. According to our U-MATH evaluation, R1 is not any better than o1 at mathematical reasoning — in fact, it’s on par with o1-mini.

We continued to study niche subdomains by examining coding, the other domain where R1 is considered to have an edge over o1. In terms of coding benchmarks, Aider’s Polyglot is a good niche example, focusing on code editing tasks in a large number of different programming languages. Aider's evaluations contradict the story of R1 beating o1 in coding, similar to our conclusions on U-MATH.

Unusual domains

There are some interesting benchmarks assessing altogether less conventionally tested capabilities, and their findings on o1 vs R1 are in line with ours as well [3]:

  • μ-MATH is our other benchmark on university mathematics, based on U-MATH but intended for meta-evaluations — i.e., assessing the performance of LLMs as automatic solution evaluators instead of problem solvers. Judging solutions by testing them against a golden label proves to be a challenging skill in its own right [4], and it is a somewhat atypical skill to evaluate and train for. Here, similarly, we observe R1’s score closer to that of o1-mini, not o1.

  • ARC tests AI systems on their ability to adapt to novel situations, showing R1 lagging well behind o1.

  • LLM Chess Leaderboard again places R1 behind o1 and closer to o1-mini. The evaluation also reports that R1 makes an average of 52.93 illegal moves per 1,000 attempts, compared to o1 and o1-mini with 9.29 and 4.29 illegal moves, respectively.

  • Northeastern University Programming Research Lab Study tested the models on the NPR Sunday Puzzle Challenge, which are difficult wordplay problems based on common knowledge. In addition to the large gap between R1 and o1, researchers also reported several "failure modes" that occur with R1 and never occur with o1 or o1-mini, such as R1 explicitly stating "I give up" 23.9% of the time or failing to stop the generation 8.5% percent of the time.

Superior generalization and reliability place o1 in a league of its own

The long-tail benchmarks above are pertinent because they are unconventional, testing for novelty and robustness. So here’s our claim: o1 has greater generalization and reliability than R1. In terms of reliability, we can see that both o1 and o1-mini are superior to R1.

As we continued investigating, we found that other publicly available evaluations support our conjecture.

From the standpoint of lesser generalization, we expect R1 not only to perform worse on niche subdomains or novel domains, but also to display degraded performance on new, unseen tasks for familiar suites. Our expectation is confirmed by MathArena’s recent report on AIME 2025 results, highlighting performance on problems published after the release of o1 and R1. With the unseen tasks, performance drops significantly for R1 but not for o1.

Reliability, in turn, plays a vital role in adversarial robustness and consistency, both lacking in R1 compared to o1 and o1-mini.

  • CISCO report a 100% success rate on their attempts to breach R1, compared to 26% for the o1-preview, and there are other similar reports on R1’s safety limitations [5].

Both proper generalization and reliability are extremely important for AI systems, often cited as the primary bottlenecks of agentic applications. An extension of our claim is that superior generalization and reliability put o1 in a league of its own among all the currently available reasoning models. While we agree with the popular narrative that R1 has a different focus from o1, the o1 emphasis on alignment and broad domain coverage results in a qualitative difference that goes far beyond performance accents.

Later, we'll broaden the discussion of generalization and reliability in models and ways to enhance these key properties in a reasoning system, so stay tuned.


Beyond o1 vs R1: studying trends and biases

Bringing in more models and metrics can help us get a bigger-picture perspective on the performance patterns, instead of only focusing on o1-to-R1 comparisons from standalone benchmarks. Let's turn again to our own suite of U-MATH and μ-MATH datasets.

Recognizing patterns

You can study the U-MATH and μ-MATH leaderboard on Hugging Face. Here are the aggregate scores, ranked by μ-MATH F1.

The graph reveals three major performance trends:

  1. Judgment vs. problem-solving: In non-reasoning models, judgment and problem-solving performance improve together — up to a point. Beyond that, they diverge, with better problem-solving correlating with weaker judgment. This puts some numbers behind our statement that judgment is a distinct skill and suggests an inherent tradeoff between the two.

  2. Reasoners extend the Pareto frontier by breaking out of this tradeoff and advancing beyond the previous generation of top models [6].

  3. o1 stands apart by pushing a step beyond other reasoning models into its own category.

Investigating the tradeoffs

We won’t go into much detail here [7], but what we find is that the problem-solving vs judgment tradeoff mentioned above translates into behavioral differences among the models, yielding two distinctive judgment styles:

  • Lenient judges: More verbose, inclined toward lengthy derivations and better at comparing complex expressions, but prone to losing track — leading to fewer false negatives but more false positives.

  • Conservative judges: More structured and precise, but often anchored on the exact form of the reference answer or golden label — leading to fewer false positives but more false negatives.

Good judgment depends on balancing mathematical problem-solving with general structured logic and instruction-following, and an ideal model would have strong domain-specific reasoning skills while maintaining high reliability and coherence in order to successfully apply them.

To understand this balance better, we can decompose the F1 score into True Positive Rate (TPR), which measures correct positive classifications, and True Negative Rate (TNR), which tracks correct rejections of incorrect solutions. Higher TPR means fewer false negatives, while higher TNR means fewer false positives.

We can observe all the same trends with this chart: non-reasoning systems exhibiting the performance tradeoff [8], reasoning systems pushing the frontier away [9], and o1 advancing a step further still. What we can also see is that on top of being the best-performing model, o1 is among the most balanced ones [10].

Studying the previous generation of models informs us on how to approach this balancing with the next:

  1. Balanced training makes for a more balanced model. We can see in action how transitioning from math specialists to generalist models leads to better, more well-rounded judgment performance. A balanced training setup integrates the depth of formal reasoning with breadth of domain coverage and coherence.

  2. Reducing capability amplifies training-induced biases. Lenient models tend to become even more lenient when scaled down to smaller sizes, and conservative models become more conservative. A model needs to have appropriate capabilities that allow it to generalize over the things that you’re balancing.

The key point is: Powerful reasoning systems have the appropriate capabilities and already excel at formal reasoning. Training focused on diversifying domains and improving coherence will help guiding them toward generalization and reliability.


Building more generalizable and reliable models: It’s all in the data

We’ve discussed at length that a reasoning system would require generalization and reliability, demonstrating o1’s excellence on these dimensions and its resulting superb performance.

We have three recommendations to help improve these qualities in reasoning systems, all requiring high-quality data. A strong data partner like Toloka can help navigate the details of data production.

1) Diverse domain coverage

Increasing versatility of autoverified data: Math and coding tasks permitting simple autoverification are the bread and butter of current reasoning systems [11], but there are ways to diversify.

  • Even math and coding have untapped niches. The U-MATH benchmark is derived from our larger training dataset, offering a good example of a math subdomain that’s underrepresented compared to the more readily-available textbook, high-school or Olympiad-style data.

  • Plenty of other fields allow for autoverification but suffer from a lack of appropriate data — examples include chemistry, finance, biology, and more. Closing the gap on these would require careful and efficient data curation by highly skilled experts from a diverse domain set — Toloka’s specialty.

Expanding the boundaries of autoverification: Another direction for improvement would involve making better [12] or more general verifiers.

  • One possible approach is to create datasets for training more capable verifiers [13], such as scaling up meta-evaluation datasets like Toloka’s μ-MATH in size and complexity.

  • Alternatively, domains such as law or medicine don’t typically have singular golden labels but lend themselves well to verification on expert-formulated rubrics, which Toloka already produces for a diverse set of complex domains.

Explicit demonstration where autoverification fails: For more open-ended expert-reasoning domains, such as consulting, web research, or forecasting, proper autoverification is challenging to do at scale [14]. For these, a possible approach is to have expert-crafted demonstrations for the entire reasoning process — planning, researching the sources, backtracking, etc. One way Toloka sets similar projects up is by having experts operate an LLM-based system, adding their inputs and steering it towards complete solutions.

Going Multi-*: multi-lingual, multi-cultural, multi-modal, multi-turn, etc.

2) Reasoning refinement

Process supervision: Training process reward models is a large part of the current reasoning systems research [15]. Despite the relevance, there’s very little public data available for PRM training, and all the limitations of poor domain coverage apply here as well. This type of data production overlaps with Toloka’s experience, such as trajectory annotation for coding agents [16].

Explicit demonstrations of excellent reasoning: Demonstration-style data could also be used to improve coherence, efficiency, and clarity of reasoning traces [17].

There are several ways to obtain this type of data:

  • Manually crafting examples of concise and effective reasoning from scratch

  • Editing LLM-generated trajectories

  • Issuing preferences amidst a number of alternative trajectory options

All of these align with our experience of building scalable SFT demonstration pipelines for complex domains. The preference option could either be done over entire trajectories or provide denser signals in a step-wise manner, similar to Toloka’s projects in multi-turn dialogue preferences.

3) Appropriate evals

Benchmarks: As we’ve already emphasized, benchmarks need to be novel and provide clear signals — incorporating purposeful design, quality data and meaningful analyses. We have relevant experience in creating clean, informative, and practical evaluation datasets, such as our publicly available U-MATH & μ-MATH, and in building custom evaluation pipelines with insight into quality criteria and metrics tailored to specific use-cases.

Red-teaming: Traditional safety labeling will need to be adapted to reasoning models to take long reasoning traces into account. Besides that, as models become more prominent, they have broader applications and thus an expanded attack surface as well. Toloka’s versatile team of domain and security experts tests a wide variety of applications, including agentic systems and computer operators.


The “DeepSeek moment” is still real

DeepSeek’s public release of R1 deserves recognition for the seminal AI moment that it is.

Although it is not a fully-open model in terms of reproduction transparency and, as we’ve shown here, it is not quite at the frontier level, this is the closest the AI community has ever been to an open frontier system. Besides, this is an incredible opportunity for us to study and better understand what differentiates these models, what’s potentially missing, and how to move forward.

Let’s build the future together.

[1] Not fully open, since the data and code to reproduce the results are not published. Although the comprehensive technical reports accompanying DeepSeek’s releases are extremely valuable and admirable.
[2] Note about the Arena benchmark: GPT-4o currently scores higher than both o1-preview and o1-mini, which is indicative of the benchmark’s nature: it depends a lot on subjective human preferences aside from formal model capabilities.
[3] LLM Chess leaderboard only reports numbers for o1-preview, presented in the o1 column for an easier overall comparison.
[4] For more complex math subjects, this is a truly demanding task.
[5] In principle, test-time compute should help with safety — there are even some reports on this already, e.g. by OpenAI with their study of o-series models — but that doesn’t seem to happen for Deepseek at all.
[6] It's fascinating that Gemini 1.5 Pro aligns more closely with reasoning systems than with other standard inference models — a remarkable achievement at the time. Did the team already have traces from an experimental in-development reasoning system to distill from? We’re not sure, especially considering the uniquely word-efficient, succinct responses Gemini demonstrated on U-MATH. In light of this, our bet is on high-quality guidance, similar to what we’ve discussed for o1.
[7] We do that in our research paper.
[8] And also forming distinctive clusters of model families, with Qwens beings more lenient and proprietary models being more conservative. This reminds that when referring to a model, one is referring in large part to its training data.
[9] Mostly pushing to the right side of the chart, consistent with our observations that increased mathematical problem-solving and verbosity — hallmarks of current reasoner systems — correlate with an increase in true positive rate.
[10] Notably, R1 is the only reasoning system that is closer to conservative models in terms of its scores. Upon inspection, we found that its reasoning traces are indeed often driving it towards conservative judgments, the model displaying “hyper-fixation” over minute details of the golden labels. This is the first case we encountered where an increase in coherence would probably aid more with true positives rather than true negatives. But the sentiment remains the same: coherence and reliability are required to appropriately and successfully apply the problem-solving skills to the task at hand.
[11] That is because objectively verifiable tasks are perfect for applying the classical reinforcement learning framework to LLM training in a robust manner.
[12] This quickly becomes a necessity even with formal domains such as math when considering the more challenging tasks, as illustrated by the μ-MATH benchmark. Transitioning into domains such as compliance further exacerbates the problem.
[13] The current consensus is to avoid neural verifiers due to the potential for reward hacking, but there are already proof-of concepts that avoid this risk, such as OpenAI’s safety training for o-series of models.
[14] Current models are also inept at generating quality post-hoc reasoning traces for such tasks.
[15] Pioneering work in the LLM space here is OpenAI’s PRM800k, a dataset of automatic math solutions with each step labeled by a human annotator. Qwen report using a similar approach for their QwQ reasoner model, specifically advocating for human labeling over the supervision-free, autoverification-based approaches. While Gemini report relying on exactly such data generation methods, they acknowledge running into the usual limitations of narrowing to autoverifiable tasks only.
[16] When viewing reasoning system training from the standpoint of classical reinforcement learning, such process supervising data could be considered a form of reward shaping.
[17] An example is given with the DeepSeek-R1 technical report, where they refer to this as “human priors”.

Article written by:

Updated:

Feb 12, 2025

Subscribe to Toloka News

Case studies, product news, and other articles straight to your inbox.

Subscribe
to Toloka News

Case studies, product news, and other articles straight to your inbox.

Subscribe to Toloka News

Case studies, product news, and other articles straight to your inbox.

More about Toloka

What is Toloka’s mission?

Where is Toloka located?

What is Toloka’s key area of expertise?

How long has Toloka been in the AI market?

How does Toloka ensure the quality and accuracy of the data collected?

How does Toloka source and manage its experts and AI tutors?

What types of projects or tasks does Toloka typically handle?

What industries and use cases does Toloka focus on?

What is Toloka’s mission?

Where is Toloka located?

What is Toloka’s key area of expertise?

How long has Toloka been in the AI market?

How does Toloka ensure the quality and accuracy of the data collected?

How does Toloka source and manage its experts and AI tutors?

What types of projects or tasks does Toloka typically handle?

What industries and use cases does Toloka focus on?

What is Toloka’s mission?

Where is Toloka located?

What is Toloka’s key area of expertise?

How long has Toloka been in the AI market?

How does Toloka ensure the quality and accuracy of the data collected?

How does Toloka source and manage its experts and AI tutors?

What types of projects or tasks does Toloka typically handle?

What industries and use cases does Toloka focus on?