Solutions

Datasets

Research

Resources

Company

Talk to us

Toloka welcomes new investors Bezos Expeditions and Mikhail Parakhin in strategic funding round

Learn more

Toloka welcomes new investors Bezos Expeditions and Mikhail Parakhin in strategic funding round

Learn more

Toloka welcomes new investors Bezos Expeditions and Mikhail Parakhin in strategic funding round

Learn more

R1 is not on par with o1, and the difference is qualitative, not quantitative

Vitaliy Polshkov

February 12, 2025

Insights

What sets a reasoning system apart, how to test for it, and how to achieve it

New benchmarking results compare the latest models

A few weeks ago, DeepSeek made waves with their latest R1 model, pushing ChatGPT out of the #1 spot in the Apple App Store and prompting tech stock selloffs that sent Nvidia share prices plummeting 17%.

The claim is that DeepSeek R1 — an open-weights [1] model produced by a small Chinese tech lab — rivals OpenAI’s o1 model in performance, but at a fraction of the cost for training and inference. The o1 model represents a shift towards adaptive reasoning systems that dynamically scale their computation via test-time scaling, where models allocate more computational resources as needed during inference. Reasoning models have more powerful problem-solving capabilities than traditional LLMs, which have less capacity for adaptive allocation of compute. To access these capabilities in their latest o1 model, OpenAI charges six times more than for GPT-4o. DeepSeek offers the same type of reasoning system, but their model is open source.

Setting aside the geopolitical aspect, open vs closed source questions, and implications of an imminent decline in training and inference costs, let's focus on DeepSeek's biggest splash — the performance parity.

The Toloka Research team has been testing R1 and o1 and analyzing existing evaluations, and our findings bring R1’s dominance into question.

Read on to learn:

How R1 really compares to o1
Where we’ve exposed the most significant gaps
What makes o1 stand apart from all the current reasoner models
How to approach closing the gap

Public comparisons show R1 on par with o1

R1 is commonly quoted to be on par with o1 by AI enthusiasts and the media, based on public benchmarks that show remarkably similar performance. Let's look at available comparisons of R1 and o1.

DeepSeek report

This comparison table from DeepSeek’s own report shows R1 and o1 neck and neck on major benchmarks. The majority of articles comparing R1 to o1 are based on this table, or tiny-scale manual vibe checks using a handful of prompts.

Domain	Benchmark	o1	R1
Mathematics	MATH-500 (Pass@1)	96.4%	97.3%
	AIME 2024 (Pass@1)	79.2%	79.8%
Coding	Codeforces (Percentile)	96.6%	96.3%
	LiveCodeBench (Pass@1)	63.4%	65.9%
	SWE-bench Verified (Resolved)	48.9%	49.2%
General	DROP (3-shot F1)	90.2%	92.2%
	MMLU (Pass@1)	91.8%	90.8%
	GPQA Diamond (Pass@1)	75.7%	71.5%
	SimpleQA (Correct)	47.0%	30.1%

Popular online leaderboards

Additional “go-to” evaluation sources are the full LiveBench leaderboard and ChatBot Arena [2].

Domain	Benchmark	o1	R1
General	ChatBot Arena (Rating)	1352 (+6/-7)	1361 (+7/-8)
Data Science	LiveBench (Pass@1)	65.5%	69.8%
Instruction Following	LiveBench (Pass@1)	81.6%	80.5%
Reasoning (Spatial & Logic Puzzles)	LiveBench (Pass@1)	91.6%	83.2%
Language (Grammar & Comprehension)	LiveBench (Pass@1)	65.4%	48.5%

Style comparisons

There are also numerous anecdotal reports on R1’s superior writing style, generally described as “more creative and fun, if a bit unhinged,” which is in line with some independent evaluations.

General sentiment

The widespread narrative based on those evaluations is that the models are more or less equal. R1 seems to be slightly better at math and coding and have a more free and creative writing style, while o1 is ostensibly somewhat better at factuality, question answering, and instruction following, with a writing style that focuses on careful structure, grammar, and logic.

All of these differences can be vaguely attributed to the focus: R1 training was "less restricted", with an emphasis on math and coding domains that lend themselves well to reinforcement learning, while o1 was presumably "more guided", with more attention to world knowledge and alignment. OpenAI invested heavily in o1's alignment as a public-facing product via ChatGPT, ensuring the model's safety and adaptation to consumer preferences across general domains. It all seems to add up.

But we think there’s more to the story.

Long-tail benchmarks reveal gaps

As soon as we leave the beaten path, alternative benchmarks paint a different picture. The Toloka Research team investigated evaluations in niche subdomains and uncommon domains and noted quantitative and qualitative gaps in model performance.

Niche subdomains

A few months ago, Toloka released U-MATH — a benchmark to test LLMs on university-level mathematics, sourced from unpublished problems used in the curriculum at top US universities. We developed the benchmark precisely to represent an uncommon, under-explored niche within the math domain — real-world university math. According to our U-MATH evaluation, R1 does not seem to have an edge over o1 in mathematical reasoning. [3][4][5]

Benchmark	o1	R1
MATH-500 (Pass@1)	96.4%	97.3%
AIME 2024 (Pass@1)	79.2%	79.8%
U-MATH (TextHard, Accuracy)	90.5%	88.2%

We continued to study niche subdomains by examining coding, the other domain where R1 is considered to have an edge over o1. In terms of coding benchmarks, Aider’s Polyglot is a good niche example, focusing on code editing tasks in a large number of different programming languages, in contrast to the more widespread from-scratch Python generation. Aider's evaluations contradict the story of R1 beating o1 in coding, similar to our conclusions on U-MATH.

Benchmark	o1	R1
Codeforces (Percentile)	96.6%	96.3%
LiveCodeBench (Pass@1)	63.4%	65.9%
SWE-bench Verified (Resolved)	48.9%	49.2%
Aider Polyglot (Correct)	61.7%	56.9%

Unusual domains

Conitnuing in this vein, we've investigated benchmarks that assess altogether less conventionally tested domains and capabilities, not just less conventional subdomains. There, a pronounced gap in favor of o1 is observed

Benchmark	o1	o1-mini	R1
μ-MATH (F1)	90.1%	83.4%	84.3%
ARC-AGI-1 (Accuracy)	32.0%	7.8%	15.8%
LLM Chess (Wins)	57.4%	30.0%	22.6%
NPR Sunday Puzzle (Correct)	59.0%	26.0%	35.0%
NYT Connections Extended	70.8%	27.0%	38.6%

μ-MATH is our other benchmark on university mathematics, based on U-MATH but intended for meta-evaluations — i.e., assessing the performance of LLMs as automatic solution evaluators instead of problem solvers. Judging solutions by testing them against a golden label proves to be a challenging skill in its own right [6], and it is a somewhat atypical skill to evaluate and train for. Here we observe R1’s score closer to that of o1-mini, not o1. [7]
ARC tests AI systems on their ability to adapt to novel situations, showing R1 lagging well behind o1.
LLM Chess Leaderboard again places R1 behind o1 and closer to o1-mini. The evaluation also reports that R1 makes an average of 52.93 illegal moves per 1,000 attempts, compared to o1 and o1-mini with 0 and 4.29 illegal moves, respectively [8].
Northeastern University Programming Research Lab Study tested the models on the NPR Sunday Puzzle Challenge, which are difficult wordplay problems based on common knowledge. In addition to the large gap between R1 and o1, researchers also reported several "failure modes" that occur with R1 and never occur with o1 or o1-mini, such as R1 explicitly stating "I give up" 23.9% of the time or failing to stop the generation 8.5% percent of the time. In 6.6% of cases, R1 exhibtis erratic answer switches, which only happens in 0.5% of cases for o1-mini and does not happen for o1.
Similarly, as reported by the NYT connections leaderboard, R1 is far behind o1 in solving puzzles that involve finding semantic subgroups within word sets.

Superior generalization and reliability place o1 in a league of its own

The long-tail benchmarks above are pertinent because they are unconventional, testing for novelty and robustness. So here’s our claim: o1 has greater generalization and reliability than R1. In terms of reliability in particular, we can see that both o1 and o1-mini are superior to R1.

As we continued investigating, we found that other publicly available evaluations support our conjecture.

From the standpoint of lesser generalization, we expect R1 not only to perform worse on niche subdomains or novel domains, but also to display degraded performance on new, unseen tasks for familiar suites. Our expectation is confirmed by MathArena’s recent report on AIME 2025 results, highlighting performance on problems published after the release of o1 and R1 [9]. With the unseen tasks, performance drops significantly for R1 but not for o1.

Benchmark	R1	o1
AIME 2024 (Pass@1)	79.8%	79.2%
AIME 2025 (avg I & II, Pass@1)	70.0%	79.2%

Reliability, in turn, plays a vital role in adversarial robustness and consistency, both lacking in R1 compared to o1 and o1-mini.

CISCO report a 100% success rate on their attempts to breach R1, compared to 26% for the o1-preview, and there are other similar reports on R1’s safety limitations [10].

A recent update of Vectara's hallucination benchmark shows R1's hallucination rate at 14.3%, while o1 and o1-mini only display 2.4% and 1.4%, respectively.

Both proper generalization and reliability are extremely important for AI systems, often cited as the primary bottlenecks of agentic applications. An extension of our claim is that superior generalization and reliability put o1 in a league of its own among all the currently available reasoning models. While we agree with the popular narrative that R1 has a different focus from o1, the o1 emphasis on alignment and broad domain coverage results in a qualitative difference that goes far beyond performance accents.

Coming up, we'll broaden the discussion of generalization and reliability in models and will then discuss ways to enhance these key properties in a reasoning system, so stay tuned.

Beyond o1 vs R1: studying trends and biases

Bringing in more models and metrics can help us get a bigger-picture perspective on the performance patterns, instead of only focusing on o1-to-R1 comparisons from standalone benchmarks. Let's turn again to our own suite of U-MATH and μ-MATH datasets.

Recognizing patterns

You can study the U-MATH and μ-MATH leaderboard on Hugging Face. Here are the aggregate scores, ranked by μ-MATH F1.

The graph reveals three major performance trends:

Judgment vs. problem-solving: In non-reasoning models, judgment and problem-solving performance improve together — up to a point. Beyond that, they diverge, with better problem-solving correlating with weaker judgment. This puts some numbers behind our statement that judgment is a distinct skill and suggests an inherent tradeoff between the two, with models in a sense being "forced to specialize" in either one or the other.
Reasoners extend the Pareto frontier by breaking out of this tradeoff, demonstrating their superior generalization and advancing beyond the previous generation of models [11].
o1 stands apart by pushing a step beyond other reasoning models into its own category.

Investigating the tradeoffs

We won’t go into much detail here [12], but what we find is that the problem-solving vs judgment tradeoff mentioned above translates into behavioral differences among the models, yielding two distinctive judgment styles:

Lenient judges: More verbose, inclined toward lengthy derivations and better at comparing complex expressions, but prone to losing track — leading to fewer false negatives but more false positives.
Conservative judges: More structured and precise, but often anchored on the exact form of the reference answer or golden label — leading to fewer false positives but more false negatives.

Good judgment depends on balancing mathematical problem-solving with general structured logic and instruction-following, and an ideal model would have strong domain-specific reasoning skills while maintaining high reliability and coherence in order to successfully apply them.

To understand this balance better, we can decompose the F1 score into True Positive Rate (TPR), which measures correct positive classifications, and True Negative Rate (TNR), which tracks correct rejections of incorrect solutions. Higher TPR means fewer false negatives, while higher TNR means fewer false positives.

We can observe all the same trends with this chart: non-reasoning systems exhibiting the performance tradeoff [13], reasoning systems pushing the frontier away [14], and o1 advancing a step further still. What we can also see is that on top of being the best-performing model, o1 is among the most balanced ones, and is the most balanced amid the tested reasoners [15].

Studying the previous generation of models informs us on how to approach this balancing with the next:

Balanced training makes for a more balanced model. We can see in action how transitioning from math specialists to generalist models leads to better, more well-rounded judgment performance. A balanced training setup integrates the depth of formal reasoning with breadth of domain coverage and coherence.
Reducing capability amplifies training-induced biases. Lenient models tend to become even more lenient when scaled down to smaller sizes, and conservative models become more conservative. A model needs to have appropriate capabilities that allow it to generalize over the things that you’re balancing.

The key point is: Powerful reasoning systems have the appropriate capabilities and already excel at formal reasoning. Training focused on diversifying domains and improving coherence will help guiding them toward generalization and reliability.

Building more generalizable and reliable models: It’s all in the data

We’ve discussed at length that a reasoning system would require generalization and reliability, demonstrating o1’s excellence on these dimensions and its resulting superb performance.

We have three recommendations to help improve these qualities in reasoning systems, all requiring high-quality data. A strong data partner like Toloka can help navigate the details of data production.

1) Diverse domain coverage

Increasing versatility of autoverified data: Math and coding tasks permitting simple autoverification are the bread and butter of current reasoning systems [16], but there are ways to diversify.

Even math and coding have untapped niches. The U-MATH benchmark is derived from our larger training dataset, offering a good example of a math subdomain that’s underrepresented compared to the more readily-available textbook, high-school or Olympiad-style data.
Plenty of other fields allow for autoverification but suffer from a lack of appropriate data — examples include chemistry, finance, biology, and more. Closing the gap on these would require careful and efficient data curation by highly skilled experts from a diverse domain set — Toloka’s specialty.

Expanding the boundaries of autoverification: Another direction for improvement would involve making better [17] or more general verifiers.

One possible approach is to create datasets for training more capable verifiers [18], such as scaling up meta-evaluation datasets like Toloka’s μ-MATH in size and complexity.
Alternatively, domains such as law or medicine don’t typically have singular golden labels but lend themselves well to verification on expert-formulated rubrics, which Toloka already produces for a diverse set of complex domains.

Explicit demonstration where autoverification fails: For more open-ended expert-reasoning domains, such as consulting, web research, or forecasting, proper autoverification is challenging to do at scale [19]. For these, a possible approach is to have expert-crafted demonstrations for the entire reasoning process — planning, researching the sources, backtracking, etc. One way Toloka sets similar projects up is by having experts operate an LLM-based system, adding their inputs and steering it towards complete solutions.

Going Multi-*: multi-lingual, multi-cultural, multi-modal, multi-turn, etc.

Similar to niche subdomains, languages other than English are underrepresented in quality data. Our team has hands-on experience delivering production-grade multilingual datasets and working with low-resource languages, thanks to our expert network of speakers in 40+ languages.
Another option is producing multimodal golden labels — e.g. for reasoning over images — which Toloka can do at scale.

2) Reasoning refinement

Process supervision: Training process reward models is a large part of the current reasoning systems research [20]. Despite the relevance, there’s very little public data available for PRM training, and all the limitations of poor domain coverage apply here as well. This type of data production overlaps with Toloka’s experience, such as trajectory annotation for coding agents [21].

Explicit demonstrations of excellent reasoning: Demonstration-style data could also be used to improve coherence, efficiency, and clarity of reasoning traces [22].

There are several ways to obtain this type of data:

Manually crafting examples of concise and effective reasoning from scratch
Editing LLM-generated trajectories
Issuing preferences amidst a number of alternative trajectory options

All of these align with our experience of building scalable SFT demonstration pipelines for complex domains. The preference option could either be done over entire trajectories or provide denser signals in a step-wise manner, similar to Toloka’s projects in multi-turn dialogue preferences.

3) Appropriate evals

Benchmarks: As we’ve already emphasized, benchmarks need to be novel and provide clear signals — incorporating purposeful design, quality data and meaningful analyses. We have relevant experience in creating clean, informative, and practical evaluation datasets, such as our publicly available U-MATH & μ-MATH, and in building custom evaluation pipelines with insight into quality criteria and metrics tailored to specific use-cases.

Red-teaming: Traditional safety labeling will need to be adapted to reasoning models to take long reasoning traces into account. Besides that, as models become more prominent, they have broader applications and thus an expanded attack surface as well. Toloka’s versatile team of domain and security experts tests a wide variety of applications, including agentic systems and computer operators.

The “DeepSeek moment” is still real

DeepSeek’s public release of R1 deserves recognition for the seminal AI moment that it is.

Although it is not a fully-open model in terms of reproduction transparency and, as we’ve shown here, it is not quite at the frontier level, this is the closest the AI community has ever been to an open frontier system. Besides, this is an incredible opportunity for us to study and better understand what differentiates these models, what’s potentially missing, and how to move forward.

Let’s build the future together.

[1] Not fully open, since the data and code to reproduce the results are not published. Although the comprehensive technical reports accompanying DeepSeek’s releases are extremely valuable and admirable.

[2] Note about the Arena benchmark: GPT-4o currently scores higher than both o1-preview and o1-mini, which is indicative of the benchmark’s nature: it depends a lot on subjective human preferences aside from formal model capabilities.

[3] Note we're reporting the TextHard subset of our benchmark, omitting multi-modal problems and problems from the easier subjects — algebra and precalculus. We use neural verifiers to deduce correctness of solutions, which is different from the traditional LLM-as-a-judge setup in that the judge model is given a ground truth label to compare the evaluated solution's answer against. We employ an ensemble of three reasoning models acting as verifiers: o3-mini, R1, and gemini 2.0-flash-thinking.

[4] Hisorically, we've been using single-pass accuracy under greedy decoding with an output token limit of 4096. For reasoners, we increase the limit to 32768 tokens. Following recommendations, we have also tested two separate Pass@1 setups for reasoning models by setting temperature to 0.3 and 0.6 (both with four rollouts and top_p=0.95). We observed little variation and only found negligible difference in terms of the final score compared to the greedy scheme, the latter one yielding slightly higher results. These comparisons have not been performed for o-series models since their API does not allow for greedy inference.

[5] Due to the margings observed with MATH, U-MATH and AIME being quite small, here we're mostly speaking in terms of impressions rather than reliably identifiable score gaps. Most of the current LLM evaluations do not report any confidence intervals, due to proper uncertainty estimations being hard to track. Some works do emerge, but no proven standard framework currently exists.

[6] For more complex math subjects, this is a truly demanding task.

[7] Note that arriving at the correct final judgment can be viewed as a classification problem: predicting a True / False label that's compared to the golden judge verdict. We thus use F1-score — a standard choice of a classification quality measure — as our primary metric. Note that unlike U-MATH, neural verifiers are not required here, since ground-truth comparisons are far less complex.

[8] GPT4 is known to have been trained on chess data. Whether chess data is used in training the DeepSeek models is not yet known. Howether, it is worth noting that the data in question is based on real game logs in PGN notation, while LLM Chess leaderboard involves issuing moves in an interactive manner based on graphical representations of board states while playing against a randomly moving bot. It is also worth noting that GPT4 does not get even a single win with this setting, same as DeepSeek-V3, and across the leaderboard only the reasoning systems are able to arrive at non-trivial numbers of wins.

[9] MathArena authors report some of the problems having identical or presumably similar ones available on the popular web resources such as Quora and Math StackExchange, thus making exposure possible. However, for similar problems it is unclear whether resemblance is high enough to meaningfully affect models' capabilities in solving the original ones. Apart from that, most of the sources where the exposed problems are found do not contain correct solutions to them.

[10] In principle, test-time compute should help with safety — there are even some reports on this already, e.g. by OpenAI with their study of o-series models — but that doesn’t seem to happen for Deepseek at all.

[11] It's fascinating that Gemini 1.5 Pro aligns more closely with reasoning systems than with other standard inference models — a remarkable achievement at the time. Did the team already have traces from an experimental in-development reasoning system to distill from? We’re not sure, especially considering the uniquely word-efficient, succinct responses Gemini demonstrated on U-MATH. In light of this, our bet is on high-quality guidance, similar to what we’ve discussed for o1.

[12] We do that in our research paper.

[13] And also forming distinctive clusters of model families, with Qwens beings more lenient and proprietary models being more conservative. This reminds that when referring to a model, one is referring in large part to its training data.

[14] Mostly pushing to the right side of the chart, consistent with our observations that increased mathematical problem-solving and verbosity — hallmarks of current reasoner systems — correlate with an increase in true positive rate.

[15] Notably, R1 is the only reasoning system that is closer to conservative models in terms of its scores. Upon inspection, we found that its reasoning traces are indeed often driving it towards conservative judgments, the model displaying “hyper-fixation” over minute details of the golden labels. This is the first case we encountered where an increase in coherence would probably aid more with true positives rather than true negatives. But the sentiment remains the same: coherence and reliability are required to appropriately and successfully apply the problem-solving skills to the task at hand.

[16] That is because objectively verifiable tasks are perfect for applying the classical reinforcement learning framework to LLM training in a robust manner.

[17] This quickly becomes a necessity even with formal domains such as math when considering the more challenging tasks, as illustrated by the μ-MATH benchmark. Transitioning into domains such as compliance further exacerbates the problem.

[18] The current consensus is to avoid neural verifiers due to the potential for reward hacking, but there are already proof-of concepts that avoid this risk, such as OpenAI’s safety training for o-series of models.

[19] Current models are also inept at generating quality post-hoc reasoning traces for such tasks.

[20] Pioneering work in the LLM space here is OpenAI’s PRM800k, a dataset of automatic math solutions with each step labeled by a human annotator. Qwen report using a similar approach for their QwQ reasoner model, specifically advocating for human labeling over the supervision-free, autoverification-based approaches. While Gemini report relying on exactly such data generation methods, they acknowledge running into the usual limitations of narrowing to autoverifiable tasks only.

[21] When viewing reasoning system training from the standpoint of classical reinforcement learning, such process supervising data could be considered a form of reward shaping.

[22] An example is given with the DeepSeek-R1 technical report, where they refer to this as “human priors”.

Subscribe to Toloka News

Case studies, product news, and other articles straight to your inbox.

Subscribe to Toloka News

Case studies, product news, and other articles straight to your inbox.

Subscribe to Toloka News

Case studies, product news, and other articles straight to your inbox.

Recent articles

View all articles

Image annotation tools: how to label data that actually teaches AI

Jul 30, 2025

Agentic AI & the Future of Coding

Jul 29, 2025

How to measure AI performance and ensure your AI investment pays off

Jul 28, 2025

Image annotation tools: how to label data that actually teaches AI

Jul 30, 2025

Agentic AI & the Future of Coding

Jul 29, 2025

How to measure AI performance and ensure your AI investment pays off

Jul 28, 2025

Detecting hidden harm in long contexts: How Toloka built an advanced safety dataset

Jul 14, 2025

More about Toloka

What is Toloka’s mission?

Where is Toloka located?

What is Toloka’s key area of expertise?

How long has Toloka been in the AI market?

How does Toloka ensure the quality and accuracy of the data collected?

How does Toloka source and manage its experts and AI tutors?

What types of projects or tasks does Toloka typically handle?

What industries and use cases does Toloka focus on?

What is Toloka’s mission?

Where is Toloka located?

What is Toloka’s key area of expertise?

How long has Toloka been in the AI market?

How does Toloka ensure the quality and accuracy of the data collected?

How does Toloka source and manage its experts and AI tutors?

What types of projects or tasks does Toloka typically handle?

What industries and use cases does Toloka focus on?

What is Toloka’s mission?

Where is Toloka located?

What is Toloka’s key area of expertise?

How long has Toloka been in the AI market?

How does Toloka ensure the quality and accuracy of the data collected?

How does Toloka source and manage its experts and AI tutors?

What types of projects or tasks does Toloka typically handle?

What industries and use cases does Toloka focus on?