The Data Behind DeepSeek’s Success
In recent weeks, DeepSeek has shaken the AI world, with discussions spreading across mainstream media, researchers, AI developers, tech enthusiasts, and industry leaders. Once a relatively unknown player in the LLM space, their latest model, DeepSeek R1, has matched the best existing LLM models on several popular leaderboards. Additionally, DeepSeek R1 is published under the MIT license, and a technical report accompanied its release. This achievement is even more remarkable because they claim the model was trained on a budget of just $5.6 million, a fraction of what competitors have spent on similar models.
So, what’s the secret behind DeepSeek’s success? The technical report leaves out key details, particularly regarding data collection and training methodologies.
Several open-source initiatives, such as the Open-R1 project on Hugging Face, are now working to reproduce DeepSeek R1. In this article, Toloka’s researchers analyze the key factors that set DeepSeek R1 apart and explore the data requirements for building your own R1 model, or an even better version.
What did DeepSeek do differently?
DeepSeek’s success with R1 comes from rethinking the standard training process. Traditionally, large models undergo supervised fine-tuning (SFT) first, followed by reinforcement learning (RL) for alignment and tuning on complex tasks. DeepSeek revised this approach.
Instead of fine-tuning first, they applied RL with math and coding tasks early in training to enhance reasoning abilities. This allowed the model to generate answers independently with minimal supervision, only validating the final answer, and maximizing the benefits of pre-training for reasoning. The model’s skills were then refined and expanded beyond the math and coding domains through fine-tuning for non-reasoning tasks.
While this provides a high-level understanding of DeepSeek’s approach, it’s important to examine the data used at each stage of training. The following diagram breaks down the key training steps in more detail.

DeepSeek-R1 training pipeline
Stage 1: Cold Start (SFT)
Using a small LLM-generated and human-curated dataset of demonstrations, the model was first trained on high-quality reasoning data (math and code). These examples focused on improving the consistency and readability of reasoning trajectories rather than enhancing reasoning ability itself. This phase helped accelerate convergence in the following reinforcement learning (RL) stage.
Stage 2: Reasoning-Oriented RL
This stage provided the biggest performance boost. The model was trained on tasks with auto-verifiable answers (math, code, logic) using predefined rule-based checks as the primary reward signal. No human demonstrations were included, only deterministic correctness checks (e.g., math answer exact-match) and rule-based evaluations for reasoning format and language consistency. While format checks slightly constrained performance, it ensured more human-friendly reasoning outputs.
Stage 3: Synthetic SFT
DeepSeek used synthetic data to fine-tune the model. Specifically, 600,000 reasoning data samples were generated through rejection sampling and refinement from the RL-trained model described above, and 200,000 non-reasoning data samples were derived from DeepSeek-V3, covering writing, QA, and translation tasks. In total, 800,000 samples were used to fine-tune the base model.
Stage 4: Mixed RL (Reasoning + RLHF)
At this final stage, auto-verifiable rule-based rewards continued to refine reasoning tasks, while preference-based RLHF (similar to DeepSeek-V3) was applied to general tasks. The final results were optimized for helpfulness, while both reasoning chains and results were tuned for safety.
How to replicate DeepSeek’s process by focusing on the right data
The most significant performance boost in DeepSeek R1 came from reasoning-oriented RL. To replicate or exceed their success, prioritize high-quality data for this stage. They used auto-verifiable tasks such as math and coding, where answers are clearly defined and can be automatically checked (e.g., through unit tests or predetermined answers). While DeepSeek concentrated on math and coding, this approach can be extended to other domains, such as physics or chemistry, where automatic verification is possible.
However, other types of data are also essential. Invest in high-quality chain-of-thought demonstrations designed for cold-start reasoning training for further improvement. Rather than relying on generic chain-of-thought data, target specific domains or languages to achieve the best performance boost. Additionally, include classic SFT data for non-auto-verifiable tasks and human preferences for final model alignment.
Beyond DeepSeek performance
At first glance, based on common benchmarks, DeepSeek R1 appears to perform similarly to OpenAI’s reasoning model o1. It slightly outperforms o1 in reasoning tasks (e.g., Math 500, SWE Verified) and falls just behind in general knowledge benchmarks (MMLU, Simple QA). However, the performance gap becomes more noticeable in niche and out-of-domain areas. For example, on the LLM Chess leaderboard, o1 achieves a 46.67% win rate, while R1 reaches only 22.58%.
Toloka’s researchers have conducted additional tests on U-MATH, a dataset of complex university-level mathematics, where R1 performed significantly worse than o1. (We will publish and link the detailed analysis soon.)
Why does o1 perform better in these specialized areas? Is DeepSeek R1 truly strong in mathematics? While R1 outperforms o1 on MATH-500, it struggles with more advanced university-level problems. A likely explanation is that MATH-500 includes data within R1’s training distribution, whereas U-MATH contains out-of-domain challenges. DeepSeek R1 was trained on widely available datasets that do not include advanced, proprietary mathematical problems.
To surpass DeepSeek R1, we recommend incorporating complex, domain-specific data. Training on widely available datasets limits a model’s ability to handle novel, specialized tasks. By integrating high-quality data from niche fields, you can develop a model that excels where R1 currently falls short.
Are you ready to take your model to the next level?
Partner with Toloka to take your model performance to the next level. We offer top-tier Auto-Verifiable Tasks, similar to those used in DeepSeek RL training, designed to enhance objective reasoning through automated feedback. Our experts create complex prompts, test cases, answers, and rubrics to ensure precision and reliability.
Additionally, we provide specialized domain-specific SFT data, chain-of-thought SFT data, and human feedback for final model alignment.
Connect with our team to explore your data needs.
Article written by:
Updated:
Feb 10, 2025