← Blog

/

Essential ML Guide

Essential ML Guide

Foundation model training data: How frontier labs build pre-training datasets at scale

Foundation model training data: How frontier labs build pre-training datasets at scale

Toloka Arena is live. See how your model ranks.

The data bottleneck era

Three years ago, the conversation about foundation model training was dominated by compute. Whoever had the bigger cluster won. Today, that conversation has shifted. The 2024 to 2026 generation of frontier models has been gated not by FLOPs, but by data: how much high-quality data exists in the world, how to source it without exhausting public reserves, and how to verify its quality at scale.

Llama 3.1 was trained on roughly 15 trillion tokens. GPT-5 and the Claude Opus 4 series are reported to use comparable or larger pre-training corpora. Forecasts from Villalobos and colleagues at Epoch AI suggest that high-quality public web text will be effectively exhausted between 2026 and 2028, depending on quality thresholds. The implication is straightforward. The frontier labs that win the 2027 to 2030 model generation will be the ones that build defensible data pipelines today.

For the heads of pre-training, research VPs, and chief AI officers reading this, the strategic question is no longer "where do we get more data?" It is "where do we get higher-quality, more diverse, more compliant data, and how do we verify it at the scale our pre-training stack demands?" This article addresses that question end to end, from the sources used by modern frontier labs to the role of human-generated data in pre-training, with practitioner-level technical detail.

Pre-training and why training data is the foundation

Pre-training is the initial training phase in which a base model learns from a large unlabeled corpus through self-supervised objectives, typically next-token prediction for text-only models, or interleaved next-token-and-image objectives for multimodal models. It sets the model's factual knowledge cutoff, its language coverage ceiling, its baseline reasoning patterns, and the latent capabilities that downstream fine-tuning can later surface. We covered this lifecycle in detail in our guide to pre-training in LLM development. The short version: pre-training choices propagate through every subsequent training stage.

Modern LLM development has settled into three stages. Pre-training itself runs for weeks to months and consumes the majority of compute spend, often hundreds of millions of dollars at the frontier. Mid-training, sometimes called continued pre-training or annealing, is the targeted addition of higher-quality data such as code, mathematics, and instruction-like sequences in the final stage of pre-training. Post-training covers everything that turns a base model into a deployed assistant: supervised fine-tuning, reinforcement learning from human feedback, direct preference optimisation, and the newer reinforcement learning from verifiable rewards (RLVR).

What makes pre-training data the foundation is the compounding nature of its choices. A capability that is absent from pre-training is extraordinarily expensive to add later. You can fine-tune for what is latent in the base model. You cannot, in general, fine-tune in what was missing. If your pre-training corpus contains very little Arabic, no amount of Arabic SFT data will produce a strong Arabic model. If your pre-training corpus is thin on rigorous mathematical reasoning, post-training cannot conjure it. This is why the pre-training data decision has become the most consequential decision in foundation model development.

The foundation model data pipeline

Data sources at the frontier

Modern pre-training corpora are blends. Six categories dominate. First, public web text, derived from Common Crawl and its refined variants. The most influential 2024 release here was FineWeb (Penedo et al., 2024), a 15 trillion token English corpus that demonstrated reproducible quality-filtering pipelines could match or exceed proprietary web corpora. Second, books and long-form literature, sourced through licensing partnerships with publishers and from public-domain collections. Third, code, drawn from GitHub repositories filtered by license, plus StackExchange and licensed code corpora. Fourth, scientific and reference content from arXiv, PubMed, S2ORC, and patent corpora. Fifth, synthetic data, generated by larger models and then filtered or verified. Sixth, and the category that has grown fastest in the past 18 months, human-generated specialised data covering instructions, reasoning chains, expert demonstrations, and curated rare-event data.

The mixture ratios are now a closely guarded competitive parameter. Open papers from 2024 to 2026 suggest that code constitutes 15 to 25 percent of high-performing pre-training mixes, far higher than its proportion of natural language data on the internet. This is not because frontier models are primarily code assistants. It is because exposure to code during pre-training improves general reasoning, structured output, and instruction-following. The mid-training mixture, in particular, often skews heavily toward code, math, and high-quality instruction data.

Filtering, deduplication, and decontamination

Raw web data is not training data. Between the petabyte-scale crawl and the trillion-token training corpus sits an enormous filtering pipeline. Quality classifiers, often fastText-based or learned with small encoder models, score documents for educational value, factuality, and reasoning density. Heuristic filters remove documents based on language identification confidence, URL block lists, perplexity ranges, document length, and character ratios. Deduplication operates at multiple granularities: exact match through hashing, near-duplicate through MinHash or SimHash, and substring-level through suffix arrays on petabyte-scale data.

Decontamination is the often-underdiscussed filtering stage. Before pre-training, the data pipeline must remove sequences that appear in evaluation benchmarks. This is harder than it sounds. Modern decontamination uses n-gram overlap detection at varying widths and, increasingly, embedding-similarity search to catch paraphrased benchmark contamination. The 2024 to 2026 era saw several public embarrassments where models scored well on benchmarks they had been inadvertently trained on. Frontier labs now treat decontamination as a release-blocking compliance step, not a quality enhancement.

The licensing turn

The most consequential shift in pre-training data sourcing has been the move from "scraped if accessible" to "licensed and provenance-tracked." Three drivers converged. The EU AI Act, with its Article 53 obligations on providers of general-purpose AI models, requires detailed summaries of training content and copyright compliance. US executive orders and voluntary commitments require frontier labs to document training data provenance. And copyright litigation, including the New York Times case against OpenAI and several class actions, has made unlicensed scraped data a measurable financial risk.

The practical result is that frontier labs increasingly source data through licensing partnerships with publishers, news organisations, academic institutions, and specialised data providers. The vendors who can deliver model-ready data with clear licensing, provenance documentation, and compliance attestations have become preferred partners. The vendors who cannot will struggle to win frontier-lab contracts in 2026 and beyond.

Quality versus quantity: the new frontier

The Chinchilla scaling laws (Hoffmann et al., 2022) established a compute-optimal ratio of approximately 20 training tokens per model parameter. For years, this drove the industry to chase larger datasets in proportion to larger models. But Chinchilla assumed uniform data quality. The frontier of 2025 to 2026 has been the realisation that quality is not uniform, and that targeted quality improvements outperform proportional scale increases.

Microsoft's phi model family demonstrated this most publicly. Small models trained on carefully curated, textbook-quality synthetic and refined data matched or exceeded much larger models on certain reasoning tasks. The mechanism is not magic. It is that high-quality data carries more learnable signal per token. The implications for pre-training strategy are significant: a 100-billion-token high-quality mid-training mix can shift model behaviour more than a 500-billion-token average-quality addition.

Quality signals that matter for pre-training data include factuality, which can be partially scored with model-based classifiers but ultimately requires expert verification on samples. Reasoning density, the extent to which a document contains multi-step inferences worth learning. Instruction-following potential, the implicit conditional structure that helps a base model later respond to instructions. Diversity along several axes: topical, syntactic, demographic, and dialectal. And what frontier labs increasingly call "epistemic quality," meaning the calibration and intellectual honesty of the source.

The Web Rephrasing approach pioneered in late 2024 has become standard in 2026. Lower-quality web text is rephrased by capable LLMs into higher-quality forms before being used in training. Verification still requires human spot-checking, particularly to detect cases where the rephrasing introduced hallucinated content. This is one of several stages where human-in-the-loop verification has become essential to frontier pre-training pipelines.

Build foundation models with defensible data

Toloka partners with frontier AI labs to source, filter, and validate pre-training data at scale, with full provenance and compliance documentation.

Talk to our team →


Multimodal and domain-specific pre-training data

Frontier models are now multimodal from the first pre-training step, not bolted-on after. GPT-4V was an early signal. Claude 3 and 4, Gemini 1.5 and beyond, and the GPT-5 series are natively multimodal architectures trained on interleaved image-text and increasingly video-text data. This changes the pre-training data problem qualitatively. It is no longer enough to assemble a large text corpus. The data team must assemble a large, balanced, interleaved multimodal corpus with verified alignment between modalities. We discussed this evolution in our coverage of multimodal SFT data, though the principles increasingly apply to pre-training as well.

Interleaved image-text data, in the CLIP and LiT lineage, has matured into pretraining inputs where extended documents contain images at natural positions, with captions and surrounding text providing weak alignment supervision. Video pretraining is the current frontier, with both natural video and procedurally generated synthetic video appearing in 2025 to 2026 pre-training mixes. Audio pretraining for natively speech-capable models is following a similar trajectory.

Domain coverage at pre-training affects the fine-tuning ceiling. Teams that want a model to perform strongly in medicine, law, finance, or scientific research must ensure the relevant content is well represented during pre-training. The 90-plus domain specialisations available through expert networks like Toloka's allow targeted pre-training data assembly for these high-value verticals, where the public-internet representation is either insufficient or unreliable.

Linguistic diversity is the other dimension where pre-training data strategy decides downstream capability. The 100 most-spoken languages are increasingly well-represented in modern frontier model pre-training. The next 1000, including most of the world's languages by speaker count, are not. Frontier labs targeting global deployment now invest in specialised multilingual data collection, often through expert networks of native speakers who can produce high-quality content where the internet cannot.

The human data layer in pre-training

Human-generated data was once thought to belong only in post-training. RLHF, SFT, and preference data were the human stages. Pre-training was "just the web." That picture is no longer accurate. In 2025 to 2026, frontier labs have increasingly woven human-generated data into pre-training and mid-training stages.

The categories of human data that enter pre-training include instruction-like sequences, which prime base models for later instruction-following without requiring full post-training. Reasoning chains, particularly mathematical and scientific reasoning traces written or verified by domain experts, which establish reasoning patterns that pure web data does not reliably provide. Expert demonstrations of high-quality task completion, used both for training and to define the upper bound of what the model should produce. Edge case curation, in which experts construct deliberately rare or adversarial examples that anchor model behaviour at distribution edges. And synthetic data verification, where human reviewers spot-check model-generated content to catch hallucinations or distributional drift.

Quality assurance at frontier-model scale is non-trivial. Training corpora of 10 to 20 trillion tokens cannot be human-reviewed exhaustively. The approach instead is a tiered system: classifier-based scoring for the full corpus, statistical sampling with human verification at the document level, expert review for subsets identified as high-leverage, and continuous quality monitoring through proxy metrics during the training run. Toloka's quality loop methodology, which combines AI-assisted annotation with human expert verification, has become a common pattern for this kind of large-scale corpus quality assurance.

The new "mid-training" stage is where human data delivers the most leverage per dollar. Adding 50 to 500 billion tokens of expert-curated, high-quality content in the final pre-training stage produces measurable downstream improvements on reasoning, instruction-following, and domain-specific tasks. The economics are striking: a few million dollars of expert-curated mid-training data can shift downstream benchmarks more than ten times that spend on additional web data.

Compliance, licensing, and security in modern pre-training data

The enterprise procurement bar for pre-training data partners has risen sharply. The minimum baseline now includes SOC 2 Type II attestation, GDPR compliance with documented data subject rights handling, regional data residency options including EU-only and US-only data handling, and detailed provenance documentation for every data source. Frontier labs preparing for enterprise deployment downstream now require their data partners to meet these standards as a precondition to engagement. Toloka's full security and privacy posture is published publicly precisely because procurement teams need to verify these claims at speed.

Provenance documentation has become a deliverable in its own right. The "model-ready output" expectation now includes not just the data, but a complete content-tag-and-licence ledger describing the source, licensing terms, copyright status, jurisdictional considerations, and any restrictions on derived model use. This documentation flows downstream to enterprise customers who increasingly require it as a condition of deploying foundation models in regulated environments.

The IP and licensing question deserves particular attention. Foundation models trained on unlicensed content face two distinct risks: the legal risk to the model developer, and the contamination risk to enterprise customers who use the model commercially. Both are now standard items in enterprise AI procurement diligence. Frontier labs that can document training-data licensing comprehensively have a meaningful commercial advantage in 2026 enterprise sales, separate from any model quality advantage.

Continuity from pre-training to post-training

There is a strategic argument for using the same data partner across the pre-training and post-training stages. Quality metrics, taxonomies, expert qualifications, and toolchain integrations are non-trivial to establish. Re-establishing them with a new partner at each stage adds cost, time, and risk of subtle drift. Teams that consolidate vendors across the lifecycle (pre-training mid-training data, SFT data, preference data for RLHF or DPO, RL environments for agent training, and evaluation data) often report faster iteration cycles and more consistent model behaviour.

The flip side of this argument is that single-vendor concentration creates supplier risk. The practical compromise that has emerged at large frontier labs is a primary partner across the lifecycle, with secondary partners for specific specialised needs. Toloka's offerings span this lifecycle, from the underlying pre-training data services through to RL environments and evaluation, which is increasingly the configuration enterprise data leaders are choosing.

Where this leaves us

The frontier of foundation model training has shifted from compute to data, and within data, from scale to quality, diversity, provenance, and verifiability. The labs that build defensible pre-training data pipelines today, with the human expert layer woven in at the right stages, with documented compliance and licensing, with the architectural flexibility to extend cleanly into post-training, will dominate the 2027 to 2030 model generation.

This is not a hopeful prediction. It is already happening. Public model release notes from the last twelve months reveal increasing transparency about pre-training data composition, source licensing, and quality assurance methodology. The labs treating this as a competitive moat are the same labs leading capability benchmarks. The relationship is not coincidental.

For organisations evaluating pre-training data partners, the criteria that matter most are demonstrated scale across all modalities, depth of domain expert coverage, full lifecycle support from pre-training through evaluation, compliance and licensing documentation, and a track record with comparable frontier customers. The next data partner you choose is the partner whose decisions will be in your model's weights for years.

Build foundation models with defensible data

Toloka partners with frontier AI labs to source, filter, and validate pre-training data at scale, with full provenance and compliance documentation.

Talk to our team →


Frequently asked questions

What is pre-training in LLM development?

Pre-training is the initial training phase in which a large language model learns from a large unlabeled corpus using self-supervised objectives, most commonly next-token prediction. The model develops factual knowledge, language coverage, reasoning patterns, and the latent capabilities that downstream fine-tuning later surfaces. Pre-training consumes most of the compute budget in modern LLM development, often weeks to months of training on tens of thousands of GPUs, and the data decisions made at this stage propagate through every subsequent stage of training.

How much data do you need to pre-train a large language model?

Modern frontier models use 10 to 20 trillion tokens of pre-training data, with the Chinchilla scaling laws suggesting roughly 20 tokens per parameter as compute-optimal. However, the 2025 to 2026 research consensus is that data quality matters more than absolute token count at the frontier. Smaller models trained on carefully curated, expert-verified data have matched larger models trained on more average-quality data. The right question is not how much data, but how much high-quality, diverse, well-licensed data the team can assemble and verify.

What is the difference between pre-training and fine-tuning?

Pre-training is the foundational stage where a model learns general patterns from a large unlabeled corpus, using self-supervised objectives. Fine-tuning is any subsequent training stage on a smaller, more targeted dataset, often labeled or curated, that shapes the model toward a specific task or behaviour. Pre-training establishes what the model can know and do in principle. Fine-tuning, including supervised fine-tuning, RLHF, DPO, and the newer RLVR, surfaces and aligns these latent capabilities for deployment. A capability that is not latent in the base model is extraordinarily hard to add through fine-tuning.

Where do frontier AI labs get their training data?

Frontier lab pre-training corpora blend six main source categories: public web text (Common Crawl and refined variants), licensed books and long-form literature, code from GitHub and StackExchange, scientific and reference content from arXiv and PubMed, synthetic data generated and verified by capable models, and human-generated specialised data from expert networks. The mixture ratios are competitive parameters, but code typically constitutes 15 to 25 percent of high-performing mixes, and mid-training increasingly weights toward expert-curated reasoning and instruction data.

How is pre-training data quality measured?

Human-generated data has expanded from a post-training role into the pre-training and mid-training stages of modern foundation model development. Human contributions include instruction-like sequences for priming, reasoning chains and demonstrations from domain experts, edge case curation for distributional anchoring, and synthetic data verification to catch hallucinations. The mid-training stage in particular delivers high leverage from expert-curated data, where small additions of high-quality content can shift downstream benchmarks more than much larger additions of average-quality web text.


Related reading

Subscribe to Toloka news

Case studies, product news, and other articles straight to your inbox.