Solutions

Datasets

Research

Resources

Company

Talk to us

Toloka welcomes new investors Bezos Expeditions and Mikhail Parakhin in strategic funding round

Learn more

Toloka welcomes new investors Bezos Expeditions and Mikhail Parakhin in strategic funding round

Learn more

Toloka welcomes new investors Bezos Expeditions and Mikhail Parakhin in strategic funding round

Learn more

Pre-training in LLM Development

Toloka Team

February 22, 2024

Essential ML Guide

Can your AI agent survive in the real world?

Training datasets are what it needs to reason, adapt, and act in unpredictable environments

Get traning data

Natural Language Processing (NLP) has been revolutionized by the advent of pre-trained models. Most of today's best-known LLMs are pre-trained. Why is it so critical to pre-train AI models, and LLMs in particular? And what other indispensable steps should be taken to obtain effective language models that comprehend user preferences and can fulfill your business-specific tasks? We'll figure it out further in our article.

What is pre-training?

Pre-training involves training a neural network model on a large corpus of text data in an unsupervised manner. It is an initial phase of machine learning training that is a crucial step to equip an LLM with general language understanding capabilities. After pre-training it can be fine-tuned to accomplish the desired results.

By leveraging past experiences rather than starting from scratch, language models are able to effectively address new tasks during fine-tuning, which results in a model benefiting from previous training.Humans possess similar inherent abilities to leverage prior knowledge, allowing us to avoid starting from scratch when faced with new challenges.

However, pre-trained models have some core knowledge and are fully capable of undertaking a wide range of tasks, yet they do not possess any kind of specialization. To master proficiency levels in conversational skills, text generation, or creating other content on request the LLM requires several more stages of learning.

Pre-training in LLMs

Pre-trained LLMs are not yet suitable for use in highly specialized areas since they do not have in-depth contextual knowledge in certain areas. Supervised fine-tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF) are often called for to make a pre-trained model suitable for such a niche use.

However, if the model was not pre-trained on large-scale unlabeled text corpora collected from various sources such as books, articles, websites, social media, and other textual resources, it would be even more time-consuming and resourceful to train it for specific needs.

Fine-tuning is a supervised learning process, requiring a relatively small set of labeled data. The use of a model that is already pre-trained is precisely what makes it possible to employ a relatively small set of labeled data for additional training. That is, if you were to take a non-pre-trained deep learning model, you would have to collect a huge dataset to train it to do what you want it to do. Since the model already has the initial knowledge, it is easy to fine tune it on a smaller amount of data.

Training a non-pre-trained deep learning model for a special use case from scratch would require more data, training time and resources. Consequently, the quickest and most cost-effective way to an improved model performance requires pre-training of LLM.

Steps for LLM pre-training

During the pre-training stage in ML, the model is learning to foretell the next word in the text in a mindful manner. It is called the model's pre-training objective. Pre-trained LLM cannot recognize the instructions or questions that are given to it yet. The SFT and RLHF steps are necessary to adapt it to a real-world AI application, for example, for using it as a chatbot. As we've already mentioned, pre-training helps to complete these steps faster and at a lower cost. Here is a step-by-step breakdown of how it can be implemented.

Data Collection

Data scientists gather a large and diverse corpus of text data from various sources such as books, articles, websites, and social media. The diversity of the data helps the model learn a wide range of language patterns and concepts.

Cleaning

Text data often contains noise, such as special characters or non-textual elements. Cleaning involves removing or replacing any non-textual elements and duplicated text samples from the raw data to ensure that it is consistent and meaningful. Techniques such as regular code scripts, artificial intelligence algorithms that are trained to automatically identify and clean data, and a final human review can be used for general text cleaning.

Tokenization

Data scientists and NLP engineers tokenize the text data into smaller units such as words, subwords, or characters to form the input sequences for the machine learning model. Word tokenization is perhaps the most common form of tokenization, where the text is split into individual words based on whitespace or punctuation boundaries.

Architecture Selection

A transformer-based architecture is often chosen for the model, as it works well with sequences in texts. Transformers have proven to be highly effective for natural language processing tasks due to their attention mechanism. It allows the model to weigh the importance of each word/token in the input sequence when computing representations. This enables the model to capture dependencies between distant words more effectively than traditional architectures.

Pre-training process

Machine learning engineers train the LLM according to its pre-training objective using the tokenized and preprocessed data. This step involves feeding a large dataset as input through the model to make it comprehend and create human-like sequences of text.

After pre-training

Once all the stages of pre-training are complete, the pre-trained model is ready to be fine-tuned and go through the RLFH or SFT phase. There are certain differences between those approaches but they both require broad datasets of preferences that could be used to teach the model which answers are preferable in a given context. Unlike data for pretraining these preference datasets are smaller yet harder to obtain. Read our article on SFT to learn more or simply get in touch, if you need high-quality dataset for RLHF or SFT.

The performance of the pre-trained and fine-tuned model is evaluated on various benchmarks and tasks to assess its generalization ability and effectiveness in understanding and generating human-like text. This so-called continuous model evaluation also incorporates human assessment of model quality.

The Importance of Pre-Training

Pre-training in machine learning is important due to its numerous benefits. Here are some key reasons why pre-training is important:

Transfer learning

The availability of pre-trained models enables a technique known as transfer learning, where a model trained on one task or dataset is employed to improve performance on a related task or dataset. Fine-tuning is a type of transfer learning that involves updating the parameters of the entire pre-trained model. Instead of training a model from scratch, transfer learning allows knowledge from pre-trained models to be reused, resulting in faster training and better performance on target tasks.

Data Efficiency and Lower Training Cost

Pre-training enables models to leverage large amounts of unlabeled data, which is often more abundant and accessible than labeled data. This reduces the need for extensive labeled data for training models on target tasks, making it feasible to train effective models even with limited labeled data. Such approach lowers the overall annotation costs associated with training machine learning models.

Easy Customization for Specific Tasks

Pre-training provides a flexible starting point for adapting models to address specific tasks or domains. By fine-tuning pre-trained models on domain-relevant datasets or objectives, ML practitioners can tailor the output to the nuances and requirements of the target application.

This process can happen more seamlessly than if one had to train a model for a narrow specialization from scratch. Fine-tuning and subsequent RLHF, which helps improve the model's output quality through human feedback, leads to improved performance and relevance.

Numerous Use Cases

Pre-trained models serve as the basis for many AI applications, particularly in the field of NLP. Thanks to pre-training LLMs can be adjusted with fine-tuning or/and RLHF to:

determine the sentiment expressed in texts;
translate from one to several other languages;
identify and classify named entities;
produce realistic and diverse texts in various styles and genres;
generate content for chatbots and conversational agents.

Pre-Training: An Important Step of ML Training That Cannot Exist On Its Own

However effective and valuable the pre-training model may be, it cannot serve as an accomplished LLM because it does not have all the necessary properties that fine-tuning and RLFH provide. Also, another vital step for LLM output assessment called continuous model evaluation is introduced after all of the training stages are through.

So, if you need to develop a truly customized and high-performing model, only pre-training wouldn't be enough. You need to fine-tune your LMM as well and introduce other steps like RLFH and model output evaluation. By incorporating these stages into the training pipeline, developers can create customized language models that meet the specific requirements and objectives of their applications

These additional steps complement pre-training and ensure that the model not only possesses the necessary knowledge but also adapts to changing conditions, learns from human feedback, and maintains high performance over time.

Read more about other stages of LLM development:

Subscribe to Toloka News

Case studies, product news, and other articles straight to your inbox.

Subscribe to Toloka News

Case studies, product news, and other articles straight to your inbox.

Subscribe to Toloka News

Case studies, product news, and other articles straight to your inbox.

Recent articles

View all articles

The new frontier of cybersecurity: a guide to AI agent security

Jun 18, 2025

Agent Evaluation: Why Simulated Environments are the New Frontier for Data

Jun 17, 2025

LLM evaluation: from classic metrics to modern methods

Jun 17, 2025

The new frontier of cybersecurity: a guide to AI agent security

Jun 18, 2025

Agent Evaluation: Why Simulated Environments are the New Frontier for Data

Jun 17, 2025

LLM evaluation: from classic metrics to modern methods

Jun 17, 2025

Agentic RAG systems for enterprise-scale information retrieval

Jun 13, 2025

More about Toloka

What is Toloka’s mission?

Where is Toloka located?

What is Toloka’s key area of expertise?

How long has Toloka been in the AI market?

How does Toloka ensure the quality and accuracy of the data collected?

How does Toloka source and manage its experts and AI tutors?

What types of projects or tasks does Toloka typically handle?

What industries and use cases does Toloka focus on?

What is Toloka’s mission?

Where is Toloka located?

What is Toloka’s key area of expertise?

How long has Toloka been in the AI market?

How does Toloka ensure the quality and accuracy of the data collected?

How does Toloka source and manage its experts and AI tutors?

What types of projects or tasks does Toloka typically handle?

What industries and use cases does Toloka focus on?

What is Toloka’s mission?

Where is Toloka located?

What is Toloka’s key area of expertise?

How long has Toloka been in the AI market?

How does Toloka ensure the quality and accuracy of the data collected?

How does Toloka source and manage its experts and AI tutors?

What types of projects or tasks does Toloka typically handle?

What industries and use cases does Toloka focus on?