Supervised Fine-Tuning: How to Customize Your LLM?

Toloka Team
by Toloka Team

Subscribe to Toloka News

Subscribe to Toloka News

There is probably no better-known technology for adapting large language models (LLMs) for well-defined natural language processing (NLP) tasks than SFT (supervised fine-tuning). In order to fine-tune a model, it must be pretrained, which means that it has already learned a lot from a wide range of texts.

But can a model be harnessed for diverse types of assignments after pre-training only? Yes, although it still lacks refinement through SFT to make it truly helpful, capable of executing the required actions, and proficient in a particular area of knowledge.

Powering Al development
from training to evaluation

Get high-quality data production pipelines tailored to your needs
Talk to us

What is supervised fine-tuning?

SFT, or Supervised fine-tuning, is a practice conventionally exploited in machine learning, particularly in the context of transfer learning with pre-trained models. A core of the fine-tuning approach relies on leveraging the comprehensive language understanding gained from previous training and molding it to the application at hand. The approach is particularly valuable in the realm of multiparameter large language models, where retraining the whole model starting from the ground up on the entire dataset is quite computationally overwhelming.

Fine-tuning acts as a supplemental training process for models. It is a part of the transfer learning paradigm in NLP with insights gained in the original task through pre-training being transferred to the target task via fine-tuning. This paradigm has led to significant advances in NLP, allowing models to better generalize across tasks and domains.

Fine-tuning has consistently yielded the highest outcomes in a variety of NLP tasks, like speech and text processing, that encapsulates classification, opinion mining for sentiment analysis, machine translation, and language generation, among others. These advances have demonstrated the effectiveness of fine-tuning when employing pre-trained models for task-specific purposes.

Supervised fine-tuning techniques have become standard practice in the NLP model development pipelines. Harnessing AI models that were trained previously and conforming them to a goal function with relatively little labeled data provides a practical solution for building models for particular applications. The widespread success of fine-tuning has transformed the landscape of natural language processing and has become a cornerstone in developing high-performing language models.

Data for SFT

Unlike unsupervised methods, where data is not verified in advance or labeled, SFT explicitly applies labeled data to guide the model adjustment process. Such supervision gives the model clear feedback during training, which leads to more focused learning and better model performance on the task at hand.

The pre-trained LLM is customized to a specific task through labeled data, where both the input examples and their corresponding correct outputs are presented. Such labeled data enables the model to learn patterns and relationships specific to the concrete objective during the fine-tuning process.

The labeled data employed for pre-trained language model fine-tuning is usually pre-validated, meaning that it is thoroughly reviewed or annotated to ensure its quality and relevance to the target task. A validation process such as this ensures that the fine-tuning process is successful and that the model assimilates meaningful representations.

Why do you need SFT if you are building an LLM?

Since they already have extensive knowledge, why not go with a previously trained model that wasn't fine-tuned? Pre-trained language models are indeed familiar with a lot of data, but they simply are not customized to be used for specific purposes.

They are like high school students who have just finished school and are about to go to university. Fine-tuning methods act as a university where the LLM gets its specialization. This approach customizes the pre-trained language model to fit the demands of a special task by training it on labeled data specially gathered for this occasion.

Supervised fine-tuning will invariably lead to improved large language model performance on downstream tasks compared to the direct application of a pre-trained model. This happens due to the fact that model parameters, also called weights, are updated during fine-tuning LLMs. Domain-specific labeled data that is utilized during the process reconfigures the AI system, further refining its predictions. However, getting such data is tricky: you need to have access to domain-experts and an elaborate system of quality-controls. If you need help with your data for SFT, talk to us.

Since the pre-trained LLMs already capture a significant amount of linguistic knowledge, fine-tuning allows the model to leverage this knowledge and adapt it to the desired applications with minimal additional training.

Benefits of SFT

Efficiency with a relatively small amount of data

Fine-tuning allows the model to quickly adapt to the target task with relatively small amounts of labeled data, thereby addressing data scarcity issues common in many real-world applications. Effective learning from limited amounts of task-specific labeled data is possible due to the pre-existing knowledge in the pre-trained model.

Versatility and flexibility to fit any specific task

Fine-tuning is an agile way to customize the same pre-trained LLM for multiple applications in different domains. Such versatility enables it to meet a wide range of requirements for NLP applications, where a single model can fulfill multiple tasks. By fine-tuning with data specific to the application, the model can adeptly master each task without requiring separate models for each.

Improved model's performance

The straightforward use of a pre-trained model usually doesn't show good results when used for a specific purpose. Output won’t be ideal even though the response may be logically valid. For example, a response from a help desk may sound harsh coming from a bare pre-trained model. To make it sound like an actual tech support employee's response, some introductions and remarks have to be added to the reply.

This is exactly the sort of issue that fine-tuning can help with. Through fine-tuning with data labeled for a particular application, the model can learn task-specific patterns and features that are critical for accurate prediction of the underlying outcome. Thus, a model can capture subtleties and interconnections specific to a particular objective that are not found in the original pre-trained model.


With the reuse of pre-trained models and fewer labeled examples, supervised fine-tuning can significantly reduce the computational resources and time required for data annotation and model training, resulting in a more cost-effective approach.

A pre-trained model fine-tuning commonly demands less data processing capabilities in contrast to training from scratch, because a pre-trained system is already capable of extracting pertinent features from the given information. Whereas fine-tuning consists mainly of adjusting these features to the specifics of the new objective. As a result, the computational cost, including memory usage and training time, is usually reduced, resulting in cost savings.

Supervised fine-tuning step-by-step

  • Task definition and model picking. At the initial stage of SFT, data scientists outline the assignment that LLM should perform, like conversational tasks that include question answering, or processing of text, including classification, translation, etc.
    Then the specialists choose a pre-trained model that is suitable for all of the tasks they want it to fulfill. It should be trained on a large dataset and ideally on assignments similar to the ones they are targeting. Popular choices for LLMs are models such as BERT, GPT, Llama, etc.;

  • Data preparation. Further, they gather a dataset library that is relevant to the outlined tasks. The dataset should be labeled, as supervised fine-tuning calls for the labeled data (input-output pairs) to provide feedback to the model during training;

  • Dataset tokenization. This involves tokenizing the text inputs, converting them into smaller representations that the model can process;

  • Fine-tuning language model. The model is undergoing fine-tuning on the labeled dataset employing supervised learning approaches. During training, the model's parameters are updated through backpropagation based on the gradient of a loss function computed between the model's predictions and the ground truth labels. The more minimized the loss function, the better. This means that the model produces results that are closest to the training data.
    Fine-tuning involves basic hyperparameter tuning. It involves customizing parameters such as learning rate, batch size, and regularization strength to achieve optimal model performance. It's crucial to evaluate the model's performance on a validation set during hyperparameter tuning to avoid overfitting.

  • Evaluation of a fine-tuned LLM. Once additional training is complete, the fine-tuned model is evaluated on the test set to assess its performance on unseen data. Such a test set usually represents the data the model will encounter in real-world applications. Selecting appropriate evaluation metrics to assess the model's performance is a challenging task in itself. Toloka offers Deep Evaluation solutions that might help you in your SFT pipelines. Get in touch, if you want to know more;

  • Model deployment. Once the data scientists are satisfied with the model's performance, they integrate the fine-tuned LLM into their software infrastructure or application ecosystem. After implementation, testing and improvement of the model should continue. Specialists ensure proper monitoring and maintenance to keep the model up-to-date and performing well.

Types of supervised fine-tuning

To understand the types of supervised fine-tuning, it is necessary to indicate that LLM is a deep-learning neural network that consists of a multitude of layers. Every layer consists of numbers or parameters, also called weights of the model. There are a huge number of such parameters and they are assembled into layers in neural networks, which represent tables of numbers or matrices.

There are some types of supervised fine-tuning, which are differentiated based on how many parameters are modified during the learning process.

Full fine-tuning

In the case of full fine-tuning, the entire model undergoes all parameters updating through labeled data. In a way, this kind of SFT is similar to pre-training but with less data. Since all model weights are subject to modification, and there may be several billions of them, this approach requires huge computational power.

What are the pros of full fine-tuning? The model can learn features and representations across all layers of the architecture thanks to the full fine-tuning process. That leads to maximum flexibility in adapting large language models to your special task. Full fine-tuning often yields significant performance improvements over using a pre-trained model with more constrained fine-tuning approaches.

Parameter-efficient fine-tuning (PEFT)

PEFT, another method for fine-tuning, leaves fundamental language understanding unaffected. That means that only a portion of the weights is changed. The key idea behind it is to add task-specific layers or adapters to already trained large language models, which are then fine-tuned on the downstream task dataset. The pre-trained model's weights are kept frozen. The main advantage of PEFT is that it substantially diminishes the computational cost in contrast to full fine-tuning while still achieving competitive performance.

Instruction fine-tuning

The key idea behind instruction fine-tuning is to provide the model with labeled examples that convey the desired behavior or response. These examples serve as instructions for the model, guiding it to generate appropriate outputs based on the provided input data or query.

During fine-tuning, the model is trained on this instructional dataset, learning to generate responses that align with the provided instructions. By exposing the model to a diverse range of instruction-response pairs and fine-tuning its parameters accordingly, instruction fine-tuning helps improve the model's ability to understand and follow specific instructions.

Fine-tuning vs. RAG

Supervised fine-tuning shares some common features with Retrieval Augmented Generation (RAG). Both approaches are applicable for specialized customization of language models for certain applications. However, the methods of implementation are different.

RAG focuses on combining retrieval and generation techniques to incorporate external knowledge or context into generative language models, while SFT concentrates on adapting pre-trained language models to specific use cases.

RAG allows the model to retrieve information from the predefined knowledge base and add it to a language model prompt. It works the following way:

  1. LLM redirects the user's prompt to the embedding model, where it is converted into a numeric form;

  2. These prompts are then compared with knowledge bases;

  3. The embedding model locates relevant data;

  4. The retrieved information is integrated into the prompt for the large language model as an additional specific context;

  5. The embedding model converts data back into a user-friendly format to add this information to the LLM response;

  6. The output, combining both the retrieved information and the original prompt, is submitted to the user.

In SFT, the pre-trained model's parameters are adjusted based on the computed loss. On the other hand, in Retrieval Augmented Generation (RAG), the model's parameters remain unchanged during the retrieval and generation process. Also, RAG relies on additional external information, whereas SFT employs specially selected labeled data.

It is safe to say that RAG would be most useful in scenarios where information evolves frequently or is constantly changing. It is ideal for applications that access external sources to improve model answers. For example databases, records, and other data repositories. While SFT provides deep alignment to specific styles or areas of knowledge, RAG is primarily focused on information retrieval.

It is hard to determine which approach is best for customizing large language models for specific tasks. In some cases, combining them for optimal results may be a good idea. Here are some considerations that may help you with your choice of approach.

SFT advantages

  • SFT is effective for tasks where labeled data is available and where direct control over model behavior is desired;

  • It allows for precise customization of the model's parameters to the target task, resulting in fine-grained adjustments tailored to specific task requirements;

  • SFT allows the same pre-trained model to be fine-tuned for different tasks.

Points in favor of RAG

  • RAG does not require labeled data, so it’s more fitting for cases where labeled data is insufficient or unavailable;

  • RAG is beneficial for tasks that require incorporating external knowledge or context;

  • It is particularly useful for cases where data is updated very frequently.

How do we evaluate the effectiveness of supervised fine-tuning?

Evaluation of supervised fine-tuning involves assessing the performance of the fine-tuned model on the never-before-seen data. Common metrics include accuracy, precision, recall, F1-score, mean squared error (MSE), BLEU score, etc. The choice of metrics depends on whether the task is classification, regression, machine translation, etc.

Before selecting evaluation methods and metrics, ensure a clear understanding of the task you're fine-tuning the model for. Here's a brief explanation of each metric:

  • Accuracy measures the proportion of correctly classified instances out of the total instances in the dataset;

  • Precision denotes the model's ability to avoid false positive predictions. A high precision value means that the model makes fewer false positive errors;

  • Recall points to the model's ability to capture all positive instances. A high recall value means that the model makes fewer false negative errors;

  • The F1 Score is the harmonic mean of precision and recall. It provides a balance between precision and recall.

Besides evaluation metrics, other techniques like human evaluation combined with an automated approach are used for model assessment. For example, the deep evaluation method proposed by Toloka includes such an approach. After analyzing the scenarios of model use, Toloka selects the necessary metrics to evaluate the model's efficiency. The subsequent evaluation pipeline includes human reviews in addition to the use of AI algorithms.


Customizing large language models is a vital step in maximizing their utility and effectiveness. The pre-trained large language models, regardless of delivering a strong general language understanding, may not be tailored to the subtleties of a particular task. Their true potential is unlocked through customization techniques such as supervised fine-tuning.

SFT lets the LLM leverage the training process that teaches it to perform a particular action. The model gets to know the features of a target application and improves its performance. SFT comes particularly in handy when labeled training data is in short supply. It doesn’t require much data because it already has fundamental knowledge acquired previously. The system just needs a little push to perform certain activities better.

Pre-trained large language models not just simply benefit from supervised fine-tuning. They increase their overall convenience and usability. By fine-tuning with the help of task-specific data, pre-trained LLMs can achieve higher performance levels compared to using the generic model, making them more effective and efficient for a wide range of NLP applications.

Read more about other stages of LLM development and RAG:

Article written by:
Toloka Team
Toloka Team

Recent articles

Have a data labeling project?

Take advantage of Toloka technologies. Chat with our expert to learn how to get reliable training data for machine learning at any scale.