Direct Preference Optimization (DPO): a lightweight counterpart to RLHF

Toloka Team

on February 12, 2024

High-quality human expert data. Now accessible for all on Toloka Platform.

Learn more

Can your AI agent survive in the real world?

Training datasets are what it needs to reason, adapt, and act in unpredictable environments

Get traning data

There is a standard approach to customizing pre-trained large language models. Most commonly, it is a combination of two steps: Supervised fine-tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF). However, recently RLHF acquired a rival – Direct Preference Optimization (DPO) approach, which is easier to implement. We will look closer at what it is and discuss how it can replace RLHF in more detail.

What is DPO?

In the 2023 paper "Direct Preference Optimization: Your Language Model is Secretly a Reward Model", the authors propose a method called Direct Preference Optimization for effectively controlling Large-scale unsupervised Language Models (LLMs) without relying on more complex approaches such as Reinforcement Learning from Human Feedback (RLHF).

DPO is a paradigm in artificial intelligence and machine learning that focuses on optimizing language models directly based on human preferences. One of the most common and traditional methods of aligning Large Language Models to preference data includes RLHF. However, this new optimization approach contributes to a faster and more efficient way to tune and train the language model to find the right answers.

How DPO Is Better Than RLHF?

Standard LLM alignment process

If you want to find out which aspects make DPO better than RLHF, you have to understand the standard sequence of LLM alignment. The next steps are what a standard additional learning process consists of.

SFT with A Labeled Dataset

LLMs already trained on a huge standard dataset undergo supervised fine-tuning (SFT) to become aligned with human expectations. In the process, the model learns by using a labeled dataset. After SFT, the model can, for example, answer questions in a certain style.

SFT requires fine-tuning datasets, normally consisting of prompts and competitions. It is possible to find such datasets off-the-shelf, but for some cases, it is more common to build a separate unique dataset consisting of business-specific inputs. Services such as Toloka are often called for help with such tasks.

RLHF to Improve SFT Model Results

SFT results are improved through the RLHF to make model output more versatile and natural, which means more aligned with human requirements. The environment that gives a system feedback to perform its task better in RLHF is represented by humans. They evaluate the behavior of the model by rating good and bad responses. In RLHF, the agent, meaning a language model, aims to learn a policy or behavior that maximizes the cumulative reward obtained from human feedback. This process usually involves two main steps:

Reward Model Training. The first step involves training a reward model (RM) that can predict the rewards or preferences assigned by human preference data. This model learns to map the agent's actions or states to corresponding human feedback signals.

Dedicated human experts evaluate the model's responses to selected prompts to create preference data. It includes chosen and rejected responses, as well as the prompt. The RM then learns to evaluate the model's responses based on this information. Once it is trained, it guides the LLM through its decision-making process.

Policy Optimization. Based on the predicted rewards or preferences from the RM, the agent's policy is updated using a reinforcement learning algorithm, such as Proximal Policy Optimization (PPO) or other optimization techniques. The objective is to adjust the policy in a way that increases the likelihood of selecting actions that are expected to lead to higher rewards or preferences according to the RM.

The process of collecting feedback, training the reward model, and updating the policy is repeated iteratively. With each iteration, the agent's policy becomes more aligned with human preferences, leading to improved performance and higher cumulative rewards over time.

DPO Pipeline

When applying DPO, the reward model is no longer required. Before fine-tuning with DPO, the model also goes through the SFT phase. After that, a preference dataset is collected. Tolokа сan help you gather high-quality datasets both for SFT as well as for DPO. If you want to know more, talk to us.

DPO treats the policy optimization task as a binary classification problem, employing a binary cross-entropy objective. In other words, it chooses the best of two answers based on the preference data, thereby directly updating the model's policy for improving its results.

In the fine-tuning process, the frozen version of the LMM compares its responses with the responses given by its updating version. The updating version of the LLM, meanwhile, is guided by preference data to enhance itself. The correlations that are obtained by comparing the estimates from these two models are employed to adjust the model weights.

DPO fulfills the same function as RLHF, yet, according to the results of several experiments, it can align the LLM just as effectively as RLHF, or even better, enhancing the quality of the model outputs in tasks including engaging in dialogue, summarization, sentiment modulation and more. DPO is a more streamlined approach for training language models from human-defined criteria.

Benefits of DPO

Simplicity. In the RLHF method, a significant portion of complexity arises from training a reward model (RM). The DPO approach eliminates the need for a separate RM, which can be complex and unstable when employing RLHF;

Stability. The RLHF approach heavily relies on the accuracy of the trained RM. However, these models can be prone to errors, biases, or inaccuracies, leading to flawed behavior of the language model. Since DPO bypasses the RM entirely, it is less prone to errors in adjusting to human preferences. By directly optimizing the policy based on preference data, DPO offers a more straightforward and stable solution;

Efficiency. DPO is computationally efficient compared to RLHF methods. Since it does not require training an RM or sampling from the language model during optimization, DPO can achieve faster convergence and lower computational overhead;

Bias mitigation. DPO directly leverages human preference data to guide the optimization process. By explicitly incorporating human preferences into the optimization, DPO ensures that the models learn to prioritize outcomes that are desirable to humans. This approach reduces the risk of unintended biases in the model's behavior by aligning it more closely with human values and preferences.

DPO Implementation Challenges

Implementing Direct Preference Optimization (DPO) effectively poses several key challenges. It is worth realizing that, similarly to RLHF, overfitting may occur when using DPO. This means that the built model performs well in interpreting examples from the training sample, however, it does a poor job on examples that were not in the training sample.

If the training dataset only covers a limited range of preferences and scenarios or predominantly contains preferences from a specific user group, the DPO may cause the model to overfit to those specific pieces of data or to inadvertently learn to prioritize preferences that are representative of the group while neglecting others. One of the simplest way to prevent overfitting is to collect diverse high-quality data. Toloka has wide experience in collecting diverse datasets in multiple domains of expertise and multiple languages. Get in touch, if you want to learn more.

In other words, the model may fail to generalize to unseen information as a result of DPO. Ensuring diversity in the training dataset by collecting data from a wide range of sources, demographic groups, and cultural backgrounds can help mitigate the risk of overfitting to specific preferences.

Moreover, the data used in DPO may contain implicit biases or assumptions that influence the learned preferences. That's why performing thorough analysis and validation of the training dataset to identify and mitigate biases is essential.

It is imperative to ensure that the dataset provided by humans is accurate. One way of regulating the validity of human feedback is to design rules and criteria for evaluating model responses that are given to human reviewers at the very beginning of the evaluation process.

DPO algorithms may also raise ethical and privacy concerns related to the collection, storage, and use of human preferences. There are critical considerations involved in protecting user privacy, ensuring well-informed consent, and reducing potentially possible damage associated with preference clarification when implementing DPO.

Lightweight and Efficient Approach To AI Fine Tuning

Like RLHF, DPO has its limitations, but its advantages are substantial. This algorithm relieves the LLM alignment specialists from performing significant hyperparameter tuning and employing datasets from the same LMM to fine-tune it. The DPO is possible without explicit reward modeling or reinforcement learning, which makes it computationally lightweight, stable, and less time-consuming.

AI models, including LLMs, are becoming more complex. Therefore, the development of algorithms that can reduce complexity while maintaining or even improving the results of earlier approaches whilst allowing for precise control over LLMs is highly valued at the current stage of AI development.

Learn more ways to optimize large language models:

Subscribe to Toloka news

Case studies, product news, and other articles straight to your inbox.