Why RLHF is the key to improving LLM-based solutions

Natalie Kudan

Subscribe to Toloka News

Subscribe to Toloka News

In our previous blog posts, we've addressed the importance of data labeling for the advancement of artificial intelligence and the role of human feedback and human-handled data labeling, i.e., raw data annotated by real people. The influence of human-in-the-loop labeling can be felt across various Machine Learning (ML) domains – from Natural Language Processing (NLP) to Computer Vision (CV).

Today, let's delve deeper into how humans can help enhance model performance through what's known as RLHF (Reinforcement Learning with Human Feedback) as it relates to LLMs (Large Language Models). In this article, we'll explore LLM uses and limitations, and then examine the roles data annotators play beyond the initial stages of data collection, labeling, and training the reward model. This article will provide valuable insights for newly trained ML engineers, aspiring data labelers, as well as project managers and team leaders.

Empower your GenAI development

Get your expert data for Fine-tuning, RLHF and Evaluation. High-quality, for any domain, at scale.
Talk to us

Large Language Models

To remind everyone, Large Language Models (LLMs) are a type of foundation model pretrained on vast amounts of language data. One of the main drawbacks is that these models are extremely expensive due to their extensive resource requirements. Typically, unsupervised learning is employed, allowing machines to absorb information and "self-prepare." LLMs are called foundation (or generative) models because they provide a solid basis for fine-tuning later down the track. The process of fine-tuning involves utilizing additional, usually labeled, data to improve the model's performance for a more specific downstream application (i.e., a more narrow task), such as sentiment analysis.

Currently, some of the most popular Large Language Models include GPT and BERT, both of which are Transformer-based models. Transformer models employ an encoder-decoder architecture, meaning that they have input and output layers, with every input connected to every output. The primary advantage of these models over their predecessors (and the reason they have become the standard today) lies in their ability to understand meaning in context.

An example of how language can be easily misinterpreted is demonstrated by this famous sentence from the field of pragmatics (i.e., meaning in context):

I'd like to eat this salad undressed.

Literal interpretation: The speaker wants to eat the salad without wearing any clothes.

Figurative interpretation: The speaker wants to eat the salad with no additional toppings or dressings.

As humans – provided we're proficient in the language that's being used – we understand what the speaker implies, because we grasp meaning in context. It's only with Transformer models like GPT and BERT that machines have begun to achieve a similar understanding.

History and LLM uses

The history of LLMs is marked by a series of innovations and advancements that have significantly improved machine understanding of human language. Early language models like GloVe and Word2vec, both introduced in 2013, generated word-embedding representations that lacked contextual comprehension. Although useful, these models couldn't understand how different words in the same sentence related to one another in various ways. A major breakthrough occurred in 2017 when Google Brain published a paper titled "Attention Is All You Need", which laid the groundwork for the Transformer architecture, i.e., a sequence-to-sequence neural network based on "attention." This new model could capture long-term dependencies and better handle sequence predictions.

In 2018, Google unveiled BERT (Bidirectional Encoder Representations from Transformers), a groundbreaking language model based on the Transformer architecture. BERT was the first model capable of understanding how words interplayed with each other in context. Such a breakthrough is what enabled language models to produce diverse and compelling text. Following BERT, other Transformer models emerged, including RoBERTa (Facebook AI), DistilBERT (Hugging Face), and GPT (Microsoft-backed OpenAI), each taking a different approach to language comprehension. OpenAI, in particular, demonstrated the great potential of generative pre-training methods when they unveiled ChatGPT in November of last year, receiving widespread acclaim from users around the globe. Newer versions followed, with the latest Open AI chatbot available right now being based on GPT-4.

Not to be left behind, several other companies have either released or are actively developing their own LLM-based chatbots, all of which are Transformer models. These include Ernie Bot by Chinese tech giant Baidu, Meta's short-lived BlenderBot-3 and Galactica based on LLaMA (Large Language Model Meta AI), and Google's new chatbot called Bard based on LaMDA (Language Model for Dialog Applications).

On some tasks, some of these AI bots perform incredibly well and their responses can be surprisingly entertaining. Let's look at this output produced by ChatGPT:

Task: Write a conversation between the Sun and Pluto

At other times, these responses can be downright alarming. Among one of the more recent AI scandals is a two-hour conversation between a New York Times contributor, Kevin Roose, and a chatbot named Sydney that was based on OpenAI's GPT but redesigned by Microsoft Bing to be more powerful. The reporter was taken aback when the chatbot expressed several unsettling tendencies, including wanting to become human, engineering a deadly pandemic, stealing nuclear codes, and inciting people to kill each other, all in pursuit of greater power and control.

While many are understandably over the moon with having AI-powered bots like ChatGPT do much of their work for them, these instances – however isolated – serve as important reminders of how model training and, even more crucially, model fine-tuning can make or break an LLM-based solution, or even go beyond that.

Advantages of LLMs

Following this train of thought, Elon Musk is asking the tech world to halt further AI development for the next six months until we figure out where we're at, technologically speaking. While his worries are understandable, thankfully, it's not all doom and gloom. Far from it. Whichever way you look at it, LLMs have been instrumental in driving rapid advancements across numerous domains at an unprecedented pace. Undoubtedly, LLM-based bots like ChatGPT and Bard are now able to produce truly impressive results. When it comes to the last category on our list, here's a telling example:

Task: Write a short treatment for an action-comedy featuring a data labeler as the main character.

Shortcomings and limitations of LLMs

Despite these impressive results, LLMs still have shortcomings and limitations. For this reason, using an LLM or an LLM-based virtual assistant should always be done under careful supervision. AI chatbots can certainly provide valuable assistance, but one should never count on chatbots to do all of their work for them, as overreliance – and sometimes plain laziness – can come at a high price.

To draw an analogy, a food supplement taken as a daily pill is no substitute for a healthy lifestyle, and ingesting one does not excuse you from striving for a balanced diet and regular exercise. Likewise, as self-driving cars are rapidly evolving yet still rely on human input to determine their destinations, it's up to each one of us to decide how to use LLM-based chatbots, set the right direction, and verify their output.

As of today, here are some of the biggest limitations of LLM-based assistants, including the most advanced of them – ChatGPT running on GPT-4:

Limited formal reasoning

An LLM-based chatbot excels at reasoning with ambiguity, such as generating step-by-step instructions for baking an apple pie or writing code to get to a predefined goal; however, it struggles with strictly formal reasoning, like deriving precise proof in pure mathematics or arriving at precise conclusions in abstract logic.

Inability to search the web / outdated knowledge

Most LLM-based chatbots cannot directly access the internet for the most up-to-date information, while their knowledge is cut off at a certain date (September 2021 in the case of ChatGPT). Some information will be inaccurate or incomplete as a result, including current events and recent developments. Bard is an exception, as its makers are about to release it to the public with this vital function added, though only as an "experiment," i.e., nothing is guaranteed.

Biases and difficulty in overwriting the model's beliefs

An LLM-based chatbot like ChatGPT is constrained by human-crafted guardrails to prevent offensive or inappropriate responses. Be that as it may, it will still exhibit certain biases from time to time, and it may strongly hold onto certain ideas embedded during pre-training even when provided with contradicting information.

Limited language support

Most of these LLM-based chatbots have a poor command of languages other than English (apart from Baidu whose first language is Mandarin Chinese), making it difficult for non-native speakers of English to utilize their benefits.

No audio or video analysis

ChatGPT and other LLM-based solutions cannot (yet) analyze audio or video content.

Some of these shortcomings can result in incorrect responses or glitches known as "artificial hallucinations." They refer to a phenomenon when an AI model, such as the LLM-based ChatGPT, generates information that may seem coherent and plausible, but it's not actually based on real-world facts or present-day information. These hallucinations occur when the model tries to generate a contextually relevant output but lacks the necessary knowledge or understanding to provide an accurate response.

According to a New York Times opinion piece from March of this year written by distinguished linguist and public intellectual Noam Chomsky and his co-authors, ChatGPT and other LLM-based models struggle to "balance creativity with constraint." In their acknowledgement of the aforementioned issue, the researchers argue that these models are still largely "incapable of distinguishing the possible from the impossible."

AI fine-tuning as the gateway to improvement

So, we know that LLM-based chatbots are incredibly useful, yet they have their limitations. So, where does it leave us? And more importantly, what does this imply for future releases and those working on AI solutions for downstream applications?

According to OpenAI's CEO, Sam Altman, the age of LLMs that gave birth to ChatGPT is already over, which is why we're unlikely to see a GPT-5 model trained in the near future. Of course, this doesn't mean that there will be no further progress; rather, this progress will stem from innovation not tied to model size. This entails generating new ideas to enhance existing models and, even more critically, making further advances in fine-tuning by incorporating human feedback into training a specific reward function.

In fact, fine-tuning, which involves honing an existing foundation model like GPT, LLaMA, or LaMDA to meet the needs of a specific task has now become the main focus (as opposed to having an all-in-one model designed to tackle multiple tasks). To rephrase that, it's no longer about creating an even larger LLM behemoth – which would likely be unwieldy and accompanied by the same or other drawbacks – but instead streamlining existing generative models, of which there are already plenty.

With that in mind, let's examine various types of fine-tuning that can apply to pre-trained LLMs, which – let us remind you – are typically trained without supervision by allowing a language model to "soak in" huge volumes of textual data. Fine-tuning then comes in and adapts these models to perform better on much more specialized tasks. There are several types of fine-tuning techniques in existence, so here's a brief overview of the most common ones:

Traditional fine-tuning

This method involves adding an additional layer of information or "weights" to the pre-trained model and training it further on a narrow task. The model learns to optimize its performance for that task by adjusting its initial parameters based on the fresh data.

Prompt tuning

P-tuning, or simply prompting, is a technique used in GPT-like models. A prompt (usually human input prompts) in this instance refers to the input text or question given to the model, which serves as a starting point or "cue" for the model to generate a response. P-tuning manipulates the input prompt without changing parameters to help the model better understand the context and generate more accurate predictions. This technique is particularly useful, because it focuses on improving the model's performance by refining the input prompt rather than adjusting the model itself.

Zero-shot/one-shot/few-shot learning

These techniques involve fine-tuning the model with limited or no new information. In zero-shot learning, the model generates responses without any task-specific examples to guide its performance. In one-shot learning, the same goal is achieved, but the model is fine-tuned with only one example. In few-shot learning, the model is fine-tuned with a limited number of examples. These techniques aim to improve the model's ability to generalize and adapt to new tasks with minimal training data.

Instruction learning

This method, which includes techniques like RLHF, focuses on improving model performance by using human feedback. Reinforcement learning is one of three basic machine learning paradigms, alongside supervised learning and unsupervised learning. It is an ML approach concerned with how models take actions in an environment in order to maximize cumulative reward. Such a reward might be given to an initial model either by the environment or by its model (also called a reward model). In RLGF, a reward model is fine-tuned by iteratively receiving feedback from human evaluators (i.e., data annotators) and updating its parameters based on their guidance. This process helps the model improve its performance by adjusting its behavior based on the feedback it receives from the reward model.

RLHF: AI fine-tuning with human feedback

Reinforcement Learning with Human Feedback (RLHF) is a concept we've already touched upon in our previous articles, including this one, though we never gave it a formal name. Instead, we referred to it more generally as "human-in-the-loop performance evaluation" or "ML model monitoring," both of which fall under instruction learning.

Instruction learning is a broad category, and reinforcement learning with human feedback is just one example of the techniques used in this domain. There are also other types of instruction learning that involve different approaches. Some methods, for instance, focus on learning from "demonstrations," where the model learns by observing and collecting human feedback or following a sequence of steps provided by humans.

Reward models and the RLHF approach are essentially sub-methods of instruction learning, and they work together to help the model learn from human feedback. A pre-trained language model receives rewards or penalties from a reward model based on its performance, and adjusts its behavior accordingly to improve its understanding and thus its predictions. These techniques help the learning process by quantifying how adequate or inadequate the model's actions are in a given context for a specified task.

RLHF consists of three distinct stages:

Gather human-generated prompts and fine-tune your LLM

During this stage, a dataset is collected from human annotators, giving ML engineers new data that consists of correct responses (i.e., "golden sets" or "honeypots"). This dataset is then used to fine-tune the language model, making it better at understanding and responding to prompts.

Obtain human preferences to get ranked responses and train a reward model

Human labelers evaluate and rank different responses generated by the model. These rankings are then used to train a reward model, which estimates the quality of the LLM's predictions.

Execute reinforcement learning

Using the reward model, the LLM is again fine-tuned, this time through reinforcement learning. The initial language model receives feedback from the reward model based on its performance, and its parameters are updated to enable better responses in the future.

In the process of reinforcement learning from human feedback, human quality judgments made by Tolokers and other data annotators play a crucial role in the training process of a model. They evaluate the model's predictions and identify mistakes; the model is rewarded or penalized based precisely on their feedback. This iterative cycle – that currently has few alternatives – is how AI developers are training language models and fine-tuning their LLMs for more narrow downstream applications.

Concluding remarks

As we've seen, LLMs have come a long way and produced state-of-the-art AI bots like ChatGPT. Despite their remarkable performance, these LLM-based chatbots have their drawbacks. What's more, these limitations cannot be overcome by increasing model size, which is why models larger than GPT-4 aren't likely to hit the market any time soon. According to the innovators behind these LLMs, the future lies in clever fine-tuning of foundation models rather than having greater volumes of the initial pre-trained data.

Fine-tuning comes in many forms, but one of the most useful techniques, not least due to its time- and cost-effectiveness, is Reinforcement Learning with Human Feedback (RLHF). This method makes use of human annotators who provide feedback on the model's predictions, allowing ML engineers to improve the model's performance on specific tasks during an iterative three-stage process.

As the age of new LLMs is rapidly being overtaken by the age of LLM fine-tuning, Toloka and our dedicated global crowd are thrilled to offer not just data collection and labeling services required in the early pre-production stages, but also now indispensable RLHF, that is, comprehensive performance evaluation and model monitoring pre-, during, and post-deployment.

Article written by:
Natalie Kudan

Recent articles

Have a data labeling project?

Take advantage of Toloka technologies. Chat with our expert to learn how to get reliable training data for machine learning at any scale.