Training large language models 101

Toloka Team
by Toloka Team
Image

Subscribe to Toloka News

Subscribe to Toloka News

Introduction

Large language models are making waves with each new model larger than the one before, boasting an impressive performance across a variety of tasks. Moreover, these large language models have infinite potential, but with that also come some considerable challenges.

If you want to learn more about the ins and outs of training large language models as well as next generation innovations, you’ve come to the right place. We take an in-depth look at some recent examples and case studies in addition to various pros/cons and practical applications to uncover the details behind this trailblazing technology. Keep reading to learn more.

Powering Al development
from training to evaluation

Get high-quality data production pipelines tailored to your needs
Talk to us
Image

What is a large language model?

A large language model (LLM) is a type of machine learning model — or more specifically, a deep learning model — that is able to comprehend and generate human language via deep neural networks. In short, deep neural networks are defined as a class of ML algorithms that aim to imitate how the human brain processes information. While there’s no set-in-stone definition of a large language model, generally speaking, it refers to a language model comprising a large number of parameters (for example, GPT has 100+ billion). Large language models are able to generate text akin to human writing and are becoming an increasingly critical component of the internet’s infrastructure. LLMs have many various uses including summarizing different texts, building more effective digital search tools, and serving as chatbots.

However, we know that the internet can be a toxic place. Given that LLMs are trained on huge amounts of online data, it doesn’t take much for them to start producing potentially dangerous responses. That’s why many AI developers are working to make their models safer; reinforcement learning using human feedback is a key component of this. There is still a lot of work that needs to be done before many of these conversational AI models can be dispersed into everyday life.

Advantages of LLMs

LLMs play an influential role in driving rapid innovation across multiple domains. Since they have more parameters and are able to capture nuances, LLMs provide a more accurate picture of the data they’re working with. This is key for natural language data since the meaning behind words is so dependent on context. Additionally, LLMs can be trained on considerably larger datasets — and the more data a language model is trained on, the better it will be at adapting to new data. This is principally true for language models given the limited data available for training.

Here’s a breakdown of the main benefits:

  • Tones and subtleties of language

LLMs can capture the intricacies of language, which allows them to better understand words in the context of a sentence or a piece of writing.

  • Wide ranging uses

LLMs are good for a wide variety of general uses, but they can also be fine-tuned to deal with narrower domains and tasks such as translating languages, answering questions, or developing chatbots.

  • Greater comprehension

The greater comprehension that comes with LLMs holds infinite opportunities that can lead to more accurate translations, improved text classification, and more natural-sounding text generation across various scenarios and programming languages.

  • Faster training time and reduced training data

Along with greater precision, LLMs also have the potential to optimize training time and decrease the amount of data required for training a large language model — the more parameters a model has, the more information it can learn from a given dataset.

As an example, a language model with 1 billion parameters can learn from a dataset that is 10 times smaller than a model with 100 million parameters.

Drawbacks of LLMs

While LLMs are on the cutting edge of innovation with far-reaching, real-world applications, they still have some significant drawbacks: namely, they can be unreliable, authoritative, and overconfident in presenting erroneous information with potentially harmful outcomes. This can be especially dangerous when it comes to a person’s health or finances.

Here’s an overview of some of the drawbacks:

  • Bias and stereotypes

Since LLMs are trained on various sources, they can unintentionally replicate the bias in those sources. They also can’t update their knowledge without being retrained.

  • Misinterpretation and false information

Even though LLMs can generate human-like text, they don’t always understand the given context and can generate inaccurate or false information as a result.

  • Resource consumption and cost

Model training for LLMs requires significant computational resources, which equates to steep costs and energy consumption.

While the largest autoregressive transformers use different evaluation protocols and new techniques such as zero-, few shot learning, one-shot, and fine-tuning with notable results, it comes at a cost of gigantic compute and energy requirements. However, with multiple advancements on the horizon, these models will undoubtedly be taken to the next level in the near future.

Training data for LLMs

The majority of LLMs are pre-trained so that when they are provided with a training dataset comprising a large corpus of text tokens, the language model is able to predict the tokens in the test dataset. To build a large model from a pretrained model, there are two pretraining approaches:

  • Autoregressive (GPT-style; predicting the next word)

Given a segment of text like "I like to eat", the model predicts the next tokens, such as "vanilla pudding".

  • Masked (BERT-style)

Given a segment of text like "I like to [MASK] [MASK] pudding", the model predicts the masked tokens, such as "eat vanilla".

LLMs can also be trained on auxiliary tasks that test their comprehension of the data distribution. For example, in Next Sentence Prediction (NSP) where pairs of sentences are displayed, and the model has to determine whether the sentences appear sequentially in the training body.

Furthermore, LLMs require an enormous amount of training data in addition to robust, flexible, and highly optimized data pipelines that can easily include new sources of data. Through self-supervised learning, LLMs are pre-trained on huge amounts of unlabeled text data extracted from sources such as books, articles, and websites.

Given the vast amounts of text data on which they’re trained, LLMs have the capacity to learn complex patterns and structures found in natural language. However, what they actually do during the training process is relatively simple: they determine the next word (or token) in a sequence, referred to as an “autoregressive” language model, which uses past outputs as input for future predictions while progressively generating output.

Training compute-optimal large language models

The AI lab DeepMind, owned by Alphabet, carried out research with the goal of determining the optimal model, parameter size, and number of tokens needed to train a transformer language model within compute budget constraints. The team trained over 400 language models extending from 70 million to 16 billion parameters on 5-500 billion tokens. The team discovered that for compute-optimal training, the model size and number of tokens must be evenly measured.

Chinchilla case study

The team presented three different approaches to determine the relationship between model size and number of training tokens; all three indicated that increasing both the model size and the number of training tokens in the neural network to roughly equal proportions would result in better performance.

The team tested their hypothesis that model size and number of training tokens should be scaled equally by training a model called Chinchilla, which comprised the same compute budget as Gopher, its larger model equivalent, but with less parameters and four times the data. They discovered that smaller, more optimally trained models have a better performance: their compute-optimal 70 billion model Chinchilla trained on 1.4 trillion tokens outpaced Gopher (a 280 billion parameter model), while reducing inference costs significantly. Not only does Chinchilla outperform Gopher, it also exceeds several other prominent models such as GPT-3 and Jurrasic-1 on a range of downstream evaluation tasks. It also uses less computing for model fine-tuning and inference with a 7% improvement in accuracy over Gopher.

Sparrow case study

DeepMind trained its chatbot Sparrow on the lab’s large language model Chinchilla to learn from human feedback and scour the internet for data to support its responses. From its research, DeepMind reasoned that an effective AI-powered chatbot requires human input to tell it how to act and make the model support its statements using information found online.

The chatbot interacted with humans and answered questions leveraging a live Google search and was then trained via a reinforcement learning algorithm. Following 23 rules, the model was able to provide realistic answers with supporting data sources about 78% of the time. However, participants were able to make the model break the rules about 8% of the time. When it comes to safe interactions between these artificial intelligence models and humans, there’s still a lot of work to be done.

NextGen LLMs

With AI moving at the speed of light, you may be wondering what the next generation of LLMs will look like. Startups and research groups alike are already on it. Let’s take a look at three emerging areas of innovation that will likely define the next wave of LLMs:

1. Models that can self-improve by producing their own training data

A new area of AI aims to enable LLMs to mimic the innately human ability to generate novel ideas and insights through inward reflection and deliberation. Imagine if models could generate their own ideas and original written content based on all the information they’ve previously acquired? They could then use that newfound knowledge to improve themselves even further. There are already models out there that can generate their own natural language processing instructions and fine-tune themselves accordingly.

Given that we may at some point run out of text training data, this area of innovation is of vital importance. Estimates of the world’s cumulative text data are somewhere between 3.2 trillion or 4.6 trillion and 17.2 trillion tokens, which encompasses all the books, academic papers, articles, shared code, and more. As mentioned above, it took 1.4 trillion tokens to train DeepMind’s Chinchilla, one of today’s foremost LLMs.

2. Models that can assess their own accuracy

Today’s LLMs are known for generating inaccurate, deceptive, or just plain wrong information, no matter how assertively they present it — often termed “hallucinations”. However, recent advancements may soon help to overcome this challenge. Given that LLMs can obtain information from external sources and provided references and citations, they’re already on the path to becoming ever more accurate. As recently as last year, OpenAI published WebGPT which is able to navigate online search engines just like humans can while providing credible information and sources. Likewise, DeepMind’s Sparrow can produce the same results.

3. Enormous sparse expert models

While differences in size, hidden layers, training data, and more may exist between existing models, today’s key LLMs all basically have the same architecture: they’re pre-trained, self-supervised, autoregressive, densely activated, transformer-based models. However, progress is being made toward the creation of an alternative architecture referred to as “sparse expert models” — the opposite of “dense”.

The idea behind sparse expert modes is that they don’t activate all their parameters for a given input, only those that are most relevant. The advantage to these models compared to their dense counterparts is that they’re simultaneously larger, yet less demanding, computationally speaking, along with having improved runtime. They’re also open to greater interpretability — understanding the “why” behind a model’s actions.

Each one of today’s largest LLMs is considered to be sparse, and new models are continuing to grow in size. As an example, Google and Meta have both produced models that have significantly outperformed their predecessor versions on a wide variety of benchmarks, including energy efficiency and interpretability.

How Toloka can help

Toloka makes working with LLMS simple and efficient. Our platform helps AI developers get their apps up and running by automating model training, tuning, deployment, and monitoring. We help developers everywhere fine-tune their pre-trained language models to align with human expectations by:

  • Collecting labeled data

Via an efficient combination of automated and human labeling in every language.

  • Training and deploying models

Integrating TensorFlow and PyTorch or auto-tuning and deploying via our ML platform.

  • Receiving human feedback

Leveraging our crowd to moderate output, assess quality, and monitor human feedback loops.

Drawing upon the latest machine learning models and the collective efforts of our diverse global crowd, we work across a variety of areas to help you get the results you need. To name a few of these areas: chatbots and AI assistants, content generation, summarization and moderation, code assistance and generation, and finance data analysis.

Our services include reinforcement learning with human feedback (RLHF), model pre-training, fine-tuning and output moderation, and human-lead quality checks. Learn more about what we offer, as well as our latest insights, advice, and solutions.

Key takeaways

This is undoubtedly the age of the LLM with new advancements and innovations around every corner. With due credit to self-supervised learning, zero-shot, few-shot, and fine-tuning methods, language models are growing in size at a rapid rate. These large models require better and higher-performing hardware, software, and algorithms for training.

However, there also needs to be a greater emphasis on dataset scaling and high-quality data along with greater accountability for ethical and privacy issues, among other concerns.

Moreover, large language models offer significant potential for the future of machine learning. As datasets and computing power continue their growth, it’s highly probable that we’ll be seeing even larger and more complex models in the coming years.

Article written by:
Toloka Team
Toloka Team
Updated: 

Recent articles

Have a data labeling project?

Take advantage of Toloka technologies. Chat with our expert to learn how to get reliable training data for machine learning at any scale.
Fractal