Mixture of Experts Approach for Large Language Models

Toloka Team
by Toloka Team
Image

Subscribe to Toloka News

Subscribe to Toloka News

There is a well-known saying that goes: two heads are better than one. The approach we are going to examine today can be summarized by this simple but accurate idiom that can be easily applied to any sphere of our lives. So it turns out that in the AI field, too, multiple models combined are better than one. Especially when there is a lot of data and tasks are complicated.

A mixture of experts (MoE) is an approach that was introduced in 1991 in a paper called “Adaptive Mixtures of Local Experts" by Robert A. Jacobs and his colleagues long before the rise of deep learning neural networks, which have captured the minds of AI experts in recent years. Further on we are going to find out what a MoE is and why this approach is especially handy when dealing with complex tasks with a large amount of data.

Powering Al development
from training to evaluation

Get high-quality data production pipelines tailored to your needs
Talk to us
Image

The Need for Specialized Models

Sophisticated tasks often rely on heterogeneous data sources containing diverse content and/or modality. Traditional models may not process and leverage such diversified information efficiently, resulting in poorer model performance.

In scenarios where the data exhibits diverse and variable patterns, a single model may struggle to adequately represent all aspects of the data. This highlights the need for more sophisticated specialized models like the MoE to address the complexities inherent in such data.

Understanding Mixture of Experts

Neural networks can only handle so much information, and the number of parameters they have determines their capacity to consume data. To make neural networks more powerful, researchers have been trying to find ways to increase the number of parameters effectively.

One method gaining traction is a mixture of experts, which allows different parts of the network to activate depending on the input data. This particular feature is referred to as conditional computation. In MoE models, only certain parts of the network are activated for each piece of data, allowing for a significant increase in model capacity without a corresponding increase in computation.

Mixture of Experts and its Fundamental Principles

A mixture of experts (MoE) is a machine learning technique that combines the predictions of multiple specialized trainable network models, called experts, to solve complex problems. The concept of experts implies dividing the problem space into multiple regions, with each region assigned to a different expert. Each region is a categorized data set that is distinguishable according to a certain similarity metric.

In the context of machine learning and problem-solving, the problem space refers to the set of all possible states, configurations, or outcomes that a problem can take during its solving. It encompasses the range of variables, features, and conditions that define the problem and influence its solution.

Expert models are specially trained to process a separate problem space region, leveraging a tailored algorithm or architecture that caters to their respective future areas of expertise and strengths. After training, different experts specialize in handling their own regions which are also referred to as subsets of data. The predictions generated by the experts are combined to create the final output of the MoE model.

The training process utilizes the strengths of each expert so that they can capture a broad range of regularities and interconnections. This approach is especially useful when all possible input data for the model, i.e. input space, is very extensive.

An effective strategy for obtaining diverse and reliable training data for a model is to collaborate with a trusted data partner like Toloka. Whether the data is generated by human domain experts or is high-quality synthetic data, Toloka ensures a quality standard that can be relied upon at every stage of developing an LLM. If you seek optimized performance and accuracy across various tasks and domains for your LLM reach out to Toloka.

The best overall system performance is achieved due to the collaborative work of the experts which are merged into a single model through a gating network also known as a router, which gives each expert prediction a weighted score. This evaluation is decided by considering the input features and the level of confidence of every expert.

A neural network may contain a vast number of such expert models, each specializing in a different area of expertise. Since the model rapidly assigns these experts with input data according to their expertise, the information fed to the model is precisely interpreted. MoE is made of multiple experts but is considered a single large model thanks to the gating network.

MoE Types

There are two primary types of MoE: dense and sparse. Dense models are considered traditional and their main difference from sparse kinds of models is that a large number of experts are active when input is fed to the model. Whereas sparse MoE allows for only a small number of experts to be active.

Sparse models are more cost-effective because fewer specialized experts are activated at the same time, therefore consuming fewer resources. Because they utilize less computational power, they are considered more advanced and relevant. We will further primarily concentrate on sparse type of MoE.

Architecture of MoE models

In MoE models for NLP tasks, transformers are often employed as the underlying architecture. Typical dense transformer models imply the use of dense feed-forward network (FFN) layers in their structure. However, when we use sparse MoE, these FFN layers are changed to MoE layers. It means that when the token is fed to the model it is directed to a MoE layer which contains both the experts and a gate network (router).

The feed-forward network layers store all previously learned model data. In large language models, this layer can contain an enormous number of parameters. In the case of sparse MoE, the FNN layer is being replaced by a set of experts in the MoE layer, and thus, only those parameters that are needed can be chosen within these expert networks, rather than all the parameters of the single layer. The sparsity of knowledge across the experts allows the model to run only parts of a complex system, making computations noticeably faster.

Experts are typically represented as feed-forward networks or even be MoE themselves. Some versions of MoE include Switch Transformers, GLaM, and V-MoE. These models have shown better scalability across different domains and are better at retaining knowledge because they introduce sparsity in the network by selecting only a subset of experts for each data point.

The router's gating function involves deciding which expert should undertake the given input token, according to the categorized data set it belongs to. The gating network should be able to determine which expert is best suited to the task at hand. It assigns weights to the outputs of the experts based on their relevance or expertise for the given input sample. These weights determine the contribution of each expert's output to the final prediction made by the MoE.

However, if the selection of experts isn't done well, some experts might not get enough training, which can lead to them being either too specialized or not specialized enough. This can affect the overall performance of the model.

Benefits of MoE

Handling Heterogeneous Data

MoE models can effortlessly combine and exploit heterogeneous data sources of different natures and modalities. By allowing each expert to specialize in representing a particular modality or aspect of the data, MoE can effectively deal with complex problems involving heterogeneous information.

Adaptability to Complex Data

MoE excels in cases where data is complex and can be segmented into distinct data subsets. Different models within the MoE framework can cater to various subsets of data, ensuring that predictions are optimized for each subset's unique characteristics.

Higher Accuracy and Better Generalization

MoE empowers each expert to focus on a particular area of expertise (problem space), which leads to better overall accuracy and improved generalization of the model. Unlike a single homogenous model that tries to handle all kinds of tasks, the customized approach of MoE guarantees that each expert achieves the best possible results in their domain, improving overall performance.

Training and Optimization

During the training process, the parameters of the gating function are trained alongside those of the expert network using backpropagation and gradient descent. The following are the key activities required to train a MoE.

Dividing Data and Experts Training

The first step is to segment the data into subsets based on segments of the input space. Then each expert within the MoE framework is trained on an associated data subset. This training involves the optimization of each expert's parameters so that they can make reliable predictions for that particular segment of the dataset. Experts may be any kind of neural network with each specializing in its own field. For example, one may be an expert in translation, another in question answering or summarization, and so on.

Training of The Gating Network

A gating network is further trained based on the data input and the predictions produced by the experts. It accepts input data and calculates gating weights. These values indicate the contribution that each expert makes to the outcome. They are probabilities totaling to 1. The most relevant expert outcome will be given more weight by the gating network after it is trained to allocate weights to each expert's predictions based on the input data.

Backpropagation and Gradient Descent

As previously mentioned both gating network and expert networks are trained using backpropagation and gradient descent methods. Backpropagation computes the gradients of the loss function concerning each model parameter, while the gradient descent algorithm updates these parameters in the direction that minimizes the loss.

The loop of backpropagation and gradient descent continues until the models approach an optimal set of parameters where losses are minimized and predictions are accurate. This whole process helps both the gating network and the experts learn from their mistakes. It's the standard way for neural networks to learn and improve their performance.

Expert Choice Routing

In MoE there's a potential problem with the routing or gating function. These functions might end up favoring certain experts too much, leaving others undertrained. This imbalance can lead to experts not fully developing their skills and knowledge.

To tackle this issue, a technique called regularization is introduced. It prevents too many examples from being sent to just one or a few experts, spreading the workload more evenly. This way, each expert gets a fair chance to learn and improve, making the entire MoE model more reliable.

Expert Choice (EC) Routing introduces a novel regularization approach to addressing potential flaws in the traditional mixture of experts models by reversing the process of assigning experts to tokens within the model. So, instead of assigning tasks to experts, the router is assigning experts to tasks based on what these experts are good at.

EC Routing introduces the concept of expert capacity, regulating how many tokens an expert can process simultaneously. This capacity is determined by the average number of tokens per expert in input sequences, multiplied by the average number of experts each token can be assigned to. The latter variable is referred to as a capacity factor. Adjusting this capacity factor allows researchers to achieve optimal load balancing and control the workload distribution among experts.

After that, the EC routing approach needs to decide which expert should handle each token. It creates a special chart called the "token-to-expert score matrix" to help see which expert is best suited for each task based on their token-to-expert compatibility score, included in this matrix.

The scores help to choose the most appropriate tasks for experts through the top-k function. Here, "k" is a way of showing how many tokens are assigned to each expert. This ensures that each expert gets the tasks they're best at, making the model more efficient. The data is then shuffled through the permutation function to ensure parallel processing for all experts. This means that the data gets organized and consequently all experts will receive their tokens fairly.

Applications of MoE

A mixture of experts (MoE) has shown promising results across various real-world applications, including natural language processing (NLP), computer vision, and recommendation systems. Here are just some examples in different domains:

Natural Language Processing. Introduced by Google, the Switch Transformer is a variant of the Transformer model that incorporates MoE. It has shown significant improvements in various NLP tasks, such as language modeling and translation. The MoE mechanism allows the model to dynamically switch between different experts based on the input sequence, leading to better performance.

Computer Vision. Vision-MoE, also introduced by Google, is a variant of MoE designed specifically for vision tasks. It has been applied successfully in tasks such as image recognition, object detection and image segmentation;

Recommendation Systems. Google's YouTube recommendation system employs MoE to personalize video recommendations for users. By analyzing user behavior and preferences, the MoE model selects and combines recommendations from different experts. This approach improves the relevance and engagement of recommended videos, enhancing the user experience on the platform.

MoE Challenges and Future Directions

Mixture of experts models are powerful architectures that combine multiple sub-models, or experts, to handle different parts of the input space. They typically involve a large number of parameters due to the presence of multiple experts. Training and inference with such models can be computationally intensive, requiring significant computational resources, training data, and time. To address the computational complexities, it is crucial to research efficient training algorithms that are scalable and can handle large-scale MoE models well.

While there are some excellent methods like EC routing for designing efficient gating functions and experts, there's still room to make them even smarter and more efficient. By exploring new ways to decide which expert to use for each part of the specialized data and making these decisions better, the researchers can improve how MoE models perform.

MoE models have shown great results in certain areas like language understanding and recognizing images. But there are many other areas where we haven't fully tested them yet. For example, in reinforcement learning or data analysis.

Explore the newest articles about Large Language Models on Toloka's blog:

Article written by:
Toloka Team
Toloka Team
Updated: 

Recent articles

Have a data labeling project?

Take advantage of Toloka technologies. Chat with our expert to learn how to get reliable training data for machine learning at any scale.
Fractal