Solutions

Datasets

Research

Resources

Company

Talk to us

Toloka welcomes new investors Bezos Expeditions and Mikhail Parakhin in strategic funding round

Learn more

Toloka welcomes new investors Bezos Expeditions and Mikhail Parakhin in strategic funding round

Learn more

Toloka welcomes new investors Bezos Expeditions and Mikhail Parakhin in strategic funding round

Learn more

Breaking barriers: multilingual large language models in a globalized world

Toloka Team

September 23, 2024

Essential ML Guide

What if your LLM could finally understand nuance like a human?

Better data drives deeper understanding

Get traning data

Language is a bridge to connect human society to computer models and applications. Large language models (LLMs) are transforming numerous global domains, from healthcare to education and beyond. However, their inclusivity remains limited, mainly when dealing with low-resource languages and specific non-Latin scripts.

Learn why integrating low-resource languages into LLMs Is essential for responsible AI.

This gap raises questions about the models’ ability to serve a truly global audience. Beyond achieving linguistic accuracy, LLMs must also be culturally aware, reflecting the unique socio-cultural contexts of the regions they are meant to serve. Yet, these models are primarily trained on web-based data, and the web itself is often biased, especially concerning communities underrepresented online.

Different LLMs’ performance for high-resource languages (German, Spanish, French, Indonesian, and Chinese) and low-resource languages (Igbo, Kazakh, Kannada, Oriya, and Turkmen). Source: Quantifying Multilingual Performance of Large Language Models Across Languages

In this article, we’ll explore multilingual LLMs’ purpose, training process, and the significant challenges their developers face. We’ll also highlight some of the most popular multilingual LLMs and discuss Toloka’s experience in fine-tuning datasets for specific languages.

Why we need multilingual LLMs

The gap in representation for different languages is driven by both economic and technological factors. High-resource languages like English, Spanish, and Mandarin dominate the digital world, leaving many other languages underrepresented.

In a world where over 7,000 languages exist, multilingual large language models have the potential to break down language barriers and create more inclusive digital spaces. Their functionality allows for more effective global communication, as their current state is only the beginning. For instance, businesses can utilize these models to interact with customers in diverse languages, enhancing user experience and broadening market reach.

Most edited editions of Wikipedia over time. The chart demonstrates the unequal representation of languages across the web. Source: Wikipedia

However, building models even for several languages would require vast investments in training data collection, processing power, and linguistic expertise.

Low-resource languages often lack the necessary datasets to train accurate models, further complicating efforts to develop LLMs that can truly master them. The linguistic diversity and unique nuances of each language add additional layers of complexity.

Evolution of multilingual LLMs over the past five years. Source: Multilingual Large Language Model: A Survey of Resources, Taxonomy and Frontiers

Developing effective LLMs for these languages requires more than simply scaling existing models—it demands a dedicated approach to creating new benchmarks, collecting specific language datasets, and fine-tuning techniques to address the subtleties of underrepresented linguistic landscapes.

What Are multilingual models?

Unlike traditional language models, which are limited to a single language, multilingual LLMs leverage extensive multilanguage datasets to understand and produce text across different tongues simultaneously. This capability is primarily powered by advanced neural networks, often utilizing transformer architectures, which excel at learning and mimicking complex language patterns.

The architecture of LLMs typically relies on advanced neural network frameworks, with transformers being the most prevalent design. Transformers use mechanisms such as self-attention to efficiently process and understand context within text. This architecture allows LLMs to handle large volumes of data by focusing on the relationships between words in a sentence, rather than processing words sequentially.

SUTRA is an example of a transformer-based multilingual LLM architecture where the input is processed through a multilingual concept encoder, followed by the concept model and multilingual concept decoder. Source: SUTRA: Scalable Multilingual Language Model Architecture

The training process for multilingual LLMs involves ingesting vast amounts of text data from diverse linguistic sources. This enables the models to not only grasp the syntactic and semantic nuances of different languages but also to preserve these subtleties in their outputs. By deconstructing and analyzing the intricacies of each language, these models can generate translations, creative writing, and responses with an impressive degree of accuracy.

Training a Multilingual LLM

Training multilingual LLMs involves massive amounts of data and sophisticated neural architectures. A key challenge here is achieving balance across languages—only careful curation can guarantee that none of the targeted languages is overshadowed.

A monolingual Large Language Model vs. a multilingual Large Language Model. Source: Multilingual Large Language Model: A Survey of Resources, Taxonomy and Frontiers

The core of multilingual LLM training is self-supervised learning, a method where the model acquires the ability to predict missing words or sentences in a text. For instance, a common technique used is masked language modeling, where certain words in a sentence are hidden, and the model reconstructs these elements from the surrounding context. Another approach is next-sentence prediction, where the model learns to generate a coherent continuation of a given passage.

These tasks allow the model to capture the contextual relationships between words and understand the syntactic and semantic nuances of different languages. By performing them across multiple languages simultaneously, the LLM develops the ability to process and generate text in various linguistic settings.

Transformer neural networks master at handling multiple streams of information at once. Such models' self-attention mechanisms allow them to weigh the importance of each word relative to the others in a particular text. This ability is crucial when dealing with multilingual data as the relationships between words may vary greatly across languages.

An example of a message translated into five languages and the corresponding number of tokens required for its tokenization. Source: All languages are NOT created (tokenized) equal

Tokenization is a crucial step in training LLMs, breaking down texts into smaller units—tokens—that the model can process. In languages like English, tokenization might be straightforward, with spaces and punctuation serving as natural boundaries. However, for languages with different structures, it may require advanced tokenization methods.

A framework suggested by a group of KRAFTON researchers to optimize tokenization for non-Latin alphabet languages. Source: Accelerating Multilingual Language Model for Excessively Tokenized Languages

During the training process, the model builds a shared representation space for different languages which allows it to transfer knowledge from one language to another. This is what makes multilingual LLMs so powerful—they don’t just learn individual languages in isolation, but instead outline their interaction to fulfill such tasks as translation or cross-lingual understanding.

Popular Multilingual LLMs

Several large language models (LLMs) have gained prominence due to their ability to handle diverse linguistic tasks. These models range from general-purpose to systems designed for specific use cases. The following list of recognized multilingual LLMs is meant to give the overall idea of such systems' variety.

mT5 (Multilingual T5)

mT5 is a multilingual variant of the popular T5 model by Google. It is trained on a vast dataset spanning over 100 languages to handle a broad array of natural language processing (NLP) tasks, such as translation, summarization, and question-answering.

Its ability to understand and generate text across a wide range of languages makes it a popular tool, especially for academic and commercial research.

M2M-100 (Many-to-Many 100)

M2M-100, developed by Meta, breaks from the conventional model of translating all languages into English as an intermediary step. Instead, it translates directly between 100 supported languages, bypassing the need for any pivot.

This approach is particularly beneficial for lesser-known language pairs, such as translations between Swahili and Amharic. Meta’s innovation in M2M-100 is reshaping how AI handles direct communication in underrepresented languages.

mBERT (Multilingual BERT)

mBERT (Multilingual Bidirectional Encoder Representations from Transformers) is a multilingual version of Google’s BERT model. Used for tasks like machine translation, sentiment analysis, and chatbots, mBERT is highly adaptable and effective in handling many languages simultaneously.

It supports a wide range of NLP tasks by learning cross-lingual representations, making it valuable in various multilingual applications.

XLM-RoBERTa

XLM-RoBERTa, developed by Meta, is a transformer-based multilingual LLM trained in 100 languages. It excels in sequence labeling and question-answering tasks.

XLM-RoBERTa is a key model for applications requiring deep linguistic understanding across various language environments.

BLOOM

BLOOM model is an open-access multilingual LLM trained to support text generation in 46 natural languages and 13 programming languages. Its open-source nature has attracted a large community of researchers and developers, particularly those working in niche language processing.

BLOOM model provides accessible state-of-the-art multilingual NLP capabilities for a broad range of users.

Aya 23

Aya 23, developed by Cohere, is an open-weight multilingual LLM optimized for performance across 35 languages. It builds on the success of previous Aya models by enhancing depth rather than breadth, allocating more capacity to fewer languages for higher-quality text generation.

Aya 23 demonstrates cutting-edge multilingual generative AI performance, making it an important tool for researchers focused on multilingual language modeling.

YAYI 2

Developed by Wenge Research, YAYI 2 is an open-source LLM designed to process tasks related to the Chinese language. With 30 billion parameters, it’s specialized for Chinese linguistic structures but also supports other languages, making it an important tool in the Chinese-speaking world.

PolyLM and XGLM

PolyLM and XGLM, as well as mT5, mentioned above, represent a wave of open-source multilingual LLMs designed for public access and research. These models are critical in expanding access to multilingual AI, providing a platform for diverse applications in natural language understanding and generation.

Platforms leveraging multilingual LLMs

Duolingo

Known for its language-learning platform, Duolingo utilizes multilingual LLMs to personalize lessons and offer real-time feedback in a variety of languages. It’s an example of how multilingual models can enhance educational experiences, making language learning more interactive and effective.

Microsoft Translator

Microsoft Translator leverages multilingual LLMs to deliver real-time translation services for various languages. Widely used for travel, business, and personal communication, it showcases the practical applications of multilingual LLMs in daily life.

Fine-Tuning LLMs for Low-Resource Languages

Fine-tuning LLMs for unrepresented languages is challenging due to the lack of annotated textual training data, supervised fine-tuning question-answer pairs, and human feedback. This gap makes it difficult for LLMs to achieve the same performance across all linguistic environments.

Toloka addressed this issue when creating a Swahili-specific version of the mT5 model. The innovative approach comprised an automated translation of the English Dolly dataset into Swahili, resulting in 15,000 question-answer pairs. Then, native Swahili-speaking annotators declined questionable translations and refined this dataset to 11,000 high-quality pairs, which were used to fine-tune the original model.

Source: HackerNoon

Combining machine translation with human validation significantly improved the model's performance in classification and generative tasks for Swahili, spoken by about 200 mln people. Such scalable solutions can dramatically enhance LLM performance for particular regions, helping eliminate language barriers.

Final Thoughts: LLMs for multiple languages

Bridging the gap between high- and low-resource languages is not just a technological challenge, but also a matter of ethics. This trajectory offers a clear path for ongoing development within the AI ecosystem. Additionally, the global economy increasingly demands scalable solutions for ML models to switch seamlessly between languages, especially as the workforce becomes more multilingual.

Poor language comprehension in LLMs can be exploited for jailbreaking. Source: Low-Resource Languages Jailbreak GPT-4

By tackling the challenges of language underrepresentation, the AI community can ensure that technology serves everyone, fostering more inclusive communication across diverse linguistic landscapes.

Subscribe to Toloka News

Case studies, product news, and other articles straight to your inbox.

Subscribe to Toloka News

Case studies, product news, and other articles straight to your inbox.

Subscribe to Toloka News

Case studies, product news, and other articles straight to your inbox.

Recent articles

View all articles

Creating domain-ready datasets: How Toloka's hybrid approach generates realistic and high-quality data

Aug 4, 2025

Image annotation tools: how to label data that actually teaches AI

Jul 30, 2025

Agentic AI & the Future of Coding

Jul 29, 2025

Creating domain-ready datasets: How Toloka's hybrid approach generates realistic and high-quality data

Aug 4, 2025

Image annotation tools: how to label data that actually teaches AI

Jul 30, 2025

Agentic AI & the Future of Coding

Jul 29, 2025

How to measure AI performance and ensure your AI investment pays off

Jul 28, 2025

More about Toloka

What is Toloka’s mission?

Where is Toloka located?

What is Toloka’s key area of expertise?

How long has Toloka been in the AI market?

How does Toloka ensure the quality and accuracy of the data collected?

How does Toloka source and manage its experts and AI tutors?

What types of projects or tasks does Toloka typically handle?

What industries and use cases does Toloka focus on?

What is Toloka’s mission?

Where is Toloka located?

What is Toloka’s key area of expertise?

How long has Toloka been in the AI market?

How does Toloka ensure the quality and accuracy of the data collected?

How does Toloka source and manage its experts and AI tutors?

What types of projects or tasks does Toloka typically handle?

What industries and use cases does Toloka focus on?

What is Toloka’s mission?

Where is Toloka located?

What is Toloka’s key area of expertise?

How long has Toloka been in the AI market?

How does Toloka ensure the quality and accuracy of the data collected?

How does Toloka source and manage its experts and AI tutors?

What types of projects or tasks does Toloka typically handle?

What industries and use cases does Toloka focus on?