How To Enhance LLM Evaluation For Responsible AI

Toloka Team
by Toloka Team

Subscribe to Toloka News

Subscribe to Toloka News

Empower your GenAI development

Get your expert data for Fine-tuning, RLHF and Evaluation. High-quality, for any domain, at scale.
Talk to us

Are LLMs really responsible and accurate?

In 2022, Beth Stackpole reported for MIT Sloan, that 79% of companies admit their implementation of responsible AI is limited in scale and scope. To create truly responsible AI, an important safety issue must be discussed: LLM (large language model) fluency in low-resource languages.

According to a study from Brown UniversityOpens a new window , bad actors have found a loophole to influence LLMs to produce unethical results by:

Translating unsafe English inputs into low-resource languages[…] which provides actionable items that can get the users towards their harmful goals 79% of the time.

In this article, we will explore how LLMs are evaluated, why they don't perform equally well in all languages, and how we can improve their performance. We'll discuss the different types of data needed for LLM development and the specific challenges faced when fine-tuning LLMs for languages with limited resources.

Additionally, we'll examine the importance of human feedback in refining LLMs, especially in languages where expertise is scarce. By involving experts proficient in non-English languages and establishing rigorous evaluation criteria, we aim to narrow the gap between widely spoken languages and those with fewer resources. This approach seeks to create AI systems that are more inclusive and accountable to diverse linguistic communities.

Why Do LLMs Not Work Equally for All Languages?

The effectiveness of LLMs depends directly on the quality of the data on which they are trained. This data comes in three types:

  • Annotated textual data: This is typically gathered from online sources and used for a base model.

  • Question-answer pairs: These are utilized for Supervised Fine Tuning (SFT), enabling the base model to follow the instructions and comprehend and respond to queries accurately.

  • Human feedback on model responses: It is necessary to align the model with human expectations by either using Reinforcement Learning with Human Feedback (RLHF) and Proximal Policy Optimization (PPO), or applying Direct Performance Optimization (DPO) directly.

Developing a multilingual LLM requires access to all three data types for each language the LLM aims to comprehend proficiently. While there may be an abundance of data available in English for training LLMs, many languages have limited resources suitable for this purpose. These languages, commonly known as low-resource languages (LRLs), are spoken by significant populations. Swahili, for example, is used by 200 million people across 14 African countries, but it is considered a LRL.

Furthermore, the scarcity of raw data is compounded by the lack of attention from the Natural Language Processing (NLP) research community toward certain languages. This results in fewer benchmarks and curated datasets available for training NLP algorithms. To create truly multilingual LLMs and inclusive AI, we need to redirect our focus and include low-resource languages despite the initial challenges it may pose.

SFT for Low-resource Languages

More commonly spoken languages like English have abundant datasets comprising thousands of question-answer pairs readily available for fine-tuning LLMs. While these datasets may not yield highly competitive LLM models, they serve as an initial step in the refinement process. On the other hand, many low-resource languages lack out-of-the-box instruction fine-tuning datasets, necessitating their creation even before basic LLM fine-tuning can start.

Creating a Supervised Fine-Tuning (SFT) dataset requires meticulous planning to include a diverse range of questions, ensuring representation across all desired categories for LLM proficiency. It’s crucial to allocate the right proportion for each skill, such as summarization, generative writing, question-answering, classification, translation, and more. Moreover, securing native speakers proficient in both question-and-answer composition is essential.

Typically, the creation process of an SFT dataset involves two distinct projects: question generation and prompt completion, both requiring native speakers known as AI tutors or AI trainers. Additionally, if specialization in fields like medicine, finance, or coding is desired, AI tutors with expertise in these areas are necessary. Managing a substantial number of AI tutors is often required to ensure the production of high-quality prompts and answers.

There are situations for low-resource languages where the above process is not feasible due to the scarcity of native-speaker experts, and alternative solutions must be explored. One viable approach in this scenario is translating English datasets using an automatic translation to a desired language and then refining the output with native speakers. This method reduces the number of annotators required and is more cost-effective.

It’s important to note that creating a multilingual LLM proficient in one hundred languages requires repeating this process for each language involved.

Human Feedback for Low-resource Languages

In the final stage of training an LLM, it’s essential to align it with human expectations. This process helps to address concerns like minimizing hallucinations and ensuring the model provides useful and harmless responses. Alignment is achieved by seeking feedback from human experts, who compare two potential responses generated by the model and select the preferred one.

This feedback-gathering process often entails collecting hundreds of thousands or even millions of comparisons, especially for complex models like the Llama 2 model. Depending on the specific abilities targeted for the LLM, experts in relevant fields are needed to identify inappropriate responses. Due to its scarcity of native experts, this can be a challenge for low-resource languages. Because of this, it is common for LLM developers to prioritize gathering feedback for English or a few other widely spoken languages.

Like the collection of Supervised Fine-Tuning (SFT) data, gathering human feedback to build multilingual LLMs becomes a large-scale annotation project. It involves utilizing AI tutors proficient in various languages and domain experts. Although simpler than collecting data for SFT, it requires significant volumes of annotations and often needs to be continually updated to enhance the safety and alignment of the LLM.

Creating Responsible AI

Bridging the gap between widely spoken languages and low-resource languages is essential to ensure the safety and alignment of LLMs and ultimately create more responsible AI. Although it may pose initial challenges, this disproportion can be addressed by prioritizing the involvement of experts proficient in non-English languages in the creation and evaluation of LLM datasets.

Implementing this approach effectively can ensure that there is no tolerance for unethical outcomes, particularly in languages with fewer resources, thereby closing the loophole highlighted by the Brown University Low-Resource Languages Jailbreak GPT-4 report. Without adequate attention to multilingual focus and evaluation, companies utilizing LLMs face potential legal, competitive, and reputational risks.

To improve LLM evaluation for responsible AI, consider partnering with Toloka. We specialize in LLM evaluation, SFT creation, and have access to experts in low-resource languages. Get in touch, and let's work together to build more inclusive and accountable AI systems.

The original version of this article was published on Spiceworks.

Article written by:
Toloka Team
Toloka Team

Recent articles

Have a data labeling project?

Take advantage of Toloka technologies. Chat with our expert to learn how to get reliable training data for machine learning at any scale.