Join Our Webinar: Multilingual Models: Development and Evaluation. Explore practical insights into the future of multilingual AI systems

Join Our Webinar: Multilingual Models: Development and Evaluation. Explore practical insights into the future of multilingual AI systems

Join Our Webinar: Multilingual Models: Development and Evaluation. Explore practical insights into the future of multilingual AI systems

Introducing JEEM: A new benchmark for evaluating low-resource Arabic dialects

Apr 14, 2025

Apr 14, 2025

News

News

jeem_arabic benchmark

Modern LLMs struggle with reasoning in low-resource languages, mainly due to two key factors: a lack of training data that accurately represents the wide variety of dialects, and insufficient evaluation data to assess model performance. Current benchmarks rely on translated data and often overlook critical cultural differences. There is an evident need for language-specific data to improve the models' understanding of languages with various dialects and cultural contexts.

Arabic is an excellent example of how limited these benchmarks can be. It's the fifth most spoken language worldwide, yet it rarely appears in research benchmarks outside its standard variation, the Modern Standard Arabic (MSA). While many Arabs learn MSA for formal communication, they usually speak local dialects shaped by their region's culture, history, and geography. Each dialect carries unique vocabulary, expressions, and cultural subtleties that a model trained solely on MSA may not fully understand. In fact, these differences can make it difficult for even native speakers of different Arabic dialects to understand each other. 

Language models need to capture the distinct words, meanings, and cultural nuances used in everyday conversation. By improving their understanding of dialects, we can enhance the models' ability to represent how language is used in real life, leading to more effective communication tools that honor and maintain local cultural identities.

The demand for a new benchmark

To address the gap in Arabic datasets, the Toloka Research team partnered with Mohamed bin Zayed University of Artificial Intelligence (MBZUAI), the fastest-growing research organization in the Gulf region. We augmented MBZUAI's expertise in Arabic multimodal models with Toloka's data annotation capabilities to produce JEEM, a new benchmark for evaluating models in low-resource Arabic dialects.

JEEM consists of image captioning and visual question-answering tasks in four regional dialects of Arabic: Jordanian (Levantine), Egyptian, Emirati (Khaleeji), and Moroccan (Maghrebi). A high-quality dataset pioneering LLM evaluation in Arabic dialects, JEEM measures how well multimodal models handle the unique aspects of these underrepresented dialects and adapt to real-world linguistic and cultural diversity.

Why is JEEM so important?

Our benchmark is the first of its kind for multiple reasons. JEEM's main purpose is to:

  • Help developers evaluate models in low-resource variations of Arabic.

  • Support the development of multimodal models, especially in dialects that are underrepresented in available training data.

  • Highlight how language models can be proficient in a standard version of a language and show significant shortcomings in other variations of the same language.

Perhaps more importantly, JEEM demonstrates why we can't omit dialects from datasets, even if they all derive from the same language. If we do so, we overlook essential details related to different cultures, environments, and objects, causing models to misinterpret visual and linguistic clues. Testing models on JEEM provides solid evidence of the importance of including diverse dialect data in evaluation setups.

How we collected multimodal data in four dialects

JEEM's data quality is highly reliable due to comprehensive data collection methods. We gathered images from three primary sources: the Wikimedia Archive, Flickr, and personal collections contributed by the authors, showcasing various aspects of typical daily life scenery. Images sourced from Wikimedia and Flickr underwent manual review and filtration to ensure cultural relevance. We opted for this mixed sourcing strategy to compile a dataset that authentically represents cultural references from different regions.

We included native speakers of the four dialects to annotate the samples. Each annotator passed a qualification test to check their writing skills and competence in the target dialect and Modern Standard Arabic. Tests were reviewed by specialists to select the strongest writers to participate in the project. 

Annotators were tasked with describing images, writing questions about the descriptions, and writing answers to other participants' questions. To ensure consistency in the data quality, annotators followed a detailed set of guidelines for writing concise, relevant, and natural-sounding texts.

Ensuring data quality

Our data collection pipeline was the key to quality assurance in this project. For quality control, qualified specialists—native speakers of each dialect with a background in NLP or computational linguistics—reviewed each task. Depending on the result, they could reject it and assign it to another writer, accept it, or edit it themselves. 

The usefulness of the final dataset depended highly on dialect diversity. For this reason, the annotators completed the tasks using the dialect of their region.

Ultimately, we developed a comprehensive benchmark encompassing four dialects and their unique characteristics in visual tasks. For example, the image captions were written to enable someone without access to the image to create a mental picture of it. This was achieved by explaining interactions between the objects, including key details, and using precise terminology. Similarly, the questions posed were designed to explore background details, potential future actions, and even emotional responses, such as whether an object looks appealing. The answers were formulated using the image and relevant cultural context.

With consistent quality control at every stage of the process, we achieved a high-quality and reliable dataset for evaluating models in Arabic dialects.

Dataset overview

The final dataset comprises 2,178 annotated images organized into 13 categories and four dialects. These categories feature topics like food and beverages, transport, nature, and sports, reflecting everyday scenes and conversations. 

The distribution of images across topics and dialects is illustrated in the graph below. The Egyptian dialect is the most common, making up almost 40% of the dataset. 

jeem arabic benchmark

If you want to learn more about the methodology, read the research paper

How do VLMs perform on this benchmark?

We assessed five VLMs trained on Arabic data and GPT-4 for comparison. We conducted a GPT-based evaluation and a human assessment, scoring each model in consistency, relevance, fluency, and dialect on a scale of 1 to 5, where 1 indicates failing to meet the criteria and 5 signifies full compliance. This comprehensive evaluation shows how well the models produce captions that match the image, explain its key elements, and use the target dialect naturally. 

1. Image captioning

In addition to the above-mentioned metrics, we also use automated metrics like BLEU and BERTScore for a more thorough assessment. Because of the morphological complexity of Arabic dialects, these metrics are less effective than they are for English and fall short compared to the other evaluation criteria. The graph below shows how well the models perform in relevance, dialect authenticity, fluency, and consistency.

jeem arabic benchmark

Human evaluation of image captioning for different Arabic varieties


The findings indicate that the models are fluent in their outputs but often struggle to generate contextually relevant captions and understand dialect nuances. Here are a few interesting takeaways:

  • While the LM foundation of these models allows them to create fluent answers, they still need improvements to handle the dialectical aspects of the task. 

  • The GPT-based evaluation shows that all five VLMs struggle with consistency, relevancy, and dialect. This indicates that the models can't offer a caption that describes the image in an optimal way.

  • AyaV ranks below GPT-4 in terms of dialect accuracy but substantially surpasses all other models.

  • GPT-4 is a superior model for image captioning in Arabic, as it even surpasses human references on some criteria in Jordanian and Egyptian.

  • That being said, GPT-4 has difficulties with grounding in the Emirati dialect, emphasizing the need for good regional coverage in VLM training.

  • Human evaluations indicated that all five models underperformed across most criteria, with particularly low scores in dialect authenticity. GPT-4 had a slight advantage over the others.

2. Visual question answering

In our evaluation of the VQA task, we will focus solely on GPT-4-based evaluation and grade the models in consistency, relevancy, fluency, and dialect. This approach is necessary because the questions are descriptive and may have multiple valid answers. Therefore, using GPT-4 provides a more equitable evaluation setting. Let's take a look at the leaderboard, along with some key takeaways.

jeem arabic benchmark

GPT-4 based evaluation of question-answering for different Arabic varieties

Similar to the image captioning task, the models are able to produce fluent answers across all dialects. 

  • The VLMs struggle again with the other criteria and underperform when it comes to consistency, relevancy, and dialect. As such, they fail to generate proper answers to the presented questions.

  • Chat GPT-4 achieves the highest scores across all metrics: 4.67 in fluency(Jordanian), 4.56 in dialect (Morrocan), 3.72 in relevancy (Emirati), and 3.64 in consistency (Emirati).

  • These low scores expose the limitations current VLMs face when interpreting culturally rich visuals and expressing them in the target dialect, which further reinforces the motivation and demand for this benchmark.

What does this mean for future models?

As more researchers acknowledge the importance of monolingual or dialect-focused models, we expect to see a growing interest in datasets specific to different regions.

JEEM proves that current models can't handle different language dialects yet. At Toloka, we solve this problem by sourcing data from native speakers, even in low-resource languages. This preserves cultural nuances that might otherwise be overlooked, leading to a more realistic assessment. 

Our multimodal Arabic benchmark is a useful tool for identifying model strengths and areas for improvement. Similar benchmarks may be developed for new models to focus on other languages and dialects. 

Are you looking to enhance your model's capabilities in a target language?

Contact us for a custom dataset that fits your use case and target language. We deliver robust datasets that undergo multi-step quality control procedures and are backed by native speakers trained for data annotation in various tasks.

Updated:

Apr 14, 2025

Subscribe to Toloka News

Case studies, product news, and other articles straight to your inbox.

Your Trusted
Data Partner for AI Development

Leave your email, and you'll be the first to know when the benchmark dataset is ready.

Subscribe to Toloka News

Leave your email, and you'll be the first to know
when the benchmark dataset is ready.

More about Toloka

What is Toloka’s mission?

Where is Toloka located?

What is Toloka’s key area of expertise?

How long has Toloka been in the AI market?

How does Toloka ensure the quality and accuracy of the data collected?

How does Toloka source and manage its experts and AI tutors?

What types of projects or tasks does Toloka typically handle?

What industries and use cases does Toloka focus on?

What is Toloka’s mission?

Where is Toloka located?

What is Toloka’s key area of expertise?

How long has Toloka been in the AI market?

How does Toloka ensure the quality and accuracy of the data collected?

How does Toloka source and manage its experts and AI tutors?

What types of projects or tasks does Toloka typically handle?

What industries and use cases does Toloka focus on?

What is Toloka’s mission?

Where is Toloka located?

What is Toloka’s key area of expertise?

How long has Toloka been in the AI market?

How does Toloka ensure the quality and accuracy of the data collected?

How does Toloka source and manage its experts and AI tutors?

What types of projects or tasks does Toloka typically handle?

What industries and use cases does Toloka focus on?