JEEM
VQA benchmark for four Arabic dialects
Our testing reveals that leading open-source Arabic models struggle with dialect-specific tasks. Better dialect understanding can help models interpret contextual clues in text and images.

JEEM’s data structure
JEEM consists of 2196 annotated images distributed across four dialects:
Jordan (Levantine) — 606 images
Emirates (Gulf) — 150 images
Egypt (Egyptian) — 863 images
Morocco (Maghrebi) — 577 images
A smaller cross-cultural set has 100 images annotated by all four dialect teams for comparison
Images cover a range of topics: transport, food and beverages, places, nature, sports, arts and culture, education, technology and others
Distribution of dialects in JEEM
Our data collection process
Step 1: Image Collection
Step 2: Image Captioning
Step 3: Question Writing
Step 4: Question Answering
Data samples
Cross-dialect data subset
100 images in the dataset are captioned by speakers of all four dialects for comparison. Some examples demonstrate narrow cultural contexts that are easily misinterpreted by Arabic speakers from other regions. In general, VLMs lack knowledge of regional nuances.
For example, this image of Omani halwa is interpreted as a different sweet depending on the region
Jordanian
Traditional dessert... almonds... pistachios... karawya or dibs
طبق حلو تقليدي... اللوز... الفستق... بالكراوية أو الدبس
Emirati
Omani halwa
حلوى عمانية
Egyptian
Pudding... chocolate... pine seeds
لبودنج... شيكولاتة... صنوبر
Moroccan
Chocolate... caramel... coconut and pistachios
شكلاط... كراميل... بالكوكو و بيسطاش
Model performance
We ran comprehensive evaluations of the latest Arabic VLMs: Maya, PALO, Peacock, AIN, AyaV, and GPT-4.
The evaluation process covered 3 types of metrics:
Surface-level and embeddings-based metrics (BLEU, CIDEr, ROUGE, BERTscore)
Human evaluation of image captioning
LLM-as-a-judge evaluation of image captioning and question answering
Human and LLM-based evaluations focused on the same four criteria: Consistency, Relevance, Fluency, and Dialect Authenticity.
Correlation analysis showed a strong correlation between LLM judgments and human judgments (see our paper).
Contributors
Karima Kadaoui, MBZUAI
Hanin Atwany, MBZUAI
Hamdan Al-Ali, MBZUAI
Abdelrahman Mohamed, MBZUAI
Ali Mekky, MBZUAI
Sergey Tilga, Toloka
Natalia Fedorova, Toloka
Dr. Ekaterina Artemova, Toloka
Prof. Dr. Hanan Aldarmaki, MBZUAI
Prof. Dr. Yova Kementchedjhieva, MBZUAI








