Multimodal Conversations Dataset
This dataset is designed to enhance image understanding, reasoning, and visual analysis in VLMs.
Size
3,500+ dialogues
Format
Each sample consists of an image paired with a 4-turn user-assistant conversation.
Quality
All dialogues were created and validated by trusted writers and editors, ensuring high-quality, natural interactions.
Skill group: Contextual understanding & inference
Scene + Context Understanding (90.0%)
Grasping the overall environment, interactions, and relationships within a visual scene.
Cultural + Historical Understanding (27.0%)
Recognizing culturally significant symbols, practices, and temporal cues to place the image in a broader context.
Temporal and Causal Reasoning (12.0%)
Interpreting sequences, predicting outcomes, and inferring cause-and-effect relationships.
Skill group: Visual details analysis
Object and Attribute Identification (72.0%)
Detecting and distinguishing objects and their attributes, including specific features like color and texture.
Material and Surface Recognition (31.0%)
Differentiating between various materials and understanding their visual and tactile qualities.
Comparative Visual Evaluation (28.0%)
Analyzing and comparing visual characteristics, such as size, proximity, and relationships.
Subject areas covered



