Multimodal Conversations Dataset

This dataset is designed to enhance image understanding, reasoning, and visual analysis in VLMs.

Size

3,500+ dialogues

Format

Each sample consists of an image paired with a 4-turn user-assistant conversation.

Quality

All dialogues were created and validated by trusted writers and editors, ensuring high-quality, natural interactions.

Skill group: 

Contextual understanding & inference

Scene + Context Understanding (90.0%)

Grasping the overall environment, interactions, and relationships within a visual scene.

Cultural + Historical Understanding 

(27.0%)

Recognizing culturally significant symbols, practices, and temporal cues to place the image in a broader context.

Temporal and Causal Reasoning (12.0%)

Interpreting sequences, predicting outcomes, and inferring cause-and-effect relationships.

Skill group:

Visual details analysis

Object and Attribute Identification (72.0%)

Detecting and distinguishing objects and their attributes, including specific features like color and texture.

Material and Surface Recognition (31.0%)

Differentiating between various materials and understanding their visual and tactile qualities.

Comparative Visual Evaluation (28.0%)

Analyzing and comparing visual characteristics, such as size, proximity, and relationships.

Subject areas covered

Landscape 12%

Landscape 12%

Nature & wildlife 8.8%

Nature & wildlife 8.8%

Landscape 12%

Landscape 12%

Food 9%

Food 9%

Architecture 7.8%

Architecture 7.8%

Street photography 4.5%

Street photography 4.5%

Data samples

Contact us to purchase the dataset

Subscribe to Toloka News