Elena Trajkova
Oleg Pavlov
Multimodal SFT data for the win: Pushing model performance in the real world
As GenAI applications go multimodal, they are learning to process images, text, audio data, and videos at the same time. For instance, models can answer questions about a picture or extract information from a video. To learn how to handle increasingly complex multimodal inputs, these models need supervised fine-tuning (SFT) that encompasses all data types.
Toloka specializes in creating high-quality domain-specific SFT data for multimodal tasks. After fine-tuning on our data, models can scale efficiently, handling real-world scenarios with high accuracy and precision. Let's look at how multimodal SFT works and why it's an essential tool for model development.
The role of SFT in multimodal performance
The SFT process uses domain-specific and labeled data to tune a pre-trained model to achieve optimized performance in a particular application. While pre-trained models set a solid base with a general understanding of the world, they still need to be taught how to use their knowledge efficiently by providing examples of desired behavior.
SFT is essential in multimodal tasks because it teaches the model to cross-reference between different data modalities, such as text, video, and images, and produce contextually correct responses. It becomes crucial when the model needs to provide specific answers to domain-driven questions. The idea is to push the model beyond general multimodal understanding to a level of domain expertise that allows it to correctly interpret data and generate context-specific output.
Why current multimodal LLMs aren't enough
While multimodal LLMs have seen tremendous progress, they still struggle with tasks that require strong visual and textual understanding. Current Vision Language Models (VLMs) are sufficient for basic image captioning or video summarization but lack broader complex skills, such as spatial and temporal reasoning, interpreting charts, graphs, and tables, fine-grained visual understanding, and navigating multilingual or cultural contexts. Moreover, many Large VLMs are typically trained on broad, generalized datasets that don't provide the depth needed for specialized applications and therefore struggle with highly technical tasks, like interpreting detailed schematics of complex industrial processes.
To achieve better performance, they require more cross-modal training data.
Moreover, these models struggle with grounding, which is the ability to link model outputs with reliable data sources and react to multimodal stimuli in a manner similar to human thought processes. Most models give generalized, non-specific answers for more complicated visual data, leading to hallucinations or responses irrelevant to the context. Due to the lack of proper grounding, current VLMs are unreliable for meeting accuracy thresholds, especially in domain-specific tasks.
Creating multimodal SFT data
At Toloka, we systematically generate SFT data tailored to specific use cases. Our approach ensures that multimodal tasks involving video-based conversations, image analysis, or audio processing will have data that delivers the optimal model potential in that context.
We source data directly from the clients or licensed datasets, comprising videos, images, and text in various modes. This includes searching for relevant data that closely matches our target domain. Our AI Tutors provide human annotations. For instance, they write realistic conversations around video content and craft detailed and grounded questions from charts and graphs.
Below, we provide examples of multimodal training data written by our AI Tutors. Each example includes visual input and a series of questions with ground truth answers. These answers are created based on quality metrics like accuracy, consistency, and granularity. They further adhere to the specific use case, verified sources, and domain-specific guidelines to ensure reliability and contextual relevance.
These interactions are then used as training examples to help the model understand and respond to multimodal inputs.
In the first example, we have visual data from an image. The AI Tutor creates a possible conversation scenario around identifying objects, describing features, colors, and structures, or giving general details about the picture. A mock dialogue could look like this:
It takes thousands of conversations like this to train a VLM. Thanks to this data, the model can learn how to recognize and describe elements in the image and learn to give contextually accurate responses.
Another use case that we cover at Toloka is to provide examples to teach a model of analytical reasoning for various industry domains. In this case, the AI Tutors create conversations with interpretative and mathematical components, like in the simplified example below.
The resulting domain-specific fine-tuning dataset will teach the model to have a deep understanding of the chart and to respond with statistically relevant answers. Each sample conversation helps the model build its ability to understand complex visual data.
In the third example, we provide the model with video content. AI Tutors create dialogs about specific segments and moments, as well as the overall theme in the video:
This discussion focuses on understanding and explaining what is happening in the video. Models trained on such dialogues can track specific events in the video and provide responses relevant to the content.
In the examples above, AI Tutors write queries to help the model correctly identify and describe elements from different data modalities. The exact process can be applied to other kinds of data, including auditory modalities such as speech, music, and sounds.
These types of dialogs are designed to improve the models' capability in interpreting data, analyzing graphs and trends, and communicating insights from different data modalities. Through iterative questioning and answering, the model learns to generate coherent responses tailored to the context of the visual data.
Dialog generation pipeline
Toloka pipelines are designed to guarantee consistent data quality. Let's look at the main stages for generating the dialogs.
Stage 1: dialog generation. First, AI Tutors write possible conversation scenarios between a user and a model based on the input, as seen in the visuals above. Following that, we run automated checks on language, relevance to the visual data, punctuation, and even specific guidelines with regard to things like word count or question types. Such automated checks reduce errors and free human auditors to work on more substantial reasoning or logical consistency issues.
Stage 2: editing. To ensure the quality of the data, a second group of AI Tutors reviews this data. Their assignment is to check the answers for helpfulness, harmlessness, and veracity. Based on that assessment, they either approve the data for further processing, make edits (improve clarity and correctness), or return the example for rewriting.
Stage 3: quality audit. After the editing stage, QA auditors check a sample of the data. This leads to the second feedback loop, where minor issues flagged by QA auditors are returned to the editing stage for slight adjustments or clarity improvements. Irrelevant and incorrect dialog examples are sent back to the generation stage for major revisions. The applied two-step verification system allows for ongoing development and improvement at multiple stages of the process.
Visualization of dialog creation pipeline
Grounding and contextual constraints
One of the significant challenges for VLMs is ensuring the model maintains tight grounding within the context. The model should stick with what it can observe, and not speculate on anything that could not be inferred from the data.
For instance, when describing the image below, annotators should choose grounded questions, such as "What color are the apples?" or "How many apples are on the ground?" because that information can be derived directly from the image.
Conversely, they must avoid questions like "What region has the best apples?" since we cannot
arrive at that conclusion using only an image. The problem with ungrounded questions is that the models will still try to generate a response, and as the source information is not available, it can cause hallucinations and bias. This is why we avoid using this type of question while creating conversational examples.
Example of grounded conversation (first question) and ungrounded one (second question).
Making SFT data count
Even though open VLMs perform well on general non-specific data, they often produce hallucinations and errors in handling specialized multimodal data. SFT is crucial in developing solutions for multimodal tasks, especially domain-specific applications, as they require high-quality, reliable data. This is where Toloka's experience and versatility are instrumental in creating SFT data that counts.
Our pipeline combines human experts and complex verification systems to generate high-quality multimodal data, making sure the models trained on our data can handle different multimodal inputs with a high degree of contextual understanding. With thorough quality control, we create data that teaches the model to interpret and respond to any modality, including images, videos, and graphs.
For businesses in many industries, the ability to process and understand multimodal data offers a clear path to innovation. The best way to improve your model's performance in modalities that are relevant to your business is to fine-tune using data tailored to your industry. Training models to analyze and fuse different types of data can help your company automate complex tasks and make more informed data-driven decisions.
Learn more
Are you looking to maximize the potential of your AI systems with custom multimodal data? Reach out to Toloka for a solution tailored to your needs!
Article written by:
Elena Trajkova
Oleg Pavlov
Updated:
Jan 14, 2025