Stirr but do not mix: Crafting Your Own Synthetic Data Pipeline for SFT

Ivan Yamshchikov
by
Image

Subscribe to Toloka News

Subscribe to Toloka News

Introduction

Welcome to the digital kitchen of AI development! Today, we're rolling up our sleeves to “cook” a synthetic data pipeline, perfect for Supervised Fine-Tuning (SFT). Just as in any culinary adventure, the right ingredients, techniques, and a pinch of inspiration are essential. Let’s get started!

Empower your GenAI development

Get your expert data for Fine-tuning, RLHF and Evaluation. High-quality, for any domain, at scale.
Talk to us
Image

Ingredients

  1. A hearty batch of source data (raw texts relevant to your domain, sourced from HuggingFace, Kaggle, or the web)

  2. A collection of text slicing tools (for chopping long texts into manageable paragraphs)

  3. A filter for refining your data stew (to keep only the most relevant chunks)

  4. An open-source LLM for profiling and seasoning. We prefer Mixtral 8x7B but you can use the one to your taste

  5. Web queries crafted from profiles to fetch fresh data for our digital kitchen

  6. Some LLM-based creativity serum to generate diverse prompts

  7. Tools for textual transformation and deduplication

Cooking Instructions

Step 1: Gather Your Ingredients

Start by collecting a set of raw texts that correspond to your domain of interest. These can be open-source datasets already published online or the results of web scraping.

Step 2: Chop It Up

Slice your long texts into smaller, more digestible paragraphs. Think of this as prepping your vegetables for a stir-fry: you want chunks that are long enough to be meaningful but short enough to fit several into the context window of the LLM you intend to use later.

Step 3: Sift Through Your Chunks

Use your filtering tool to sift through the text chunks, keeping only those that are relevant to your specific needs, such as domain and subdomains.

Step 4: Profile Seasoning

Create a “profile” for each document using your LLM. This profile is like a structured summary, identifying key flavors such as subdomain, main topic, and use case.

Step 5: Fresh from the Web

Take an LLM of your choice and generate further web queries using the profiles you’ve created. Collect the most relevant pages from the web. Clean and parse these new ingredients before mixing them back into your source data. Repeat these five steps until you have enough data. In our projects we usually need 10 to 20 thousand prompts for a domain, so you can keep this number at the back of your mind.

Step 6: Prompt Generation

Use the profiles above and cook up a variety of prompts with your LLM.

Step 7: In-Breadth Flavor Testing

Evolve your prompts in-breadth with the LLM, generating multiple similar prompts across different topics and use cases to fully explore the flavor palette. The diversity of prompts is key for the success of the whole pipeline. You want your SFT dataset be adjusted to a variety of the usecases within a given domain. Toloka implements various automated and manual checks to guarantee that the domain is covered broadly.

Step 8: Add Lexical Spice

Once again, use your LLM to rephrase your prompts to add lexical diversity, ensuring a rich, complex taste that avoids bland repetition.

Step 9: Quality Control

Filter out any bad prompts—those that are irrelevant, meaningless, or just plain rubbish. This step is crucial for maintaining the quality of your dish. Toloka has a variety of quality control tools in place and we apply relevant tools at every step in this process.

Step 10: Avoid Clumping

Finally, transform your text into embeddings and filter out any closely similar samples. This ensures each prompt in your collection remains distinct and covers relevant are of the domain.

Image

Our prompts-starter is ready, now we need to add a pipeline of hearty answers to it. Parts of the prompt-collection recipe could be reused here.

Step 1: Synthetic answers

You can use the same LLM that you used for prompt generation to get some short-list of synthetic answers.

Step 2. Web-condiments

To improve the quality of your answers you need depth. Generating several queries per prompt and saving several top search results significantly extends the range of your answers.

Step 3. Chop-n-RAG

Similarly to the second step in the prompt generation pipeline we split the relevant web-pages into chunks. Now you can search for the chunks that are most relevant to a given prompt and add them to the model context (similarly to RAG approach). This extra knowledge adds domain specific information, lowers hallucinations and improves overall answers accuracy.

Image

Step 4. Fine-tuning with connoisseurs

If you are not satisfied with the your synthetic answers you can use expert knowledge and evaluate the answers on various scales like style, correspondence to the domain, context-relevance etc. Once you have your evaluations you can fine-tune the model to the domain and increase the quality of your synthetic answers further.

Congratulations, you’ve just prepared a synthetic data pipeline for SFT! Just like in cooking, the key to success in AI development is in carefully selecting your ingredients, applying the right techniques, and not being afraid to experiment. Tune on it while it’s hot — bon appétit in the world of AI!

Article written by:
Ivan Yamshchikov
Updated: 

Recent articles

Have a data labeling project?

Take advantage of Toloka technologies. Chat with our expert to learn how to get reliable training data for machine learning at any scale.
Fractal