Subscribe to Toloka News
Subscribe to Toloka News
Welcome to the digital kitchen of AI development! Today, we're rolling up our sleeves to “cook” a synthetic data pipeline, perfect for Supervised Fine-Tuning (SFT). Just as in any culinary adventure, the right ingredients, techniques, and a pinch of inspiration are essential. Let’s get started!
A hearty batch of source data (raw texts relevant to your domain, sourced from HuggingFace, Kaggle, or the web)
A collection of text slicing tools (for chopping long texts into manageable paragraphs)
A filter for refining your data stew (to keep only the most relevant chunks)
An open-source LLM for profiling and seasoning. We prefer Mixtral 8x7B but you can use the one to your taste
Web queries crafted from profiles to fetch fresh data for our digital kitchen
Some LLM-based creativity serum to generate diverse prompts
Tools for textual transformation and deduplication
Start by collecting a set of raw texts that correspond to your domain of interest. These can be open-source datasets already published online or the results of web scraping.
Slice your long texts into smaller, more digestible paragraphs. Think of this as prepping your vegetables for a stir-fry: you want chunks that are long enough to be meaningful but short enough to fit several into the context window of the LLM you intend to use later.
Use your filtering tool to sift through the text chunks, keeping only those that are relevant to your specific needs, such as domain and subdomains.
Create a “profile” for each document using your LLM. This profile is like a structured summary, identifying key flavors such as subdomain, main topic, and use case.
Take an LLM of your choice and generate further web queries using the profiles you’ve created. Collect the most relevant pages from the web. Clean and parse these new ingredients before mixing them back into your source data. Repeat these five steps until you have enough data. In our projects we usually need 10 to 20 thousand prompts for a domain, so you can keep this number at the back of your mind.
Use the profiles above and cook up a variety of prompts with your LLM.
Evolve your prompts in-breadth with the LLM, generating multiple similar prompts across different topics and use cases to fully explore the flavor palette. The diversity of prompts is key for the success of the whole pipeline. You want your SFT dataset be adjusted to a variety of the usecases within a given domain. Toloka implements various automated and manual checks to guarantee that the domain is covered broadly.
Once again, use your LLM to rephrase your prompts to add lexical diversity, ensuring a rich, complex taste that avoids bland repetition.
Filter out any bad prompts—those that are irrelevant, meaningless, or just plain rubbish. This step is crucial for maintaining the quality of your dish. Toloka has a variety of quality control tools in place and we apply relevant tools at every step in this process.
Finally, transform your text into embeddings and filter out any closely similar samples. This ensures each prompt in your collection remains distinct and covers relevant are of the domain.
Our prompts-starter is ready, now we need to add a pipeline of hearty answers to it. Parts of the prompt-collection recipe could be reused here.
You can use the same LLM that you used for prompt generation to get some short-list of synthetic answers.
To improve the quality of your answers you need depth. Generating several queries per prompt and saving several top search results significantly extends the range of your answers.
Similarly to the second step in the prompt generation pipeline we split the relevant web-pages into chunks. Now you can search for the chunks that are most relevant to a given prompt and add them to the model context (similarly to RAG approach). This extra knowledge adds domain specific information, lowers hallucinations and improves overall answers accuracy.
If you are not satisfied with the your synthetic answers you can use expert knowledge and evaluate the answers on various scales like style, correspondence to the domain, context-relevance etc. Once you have your evaluations you can fine-tune the model to the domain and increase the quality of your synthetic answers further.
Congratulations, you’ve just prepared a synthetic data pipeline for SFT! Just like in cooking, the key to success in AI development is in carefully selecting your ingredients, applying the right techniques, and not being afraid to experiment. Tune on it while it’s hot — bon appétit in the world of AI!