Products

Resources

Impact on AI

Company

Test your LLM's math skills with our benchmark for complex problems and step-by-step reasoning

Test your LLM's math skills with our benchmark for complex problems and step-by-step reasoning

Test your LLM's math skills with our benchmark for complex problems and step-by-step reasoning

Elena Trajkova

Sergei Tilga

May 14, 2024

May 14, 2024

Insights

Insights

Cooking synthetic data for SFT: a recipe for success

Crafting Your Own Synthetic Data Pipeline for SFT
Crafting Your Own Synthetic Data Pipeline for SFT

High-quality training data is a main ingredient in LLM development, as it can vastly improve model performance in practical applications. This data is typically obtained by either traditional data collection methods or their alternatives, such as generative AI models to create synthetic data. Conventional methods include utilizing open-source datasets, closed-source data collections (outsourcing or in-house), web scraping, manual data collection, purchasing data, and so on. These are robust techniques that ensure quality and reliability of the data. 

On the other hand, fine-tuning LLMs can often require domain-specific data that is sometimes easier to supplant with artificially generated data. This is a commonality in domains that have limited data resources. Even though it has its shortcomings and it’s not a unified solution, a routine workaround suggests incorporating synthetic data, in scenarios where traditional data collection can be challenging or costly.

By mimicking “real” data, synthetic data lessens the problem of data availability. Generating synthetic data is often more cost-effective and scalable than conventional data collection while ensuring privacy compliance by excluding sensitive information. Hence, it can be a robust addition to the process of fine-tuning LLMs.

That being said, synthetic data has several drawbacks of its own. For example, if it doesn’t undergo quality control or it’s not a good replica of real-world examples, the trained model may not perform well in practical applications. Human validation of the synthetic data can mitigate this risk and help obtain reliable results. For this reason, we chose a hybrid approach, incorporating both synthetic data and human input. 

Let’s dive into the process of building a hybrid pipeline for supervised fine-tuning (SFT). You can think of synthetic data generation as a culinary adventure in the digital kitchen of AI development. Just as a chef carefully selects ingredients and applies precise techniques to create a delicious dish, let’s roll up our sleeves to “cook” a synthetic data pipeline, perfect for SFT. With the right ingredients, techniques, and a pinch of inspiration, you can craft a dataset that satisfies your model’s appetite. Let’s get started!

Ingredients

  1. A hearty batch of open-source data (raw texts relevant to your domain, sourced from Hugging Face, Kaggle, or the web)

  2. A collection of text-slicing tools (for parsing long texts and chopping them into manageable paragraphs)

  3. A filter for refining your data stew (to keep only the most relevant chunks)

  4. An open-source LLM for data handling (filtering, profiling, generation and transformation). We prefer Mixtral 8x7B, but you can choose your favorite flavor

  5. Web search to fetch fresh data and improve answer generation using web queries crafted from profiles

  6. A splash of LLM-based hot sauce to generate diverse prompts 

  7. Tools for textual transformation and deduplication to smooth out the texture

  8. Human experts for taste tasting (to evaluate the generated data and add improvements)

Cooking the prompts

Step 1. Gather your ingredients: collect raw texts

Start by collecting a set of raw texts that correspond to your domain of interest. These can be open-source datasets already published online or the results of web scraping. Filter each document from the web, and keep the ones relevant to your task. For example, if you need to generate a medical dataset, look for web texts from the medical domain. While doing so, make certain the collected data complies with all legal and ethical standards.

This step provides the foundational material for generating synthetic data tailored to your needs.

Let’s illustrate this step with a simple example. Suppose we want to prepare a synthetic pipeline in the domain of physics, more specifically about quarks and gluons. To achieve this, we must gather extensive information and key points about the topic. Our collection would include texts with interesting facts about quarks and gluons, their characteristics, underlying principles, and so on.

Step 2. Chop it up: break the texts into chunks

Slice your long texts into smaller, more digestible paragraphs. Instead of generating one generic prompt from a longer document, divide it into smaller, logical texts and generate separate prompts for each. This step increases the dataset volume and creates a more diverse dataset. Separate parts highlight different aspects of the topic, leading to more varied data and unique prompts.

Think of this as prepping your vegetables for a stir-fry: you want the perfect size chunks. For your dataset, they should be long enough to be meaningful but short enough to fit several into the context window of the LLM you intend to use later.

Going back to our physics example, we could divide a text about quarks and gluons into distinct paragraphs: “What are quarks and gluons?”, “Interesting facts about quarks and gluons” and “Quarks and gluons in the standard model”.

Step 3. Sift carefully: filter texts by relevancy

Use your filtering tool to sift through the text chunks, keeping only those relevant to your specific needs, such as domain and subdomains. You want to make sure you keep useful information and discard irrelevant noise. For example, we use an LLM to check that each paragraph is informative and relevant to our needs.

Step 4. Choose the seasonings: generate text profiles

Open-source LLMs are ineffective at generating complex prompts directly from raw documents. You can get around this limitation by using profiling. Instruct the LLM to read each of the previously created chunks and generate a profile for it. This involves summarizing the chunk into several sentences, each corresponding to a specific property like topic, subdomain, and so on.  

Let's understand this concept through an example, where we generate a profile for a sample document from our physics collection:
Generated profile:
- Domain: Natural sciences
- Subdomain: Physics
- Topic: Quarks and Gluons, their properties, and their role in the strong nuclear force
- Use Case: Learning about the fundamental particles that make up matter, their interactions, and their study in high-energy physics experiments 

Think of it as identifying the key flavors in your dish — maybe ginger or peppers? The point is to summarize and categorize the data effectively, which improves the complexity and quality of the prompts. Additionally, based on the profiles, you can control the diversity and distribution of the dataset by selecting the percentage of each domain included in the final dataset.

Step 5. Add fresh ingredients: source web documents

Obtaining fresh documents from the web is essential, as it enhances the scalability and diversity of the final dataset. Once you have identified your key flavors (profiles), it’s time to throw in all the fresh ingredients. You do this in a couple of steps:

  1. Like writing a shopping list based on a recipe, use the created profiles and a selected LLM to generate web queries.

These are some examples of generated search queries:

- what are quarks and gluons
- quark properties and types
- gluon's role in atoms
- history of quark discovery
- applications of quark and gluon research in physics

  1. Using these queries, search the web and find additional relevant documents.

  2. Clean and parse these new ingredients and mix them into your source data.

Repeat these steps until you have enough data. In our projects, we usually need 10 to 20 thousand prompts for a domain, so you can keep this number at the back of your mind.

Step 6. Start cooking: generate prompts

Once you have gathered your ingredients, it’s time to create a balanced dish. Use the profiles created from the source data and instruct the LLM to generate a variety of prompts that cover various aspects of the topic and are suited for the task at hand. Just as different textures and ingredients enhance the richness and flavor of a meal, you should incorporate diverse and detailed prompts to get a more robust dataset. 

Examples of generated prompts would look like this: 

  • What role do quarks and gluons play in particle physics, and how do they help form protons and neutrons?

  • How many types of quarks are there, and what are their names?

Step 7. Explore the flavor palette: focus on breadth and diversity

To evolve the breadth of your prompts with the LLM, generate multiple similar prompts across different topics and use cases to explore the flavor palette. The diversity of prompts is important to the success of the whole pipeline. You want to adapt your SFT dataset to various use cases within a given domain. 

Toloka implements various automated and manual checks to guarantee that the domain is covered broadly. Here are some examples of how to broaden your collection of prompts:

  • How do electrons interact within an atom, and what role do they play in chemical bonding and the formation of molecules? I’m particularly interested in understanding the differences between ionic and covalent bonds.

  • How do neutrinos interact with matter given their weak interaction with other particles, what makes them so elusive, and what techniques do scientists use to detect and study them?

  • What advancements in detector technology and experimental techniques have allowed scientists to study high-energy particle collisions, and what recent discoveries have been made as a result of these experiments?

Step 8. Add lexical spice: rephrase monotonous prompts

After completing the steps above, you’ll probably find that most prompts are lexically identical. For instance, numerous prompts begin with “Could you please explain [topic of interest]?” This results in a monotonous dataset with repetitive language patterns. To avoid this, use your LLM to rephrase your prompts to add lexical diversity, ensuring a rich, complex taste that avoids bland repetition.

Let’s illustrate this with an example. For the input “What role do quarks and gluons play in particle physics, and how do they help form protons and neutrons?”, you could generate the following prompts:

- Could someone help me understand the role of quarks and gluons in particle physics? I'm particularly interested in their contribution to the formation of protons and neutrons.
- I’m curious about quarks and gluons in particle physics. How do they help create protons and neutrons?
- What is the role of quarks and gluons in particle physics, and how do they contribute to the formation of protons and neutrons?

Step 9. Watch the pot: discard bad prompts

This step is crucial for maintaining the quality of your dish. 

Let’s say you are generating a recipe for an SFT pipeline on the topic of quarks and gluons, but you notice some odd instructions, such as “A group of flamingoes is called a flamboyance” or “There are four types of quarks”. The first one is irrelevant to your recipe, while the second one is incorrect and can even ruin your results. 

It is important to discard any irrelevant or meaningless prompts and only keep the ones that are useful for your task. You can perform automatic prompt filtering using the LLM. This closely resembles step 3, where you filtered raw texts, but now you are focusing on filtering generated prompts.

Step 10. Smooth out clumps: minimize repetitive prompts

You don’t want your data clumping together around the same topics. We recommend running a similarity search to minimize repetition as much as possible. A good approach is to transform the text into embeddings and filter out any closely similar samples. This ensures each prompt in the collection remains distinct and covers relevant domain areas. This step is important for avoiding repetitive samples and enhancing data diversity.

Step 11. Bring in the taste testers: add a human feedback loop 

With any new recipe, you want to have taste testers critique the dish before you serve it. The same goes for data production. 

Despite the improvements synthetic data introduces, you still need human input when developing a robust pipeline for SFT. Integrating human feedback into the process can ensure the quality and scalability of the generated data. To better simulate real-world conditions, human experts validate the synthetic data, usually by accepting, rejecting, or modifying the generated samples/prompts. If the generated prompt is good, it will be accepted; otherwise, it will be modified or discarded altogether. The experts only edit the prompts that are not quite ready for acceptance but are useful and can be easily refined.  This “human-in-the-loop” approach enhances the reliability and applicability of the data. The Toloka team uses a variety of quality control tools at every step in this process.

The ratio of human input versus synthetic data depends on specific criteria for each project. For example, you may opt for a more synthetically generated dataset if budget is a priority. If you’re focusing on quality, you may need more human-generated data. The criteria are based on set thresholds, parameters, and quality control conditions, which help set an optimal tradeoff for each use case. 

Your prompt starter is ready to serve. You can see a more formal representation of the prompt generation pipeline in the image below.

Cooking the answers

Now that you have a set of prompts, you need to add a pipeline of hearty answers to it. You can reuse parts of the prompt collection recipe here.

Step 1. Line up the condiments: use extended web queries

Whether it’s mayo, ketchup, or mint chutney, everyone appreciates a selection of condiments. But you need depth to improve the quality of answers for your dataset. 

Generate several queries per prompt and save several top search results to substantially expand the range of your answers. You could extend the above example and generate additional queries that cover more details or aspects. Similar queries to the original prompt, “How many types of quarks exist, and what are their names?” include: 

  • Which are the six types of quarks?

  • How do different types of quarks interact?

  • What role do different types of quarks play in particle physics?

With this approach, you provide more informative and detailed answers, for a more in-depth understanding of the topic. As a result, you expand the diversity and volume of the dataset, which significantly improves the quality of the data and the performance of the model itself.

Step 2. Chop-n-RAG: generate synthetic answers

Just like the second step in the prompt generation pipeline, you need to split the relevant web pages into chunks. Then you can search for the chunks most relevant to a given prompt and add them to the model context (similarly to the RAG approach). If you don’t incorporate RAG, the LLM will generate answers based solely on the information it was trained on. 

Retrieval-augmented generation (RAG) uses an information retrieval component that enables the user to first pull relevant data from external sources. The new data is then combined with the original training data, which assists the LLM in creating more accurate and relevant responses. This extra knowledge adds domain-specific information, lowers hallucinations, and improves the overall accuracy of answers.

Such context-boosted LLM can be further used to get a short list of synthetic answers. For example, for the prompt “How many types of quarks exist and what are their names?”, the LLM might generate an answer such as  “There are six types of quarks: up, down, charm, strange, top and bottom.” 

Step 3. Add your secret sauce: get expert evaluations and fine-tune

If you are not satisfied with your synthetic answers, you can recruit knowledgeable experts to evaluate the answers on scales like style, correspondence to the domain, context relevance, etc. This is done similarly to step 11 in prompt generation, where we include humans in the process to validate or edit the answers and assess the overall performance of the pipeline.

Once you have your evaluations, you can fine-tune the model to the domain and refine the quality of your synthetic answers.

Like in cooking, the key to AI development success is carefully selecting your ingredients, applying the right techniques, and not being afraid to experiment. Tune on the data while it’s hot, and bon appétit!

Learn more 

Need help building your SFT pipeline? Reach out to Toloka, and our experts will help you create the perfect recipe!

Article written by:

Elena Trajkova

Sergei Tilga

Updated:

May 14, 2024

Subscribe to Toloka News

Case studies, product news, and other articles straight to your inbox.

Subscribe

Subscribe
to Toloka News

Case studies, product news, and other articles straight to your inbox.

Subscribe

Subscribe to Toloka News

Case studies, product news, and other articles straight to your inbox.

Subscribe

More about Toloka

What is Toloka’s mission?

Where is Toloka located?

What is Toloka’s key area of expertise?

How long has Toloka been in the AI market?

How does Toloka ensure the quality and accuracy of the data collected?

How does Toloka source and manage its experts and AI tutors?

What types of projects or tasks does Toloka typically handle?

What industries and use cases does Toloka focus on?

What is Toloka’s mission?

Where is Toloka located?

What is Toloka’s key area of expertise?

How long has Toloka been in the AI market?

How does Toloka ensure the quality and accuracy of the data collected?

How does Toloka source and manage its experts and AI tutors?

What types of projects or tasks does Toloka typically handle?

What industries and use cases does Toloka focus on?

What is Toloka’s mission?

Where is Toloka located?

What is Toloka’s key area of expertise?

How long has Toloka been in the AI market?

How does Toloka ensure the quality and accuracy of the data collected?

How does Toloka source and manage its experts and AI tutors?

What types of projects or tasks does Toloka typically handle?

What industries and use cases does Toloka focus on?