LLM for code generation: a scalable pipeline to gather SFT data

Toloka Team
by Toloka Team
Image

Subscribe to Toloka News

Subscribe to Toloka News

Empower your GenAI development

Get your expert data for Fine-tuning, RLHF and Evaluation. High-quality, for any domain, at scale.
Talk to us
Image

Domain-specific LLMs

Many general-purpose LLMs struggle to perform in specialized areas like law, engineering, coding, or medicine. Therefore numerous companies have started developing domain-specific models optimized for their particular areas of expertise. For instance, there are models like Med-PaLM for medicine and BloombergGPT for finance. In the coding domain, there are also several fine-tuned models such as GitHubCopliot, StarCoder2, and Code Llama. While these current code models are useful for practical tasks, they still have limitations and struggle with more complex problems. A common problem these companies are facing is building high-quality Supervised Fine-Tuning (SFT) datasets that are niche-specific and can be used to fine-tune base LLM models to excel in the desired domain. If you require assistance in gathering domain-specific datasets, Toloka can help. Feel free to reach out to us.

In this article, we will introduce a pipeline that Toloka has developed to collect a high-quality dataset within a coding domain with a throughput of 2000 prompt-completion pairs a month.

Image

SFT dataset for coding domain

The challenge involved creating an SFT dataset covering various programming languages, including Python, SQL, Scala, Go, C, C++, R, and JavaScript. Toloka aimed to generate a significant number of high-quality question-answer pairs addressing different coding challenges within this domain.

The task was particularly challenging because it involved the work of highly qualified experts, coordination of their efforts, and had to be delivered within a given timeframe. Additionally, we had to focus on ensuring high quality throughout this project as it was crucial to craft prompts and prompt completions that fulfilled user requests, were accurate, and concise.

Pipeline for coding SFT dataset

To tackle the challenge, Toloka crafted a solution that combines human expertise with modern Machine Learning (ML) techniques in a data-gathering pipeline. We assembled a team of coding experts proficient in all necessary programming languages, serving as the foundation of the project with their primary responsibility being to maintain high data quality. Additionally, to assist human experts and speed up the process, we developed several ML algorithms that perform data extraction and automatic data checks. The constructed pipeline is shown in the schema below and each step will be further described in the next subsections of this article.

Image
  1. Collecting and summarizing the context data from documentation

The initial step of the pipeline involves extracting information from clients’ documentation regarding the problems that LLM must be able to address. This stage automatically gathers details about each coding problem, such as context, keywords, programming language, category, and subcategory. These details are then utilized in the subsequent step to generate prompts effectively.

  1. Generating prompts with experts

In the second step, the extracted information is passed to a human coding expert. Their task is to create the actual prompt by following specific instructions and carefully considering the context and other relevant details. An example of such a task is illustrated in the screenshot below.

Image

Prompt generation task

To support experts in generating prompts, we use open LLMs to automatically generate synthetic data across a range of topics, user cases, and scenarios. It gives them examples, they can follow and improve their creativity. Ultimately, this approach improves the efficiency of experts and enhances the diversity and quality of prompts collected during this phase.

  1. Verifying prompts

In the third step, the prompts crafted in the previous project are reviewed by a second line of coding experts and ML models. The models perform basic classification checks to confirm that the prompt corresponds to the chosen category, subdomain, and programming language. Meanwhile, human coding experts assess the prompt’s quality, suitability, and adherence to instructions, context, and provided details. Examples of some questions given to coding can be found in the screenshot below.

Image

A subset of questions from the prompt verification task given to human coding experts assessing the prompt’s quality, suitability, and adherence to instructions, context, and provided details.

In our pipeline, the prompts that do not meet the required criteria are rejected and need to be rewritten whereas the accepted ones move on to the next step, where coding experts are tasked with writing completions.

  1. Writing completions with experts

The fourth step in our pipeline requires experts to follow carefully crafted instructions when writing answers to user queries. They must ensure that their completions fully address user requests and are truthful, elegant, and concise. Once an expert completes a response, it proceeds to the next step for further verification.

  1. Verifying completions with experts

The fifth step involves a second line of human coding experts verifying the completion. This ensures that the prompt completion is assessed by a different person from the one who wrote it. Once again, experts must answer detailed questions about the completion’s quality. Examples of these questions are provided in the screenshot below.

Image

Prompt completion verification task

Similarly to a prompt verification project, if the completion’s quality does not meet the required standards, it is rejected and must be rewritten. Otherwise, it is accepted together with the corresponding question and becomes a new data point in the dataset.

  1. Evaluating the created dataset

This brings us to the final sixth step in the pipeline, where we run an evaluation of the dataset to assess its overall quality even before it is consumed by an LLM. For each data point in the dataset, we request different models such as GPT-3.5, LLama, and Mistral to complete the prompt, and then compare these answers to the responses given by our experts. This is done using both expert evaluation to ensure data quality, and automatic evaluation with GPT-4 to improve cost and scalability. In both scenarios, a decision maker (expert or a model) is presented with a side-by-side comparison of two answers, asked about preference. We collect the number of win rates against the previously mentioned models and a high number gives us confidence that the dataset will improve the LLM our client is developing.

In addition to ensuring data quality, we also examine the distribution of the dataset based on languages, technologies, and cases to ensure it aligns with the desired parameters before providing it to the client. Additionally, we perform checks to identify and remove any semantically similar samples. Only rigorously, checked datasets are delivered to the client. If any criteria are not met, we continue refining the dataset until all issues are resolved.

Summary

The above pipeline demonstrates how to structure an SFT data-gathering project by combining human expertise with ML solutions giving it an additional boost of efficiency and scalability. Collecting a domain-specific dataset poses unique challenges, necessitating a large pool of expert annotators and effective coordination of their activities. While we’ve applied this pipeline for coding we can easily adapt it to other niches. This includes different industries such as medicine, law, engineering, finance, pharmaceutical, and many more. We have experts in all of those fields and extensive experience in running data annotation AI projects.

If you are interested in gathering a dataset for a specific domain, feel free to reach out to us. We can build a custom solution for you similar to the pipeline outlined in this article.

Article written by:
Toloka Team
Toloka Team
Updated: 

Recent articles

Have a data labeling project?

Take advantage of Toloka technologies. Chat with our expert to learn how to get reliable training data for machine learning at any scale.
Fractal

More about Toloka

  • Our mission is to empower businesses with high quality data to develop AI products that are safe, responsible and trustworthy.
  • Toloka is a European company. Our global headquarters is located in Amsterdam. In addition to the Netherlands, Toloka has offices in the US, Israel, Switzerland, and Serbia. We provide data for Generative AI development.
  • We are the trusted data partner for all stages of AI development–from training to evaluation. Toloka has over a decade of experience supporting clients with its unique methodology and optimal combination of machine learning technology and human expertise. Toloka offers high quality expert data for training models at scale.
  • The Toloka team has supported clients with high-quality data and exceptional service for over 10 years.
  • Toloka ensures the quality and accuracy of collected data through rigorous quality assurance measures–including multiple checks and verifications–to provide our clients with data that is reliable and accurate. Our unique quality control methodology includes built-in post-verification, dynamic overlaps, cross-validation, and golden sets.
  • Toloka has developed a state-of-the-art technology platform for data labeling and has over 10 years of managing human efforts, ensuring operational excellence at scale. Now, Toloka collaborates with data workers from 100+ countries speaking 40+ languages across 20+ knowledge domains and 120+ subdomains.
  • Toloka provides high-quality data for each stage of large language model (LLM) and generative AI (GenAI) development as a managed service. We offer data for fine-tuning, RLHF, and evaluation. Toloka handles a diverse range of projects and tasks of any data type—text, image, audio, and video—showcasing our versatility and ability to cater to various client needs.
  • Toloka addresses ML training data production needs for companies of various sizes and industries– from big tech giants to startups. Our experts cover over 20 knowledge domains and 120 subdomains, enabling us to serve every industry, including complex fields such as medicine and law. Many successful projects have demonstrated Toloka's expertise in delivering high-quality data to clients. Learn more about the use cases we feature on our customer case studies page.