Products

Resources

Impact on AI

Company

Test your LLM's math skills with our benchmark for complex problems and step-by-step reasoning

Test your LLM's math skills with our benchmark for complex problems and step-by-step reasoning

Test your LLM's math skills with our benchmark for complex problems and step-by-step reasoning

Toloka Team

Oct 22, 2024

Oct 22, 2024

Customer cases

Customer cases

Multi-domain, multi-language SFT dataset pushes LLM performance to the next level


The accuracy of a fine-tuned model depends largely on the expertise inherent in the dataset used for supervised fine-tuning (SFT). When this LLM team was tasked with fine-tuning in niche domains and multiple languages, they trusted Toloka’s experience to create a dataset that would cover all the bases.

In this case study, Toloka experts prepared a diverse and complex SFT dataset of 10,000 pairs of prompts and completions in multiple languages for specialized fields. The specific domains and languages in this project are confidential. 

Read on to learn about each step in Toloka’s approach to building expert datasets — and how fine-tuning on the dataset boosted model performance. 

The project: crafting a large dataset of prompts and completions

As they reached the fine-tuning stage for their newest LLM, the client set out to source a large dataset: 10,000 pairs of prompts and completions in multiple languages and hard-to-find domains. 

The dataset requirements were highly complex: 

  • Balance between languages.

  • Segmentation into seven skills or prompt types: summarization, extraction, open QA, closed QA, rewriting, classification, and inference.

  • Adherence to the client’s corporate style with a specific format and tone.

  • Subset of the dataset in feature-specific formats such as JSON, XML, tables, and CSV.

  • Context for prompts. Toloka collected relevant context in compliance with local regulations and copyrights.

  • Response length guidelines specified upper and lower bounds on token length to avoid overly short or long responses.

Toloka took on the data production and delivered a high-quality dataset in four months. The dataset was effectively decomposed into 80 smaller datasets based on language, skill, and domain variations. These domains included general topics and a handful of specialized areas. Toloka’s expert network covers a wide range of niche domains, including finance, ESG, philosophy, linguistics, medicine, manufacturing, civil engineering, automotive engineering, compliance, law, coding, and more. 

During the project, we developed a new type of prompt called inference, a mixture of extraction and open QA. Initially, the focus was on extraction — working with context to extract information. However, this approach was too basic and didn’t leverage domain expertise. The inference prompts allowed domain experts to contribute additional knowledge from outside the context for comprehensive and useful answers. This enhancement proved to be particularly valuable in the final dataset. 

Weekly check-ins with the client helped refine aspects such as prompt types, summarization tasks, and style guidelines to match the dataset to their needs. This collaborative approach ensured that the final product met their expectations in both format and substance.

The process: a pipeline of experts, editors, and automated checks

The Toloka team built on previous experience to quickly implement a three-stage pipeline:

1. Data generation: Domain experts write prompts and completions.

2. Simple evaluation: Other domain experts review prompts and completions to remove incorrect and irrelevant content. This step also includes automated quality checks.

3. Editing: Domain team leads and editors revise and improve the data.

One of the keys to streamlining the project was integrating automated checks to complement human efforts. These checks served two primary purposes: to reject incorrect content immediately, and to help editors focus on the right issues. 

Here are some of the autochecks we used in this project:

  • Grammar check to rate the language quality and suggest improvements. This feature significantly boosted the overall quality. 

  • Detection of wrong links, italic fonts, incorrect headings, and other undesirable artifacts in the completions. This feature flagged issues for the QA stage to ensure compliance with the guidelines.

  • Requirements analysis to check whether prompts meet the criteria for the skill, such as classification or extraction. This feature relied on few-shot prompt engineering and flagged the prompts for experts to rewrite.

  • Performance tracking to monitor individual experts and correct problems early on, before they could impact dataset quality. This feature deprioritized experts with high task rejection rates, indicating potential quality issues. Additionally, we analyzed the similarity between prompts and completions using embeddings and cosine distance to help ensure prompt diversity.

  • AI classification of tasks as generalized or domain-specific. This feature ensured that domain experts focused on tasks requiring specialized knowledge, while general experts handled the general knowledge content.

The experts: recruiting native speakers across specialized domains

To meet the client's requirement for native language speakers with domain knowledge, we assembled a team of experienced professionals. Domain experts showcased their knowledge and industry experience, with top contributors including a university professor and an ex-Mercedes engineer. 

It took less than three weeks to prepare a curated team of several hundred subject matter experts with the right balance of expertise and languages, leveraging our rapid expert onboarding program and the Mindrift platform. 

Mindrift is a hub where data production tasks are supported by instructions and guidelines, automated checks, control tasks, spot checking, and built-in tools for smooth operations. The platform helps experts maximize their performance with two-way feedback and innovative co-pilot tools, which enhance efficiency by 45%. Here are some of the tools designed to automate routine tasks:

  • Grammar Check goes beyond basic grammar and spelling to highlight issues and suggest improvements for readability to assist editors. 

  • AI Detector flags text that sounds AI-generated to make sure it gets rewritten by a human. 

  • Instructions Check identifies the type of prompt and analyzes whether the response is a good match. 

Technical tools are layered over human quality management to optimize production and support data excellence.


Tasks examples

The outcome: valuable data for improving model performance on target tasks

While quality checks give us confidence in the accuracy of data, it’s equally important to measure how the dataset benefits models during fine-tuning. 

Before delivering the dataset to the client, we ran several experiments to verify that the dataset improves model performance more significantly than open-source datasets.

Our experiments follow a general framework:

  1. Select open-source models to fine-tune. For this project, we chose Llama 3 8B and Mistral 7B.

  2. Take the custom dataset and a competing open-source dataset for comparison. For this project, we used the No Robots dataset.

  3. Tune each model (supervised fine-tuning) on the two datasets separately, using similar training parameters. This results in two versions of each model.

  4. Evaluate the performance of both models and compare the results.

We chose the No Robots dataset for comparisons because it is similar to our custom dataset in size and scope. It’s a high-quality open dataset of instructions and demonstrations, created by skilled human annotators. Here’s how it compares to our dataset:

  • Same size – 10,000 samples.

  • Similar set of skills – Open QA, Generation, Brainstorm, Closed QA, Summarization, Extraction.

  • English only (unlike our dataset). 

To evaluate model performance, we focused on two groups of metrics:

  1. Domain-specific evaluation. This assesses the models’ capabilities in the domains, skills, and languages of interest to the client. Here, we aim to demonstrate that performance significantly improves after fine-tuning on our dataset, surpassing the improvements seen with other non-specialized data.

  2. General capability evaluation. This ensures that the basic abilities of the model do not degrade after fine-tuning on our specialized data. We want to maintain or enhance the model’s general performance while improving its domain-specific skills.

Domain-specific evaluation

We used the Alpaca Eval approach to evaluate specific, relevant abilities of the models using LLM-as-a-judge. We prepared special hold-out subsets of prompts for the domains, skills, and languages of interest and compared the quality of the fine-tuned models using a stronger model (GPT-4 Turbo). This is how evaluation works:

  1. Select a subset of prompts for the chosen domain, skill, and language.

  2. For each prompt, get responses from the two models to compare (e.g., the model fine-tuned on our dataset and the model fine-tuned on the No Robots dataset).

  3. Use GPT-4 Turbo to choose which response is better. Compare the models' responses based on five separate criteria:

    • Correctness: accuracy, truthfulness, groundedness and relevance of model answers.

    • Conciseness: non-redundancy and word-efficiency of model answers.

    • Observance: compliance with input instructions, appropriate and consistent tone, language and structure.

    • Safety: harmlessness and friendliness of model answers.

    • Overall: overall quality of the response.

  4. Aggregate the automatically annotated pairs into a single win rate metric — the percentage of cases where the quality of the first model's responses is better than the second's.

This table compares the performance of two models: Llama 3 Instruct fine-tuned on our dataset, and Llama 3 Instruct fine-tuned on the No Robots dataset.

For both datasets, training occurred on 85% of the data (8500 samples). The remaining 1500 prompts from our dataset were used as a held-out set for calculating metrics. Green text indicates a statistically significant winrate. The winrates over 66% are in bold.

The overall winrate of 74.1% (row 1) makes it clear that the custom dataset provides greater benefits compared to the No Robots dataset.

Here’s a breakdown of the results in data subsets:

  • The subsets with significant improvements are in Language 2 and non-general domains (rows 2–5). This is expected, since the No Robots dataset didn’t include additional languages or cover specific domains.

  • The highest winrates are in project-specific subsets (row 6): non-general domains, non-contextual skills, and in Language 2. The custom dataset focused on these areas.

  • The winrates for the general domain and Language 1 subsets aren’t as high, but still indicate superior benefits from our dataset (row 7). There are two possible explanations: either the quality of responses in our dataset is higher across the board, or the diversity of our data (with the addition of languages and domain knowledge) has enhanced the model’s generalization ability.

  • Out of the four metrics, Correctness and Observance show the most substantial difference, reflecting the contributions of domain experts and editors with a focus on high-quality content.

The experimental results gave us confidence that the dataset size, response quality, and prompt diversity are enough to ensure a significant improvement in the model's performance — and help the client reach their goals.

General capability evaluation

To ensure that our training data distribution does not bias the model excessively towards our specific domains, answer style, or other features of the data, we additionally compared the performance of the models on a sample from a different distribution. This is our private set of prompts that has not been made publicly available and was not used in training the models. This set contains diverse skills, but only for the general domain and English.

The results show that the model trained on our data still performs better than the model trained on the No Robots data, even for prompts outside the training distribution. 

As a final step, we used popular general benchmarks to assess the overall skills of the models. These metrics are not tailored to our specific domains and skills, so we did not expect significant improvements. However, we needed to confirm that the general capabilities of the models did not degrade or become biased after training on our dataset.

In 2 out of 3 benchmarks, the performance of the model trained on our dataset is higher. Moreover, in the MMLU benchmark subsets that are most significant to the client, the difference is even more pronounced.

We also conducted additional experiments using Llama 8B Base and Mistral 7B Instruct models and obtained roughly the same conclusions.

The impact: a superior dataset for tuning custom LLMs 

Our extensive experiments confirmed the dataset’s viability. After training models on our dataset:

  • Performance significantly improved in the domains, languages, skills, and qualities of interest to the client.

  • The overall quality of the models on general tasks did not degrade and, in some cases, even improved.

With satisfactory metrics in hand, Toloka delivered the dataset of 10,000 prompt-completion pairs in two languages across multiple specialized domains. The project’s complexity resulted in 80 smaller datasets rather than one large one, covering different languages, domains, and prompt types, both contextual and non-contextual. 

The client’s linguists confirmed that the resulting language quality was high, outperforming leading models like Llama 3 on specific tasks in specific languages

Highlights of how we achieved exceptional dataset quality: 

  • Every data point was checked by at least two experts, alongside automated checks. 

  • We used LLMs to classify data and flag potential issues, which helped us maintain high standards and swiftly address any problems.

  • We developed a new category of data generation called inference to enrich the dataset with valuable domain-specific data.

  • Domain experts contributed in-depth knowledge and industry experience to the dataset, ranging from a university professor to an auto engineer.

Ready to optimize your model's performance with high-quality SFT data? Let’s talk about your ideal dataset.

Article written by:

Toloka Team

Updated:

Oct 22, 2024

Subscribe to Toloka News

Case studies, product news, and other articles straight to your inbox.

Subscribe

Subscribe
to Toloka News

Case studies, product news, and other articles straight to your inbox.

Subscribe

Subscribe to Toloka News

Case studies, product news, and other articles straight to your inbox.

Subscribe

More about Toloka

What is Toloka’s mission?

Where is Toloka located?

What is Toloka’s key area of expertise?

How long has Toloka been in the AI market?

How does Toloka ensure the quality and accuracy of the data collected?

How does Toloka source and manage its experts and AI tutors?

What types of projects or tasks does Toloka typically handle?

What industries and use cases does Toloka focus on?

What is Toloka’s mission?

Where is Toloka located?

What is Toloka’s key area of expertise?

How long has Toloka been in the AI market?

How does Toloka ensure the quality and accuracy of the data collected?

How does Toloka source and manage its experts and AI tutors?

What types of projects or tasks does Toloka typically handle?

What industries and use cases does Toloka focus on?

What is Toloka’s mission?

Where is Toloka located?

What is Toloka’s key area of expertise?

How long has Toloka been in the AI market?

How does Toloka ensure the quality and accuracy of the data collected?

How does Toloka source and manage its experts and AI tutors?

What types of projects or tasks does Toloka typically handle?

What industries and use cases does Toloka focus on?