BigCode project: Code-generating LLMs boosted by Toloka's crowd

Amy Krishnevsky
by
Image

Subscribe to Toloka News

Subscribe to Toloka News

Why BigCode is a big deal

Toloka recently labeled training data to support BigCode, an open scientific collaboration jointly led by HuggingFace and ServiceNow.

BigCode's mission is focused on the responsible development and use of large language models for code. This aligns perfectly with Toloka's values. We believe that a high-tech approach to data labeling requires a combination of science and high-quality human input obtained in a responsible and scalable way. That's why working on the BigCode project is so organic to Toloka and mutually beneficial for all the teams involved.

The PII challenge

Code LLMs are making developers more productive by taking over the mundane and error prone parts of programming. However, current systems face regulatory challenges such as GDPR, so the BigCode project looked to Toloka for help with removing sensitive information from their dataset called The Stack. This would mean getting help with the annotation of usernames, passwords, and security tokens. With 6.4 TB of code in multiple programming languages in the dataset, this was not a trivial task.

The curators set out to prepare a PII model training dataset to automatically detect and mask personal data in the The Stack. They identified 14 categories of sensitive data that needed to be accurately recognized. The remaining challenge was to find enough experts to label the training data.

Crowd magic: 4349 hours in 4 days

Toloka teamed up with BigCode to take on the dataset preparation. The traditional approach would be to hire a team of programmers who could read the source code and accurately label all 14 categories of data. Considering the size of the dataset (12,000 code chunks), it would take a lot of programmers or many months of effort.

Fortunately, Toloka is anything but traditional. Our team got the data labeled in just 4 days with the combined efforts of 1399 Tolokers (some non-programmers) from 35 different countries. This was a total of 4349 person-hours — the equivalent of an entire year of work for one programmer working alone.

Cracking the magic trick

The secret to making the task doable was to break it down into smaller steps, a process that we call decomposition. Let's dissect the project to see how we pulled it off.

Labels and skills

The goal was to find and label 14 categories of personal data: 7 main categories (names, email addresses, passwords, usernames, SSH and API keys, cloud IDs, and IP addresses), 6 subcategories, and an “ambiguous” category. While names and email addresses are easy to understand, some of the categories are difficult for non-experts to recognize when looking at programming code.

We used a quiz to teach Tolokers about all the categories. The quiz gave detailed instructions for each category, followed by a test. After the quiz, Tolokers were assigned a skill level for each category and granted access to tasks based on skill. So if a certain person was good at labeling names but not good at IP addresses, they were only allowed to do tasks for labeling names.

Image
Project decomposition before we set it up in Toloka

The interface

The 12,000 code samples were divided into chunks of 50 lines each. We decomposed the labeling into separate projects by category, which meant that each chunk was labeled in 7 different projects — once for each of the main categories. In any given task, the Toloker was asked to look for just one category of personal data at a time. This meant that more people could successfully contribute to the project with less expertise needed.

We designed a custom Javascript interface to display the code formatting correctly. Tolokers used color highlighting to select the personal data they found for each category. Each task had a maximum of 4 subcategories to choose from, which made it easy to differentiate them with contrasting colors. The visual simplicity speeds up labeling (imagine what it would be like if we tried to label all 14 categories with different colors at once). This is what it looked like:

Image
Image

The steps

The project was broken down into 4 steps:

  1. Quiz. We showed Tolokers explanations and examples of how to find each category of personal data and how to handle exceptions. The quiz tested their skills in each category separately.
  2. Labeling in 7 separate projects. Each code chunk was labeled separately for each main category.
  3. Validation in 7 separate projects. Each labeled chunk was shown to other Tolokers to decide if the label is correct or incorrect. Validation was repeated 3 – 5 times, depending on the confidence and consistency of the answers.
  4. Automatic aggregation. All of the labels were combined for each chunk. Intersecting labels were marked as “Ambiguous”.

Taking care of the crowd

The BigCode team and Toloka share a commitment to responsible data collection and fair treatment of crowd workers.

A main concern for the BigCode project was to pay Tolokers more than the minimum wage. Together, we analyzed minimum wage rates and purchasing power in different countries and set an hourly pay rate of $7.30. The project was open to Tolokers in countries where this pay rate would have the same purchasing power as the highest minimum wage in the US ($16.50).

We wanted to reward Tolokers for their performance, so we started with a fair market price on tasks and then awarded bonuses to reach the hourly rate. When the project finished under budget, we allocated all of the remaining funds as final bonuses.

The project was a big hit with Tolokers. Many people welcomed the opportunity to work with real programming code and learn more about coding in the process. We received waves of positive feedback from people asking for more tasks.

Image
Some feedback received from Tolokers

The ethics of BigCode

We did have concerns about the ethics of sharing personal data with the crowd. BigCode took pains to make sure that all the code samples were taken from permissively licensed open sources that were already publicly available, and we made sure that Tolokers understood that the dataset would be used to train a model to handle this task automatically.

As a company, Toloka supports the BigCode goals of transparency, openness, and responsible AI. Open datasets are just part of the equation, but it's essential that those datasets are collected in a fair and transparent way — Toloka's number one priority.

Article written by:
Amy Krishnevsky
Updated: 

Recent articles

Have a data labeling project?

Take advantage of Toloka technologies. Chat with our expert to learn how to get reliable training data for machine learning at any scale.
Fractal

More about Toloka

  • Our mission is to empower businesses with high quality data to develop AI products that are safe, responsible and trustworthy.
  • Toloka is a European company. Our global headquarters is located in Amsterdam. In addition to the Netherlands, Toloka has offices in the US, Israel, Switzerland, and Serbia. We provide data for Generative AI development.
  • We are the trusted data partner for all stages of AI development–from training to evaluation. Toloka has over a decade of experience supporting clients with its unique methodology and optimal combination of machine learning technology and human expertise. Toloka offers high quality expert data for training models at scale.
  • The Toloka team has supported clients with high-quality data and exceptional service for over 10 years.
  • Toloka ensures the quality and accuracy of collected data through rigorous quality assurance measures–including multiple checks and verifications–to provide our clients with data that is reliable and accurate. Our unique quality control methodology includes built-in post-verification, dynamic overlaps, cross-validation, and golden sets.
  • Toloka has developed a state-of-the-art technology platform for data labeling and has over 10 years of managing human efforts, ensuring operational excellence at scale. Now, Toloka collaborates with data workers from 100+ countries speaking 40+ languages across 20+ knowledge domains and 120+ subdomains.
  • Toloka provides high-quality data for each stage of large language model (LLM) and generative AI (GenAI) development as a managed service. We offer data for fine-tuning, RLHF, and evaluation. Toloka handles a diverse range of projects and tasks of any data type—text, image, audio, and video—showcasing our versatility and ability to cater to various client needs.
  • Toloka addresses ML training data production needs for companies of various sizes and industries– from big tech giants to startups. Our experts cover over 20 knowledge domains and 120 subdomains, enabling us to serve every industry, including complex fields such as medicine and law. Many successful projects have demonstrated Toloka's expertise in delivering high-quality data to clients. Learn more about the use cases we feature on our customer case studies page.