Subscribe to Toloka News
Subscribe to Toloka News
BigCode's mission is focused on the responsible development and use of large language models for code. This aligns perfectly with Toloka's values. We believe that a high-tech approach to data labeling requires a combination of science and high-quality human input obtained in a responsible and scalable way. That's why working on the BigCode project is so organic to Toloka and mutually beneficial for all the teams involved.
Code LLMs are making developers more productive by taking over the mundane and error prone parts of programming. However, current systems face regulatory challenges such as GDPR, so the BigCode project looked to Toloka for help with removing sensitive information from their dataset called The Stack. This would mean getting help with the annotation of usernames, passwords, and security tokens. With 6.4 TB of code in multiple programming languages in the dataset, this was not a trivial task.
The curators set out to prepare a PII model training dataset to automatically detect and mask personal data in the The Stack. They identified 14 categories of sensitive data that needed to be accurately recognized. The remaining challenge was to find enough experts to label the training data.
Toloka teamed up with BigCode to take on the dataset preparation. The traditional approach would be to hire a team of programmers who could read the source code and accurately label all 14 categories of data. Considering the size of the dataset (12,000 code chunks), it would take a lot of programmers or many months of effort.
Fortunately, Toloka is anything but traditional. Our team got the data labeled in just 4 days with the combined efforts of 1399 Tolokers (some non-programmers) from 35 different countries. This was a total of 4349 person-hours — the equivalent of an entire year of work for one programmer working alone.
The secret to making the task doable was to break it down into smaller steps, a process that we call decomposition. Let's dissect the project to see how we pulled it off.
The goal was to find and label 14 categories of personal data: 7 main categories (names, email addresses, passwords, usernames, SSH and API keys, cloud IDs, and IP addresses), 6 subcategories, and an “ambiguous” category. While names and email addresses are easy to understand, some of the categories are difficult for non-experts to recognize when looking at programming code.
We used a quiz to teach Tolokers about all the categories. The quiz gave detailed instructions for each category, followed by a test. After the quiz, Tolokers were assigned a skill level for each category and granted access to tasks based on skill. So if a certain person was good at labeling names but not good at IP addresses, they were only allowed to do tasks for labeling names.
The 12,000 code samples were divided into chunks of 50 lines each. We decomposed the labeling into separate projects by category, which meant that each chunk was labeled in 7 different projects — once for each of the main categories. In any given task, the Toloker was asked to look for just one category of personal data at a time. This meant that more people could successfully contribute to the project with less expertise needed.
The project was broken down into 4 steps:
The BigCode team and Toloka share a commitment to responsible data collection and fair treatment of crowd workers.
A main concern for the BigCode project was to pay Tolokers more than the minimum wage. Together, we analyzed minimum wage rates and purchasing power in different countries and set an hourly pay rate of $7.30. The project was open to Tolokers in countries where this pay rate would have the same purchasing power as the highest minimum wage in the US ($16.50).
We wanted to reward Tolokers for their performance, so we started with a fair market price on tasks and then awarded bonuses to reach the hourly rate. When the project finished under budget, we allocated all of the remaining funds as final bonuses.
The project was a big hit with Tolokers. Many people welcomed the opportunity to work with real programming code and learn more about coding in the process. We received waves of positive feedback from people asking for more tasks.
We did have concerns about the ethics of sharing personal data with the crowd. BigCode took pains to make sure that all the code samples were taken from permissively licensed open sources that were already publicly available, and we made sure that Tolokers understood that the dataset would be used to train a model to handle this task automatically.
As a company, Toloka supports the BigCode goals of transparency, openness, and responsible AI. Open datasets are just part of the equation, but it's essential that those datasets are collected in a fair and transparent way — Toloka's number one priority.