Quality control is a particularly important part of a crowdsourcing project. It is so essential because by default we don't know anything about the performers who complete our tasks. We don't know if they understood the task correctly, if they're attentive enough, or if their intentions are good. We don't even know that they aren't bots. Various quality control methods allow us to get answers to all these questions so that we can manage the crowd according to various traits we define.

Designing quality control is a multi-stage process. It starts with decomposing a task, writing instructions and designing a clear interface. All these things help to eliminate misunderstandings and guide the performer through the task. The next steps are:

  • Pre-filtering performers to match project goals
  • Creating a training project that explains all necessary rules
  • Using an entrance test to check if the guidelines were understood
  • Tracking performers' behavior in the project
  • Checking the quality of responses
  • Using smart response aggregation

The results of all these checks can be converted into specific performer attributes that reflect their quality and behavior.

Selecting performers

Though it is not a quality control mechanism in a literal sense, this is a very important step and the best investment in terms of both data quality and requester's time spent. You can create a foundation for stable data quality by explaining the guidelines and testing performers before they get access to the project. Creating a comprehensive and clear training set and a corresponding test allows you to recruit new performers whenever it becomes necessary, without wasting your time on manual selection.
Pre-filtering performers means offering the task only to people who have certain attributes necessary for the task. Different crowdsourcing platforms may offer different filters, but most popular are location, age, gender, known languages and devices. If you have good reason to believe that some of these properties will affect task performance, try using these filters. Here are some examples when filters might be useful:
  • If your project involves working with content in a certain language, give access only to users with corresponding language skills.
  • If your project involves UX testing for a product that has an age-based target audience, send tasks to performers in the same age group.
  • If your project involves evaluating content that might not be accessible from some regions, restrict access for users in these regions.

For more information on filters available on Toloka, see our Requester's Guide.

After pre-filtering, you need to teach your performers to apply the rules and guidelines to real-life tasks. Create a training set of tasks. Training sets consist of tasks with hints that explain how to do the task and why. Here are some suggestions for creating effective training projects:
  • Include examples for all the rules and guidelines in your instructions, even the simplest ones.
  • Don't make it too long – just cover all necessary cases.
  • In the comments, explain why a certain response is appropriate with a reference to the instructions. Don't just state the correct answer.
There's a special type of pool in Toloka aimed solely for entrance training. Learn more about the settings available.
After a performer finishes the training, redirect them to a testing set, which is a set of tasks with known answers. After a performer completes the test, you can calculate the percentage of correct answers and decide if it's high enough to stay on the project. Here are some suggestions for setting up a test:
  • Testing tasks should match the topics and complexity of the training.
  • It's better to create several versions of the test and alternate between them, so that it will be more difficult for performers to cheat and share correct answers with each other.
  • Testing tasks should be of the highest quality because you are using them to make important decisions about performers. Pay attention to examples where performers make a lot of mistakes: it's a sign that either something's wrong with the task or the concept wasn't explained well enough in the instructions.
A testing set is technically a pool that only consists of control tasks. See the step-by-step description of adding control tasks and quality calculation rules in our Requester's Guide.

Synchronous quality control

Most quality control methods are designed to control performers' behavior and assessments with the smallest delay possible. It is important to detect low-quality performers quickly enough before they produce too much useless data and too much money is spent. The following sections describe some of the popular control mechanisms available on crowdsourcing platforms.
Checks like these are designed to reveal if a performer regularly demonstrates suspicious behavior, like browsing through tasks too quickly and inattentively. Here are some popular approaches to control bot-like behavior:
  • CAPTCHA. This is a well-known tool that helps filter out bots by displaying an image with a text that a user needs to decipher and enter. It's a great basic mechanism, but it's better not to rely solely on it because it only helps to filter out bots or users who don't pay any attention at all. Besides, regular users also tend to make CAPTCHA mistakes because CAPTCHAs may be complicated or just irritating. Best practice here is to ban users only after a certain number of missed CAPTCHAs in a row (more than one).
  • Speed monitoring. For every crowd-based task there's a reasonable amount of time it takes to understand the task and make an assessment. If a certain user regularly closes tasks much faster than the average time, they're probably just clicking through the tasks without looking at them carefully. The suspicious speed is different for every project, but it can be defined as 10-20% of the normal speed in the project. If a performer submits a string of tasks under this limit, it's better to single them out and ban them from the project.
  • Checking for certain actions. A crowd task is basically an interface that allows you to implement any controls or checks based on Javascript. If a task involves interaction with content like playing a media file, visiting a link, or typing in text, these interactions can be checked and analyzed. As discussed in the Interfaces section, some of these checks can be transformed into warnings. Another option is to keep a record of performer actions and calculate performer attributes based on how diligently a user handles task content.
For most tasks it's not enough to just identify users who mindlessly blast through tasks — it's also important to check if those who are performing diligently enough are also following all the guidelines and producing quality data. Checking assignment quality is dependent on a benchmark to compare the answers to. There are two major approaches to establishing a benchmark: majority vote and expert opinion.
  • Majority vote is based on task overlap. Overlap is when several users complete the same assignment. After all assessments are submitted, you can find which answer was selected by most users and use it as a benchmark. By comparing a particular performer's answers to benchmarks like these, you can calculate the percentage of cases when their opinion contradicts the opinion of the majority. This percentage can help you detect unstable quality in specific performers. However, majority vote mechanisms can sometimes discriminate against unusually attentive users who submit correct but unobvious answers. Majority vote calculations are also prone to being skewed by spam.
  • Control tasks (a.k.a. golden sets) solve both of these problems. A control task is a task with a known correct answer. Performers get these tasks mixed in with their general task flow, without knowing that it entails a special check. By adding control tasks to your project, you can determine the percentage of correct responses for the project as a whole, and the percentage of correct responses for individual users.

Toloka allows to set up these and additional quality control mechanisms as independent settings or quality control presets. They are available both on pool and project levels.

Which overlap should I choose? 
Research shows that the optimal overlap for task assignment is 3 to 5. Beyond this, quality increases only slightly while price increases significantly. Be sure to check the section about incremental relabeling, which is a smart way to control costs on overlap.

How many control tasks should I add?
If a task set consists of several hundred tasks, then 10% of the pool should be control tasks. If the set contains thousands of tasks, just 1% is enough.

How do I create control tasks?
There are two ways to create control tasks:

  • Send the tasks to a "trusted" crowd. Select performers who are reliable in delivering high-quality results. Then create a separate project for them to generate correct responses for tasks and launch it with high overlap. Don’t forget to use additional quality control methods, just in case.
  • Choose experts at your company who can label the data well. Have them generate control tasks to use for checking the quality of performers' responses. In other words, 10% of the tasks are completed by an internal team and used for controlling 90% of the entire labeling process.
How can I maintain the quality of control tasks?
First, here's a tip: if the quality or labeling speed is low, check the control tasks. Perhaps they contain incorrect, outdated or unclear examples. Control tasks need special maintenance because the project's quality metrics and individual performers' quality depend on them. There are two best practices:

  • Get rid of old control tasks and replace them with new ones.
  • Check for suspicious control tasks where performers constantly make mistakes. The task may contain a mistake or a corresponding guideline may be unclear.
There is one more important feature of a control task set: if possible, classes should be represented in the control set in proportions similar to the classes in the general pool of tasks. Let's say you need to determine the type of accommodations on a hotel aggregator website: family, business, casual or luxury. Luxury accommodations make up just 10% of the main pool, but performers see them in every second control task. As a result, you will not be able to check whether the performers correctly mark the other types of accommodations, and you risk getting poor results with noisy data.

Asynchronous quality control

Synchronous quality control methods are only applicable to tasks where there's a single correct answer. But not all crowd tasks are like that: some demand a creative approach or content processing and can have a variety of correct solutions. Tasks like these can be checked via assignment review. There are two possible ways to review a task:
  • The review can be done by someone on the requester's side. A crowdsourcing platform may even have this option in its UI. But this is only a viable option for very small data volumes (or for requesters with unlimited resources).
  • Other crowd performers can do the review. This requires starting a new project using marked data that is transferred from the first project and asking performers whether the task was completed correctly. After verification, incorrect tasks are transferred back for evaluation, payments for correctly completed tasks is sent to the performers, and performer quality can be determined.
See the Requester's Guide for a step-by-step description of assignment review settings.
Using all data about performers' individual quality and their approach to certain tasks can become a basis for smart quality enhancement via smart aggregation.
Toloka News
Receive information about platform updates, training materials, education events and other news.
Cookie files
Yandex uses cookies to personalize its services. By continuing to use this site, you agree to this cookie usage. You can learn more about cookies and how your data is processed in the Privacy Policy.
Wed Apr 07 2021 14:41:40 GMT+0300 (Moscow Standard Time)