Principled Design for Crowdsourcing

by Toloka Team on Mar 3rd, 2021
New series: Crowd Science Seminars

Introducing Crowd Science Seminars: a series of international online meetups highlighting scientific research related to crowdsourcing, organized bi-weekly by the Toloka team. We’ll be using the blog to share the main ideas and key takeaways from each event. This article is the first in an ongoing series where we’ll zero in on the crowdsourcing aspects of each topic in the Crowd Science Seminars.

In our first post, we cover a recent talk by Ivan Stelmakh, who presented his research with Nihar Shah and Aarti Singh on the importance of “principled design of human decision-making systems”. Stelmakh examined crowdsourcing as one example of a human decision-making system, which is what we’ll focus on here.

What is a human decision-making system?

A human decision-making system uses input from a group of individuals to solve complex problems. Here are some specific examples:


  • Crowdsourcing. Platforms like Toloka or Amazon Mechanical Turk maintain a crowd of contributors who perform tasks for compensation. If you need to complete a huge task, like label a dataset with thousands of images, you can pay the crowd to do it for you. A large number of people work side by side to solve complex tasks.
  • Performance reviews. Companies usually check employee performance at regular intervals, and these reviews are often based on evaluations made by other employees.
  • Academic peer review. Every scientific research article goes through the system of peer review before being approved for publication. Articles are evaluated by several researchers who work in the same field and are trusted to validate the credibility of the research.
  • Peer grading in education. Students grade each other's homework. Online learning platforms like Coursera often use this method to save instructors' time and effort and reinforce learned material.

All of these systems are beneficial because they capitalize on shared knowledge and insights. However, as human decisions are prone to biases and errors, these systems also have inherent pitfalls which need to be accounted for.

Why is principled design important?

Human systems tend to have systematic problems. In a 2016 article for Nature, Drummond Rennie makes a strong point: 

It may sound surprising that the system responsible for validating the rigor of scientific research is itself unscientific. And if academic peer review is not scientific, how can the results of performance reviews or crowdsourcing be considered reliable?
Stelmakh's research focuses on principled design of human decision-making systems: he works on designing tools and techniques to address the problems that inevitably arise in large systems where humans make decisions. At the seminar, he presented three main problems to tackle:

  1. Noise (incorrect results)
  2. Strategic behavior (cheating or manipulation)
  3. Bias (systematic deviations in decision making).

Let's look at how these issues can be managed in the crowdsourcing context.


Problem: People are not always careful and often make mistakes.

Solution: Rely on the accuracy of the group, not individuals. The Golden Rule of crowdsourcing is to assign each task to multiple performers and aggregate their responses. The crowdsourcing industry has put a lot of effort into developing aggregation methods that take results from multiple people and combine them to boost the overall quality of data. One of the simplest approaches is majority vote: examine multiple responses for the same task and assume that the most popular answer is correct. Even a basic approach like this is a good start on eliminating noise.

Strategic behavior

Problem: In crowdsourcing, the performer's goal is often to earn as much money as possible, which might not match up with the requester's goal to get high-quality data. Performers sometimes generate spam, use scripts to do their tasks, or click through tasks randomly to speed up the process or “game the system”.

Solution: Design a system where the incentives for performers are aligned with the goals of the organizers. A simple approach is to check the performer's accuracy by using a set of questions that you know the correct answers to (called a golden set). The performer’s quality rating can be directly connected to their pay rate, which motivates them to put more effort into the task.


Problem: All people have biases, whether they are aware of them or not. Importantly, manifestations of these biases are often too subtle to be detected, but can hurt the quality of collected data. For example, performers can be influenced by certain aspects of the task design without realizing it — missing information that causes doubt, preference for options that are given first, similarity between response options — and these issues can decrease the quality of collected data.

Solution: Crowdsourcing projects can take a practical approach to eliminating the opportunity for bias. Careful instructions can reduce uncertainty and simplify the task for performers. To avoid social influence, performers are usually not allowed to see answers given by other people. Another common strategy is to use a randomized interface that shuffles the order of response options for each performer.

That's not all...

Human error in data labeling can have far-reaching consequences — if you use crowdsourcing to collect a large dataset and then train algorithms with it, any problems in the data (like bias or noise) will also appear in the machine learning models. However, there are many ways to control quality and compensate for human error in crowdsourcing, with ongoing research in this area.

This post has covered just an overview of the crowdsourcing aspect of Ivan Stelmakh's research. For the full analysis and more applications of these problems and solutions, watch the presentation:

Toloka News

Receive information about platform updates, training materials, and other news.
Wed Apr 28 2021 16:47:50 GMT+0300 (Moscow Standard Time)