Aggregating categorical replies via crowdsourcing: two classical algorithms

by Toloka Team

Sep 20, 2021

engineering

Subscribe to Toloka News

Successful aggregation of data lies at the heart of crowdsourcing. This article in the form of a hands-on demo will briefly look at how categorical responses can be aggregated using two classical algorithms — Majority Vote and Dawid-Skene — to help you meet your data labeling goals even faster. The demo is based on the research presented at ICML 2021 by one of our key developers, Dmitry Ustalov.

We’ll be using Crowd-Kit, an open-source computational quality control library that offers efficient implementations of various quality control methods, including aggregation, uncertainty, agreements, and more. But feel free to use any of the alternatives: spark-crowd (which uses Scala instead of Python), CEKA (Java instead of Python), or Truth Inference (which uses Python but provides only categorical and numeric responses and doesn’t have an all-in-one API). Crowd-Kit is designed to work with Python data science libraries like NumPy, SciPy, and Pandas, while providing a very familiar programming experience with well-known concepts and APIs. It’s also platform agnostic. As soon as you provide the data as a table of performers, tasks, and responses, the library will deliver high-quality output regardless of the platform you use.

In this demonstration, we will aggregate some responses provided by crowd performers. For the project, they had to indicate whether the link to a target website was correct or not. Given that we asked multiple performers to annotate each URL, we needed to choose the correct response, considering performers’ skills and task difficulties. That’s why we needed aggregation. It’s a vast research topic, and there are many methods available for performing this task, most of which are based on probabilistic graphical models. Implementing them efficiently is another challenging task.

Our demo will use Google Colab, but any other Python programming environment would work fine. First, we need to install the Crowd-Kit library from the Python Package Index. We’ll also need annotated data. We’ll be using Toloka Aggregation Relevance datasets with two categories: relevant and not relevant. These datasets contain anonymous data that is safe to work with. We’ll need the Crowd-Kit dataset downloader to download them from the Internet as Pandas data frames. Again, feel free to use a different source of annotated data; open datasets are, naturally, fair play as well. Now we’re ready to go.

The load_dataset function returns a pair of elements. The first element is the Pandas data frame with the crowdsourced data. The second element is the ground truth dataset, whenever possible. The data frame (df) has three columns: performer, task, and label. The label is set to 0 if the document is rated as non-relevant by the given performer in the given task, otherwise the label will be 1. The ground truth dataset df_gt is a Pandas series that contains the correct responses to the tasks put to the index of this series. Let's proceed with the aggregation using majority vote, a very simple heuristic method.

This simple heuristic works extremely well, especially with small datasets, so it’s always a good idea to try it. Note that the ties are broken randomly to avoid bias towards the first occurring label.

However, the classical majority vote approach does not take into account the skills of the performers. But sometimes it’s useful to weigh every performer's contribution to the final label proportionally to their agreement with the aggregate. This approach is called Wawa, and Crowd-Kit also offers it. Internally, it computes the majority vote and then re-weighs the performers’ votes with the fraction of responses matched to that majority vote.

Now we perform the same operation with Dawid-Skene.

Dawid-Skene is another classical aggregation approach in crowdsourcing, which was originally designed in the 70s for probabilistic modeling of medical examinations. The code is virtually the same: we create an instance, set the number of algorithm iterations, call fit_predict, and obtain the aggregated results.

Now, let's evaluate the quality of our aggregation. We will use the well-known F1 score from the scikit-learn library.

In this dataset, the ground truth labels are available only for the subset of tasks, so we need to use index slicing. This allows us to perform model selection using well-known and reliable tools like Pandas and scikit-learn together with Crowd-Kit.

In our experiment, the best quality was offered by the Dawid-Skene model. Having selected the model, we want to export all of the aggregated data, which makes sense in downstream applications.

We’ll now use Pandas to save the aggregation results as a TSV file, after transforming the series to a data frame and specifying the desired column name.