Guide: how to evaluate the quality of annotated data

Boris Tseytlin

Subscribe to Toloka News

Subscribe to Toloka News

How do you know if the collected data is any good?

We at Toloka naturally know a thing or two about that. In this guide, I will describe how to estimate the quality of your data annotations, discuss metrics to use, and popular issues that might arise.

For the sake of simplicity, let's focus on classification tasks: there are sample images or texts, and the annotator needs to find an appropriate label for each one. This is the most common type of crowdsourcing task. All metrics in this article are selected with classification in mind, but metrics aren't as important as principles. The principles of quality evaluation we are going to discuss are universal.

First, some definitions.

  • Data/dataset/labeling: Multiple samples plus crowd labels. Crowd labels means that for each sample there are labels from multiple crowd workers. For example, if we're identifying pornographic content in pictures, then images are the samples, and the annotator responses are "prohibited content" and "no prohibited content" labels. With an overlap of three, there will be three labels per each sample.
  • True label/class: The true category the sample belongs to, usually can't be observed.
  • Predicted label/class: The category the sample belongs to according to an annotator. In other words, their response to a task.
  • Aggregated label/class: A sample label obtained by majority vote or other aggregation of multiple annotator labels.
  • Labeling quality: The extent to which the resulting labels meet the original objective. For example, let's say a dataset was collected to train a model. In this case, the label quality is good if a model can be trained on these labels and perform well.
  • Metric: a quality measurement expressed as a number. A metric filters and compresses the information in the dataset, leaving only the relevant signal.

If you are familiar with classification metrics in machine learning then you can skip the introductions and jump straight to the "Approaches to evaluating labeling quality" section.

Below I'll describe accuracy as an example of a simple metric. After this we will get into the principles behind creating and selecting metrics for quality evaluation in crowdsourcing. Finally, we will discuss specific metrics for classification tasks and things to note when using them.

I love code and examples, so we will have code and examples. We will test all methods on a crowd-labeled toy dataset with available true labels: IMDB Movie Reviews. The task is to determine whether movie reviews are positive or negative.

You can view the entire code here and the data here and here.

Code for data loading and preparation:

import pandas as pd
import numpy as np
import as px
import crowdkit
df_gt = pd.read_csv('imdb_crowd/train.csv') # True train
df_a06 = pd.read_csv('imdb_crowd/train_crowd_alpha06.csv') # Crowd annotated train
df_a06, df_a06_golden = df_a06[pd.isnull(df_a06['GOLDEN:result'])], df_a06[~pd.isnull(df_a06['GOLDEN:result'])]
true_task_labels = df_gt[['text', 'label']].drop_duplicates().set_index('text')['label'].to_dict()
df_a06['true_label'] = df_a06['INPUT:text'].apply(lambda t: true_task_labels[t])
df_a06_golden['true_label'] = df_a06_golden['INPUT:text'].apply(lambda t: true_task_labels[t])
def rename_columns(df):
return df.rename(columns={
'ASSIGNMENT:task_id': 'task',
'OUTPUT:result': 'label',
'ASSIGNMENT:worker_id': 'worker',
'GOLDEN:result': 'golden_label',
df = rename_columns(df_a06)
df_golden = rename_columns(df_a06_golden)
df.shape, df_golden.shape

Accuracy: the simplest quality metric

Accuracy is a popular, simple, and intuitive metric.

Let's say our task is to determine the sentiment of movie reviews. There are two classes: positive reviews and negative reviews. We have the correct labels and labels given by crowd annotators for 100 reviews. Majority vote is applied to annotator responses to get one label for each task. Annotators correctly labeled 35 of 50 positive reviews and 45 of 50 negative reviews.

Accuracy is the number of correct responses.

accuracy=35+45100=0.8accuracy = \frac{35+45}{100} = 0.8

This metric is easy to interpret: annotators correctly determine 80% of reviews.

Now let's take a look at another set of 100 reviews, of which 10 are positive and 90 negative. The annotators labeled 1 positive and 89 negative reviews correctly.

accuracy=89+1100=0.9accuracy = \frac{89+1}{100} = 0.9

The metric increased. But is the labeling better? No. The annotators labeled all reviews as positive, giving the same response in every case.

Is that even possible in practice?
It is! If you set up a project with no quality control, annotators can simply click through with the same response to earn their money faster.

The metric increased, while the labeling quality did not. The metric doesn't work for the job at hand. One of the classes has many more samples than the other, the classes are imbalanced, and the accuracy metric doesn't take that into account.

This example is here to demonstrate that we shouldn't treat metrics as black boxes, just thoughtlessly feeding numbers into them. What we see as good labeling depends on the task. An appropriate metric must be selected based on the task. Let's see how to do that.

Approaches to evaluating labeling quality

Control tasks quality evaluation

To evaluate labeling quality, we need to check annotator responses. They have to be compared with something.

One popular method is to use control tasks. They are samples we know the correct labels for. While labeling, annotators get a mix of regular and control tasks to label. As a result, we can compare their responses on control tasks to the correct labels.

Sample code for IMDB:

from sklearn.metrics import accuracy_score
control_task_labels = df_golden['golden_label']
crowd_labels = df_golden['label']
accuracy_score(control_task_labels, crowd_labels)
# Output: 0.853

By averaging the accuracies of individual annotators we get an estimate of the overall quality. For example, if the average annotator control task accuracy is 0.8, we can expect about 80% labels to be correct.

Control tasks work well for quality evaluation when:

  1. Control tasks are labeled correctly.
  2. Control tasks are representative in terms of class, complexity, and the number of samples. In other words, they look similar to regular tasks.
  3. Annotators respond to control tasks the same way they respond to regular tasks.

In real life, these conditions aren't often met. If a task ended up in crowdsourcing, it means clean labels are not available. In that case, control tasks are created by the requester. They might contain errors. There's no guarantee that control tasks cover all classes and are overall representative. To make matters worse, annotators might adapt to control tasks and respond to them differently, for example by memorizing them.

Where do control tasks usually come from?
A subset of samples is selected and labeled by many annotators each, for example ten. The tasks where all annotators were in agreement are turned into control tasks. Naturally, this means simpler examples are selected for control tasks since it's more likely annotators will give the same responses to them.

The main problem with control tasks is that there aren't many of them. For each project, there will be many annotators who completed two, one, or even zero control tasks. It's impossible to get a reliable evaluation when the sample size is so small.

Let's see how control task accuracy reflects actual accuracy in the case of IMDB.


In this plot, on the X-axis is annotator accuracy on control tasks, while on the Y-axis is annotator accuracy on true labels. Each point represents a single annotator. Correlation coefficient: 0.33, R²: 0.24.

Code for the plot:

def compute_worker_metrics(df, true_col, metric, min_answers=1):
worker_labels = df[['worker', 'label', true_col]].groupby('worker').agg({'label': list, true_col: list})
worker_metrics = worker_labels.apply(lambda row: metric(row[true_col], row.label) if len(row.label) >= min_answers else None, axis=1)
worker_metrics = worker_metrics.dropna()
return worker_metrics
worker_true_accuracy = compute_worker_metrics(df, 'true_label', accuracy_score)
worker_golden_accuracy = compute_worker_metrics(df_golden, 'golden_label', accuracy_score)
worker_metrics = pd.DataFrame({
'true_accuracy': worker_true_accuracy,
'golden_accuracy': worker_golden_accuracy,
fig = px.scatter(worker_metrics, x='golden_accuracy', y='true_accuracy', trendline="ols")
results = px.get_trendline_results(fig)

Note the weird "columns" of points on the plot. For all the annotators in each column, the accuracy on control tasks is the same, but the true quality of their responses is very different. According to control tasks, many annotators have an accuracy of 1, the maximum, while the true quality of their responses is poor.

Despite all the limitations, control tasks work well for evaluating quality more often than not. However, there are usually not enough of them, so the estimate might be unreliable. To get more reliable estimates we need to add something else.

Consistency-based quality evaluation

Another approach is to evaluate quality without control tasks. This approach is based on a simple idea: if many annotators read the same instructions, got the same task, and gave the same response, the resulting label is likely to be correct. If their responses are different, most likely there is a problem. Annotator agreement, also called consistency, correlates strongly with quality. We can create a metric based on this.

For example, let's aggregate responses using majority vote (MV): for each task, we select the most frequent response. Next, we can calculate how often annotators give the same response as the majority. Basically we calculate annotator accuracies compared to MV labels. We can average that over all annotators to get an estimate of overall quality.

Unlike using control tasks to evaluate quality, this approach doesn't rely on assumptions. Still, there are things to keep in mind. Consistency may be high for consistent but incorrect labels. In theory, organized fraud can manipulate the majority. Still, this has never been a problem in my experience.

Let's go back to IMDB Movie Reviews and see how well MV accuracy correlates with actual accuracy.


In this plot, annotator accuracy on MV labels is on the X-axis, true accuracy on the Y-axis. Correlation coefficient: 0.96, R²: 0.66.

Code for the plot:

from crowdkit.aggregation import MajorityVote
agg = MajorityVote()
mv_labels = agg.fit_predict(df).to_dict()
df['mv_label'] = df['task'].apply(lambda t: mv_labels[t])
worker_true_accuracy = compute_worker_metrics(df, 'true_label', accuracy_score)
worker_mv_accuracy = compute_worker_metrics(df, 'mv_label', accuracy_score)
worker_metrics = pd.DataFrame({
'true_accuracy': worker_true_accuracy,
'mv_accuracy': worker_mv_accuracy,
fig = px.scatter(worker_metrics, x='mv_accuracy', y='true_accuracy', trendline="ols")
results = px.get_trendline_results(fig)

Looks much better than when we used control tasks! That's mostly due to the fact that we have a lot more responses for each annotator.

Eyesight-based evaluation

Metrics inevitably lead to information loss and shouldn't be trusted blindly. After calculating the metrics, don't be shy to take a look at the labels with your own eyes.

There are other heuristics worth paying attention to. Typical red flags are:

  • Responses submitted too fast.
  • A lot of overdue task suites.
  • Too many or too few tasks per annotator.

The eyeball test tells you if the result makes sense. You can't evaluate quality accurately that way, but you can sanity-check your calculations.

So what should we do?

There's no such thing as a perfect method. Control tasks are few and consistency can be skewed. Use both methods, evaluating quality using control tasks as well as consistency.

Rule of thumb: if you have good labeling, quality on control tasks and based on consistency will coincide.

Typical issues:

  • Quality is low on control tasks, high on consistency: annotators might not have understood the task, think about clarifying the instructions.
  • Quality is low on consistency, high on control tasks: most likely your control tasks are too simple.
  • The metrics are good, but you can see that isn't the case: good luck, you're going to need it.

Now that we're familiar with the core principles behind evaluating labeling quality, let's move on to specific classification metrics.

Classification metrics

Confusion matrix

The confusion matrix is the mother of most classification metrics. We create it by comparing predicted labels with the correct responses. The correct responses could come from control tasks, majority vote, or somewhere else.

Let's revisit previous examples for movie review classification. There were 100 reviews, 50 positive and 50 negatives. The annotators correctly labeled 35 positive reviews and 45 negative reviews.

The confusion matrix looks like this:


Each row represents a true class. The sum of the elements in the first row is the number of positive reviews in the dataset. Each column is a predicted class. The sum of the elements in the first column is the number of times the annotators labeled a review as positive. Each cell contains the number of times a sample of row class was labeled as a column class. For example, the first cell shows the number of positive reviews that were correctly recognized as positive.

This might be confusing (pun not intended) the first time. Try asking yourself what each cell represents before continuing.

The confusion matrix shows both the number of errors and what the errors actually were. We can see that the annotators incorrectly identified 15 positive reviews as being negative (false positive errors) and 5 negative reviews as being positive (false negative errors).

All the elements of the confusion matrix are displayed from left to right:

  • True positive: The true label and the predicted label are both positive.
  • False negative: The true label is positive, and the predicted label is negative.
  • False positive: The label is negative, and the predicted label is positive.
  • True negative: The true label and the predicted label are both negative.

Let's revisit accuracy. Accuracy is calculated as the sum of the diagonal elements of the confusion matrix divided by the total sum of the elements:

accuracy=TP+TNTP+TN+FP+FNaccuracy = \frac{TP+TN}{TP+TN+FP+FN}

Accuracy values range from 0 to 1: the higher, the better.

Accuracy is vulnerable to class imbalance since it only takes into account the correct responses while ignoring errors. But could there be a better metric?


Precision, recall, F-score

Precision is the ratio of samples labeled as positive that were actually positive. In other words: how many positive predictions were correct?

precision=TPTP+FPprecision = \frac{TP}{TP+FP}

Precision values range from 0 to 1: the higher, the better.

Precision depends on false positive errors. High precision means that if annotators label a class as positive, it most likely is positive.

Recall is the ratio of positive samples in the dataset that were correctly recognized as positive. How many positive samples were we able to identify in the dataset?

recall=TPTP+FNrecall = \frac{TP}{TP+FN}

Recall values range from 0 to 1: the higher, the better.
Recall depends on false negative errors. High recall means that if the dataset contains a positive sample, it is likely to be identified as positive.

Precision and recall are at odds. It's easy to reach a precision value of 1.0 since you just need to correctly label one sample as positive and all the others as negative. But then recall will be close to zero. It's also easy to reach a recall value of 1.0: you just need to predict only positive labels all the time. But then precision will be close to zero.

Can we maximize them both at the same time? We can by using the F1 score.

F1-score is the harmonic mean of precision and recall.

F1=2precisionrecallprecision+recallF_1 = 2 * \frac{precision * recall}{precision + recall}

Like precision and recall, F1-score ranges from 0 to 1. It cannot be increased by decreasing recall or precision because of the harmonic mean. If precision is low, F1-score will be low regardless of recall, and vice versa. F1-score only increases when both precision and recall increase.

Let's take a look at an example of calculating these metrics for two confusion matrices.


Case A:

accuracy=(35+45)/100=0.8accuracy = (35+45)/100 = 0.8

precision=35/(35+5)=0.875precision = 35/(35+5) = 0.875

recall=35/(35+15)=0.7recall = 35/(35+15) = 0.7

F1=2(0.8750.7)/(0.875+0.7)=0.78F_1 = 2*(0.875*0.7)/(0.875+0.7) = 0.78

Case B:

accuracy=(90+0)/100=0.9accuracy = (90+0)/100 = 0.9

precision=1/(1+0)=1precision = 1/(1+0) = 1

recall=1/(9+1)=0.1recall = 1/(9+1) = 0.1

F1=2(10.1)/(1+0.1)=0.18F_1 = 2*(1*0.1)/(1+0.1) = 0.18

In the first case, both accuracy and the F1-score are close to 1, which means we have good labeling. In the second case, the classes are imbalanced and the labeling practically useless. Accuracy is high, which is misleading, while the F1-score is low. This correctly shows how poor the labeling is. F1-score is resistant to class imbalance.

Beyond F1-score

F1-score assumes that precision and recall are equally important. Is it always true?

Fβ=(1+β2)precisionrecall(β2precision)+recallF_{\beta}= (1+\beta^2) \frac{precision \cdot recall}{(\beta^2 \cdot precision ) + recall}

In some tasks, one error type is more serious than the other. For example, if your crowd annotators are diagnosing cancer using X-rays, first of all, you're a danger to society, and second, false negatives are much worse for you than false positives. Scaring healthy people isn't nice, but it's still better than missing someone who actually has the disease.

In that case, the F2-score comes in handy as it weights false negatives two times more than false positives.

F2=3.5precisionrecall(4precision)+recallF_2 = 3.5 * \frac{precision * recall}{(4*precision)+recall}

Adjusting the β parameter adapts the metric for your task. Decide whether precision or recall is more important in your case and select a metric accordingly.

When there are more than two classes

We can also use a confusion matrix for multi-class classification.

Let's say that positive and negative reviews are being joined by a third class: neutral reviews. In that case, the confusion matrix might look like this:


In this case, we have to talk about precision and recall of each separate class. For example, when we're calculating precision for the positive class, both negative samples and neutral samples labeled as positives are false positives. Just think of reducing the task to separating one particular class from the others, and then calculate the metrics like we did when we only had two classes.


Here are the precision and recall metrics for all classes in this example:

precisionpositive=35/(35+5+2)=0.83precision_{positive} = 35/(35+5+2) = 0.83

recallpositive=35/(35+10+5)=0.7recall_{positive} = 35/(35+10+5) = 0.7

precisionnegative=40/(40+10+2)=0.77precision_{negative} = 40/(40+10+2) = 0.77

recallnegative=40/(40+5+5)=0.8recall_{negative} = 40/(40+5+5) = 0.8

precisionneutral=46/(46+5+5)=0.82precision_{neutral} = 46/(46+5+5) = 0.82

recallneutral=46/(46+2+2)=0.92recall_{neutral} = 46/(46+2+2) = 0.92

To get a single metric, we can calculate the F1 score for each class and average them. There are multiple ways to average metrics by class. This one is called macro average.

F1positive=2(0.830.7)/(0.83+0.7)=0.76F1_{positive} = 2*(0.83 * 0.7)/(0.83 + 0.7) = 0.76
F1negative=2(0.7690.8)/(0.769+0.8)=0.78F1_{negative} = 2*(0.769 * 0.8)/(0.769 + 0.8) = 0.78

F1neutral=2(0.820.92)/(0.82+0.92)=0.87F1_{neutral} = 2*(0.82 * 0.92)/(0.82 + 0.92) = 0.87
F1macro=(0.76+0.78+0.87)/3=0.8F1_{macro} = (0.76 + 0.78 + 0.87)/3 = 0.8

Matthews correlation coefficient (MCC)

F1-score is a very popular metric. It is nice and resistant to class imbalance. But it's still not perfect (what ever is?). Consider two confusion matrices:


The only difference is that we switched the positive and negative labels in the second matrix. What would the F1-score be for these two cases? In the first case, it would be very high, while it would be very low in the second case. But the labeling didn't change! The thing is that F1-score ignores true negatives, which means it depends strongly on what class we designate as positive.

Is there a metric that takes both positive and negative classes into account for labeling quality? Of course, there is, otherwise I wouldn't be asking this rhetorical question.

We need the Matthews correlation coefficient, which is yet another metric based on the confusion matrix.

MCC=TP×TNFP×FN(TP+FP)(TP+FN)(TN+FP)(TN+FN)MCC = \frac{TP \times TN - FP \times FN}{\sqrt{(TP+FP)(TP+FN)(TN+FP)(TN+FN)}}

This metric sees true and predicted classes as random variables, calculating the Pearson correlation coefficient between them.

Its values range from -1 to +1, where -1 is a complete mismatch between predictions and the truth, 0 is no better than random guessing, and +1 is perfect labeling.

MCC doesn't depend on which class is designated as positive, it's resistant to class imbalance, and it doesn't require changes when you have multiple classes. It is considered one of the best ways to express the confusion matrix as a single number. It's a good idea to use MCC until you have a reason not to. But don't forget that classes are not equally important for some tasks, in which case it is better to use precision and recall.

Example of calculating the above metrics for IMDB:

from sklearn.metrics import accuracy_score, matthews_corrcoef, classification_report
print(f"Control task accuracy: {accuracy_score(df_golden['golden_label'], df_golden['label']):.3f}")
print(f"Control task MCC: {matthews_corrcoef(df_golden['golden_label'], df_golden['label']):.3f}")
print(f"Control task Precision, Recall, F1-score:
{classification_report(df_golden['golden_label'], df_golden['label'])}")
print(f"MV accuracy: {accuracy_score(df['mv_label'], df['label']):.3f}")
print(f"MV MCC: {matthews_corrcoef(df['mv_label'], df['label']):.3f}")
print(f"MV Precision, Recall, F1-score:
{classification_report(df['mv_label'], df['label'])}")
# Output:
# Control task accuracy: 0.854
# Control task MCC: 0.709
# Control task precision, recall, F1 score:
# precision recall f1-score support
# neg 0.88 0.82 0.85 8379
# pos 0.83 0.89 0.86 8424
# accuracy 0.85 16803
# macro avg 0.86 0.85 0.85 16803
# weighted avg 0.86 0.85 0.85 16803
# MV accuracy: 0.911
# MV MCC: 0.820
# MV precision, recall, F1 score:
# precision recall f1-score support
# neg 0.91 0.90 0.90 30795
# pos 0.91 0.92 0.92 36408
# accuracy 0.91 67203
# macro avg 0.91 0.91 0.91 67203
# weighted avg 0.91 0.91 0.91 67203

Evaluating annotator skills

Evaluating labeling quality is closely related to evaluating the skills. Skills are a nickname for the quality of responses of individual annotators. The average skill is a good quality metric on its own. Additionally, you can use skills to select good annotators and as weights for metrics and aggregation methods.

There are two ways to evaluate annotator skill levels: control tasks and consistency with other annotators.

Commonly we use accuracy as the metric for annotator skills, but this is mostly due to tradition and accuracy being very intuitive. Unfortunately it's vulnerable to class imbalance and provides broken distributions, which you can see from the breakdown of skills by MV in the IMDB dataset:


I recommend using MCC, balanced accuracy, or the F1-score to evaluate annotator quality unless you specifically need accuracy.

Distribution of annotator MCC by MV:


Much smoother. With this distribution we can separate annotators by skill in a much more fine-grained manner.

Note: MV overestimates annotator skills. Annotators can score higher by agreeing with the majority even if their response is wrong. The problem is exacerbated when datasets are class-imbalanced.

Distribution of annotator MCC by control tasks:


We can see that the distribution generally follows the true distribution. But there are few responses, the histogram has a lot of gaps, and many annotators get the maximum skill value.

If we remove annotators with fewer than five responses to control tasks, we get a better result:


There are significantly fewer annotators with skill values of 0 or 1, but there are too few responses to control tasks for a representative sample anyway.

The best way to evaluate annotators is to average skills by MV and control tasks:


This way, the distribution we get is very close to the true distribution.

Code for plots:

from sklearn.metrics import accuracy_score, matthews_corrcoef
def compute_worker_metrics(df, true_col, metric, min_answers=1):
worker_labels = df[['worker', 'label', true_col]].groupby('worker').agg({'label': list, true_col: list})
worker_metrics = worker_labels.apply(lambda row: metric(row[true_col], row.label) if len(row.label) >= min_answers else None, axis=1)
worker_metrics = worker_metrics.dropna()
return worker_metrics
worker_true_accuracy = compute_worker_metrics(df, 'true_label', accuracy_score)
worker_mv_accuracy = compute_worker_metrics(df, 'mv_label', accuracy_score)
worker_golden_accuracy = compute_worker_metrics(df_golden, 'golden_label', accuracy_score)
worker_true_mcc = compute_worker_metrics(df, 'true_label', matthews_corrcoef)
worker_mv_mcc = compute_worker_metrics(df, 'mv_label', matthews_corrcoef)
worker_golden_mcc = compute_worker_metrics(df_golden, 'golden_label', matthews_corrcoef)
worker_golden_mcc_no_outliers = compute_worker_metrics(df_golden, 'golden_label', matthews_corrcoef, min_answers=5)
worker_metrics = pd.DataFrame({
'true_accuracy': worker_true_accuracy,
'golden_accuracy': worker_golden_accuracy,
'mv_accuracy': worker_mv_accuracy,
'true_mcc': worker_true_mcc,
'golden_mcc': worker_golden_mcc,
'golden_mcc_no_outliers': worker_golden_mcc_no_outliers,
'mv_mcc': worker_mv_mcc,
worker_metrics['mean_accuracy'] = (worker_metrics['golden_accuracy'] + worker_metrics['mv_accuracy'])/2
worker_metrics['mean_mcc'] = (worker_metrics['golden_mcc_no_outliers'] + worker_metrics['mv_mcc'])/2
fig = px.histogram(worker_metrics, x=['true_accuracy', 'mv_accuracy'], barmode='overlay')
fig = px.histogram(worker_metrics, x=['true_accuracy', 'golden_accuracy'], barmode='overlay')
fig = px.histogram(worker_metrics, x=['true_accuracy', 'mean_accuracy'], barmode='overlay')
fig = px.histogram(worker_metrics, x=['true_mcc', 'mv_mcc'], barmode='overlay')
fig = px.histogram(worker_metrics, x=['true_mcc', 'golden_mcc'], barmode='overlay')
fig = px.histogram(worker_metrics, x=['true_mcc', 'golden_mcc_no_outliers'], barmode='overlay')
fig = px.histogram(worker_metrics, x=['true_mcc', 'mean_mcc'], barmode='overlay')

The consistency metric

The consistency metric is a crowdsourcing-specific and agreement-based metric that uses annotator accuracy. For each specific task, it is equal to the a posteriori probability that the MV label is correct given the labels by annotators and the annotators accuracies. Averaging consistency by tasks is a good measure of agreement and can be used as a quality metric.

Understanding consistency

Let's find out where this metric comes from and how to calculate it.

We'll say we have a classification task where n annotators gave a response to one task. What is the probability that the majority vote of their responses is the true label?


z=1Kz = 1 \dots K classes

zMVz_{MV}: label by majority vote

zTz_{T}: true label

y1yny_1 \dots y_n: annotator responses

s1sns_1 \dots s_n: annotator skills
P(zMV=zTy1,,yn)=i=1nsiδ(zMV=yi)(1siK1)δ(zMVyi)z=1Ki=1nsiδ(z=yi)(1siK1)δ(zyi)P(z_{MV} = z_{T}|y_1, \dots, y_n) =
\prod_{i=1}^{n} s_i^{\delta(z^{MV} = y_i)} (\frac{1-s_i}{K-1})^{\delta(z^{MV}\neq y_i)}
\sum_{z=1}^{K} \prod_{i=1}^{n} s_i^{\delta(z = y_i)} (\frac{1-s_i}{K-1})^{\delta(z \neq y_i)}

Whoa, scary stuff. Let's decrypt it.

We can conceptually describe the whole formula like this:

P(MV label is correctresponses)=P(responsesMV label is correct)P(MV label is correct)P(responses)P(\text{MV label is correct} | \text{responses}) = \frac{P(\text{responses}|\text{MV label is correct})P(\text{MV label is correct})}{P(\text{responses})}

P(responses|MV label is correct): The probability of observing such responses if the MV label is correct.

P(responses): The probability of observing such responses in general regardless of the correct label.

P(MV label is correct): The probability that the MV label is correct. We assume that all labels are equally probable, so the probability is 1/K. This term is reduced when expanding the numerator and denominator, so we'll skip it.

Let's focus on the numerator. There are two possibilities for annotators: their label is the same as the majority or it isn't. By the definition of skill, if zMVz_{MV} is the true label, the probability that the annotator will assign it equals sis_i.
P(yi=zMVzMV=zT)=siP(y_i = z_{MV}|z_{MV} = z_{T}) = s_i
What is the probability that the annotator will assign a label besides zMVz_{MV}?

When we have only two classes:

P(yizMVzMV=zT)=1si    K=2P(y_i \neq z_{MV}|z_{MV} = z_{T}) = 1 - s_i \iff K = 2

When there are more than two classes, we generally assume all other labels are equally probable:

P(yizMVzMV=zT)=(1si)/(K1)P(y_i \neq z_{MV}|z_{MV} = z_{T}) = (1 - s_i)/(K-1)

This means that:

P(yizMV=zT)={si,if yi=zMV(1si)/(K1),if yizmvP(y_i |z_{MV} = z_{T})  = \begin{cases}
s_i, & \text{if}\ y_i = z_{MV} \\
(1-s_i)/(K-1), & \text{if} \  y_i \neq z_{mv}

The delta function is a more succinct way to formulate this:

δ(yi=zMV)={1,if yi=zMV0,otherwise\delta(y_i = z_{MV})  = \begin{cases}
1, & \text{if}\ y_i = z_{MV} \\
0, & \text{otherwise}
P(yizMV=zT)=siδ(yi=zMV)(1si)/(K1))δ(yizMV) P(y_i |z_{MV} = z_{T})  = s_i^{\delta(y_i = z_{MV})} \cdot (1-s_i)/(K-1))^{\delta(y_i \neq z_{MV})}
We got the probability of observing the yiy_i label from the ithi-th annotator, whatever it may be, provided that the zMVz_{MV} label is true.
What is the probability of observing all labels we received if zMVz_{MV} is true? Since labels are independent, it's just the product of the probabilities of labels from all annotators:
P(y1ynzMV=zT)=iNsiδ(yi=zMV)(1si)/(K1))δ(yizMV)P(y_1 \dots y_n |z_{MV} = z_{T})  = \prod_{i}^{N} s_i^{\delta(y_i = z_{MV})} \cdot (1-s_i)/(K-1))^{\delta(y_i \neq z_{MV})}

This is the probability from the numerator: P(responses|MV label is correct). In probability theory, this is called the observed data likelihood.

Let's take a look at the denominator:

P(y1yn)=z=1Ki=1nsiδ(z=yi)(1siK1)δ(zyi)P(y_1 \dots y_n) = \sum_{z=1}^{K} \prod_{i=1}^{n} s_i^{\delta(z = y_i)} (\frac{1-s_i}{K-1})^{\delta(z \neq y_i)}

We set different labels as the correct one, get the probabilities of observing what we observe with them, and sum it all up. As a result, we get the probability of observing our labels regardless of which is correct: P(responses).

Practical application

Consistency gives a quality measure from 0 to 1: the higher, the better. In real life, values above 0.9 are considered good. However, as always with metrics, you should be aware of potential problems.

Try plugging the following values into the formula:

y=[1,0,1]s=[1.0,1.0,1.0]K=2y = [1, 0, 1] \\
s = [1.0, 1.0, 1.0] \\
K = 2

If you try to calculate consistency, you will get zero in the denominator since the value is undefined. The thing is that if the annotator's skill level is 1.0, the implication is that the probability of error is zero. It's impossible for them to be wrong, so if they disagree with the majority of other people that are also impossible to be wrong, the answer is undefined. In practice, we can never be 100% sure that annotators will never make mistakes. Smoothing skills out, making sure they are never 0 or 1, avoids those situations.

Consistency also depends on the number of classes and the balance between them. Compare consistency for K=2 and K=1000 for the following dataset:

y=[1,0,1]s=[0.9,0.9,0.9]y = [1, 0, 1] \\
s = [0.9, 0.9, 0.9]

The consistency is different. In the second case, it's higher. That makes sense: the fact that two annotators chose the same label out of a thousand is more consistent. However that means consistency can not be interpreted without taking into account the number of classes.

Note on statistical significance

Now that we've analysed some metrics, we should remember that the differences between them aren't always significant.

Image two datasets. We evaluate their quality using accuracy in control tasks, with all the necessary assumptions being fulfilled. There are 10 control tasks in the first dataset and 10,000 in the second. Accuracy for the first dataset is 90%, while it's 80% for the second.

The metric for the first dataset is higher. Does that mean it's better labeled? Probably not. The evaluation based on 10 control tasks could be high simply by chance. Point estimations can be misleading, and the problem isn't always that obvious. Is there a significant difference if one accuracy value is 90% and the other is 95%? Or 90% and 91%? 90% and 90.1%? We need more information.

This is where confidence intervals and statistical tests come into the picture. If you have access to a computer, you skip the statistics course and bootstrap some confidence intervals.

In its simplified form, the algorithm goes like this:

  1. Set the alpha significance level (95%, for example).
  2. Repeat many times (10,000, for example):
  • Select a subsample of annotator responses via sampling with replacement.
  • Calculate the metric we need for the subsample (Accuracy,