Why Inter-Rater Reliability Isn’t Enough: Enhancing Data Quality Metrics

Magdalena Konkiewicz
by
Image

Subscribe to Toloka News

Subscribe to Toloka News

Introduction

If you’ve been involved in the data annotation industry, you are likely familiar with the concept of inter-rater reliability agreement. Inter-rater reliability measures the consistency among different raters when they assess the same set of data. It is an important concept in research fields like psychology, medicine, and education, where subjective judgments are needed. There is a general belief in the data labeling industry that high inter-rater reliability indicates high-quality data.

In this article, we will explain why this is not always the case. We will first explain basic methods to calculate inter-rater reliability, such as joint probability agreement, Cohen’s kappa, and Fleiss’ kappa, and then discuss their limitations. Finally, we will show you better ways to control and assess data quality in annotation projects.

Empower your GenAI development

Get your expert data for Fine-tuning, RLHF and Evaluation. High-quality, for any domain, at scale.
Talk to us
Image

What is inter-rater reliability?

Inter-rater reliability, or inter-rater agreement, refers to the degree of agreement between different raters when scoring the same data. Let's use a practical example to visualize this concept better.

Imagine a scenario where a high school English class submits their essays on a literary analysis of a novel. The marks are either pass or fail and the class has 5 students. The teacher decides to have two different English teachers grade each essay independently to ensure fairness. In 4 cases both teachers coincide and assign the same mark for the student (3 passes and 1 failure), but in one case one teacher assigns a failure whereas the other one thinks it's a pass. This is a visualization of this scenario.

Grader AGrader B
Student 1passpass
Student 2passpass
Student 3passpass
Student 4failurefailure
Student 5passfailure

Joint probability agreement

The most straightforward method to calculate inter-rater reliability is using joint probability agreement. To get a percentage, we take the number of agreements and divide it by the total number of samples.

po=number of agreementsnumber of total samplesp_{o} = \frac{\text{number of agreements}}{\text{number of total samples}}

In our example, the percent agreement is 0.8. We have four agreements and five samples, so 4/5 = 0.8.

This metric can range from 0 to 1; 0 indicates no agreement and 1 indicates perfect agreement.

Cohen’s kappa coefficient

Estimating inter-rater reliability using a joint probability agreement has one important disadvantage. This method does not take into account that the graders could agree by chance. Imagine that you and your coworker flip a coin each. There is quite a high chance you will both get the same result (two heads or two tails). Similarly, when scoring a sample, the random coincidence between raters should be accounted for. This is why statisticians have come up with a better method to estimate inter-rater reliability called Cohen´s kappa.

k=pope1pek = \frac{p_{o} - p_{e}}{1 - p_{e}}

where

po is the joint probability agreement we have calculated before

pe is the hypothetical probability of a random match between raters

To calculate pe, we sum up the possibility of two raters giving the same results by chance. In our case, it is both raters giving a pass or both giving a failure.

pe=P(both raters pass)+P(both raters fail)p_{e} = P(\text{both raters pass}) + P(\text{both raters fail})

where we can calculate individual numbers in the following way.

P(both raters pass)=P(Rater A pass)P(Rater B pass)P(\text{both raters pass}) = P (\text{Rater A pass}) * P (\text{Rater B pass})
P(both raters fail)=P(Rater A fail)P(Rater B fail)P(\text{both raters fail}) = P (\text{Rater A fail}) * P (\text{Rater B fail})

Let´s plug in the numbers for our essay grading example.

pe=4535+1525=0.48+0.08=0.56p_{e}=\frac{4}{5}*\frac{3}{5}+\frac{1}{5}*\frac{2}{5}=0.48+0.08=0.56

The probability of both graders agreeing by chance is 0.56. Now we can calculate Cohen´s Kappa score.

k=pope1pe=0.80.5610.56=0.545k=\frac{p_{o}-p_{e}}{1-p_{e}}=\frac{0.8-0.56}{1-0.56}=0.545

In our scenario, the kappa statistic is 0.545 meaning weak agreement. This metric ranges from -1 to 1, with negative one representing total disagreement between raters and positive one representing perfect agreement.

Fleiss´ Kappa and other metrics used for inter-rater reliability agreement

An important disadvantage of Cohen´s kappa is the fact that it can only be used with two raters. So in our example, if we had more than two teachers grading essays, it would not be possible to estimate inter-rater reliability using this metric. There is an improved version that can be used with more than two raters, called Fleiss Kappa. We will not go into detail about how this metric is calculated, but the logic behind the calculation is very similar to Cohen´s Kappa with a slight tweak to accommodate multiple raters.

You may encounter other ways to calculate inter-rater agreement, like Pearson’s 𝑟, Kendall’s τ, Spearman’s 𝜌, or Intra-class correlation coefficient. These are valid approaches used by research and industry, and the appropriate choice depends on the type of underlying data that is being analyzed. Whereas all of those metrics are important statistical methods to calculate the agreement between raters, none of these metrics used on their own can be an estimator of data quality, and the rest of this article will explain why.

Why inter-rater reliability is not a good metric for controlling data quality

Data annotation projects often use an approach similar to our essay scoring example. Multiple annotators label the same data, and the final labels are delivered with an inter-rater reliability score reflecting agreement across labelers for the dataset. Many data providers assume that having a high inter-rater reliability score means high-quality data. It seems logical that if many people agree on something, they must be right. This may be true in some cases, especially in simple or relatively trivial annotation tasks, but it does not always hold in more complex projects that require specific skills or attention to detail.

It is easy to imagine a real-life scenario where annotators agree with each other but are wrong. For example, annotators are asked to mark the color of traffic lights visible in a picture. If the traffic light is far away and partially covered by a traffic sign, less attentive annotators might mark the photo as having no traffic light at all. Only one annotator, paying close attention to detail, might correctly identify the traffic light’s color. This scenario would result in a high inter-rater reliability score but an incorrect answer.

Similar situations frequently occur in data annotation. Correctly labeling these harder cases distinguishes high-quality datasets from average ones. This is why Toloka does not use inter-rater reliability as a quality measure. Instead, we have developed a more complex process to understand and measure data quality accurately.

Better ways of ensuring quality

When working with clients, we receive an unlabeled dataset and return a dataset where each data point has a corresponding label. It sounds simple on the surface, but behind each label, there is a stack of technical decisions, statistics, and quality control mechanisms. Let’s take a closer look at the techniques we use to ensure data quality at Toloka and how we ensure our clients receive high-quality data.

Project decomposition

Our clients often come to us with complex annotation tasks, so the first step is to perform task decomposition. This helps to break up complex tasks, making the annotators’ job easier and allowing us to assign appropriate annotators to each subproject according to the skills required. It is better visualized with a simple example.

The task is to label mechanical tools visible in a picture. This could be broken into two projects: in project one, annotators identify if there are mechanical tools in the picture, and in project two, annotators label the actual tools. These are now separate projects and require annotators with different skills. The project identifying whether tools are present does not require any specialized knowledge, but the second one does. In the latter, we need expert labelers who can identify specific tools and write the proper names. At Toloka, we often deal with projects that are broken into many more subprojects, involving further decomposition.

Understanding and measuring annotator skills

Using this decomposition approach and assigning annotators to projects according to their skill levels allows us to control the desired level of confidence in aggregated labels. This means that several annotators label the same instance, and we aggregate the results, taking into account their skill levels. This provides us with a label that has an associated confidence score. If the confidence is too low, we ask more annotators with the required skills to label the instance. This approach enables us to balance the search for quality with cost-effectiveness, as we only use the number of annotators needed to achieve the desired confidence level.

This approach also means we need to monitor the skill levels of our annotators. We use initial tests to check annotator skill levels prior to participating in projects. During the annotation process, we mix control tasks with the normal tasks that annotators work on and use the control tasks for continual monitoring.

Final audit with meaningful quality metrics

As a final step, if the project was broken into parts, we recombine the results to produce the final labels. To ensure the quality of the dataset, we audit a small percentage of it and compute estimated metrics such as accuracy, precision, recall, and F-score. These metrics provide a better estimation of the overall quality of the dataset compared to measuring the inter-rater reliability. This is particularly true when dealing with complex annotation tasks, where annotators may often agree with each other and still be wrong.

Summary

Traditional inter-rater reliability metrics are insufficient for ensuring data quality, especially for complex annotation tasks. These metrics are often favored for their ease of computation rather than their correlation with quality data. A more thorough approach involves task decomposition, skill-based annotation, rigorous quality control, and regular quality audits.

If you’re planning to annotate large datasets with a focus on quality, consider partnering with Toloka and take advantage of our experience and expertise. Book a demo today.

Article written by:
Magdalena Konkiewicz
Updated: 

Recent articles

Have a data labeling project?

Take advantage of Toloka technologies. Chat with our expert to learn how to get reliable training data for machine learning at any scale.
Fractal

More about Toloka

  • Our mission is to empower businesses with high quality data to develop AI products that are safe, responsible and trustworthy.
  • Toloka is a European company. Our global headquarters is located in Amsterdam. In addition to the Netherlands, Toloka has offices in the US, Israel, Switzerland, and Serbia. We provide data for Generative AI development.
  • We are the trusted data partner for all stages of AI development–from training to evaluation. Toloka has over a decade of experience supporting clients with its unique methodology and optimal combination of machine learning technology and human expertise. Toloka offers high quality expert data for training models at scale.
  • The Toloka team has supported clients with high-quality data and exceptional service for over 10 years.
  • Toloka ensures the quality and accuracy of collected data through rigorous quality assurance measures–including multiple checks and verifications–to provide our clients with data that is reliable and accurate. Our unique quality control methodology includes built-in post-verification, dynamic overlaps, cross-validation, and golden sets.
  • Toloka has developed a state-of-the-art technology platform for data labeling and has over 10 years of managing human efforts, ensuring operational excellence at scale. Now, Toloka collaborates with data workers from 100+ countries speaking 40+ languages across 20+ knowledge domains and 120+ subdomains.
  • Toloka provides high-quality data for each stage of large language model (LLM) and generative AI (GenAI) development as a managed service. We offer data for fine-tuning, RLHF, and evaluation. Toloka handles a diverse range of projects and tasks of any data type—text, image, audio, and video—showcasing our versatility and ability to cater to various client needs.
  • Toloka addresses ML training data production needs for companies of various sizes and industries– from big tech giants to startups. Our experts cover over 20 knowledge domains and 120 subdomains, enabling us to serve every industry, including complex fields such as medicine and law. Many successful projects have demonstrated Toloka's expertise in delivering high-quality data to clients. Learn more about the use cases we feature on our customer case studies page.