Build a SQuAD Labeling Project with Toloka Kit

by Toloka Team

Apr 28, 2021

product news

Subscribe to Toloka News

We recently announced the launch of Toloka Kit – a Python library for data labeling projects that can help data scientists and ML engineers build scalable ML pipelines. In this article, we're going to talk about how Toloka Kit can help tackle one of the most popular problems in NLP – question answering – by labeling the SQuAD2.0 dataset.

What is SQuAD?

The Stanford Question Answering Dataset (SQuAD) is used to test NLP models and their ability to understand natural language. SQuAD2.0 consists of a set of paragraphs from Wikipedia articles, along with 100,000 question-answer pairs derived from these paragraphs, and 50,000 unanswerable questions. To show good results on SQuAD2.0, a model must not only answer questions correctly, but also determine whether a question has an answer in the first place, and refrain from responding if it doesn't.

SQuAD2.0 is the most popular question answering dataset: it's been cited in over 1000 articles, and in the three years since its release, 85 models have been published on its leaderboard.

Using Toloka Kit to label SQuAD

At Toloka, we believe that crowdsourcing can be extremely useful in solving Q&A tasks. If you're building a virtual assistant, a chitchat bot, or any other system that's supposed to answer questions posed in natural language, you need to train your model on a dataset like SQuAD2.0. But using an open dataset is not always an option (for instance, there may be nothing available in the language you're working with). You can use Toloka Kit and the power of the crowd to build your own dataset and make your labeling process easier and more flexible.

Here's how we went about solving our Q&A task with Toloka's Python library and SQuAD2.0. Our task was to get the correct answer to a question based on a fragment of a Wikipedia article. The answer is a segment of text from the corresponding passage, or the question may not have an answer at all. Here's an example of text, question, and answer:

Beyoncé Giselle Knowles-Carter (/bi:'jɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny's Child. Managed by her father, Mathew Knowles, the group became one of the world's best-selling girl groups of all time. Their hiatus saw the release of Beyoncé's debut album, Dangerously in Love (2003), which established her as a solo artist worldwide, earned five Grammy Awards and featured the Billboard Hot 100 number-one singles "Crazy in Love" and "Baby Boy".

question: When did Beyonce start becoming popular?

answer: [in the late 1990s]

We created two projects for our labeling pipeline:

Project 1: Tolokers answer the questions.
Project 2: A different group of Tolokers verifies the first group's answers.

Based on the results of Project 2, we accepted or rejected the responses from Project 1 until we got the correct answers to all the questions. This is a tricky process, but with Toloka Kit it took only one cell in Jupyter Notebook.

The initial results weren't particularly impressive: an 0.4 exact match score and an 0.43 f1 score. We identified four main setbacks:

Dishonest performers. Since Project 2 was easy to perform (the Tolokers only had to choose between "yes" or "no"), there were quite a few users who didn't put in the effort and responded completely at random. Unfortunately, there was no way for us to get a decent enough number of control tasks, so we resorted to banning performers if they responded too quickly.
Unclear task description. We realized we had to explicitly tell Tolokers that the answer must be an exact match to the corresponding piece of text from the paragraph. To fix this, we wrote detailed instructions for both projects and added training pools and control tasks to Project 1 that were based on a training dataset.
Response aggregation. With multiple responses to each question on our hands, we had to come up with an optimal approach to choosing a single answer. After testing various aggregation options, we decided to use majority vote to determine whether the question has an answer at all, and if it does, choose the shortest one.
Poor results on unanswerable questions. Tolokers were actually doing really well on answerable questions (f1=0.8) but performed poorly on questions that were unanswerable (f1=0.33). They simply did not believe that a question might not have an answer. We added examples of unanswerable questions to the training pool, and it worked!

After these adjustments, we saw the exact match score go up to 0.67 and the f1 score to 0.72, which is similar to results shown by models in 2018. These are the baseline figures you can get with Toloka Kit, but there is definitely room for improvement with more tweaking.

Launch a labeling project of your own

You can (and by all means should!) launch your own SQuAD2.0 labeling project with Toloka Kit. You'll find all the code and instructions in our free tutorial. What's more, you can use it to build a dataset in any language (not only English, which is the original language of SQuAD2.0). We would be thrilled to see you beat our results!

Even though this project is still a work in progress, we're already seeing promising results and we're certain that with incremental changes and improvements we can even beat SOTA models. So, if you have any ideas on how to improve this labeling project's architecture, settings, instructions, or result aggregation methods, or if you have any other suggestions, feel free to commit to our GitHub repo.

Try using Toloka Kit in your labeling projects and send us feedback. We're putting a ton of effort into developing and improving it with the ultimate goal of turning it into the most convenient tool for working with Toloka, so we appreciate all the input we can get.

Article written by:

Toloka Team

Updated: Feb 22, 2024

Back to all articles