At Toloka, we believe that crowdsourcing can be extremely useful in solving Q&A tasks. If you’re building a virtual assistant, a chitchat bot, or any other system that’s supposed to answer questions posed in natural language, you need to train your model on a dataset like SQuAD2.0. But using an open dataset is not always an option (for instance, there may be nothing available in the language you’re working with). You can use Toloka Kit and the power of the crowd to build your own dataset and make your labeling process easier and more flexible.
Here’s how we went about solving our Q&A task with Toloka’s Python library and SQuAD2.0. Our task was to get the correct answer to a question based on a fragment of a Wikipedia article. The answer is a segment of text from the corresponding passage, or the question may not have an answer at all. Here’s an example of text, question, and answer:
Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny\'s Child. Managed by her father, Mathew Knowles, the group became one of the world\'s best-selling girl groups of all time. Their hiatus saw the release of Beyoncé\'s debut album, Dangerously in Love (2003), which established her as a solo artist worldwide, earned five Grammy Awards and featured the Billboard Hot 100 number-one singles "Crazy in Love" and "Baby Boy".
question: When did Beyonce start becoming popular?
answer: [in the late 1990s]
We created two projects for our labeling pipeline:
- Project 1: Tolokers answer the questions.
- Project 2: A different group of Tolokers verifies the first group’s answers.
Based on the results of Project 2, we accepted or rejected the responses from Project 1 until we got the correct answers to all the questions. This is a tricky process, but with Toloka Kit it took only one cell in Jupyter Notebook.