The goal of Knowledge Base Question Answering (KBQA) is to generate an answer to a given question using a collection of facts stored in a knowledge base. Among other things, KBQA can help modern search engines to provide an off-the-shelf answer to a query in addition to the conventional list of relevant links. Similar to other machine learning tools, the efficacy of KBQA systems relies on high-quality data.
However, there are two unique challenges:
- The data collection pipeline must be efficient so that the system can adjust to the rapidly changing world.
- The data needs to account for the variety of languages spoken by the system users.
In this talk, Vladislav Korablinov introduces RuBQ — the first Russian-language KBQA dataset that consists of 1,500 questions of varying difficulty which feature diverse vocabulary. Vladislav describes the efficient data collection pipeline designed in this work, focusing on the role of crowd performers in streamlining the task. He also shares valuable insights on the caveats of the data collection process and reflects on the role of online labor platforms in scaling the task to a larger number of questions. The talk is based on a joint work with Pavel Braslavsky.