The Toloka team presents an online tutorial based on KDD 2019.
In this tutorial, we present a portion of our unique industry experience in efficient data labeling via crowdsourcing, shared by both leading researchers and engineers from Yandex. Most ML projects require training data, and often this data can only be obtained through human labeling. As new applications of AI emerge, there is ever-growing demand for human-labeled data collected in nontrivial tasks. Large-scale data production requires a technological pipeline that can successfully manage quality control and smart distribution of tasks between performers.
We will introduce you to data labeling via public crowdsourcing marketplaces and present the key techniques for efficiently collecting labeled data. This will be followed by a practice session, where participants will choose one real label collection task, experiment with selecting settings for the labeling process, and launch their own labeling project on Toloka, one of the world's largest crowdsourcing marketplaces. During the tutorial, all projects will run on the real Toloka crowd. Participants will also receive feedback and practical advice on making their projects more efficient. We invite beginners, advanced specialists, and researchers to learn how to collect high-quality labeled data, and do so efficiently.
Part 0: IntroductionÂ
Part I: Main components of data collection via crowdsourcingÂ
Part II: Label collection projects to be done (practical session)Â
Part III: Introduction to Toloka for requesters
Coffee Break
Part IV: Setting up and running label collection projects (practical session)Â
Part V: Theory on efficient aggregation, incremental relabeling, and pricingÂ
Part VI: Discussion of results from the projects and conclusionsÂ