In this tutorial, we present a portion of our unique industry experience in efficient data labeling via crowdsourcing, shared by both leading researchers and engineers from Yandex. Most ML projects require training data, and often this data can only be obtained through human labeling. As new applications of AI emerge, there is ever-growing demand for human-labeled data collected in nontrivial tasks. Large-scale data production requires a technological pipeline that can successfully manage quality control and smart distribution of tasks between performers.
We will introduce you to data labeling via public crowdsourcing marketplaces and present the key techniques for efficiently collecting labeled data. This will be followed by a practice session, where participants will choose one real label collection task, experiment with selecting settings for the labeling process, and launch their own labeling project on Toloka, one of the world's largest crowdsourcing marketplaces. During the tutorial, all projects will run on the real Toloka crowd. Participants will also receive feedback and practical advice on making their projects more efficient. We invite beginners, advanced specialists, and researchers to learn how to collect high-quality labeled data, and do so efficiently.