Most ML projects require training data, and often this data can only be obtained through human labeling. As new applications of AI emerge, there is ever-growing demand for human-labeled data collected in nontrivial tasks. Large-scale data production requires a technological pipeline that can successfully manage quality control and smart distribution of tasks between performers.

In this tutorial, we introduce you to data labeling via public crowdsourcing marketplaces and present the key techniques for efficiently collecting labeled data. In the practice session, participants choose one real label collection task, experiment with selecting settings for the labeling process, and launch their own labeling project on Toloka, one of the world’s largest crowdsourcing marketplaces. During the tutorial, all projects run on the real Toloka crowd. Participants also receive feedback and practical advice on making their projects more efficient.


Alexey Drutsa
Head of Efficiency &
Growth Division
Valentina Fedorova
Olga Megorskaya
Evfrosiniya Zerminova
Technical Product Manager


— The concept of crowdsourcing
— Crowdsourcing task examples
— Crowdsourcing platforms
— Yandex crowdsourcing experience
Part I: Main components
of data collection via crowdsourcing
— Decomposition for an effective pipeline
— Task instruction & interface: best practices
— Quality control techniques
Part II: Label collection projects
(practical session)
— Dataset and required labels
— Discussion: how to collect labels?
— Data labeling pipeline for implementation
Part III: Introduction to Toloka for requesters
— Main types of instances
— Project: creation & configuration
— Pool: creation & configuration
— Tasks: uploading & golden set creation
— Statistics in flight and results downloading
Part IV: Setting up & running
label collection projects
(practical session)
— create
— configure
— run data labeling projects on real performers in real-time
Part V: Theory on efficient aggregation,
incremental relabeling, and pricing
— Aggregation models
— Incremental relabeling
— Performance-based pricing
Part VI: Discussion of results and conclusions
— Project results
— Ideas for further work and research
— References to literature and other tutorials
Don't miss
Don't miss our informative workshops, tutorials, and webinars.
Mon Aug 02 2021 12:29:20 GMT+0300 (Moscow Standard Time)