Overview
Most ML projects require training data, and often this data can only be obtained through human labeling. As new applications of AI emerge, there is ever-growing demand for human-labeled data collected in nontrivial tasks. Large-scale data production requires a technological pipeline that can successfully manage quality control and smart distribution of tasks between performers.
In this tutorial, we introduce you to data labeling via public crowdsourcing marketplaces and present the key techniques for efficiently collecting labeled data. In the practice session, participants choose one real label collection task, experiment with selecting settings for the labeling process, and launch their own labeling project on Toloka, one of the world’s largest crowdsourcing marketplaces. During the tutorial, all projects run on the real Toloka crowd. Participants also receive feedback and practical advice on making their projects more efficient.
Speakers
Alexey Drutsa
TolokaHead of Efficiency & Growth DivisionProfile linkValentina Fedorova
TolokaAnalystEvfrosiniya Zerminova
TolokaTechnical Product ManagerProfile linkSchedule
- The concept of crowdsourcing
- Crowdsourcing task examples
- Crowdsourcing platforms
- Yandex crowdsourcing experience
- Decomposition for an effective pipeline
- Task instruction & interface: best practices
- Quality control techniques
Part II: Label collection projects (practical session)
- Dataset and required labels
- Discussion: how to collect labels?
- Data labeling pipeline for implementation
- Main types of instances
- Project: creation & configuration
- Pool: creation & configuration
- Tasks: uploading & golden set creation
- Statistics in flight and results downloading
Participants:Â
- Main types of instances
- Project: creation & configuration
- Pool: creation & configuration
- Tasks: uploading & golden set creation
- Statistics in flight and results downloading
- Aggregation models
- Incremental relabeling
- Performance-based pricing
- Project results
- Ideas for further work and research
- References to literature and other tutorials
Slides
Part 1: "Main components of data collection via crowdsourcing"
Part 3: "Introduction to Toloka for requesters"
Part 4: "Setting up and running label collection projects"
Part 5: "Theory on efficient aggregation, incremental relabelling, and pricing"
Part 6: "Results & Сonclusions"