Conference

Tutorial at WSDM 2020

In this tutorial, we present some key techniques for efficiently collecting labeled data, including aggregation, incremental relabeling, and dynamic pricing.

Image
Image
Image
+3
Image

Overview

In this tutorial, we present portion of unique industry experience in efficient data labeling via crowdsourcing. The majority of ML projects require training data, and often this data can only be obtained by human labeling. Moreover, the more applications of AI appear, the more nontrivial tasks for collecting human labeled data arise. Production of such data on a large-scale requires construction of a technological pipeline, which includes solving issues related to quality control and smart distribution of tasks between performers.

We introduce data labelling via public crowdsourcing marketplaces and present the key components of efficient label collection. This is followed by a practice session, where participants choose one real label collection task, experiment with selecting settings for the labeling process, and launch their label collection project on Toloka, one of the world's largest crowdsourcing marketplaces.

Speakers

Image
Alexey Drutsa
TolokaHead of Efficiency & Growth Division
Image
Valentina Fedorova
TolokaAnalyst
Image
Olga Megorskaya
TolokaCEO
Image
Evfrosiniya Zerminova
TolokaTechnical Product Manager
Image
Dmitry Ustalov
TolokaHead of Research
Image
Daria Baidakova
TolokaDirector of Educational Programs

Schedule

— The concept of crowdsourcing
— Crowdsourcing task examples
— Crowdsourcing platforms
— Toloka crowdsourcing experience

— Decomposition for an effective pipeline
— Task instruction & interface: best practices
— Quality control techniques

— Dataset and required labels
— Discussion: how to collect labels?
— Data labeling pipeline for implementation

— Main types of instances
— Project: creation & configuration
— Pool: creation & configuration
— Tasks: uploading & golden set creation
— Statistics in flight and results downloading

Participants:
— create
— configure
— run data labeling projects on real performers in real-time

— Detailed examination of quality control techniques
— Comprehensive overview of best practices for creating a functional interface

Participants:
— create
— configure
— run data labeling projects on real performers in real-time

— Incremental relabeling to save money
— Performance-based pricing

— Results of your projects
— Ideas for further work and research
— References to literature and other tutorials

Slides

Introduction
Part 1: Main components of data collection
Part 2: Label collection projects to be done crowdsourcing
Part 3: Introduction to Toloka for requesters
Part 4: Setting up and running label collection projects
Part 5: Interface and quality control
Part 6: Theory on efficient aggregation projects
Part 7: Setting up and running label collection projects
Part 8: Theory on incremental relabelling and pricing
Part 9: Results and conclusions

Don't miss out

Be the first to hear about our workshops, 
tutorials, and webinars.
Fractal