Practice of Efficient Data Collection via Crowdsourcing
on a Large-Scale
The Academic Fringe Festival presents a tutorial based on KDD 2019
Register now
Thu, April 8th, 2021 | 10:00 - 13:30 CET
Events | Crowd Science Tutorials

Overview

In this tutorial, we present a portion of our unique industry experience in efficient data labeling via crowdsourcing, shared by both leading researchers and engineers from Yandex. Most ML projects require training data, and often this data can only be obtained through human labeling. As new applications of AI emerge, there is ever-growing demand for human-labeled data collected in nontrivial tasks. Large-scale data production requires a technological pipeline that can successfully manage quality control and smart distribution of tasks between performers.

We will introduce you to data labeling via public crowdsourcing marketplaces and present the key techniques for efficiently collecting labeled data. This will be followed by a practice session, where participants will choose one real label collection task, experiment with selecting settings for the labeling process, and launch their own labeling project on Toloka, one of the world's largest crowdsourcing marketplaces. During the tutorial, all projects will run on the real Toloka crowd. Participants will also receive feedback and practical advice on making their projects more efficient. We invite beginners, advanced specialists, and researchers to learn how to collect high-quality labeled data, and do so efficiently.

2D objects detection
3D objects detection
Moving object tracking
Topics
  • Key components of crowdsourcing for efficient data labeling
  • Decomposition approach
  • Performer selection and training
  • 2D object segmentation demo
  • Hands-on practice session: object segmentation pipeline
  • Advanced crowdsourcing techniques: aggregation, incremental relabeling & pricing

Speakers

Dmitry Ustalov
Toloka
Analyst / Software Developer at Toloka
Daria Baidakova
Toloka
Education & Customer Success Team Lead
Sergey Koshelev
Yandex
Crowd Solutions Architect
Polina Smirnova
Toloka
Educational Project Manager

Organizers

Ujwal Gadiraju
Delft University of Technology
Assistant Professor
Simo Hosio
University of Oulu
Associate Professor
Natalie Fedorova
Toloka
Educational Project Manager
Polina Smirnova
Toloka
Educational Project Manager

Schedule

10:00 - 10:30
Part 0: Introduction
— The concept of crowdsourcing
— Crowdsourcing task examples
— Crowdsourcing platforms
— Yandex crowdsourcing experience
10:30 - 11:00
Part I: Main components of data collection
via crowdsourcing 
— Decomposition for an effective pipeline 
— Task instruction & interface: best practices 
— Quality control techniques

11:00 - 11:15
Part II: Label collection projects to be done 
(practical session) 
— Dataset and required labels 
— Discussion: how to collect labels? 
— Data labeling pipeline for implementation
11:15 - 11:50
Part III: Introduction to Toloka 
for requesters 
— Main types of instances 
— Project: creation & configuration 
— Pool: creation & configuration 
— Tasks: uploading & golden set creation 
— Statistics in flight and downloading results
11:50 - 12:00
Coffee Break
12:00 - 13:00
Part IV: Setting up and running label collection projects 
(practical session) 
— You 
› create 
› configure 
› run on real performers 
— data labeling projects in real-time
13:00 - 13:20
Part V: Theory on efficient aggregation, 
incremental relabeling, and pricing 
— Aggregation models 
— Incremental relabeling to save money 
— Performance-based pricing
13:20 - 13:30
Part VI: Discussion of results 
from the projects and conclusions 
— Results of your projects 
— Extensions to work on after the tutorial 
— References to literature and other tutorials
Don't miss
Register now
Wed Apr 28 2021 16:42:51 GMT+0300 (Moscow Standard Time)