Conference

Tutorial at KDD 2019

In this tutorial, we introduce you to data labeling via public crowdsourcing marketplaces and present the key techniques for efficiently collecting labeled data.

Image
Image
Image
+1
Image

Overview

Most ML projects require training data, and often this data can only be obtained through human labeling. As new applications of AI emerge, there is ever-growing demand for human-labeled data collected in nontrivial tasks. Large-scale data production requires a technological pipeline that can successfully manage quality control and smart distribution of tasks between performers.

In this tutorial, we introduce you to data labeling via public crowdsourcing marketplaces and present the key techniques for efficiently collecting labeled data. In the practice session, participants choose one real label collection task, experiment with selecting settings for the labeling process, and launch their own labeling project on Toloka, one of the world’s largest crowdsourcing marketplaces. During the tutorial, all projects run on the real Toloka crowd. Participants also receive feedback and practical advice on making their projects more efficient.

Speakers

Image
Alexey Drutsa
TolokaHead of Efficiency & Growth DivisionProfile link
Image
Valentina Fedorova
TolokaAnalyst
Image
Olga Megorskaya
TolokaHead of TolokaProfile link
Image
Evfrosiniya Zerminova
TolokaTechnical Product ManagerProfile link

Schedule

  • The concept of crowdsourcing
  • Crowdsourcing task examples
  • Crowdsourcing platforms
  • Yandex crowdsourcing experience

  • Decomposition for an effective pipeline
  • Task instruction & interface: best practices
  • Quality control techniques

Part II: Label collection projects (practical session)

  • Dataset and required labels
  • Discussion: how to collect labels?
  • Data labeling pipeline for implementation

  • Main types of instances
  • Project: creation & configuration
  • Pool: creation & configuration
  • Tasks: uploading & golden set creation
  • Statistics in flight and results downloading

Participants: 

  • Main types of instances
  • Project: creation & configuration
  • Pool: creation & configuration
  • Tasks: uploading & golden set creation
  • Statistics in flight and results downloading

  • Aggregation models
  • Incremental relabeling
  • Performance-based pricing

  • Project results
  • Ideas for further work and research
  • References to literature and other tutorials

Slides

Introduction
Part 1: "Main components of data collection via crowdsourcing"
Part 3: "Introduction to Toloka for requesters"
Part 4: "Setting up and running label collection projects"
Part 5: "Theory on efficient aggregation, incremental relabelling, and pricing"
Part 6: "Results & Сonclusions"

Don't miss out

Be the first to hear about our workshops, 
tutorials, and webinars.
Fractal