Education

INFOTEC: Practice of efficient data collection

In this tutorial, we present a portion of our unique industry experience in data labeling, shared by both leading researchers and engineers from Toloka.

Oct 18, 2021, 10:00 UTC

Overview

In this tutorial, we present a portion of our unique industry experience in efficient data labeling via crowdsourcing, shared by both leading researchers and engineers from Toloka. Most ML projects require training data, and often this data can only be obtained through human labeling. As new applications of AI emerge, there is ever-growing demand for human-labeled data collected in nontrivial tasks. Large-scale data production requires a technological pipeline that can successfully manage quality control and smart distribution of tasks between performers.

We introduce you to data labeling via public crowdsourcing marketplaces and present the key techniques for efficiently collecting labeled data. This is followed by a practice session, where participants choose one real label collection task, experiment with selecting settings for the labeling process, and launch their own labeling project on Toloka, one of the world's largest crowdsourcing marketplaces. During the tutorial, all projects are run on the real Toloka crowd. Participants also receive feedback and practical advice on making their projects more efficient. We invite beginners, advanced specialists, and researchers to learn how to collect high-quality labeled data, and do so efficiently.

Topics

Key components of crowdsourcing for efficient data labeling
Decomposition approach
Performer selection and training
2D object segmentation demo
Hands-on practice session: object segmentation pipeline
Advanced crowdsourcing techniques: aggregation, incremental relabeling & pricing

Speakers

Daria Baidakova

TolokaDirector of Educational Programs

Nikita Pavlichenko

TolokaResearch Scientist

Sergey Koshelev

TolokaCrowd Solutions Architect

Polina Smirnova

TolokaEducational Project Manager

Organizers

Aljona Johnson

TolokaEducational Project Manager

Carlos Josuè Lavandeira Portillo

INFOTECDeputy Director of Innovation & Knowledge

Schedule

10:00 - 10:15

Part 0: Introduction
— The concept of crowdsourcing
— Crowdsourcing task examples
— Crowdsourcing platforms
— Yandex crowdsourcing experience

10:15 - 10:45

Part I: Main components of data collection via crowdsourcing
— Decomposition for an effective pipeline
— Task instruction & interface: best practices
— Quality control techniques

10:45 - 11:00

Part II: Label collection projects to be done (practical session)
— Dataset and required labels
— Discussion: how to collect labels?
— Data labeling pipeline for implementation

11:00 - 11:40

Part III: Introduction to Toloka for requesters
— Main types of instances
— Project: creation & configuration
— Pool: creation & configuration
— Tasks: uploading & golden set creation
— Statistics in flight and downloading results

11:40 - 12:00

Coffee Break

12:00 - 13:00

Part IV: Setting up and running label collection projects (practical session)
— You
› create
› configure
› run on real performers
— data labeling projects in real-time

13:00 - 13:20

Part V: Theory on efficient aggregation, incremental relabeling, and pricing
— Aggregation models
— Incremental relabeling to save money
— Performance-based pricing

13:20 - 13:30

Part VI: Discussion of results from the projects and conclusions
— Results of your projects
— Extensions to work on after the tutorial
— References to literature and other tutorials

Back to Events

(

Don't miss out

Be the first to hear about our workshops,
tutorials, and webinars.