Overview

In this tutorial designed specifically for the readers of The Sequence, leading researchers and engineers from Toloka will share their unique industry experience in achieving efficient natural language annotation with crowdsourcing. We will introduce data labeling via public crowdsourcing marketplaces and present the key components of efficient label collection. Then, in the practice session, participants will choose one real language resource production task, experiment with selecting settings for the labeling process, and launch their label collection project on Toloka, one of the world’s largest crowdsourcing marketplaces. During the tutorial session, all projects will be run on the real Toloka crowd. We will also present useful quality control techniques and give the attendees an opportunity to discuss their own annotation ideas.
Topics
  • Reasons for collecting and labeling data via crowdsourcing for SDC: pros & cons
  • Key components of crowdsourcing for efficient data labeling
  • Decomposition approach
  • Performers selection and training
  • Hands-on practice session: audio transcription
  • Advanced crowdsourcing techniques: aggregation, incremental relabeling & pricing
  • Example of how to aggregate crowdsourced texts using the Crowd-Kit Python library

Speakers

Dmitry Ustalov
Toloka
Analyst / Software Developer
Daria Baidakova
Toloka
Director of Educational Programs
Natalie Fedorova
Toloka
Educational Project Manager
Nikita Pavlichenko
Toloka
Analyst / Software developer

Organizers

Natalie Fedorova
Toloka
Educational Project Manager
Daria Baidakova
Toloka
Director of Educational Programs

Schedule

16:00 – 16:15
Part 0: Introduction
— The concept of data labeling via crowdsourcing
— Crowdsourcing task examples
— Crowdsourcing platforms
— Yandex crowdsourcing experience
16:15 – 16:45
Part I: Key Components for Efficient Data Collection 
— Decomposition for effective pipeline 
— Task instruction & interface: best practices 
— Quality control techniques
16:45 – 17:45
Part II: Part 2: Practice part I  
— Dataset and required labels 
— Discussion: how to collect labels? 
— Data labeling pipeline for implementation 
— You 
» create 
» configure 
» run data labeling projects on real performers in real-time
17:45 – 18:30
Break
18:30 – 19:00
Part III: Advanced techniques
— Incremental relabeling 
— Dynamic pricing
19:00 - 19:30
Break
19:30 – 19:45
Part IV: Practice part II 
— Finishing up label collection
— Results aggregation
19:45 – 20:15
Part V: Part 5: Conclusion 
— Results of your projects 
— Ideas for further work and research 
— References to literature and other tutorials 
— Q&A

Text Aggregation Example

We will share an example of how to aggregate crowdsourced texts using the Crowd-Kit library for Python.

Register today

Don't miss
Mon Aug 02 2021 12:31:55 GMT+0300 (Moscow Standard Time)