Education

UNAM: Crowdsourcing natural language data

In this tutorial, leading researchers and engineers from Toloka share their unique industry experience in achieving efficient natural language annotation with crowdsourcing.

Oct 29, 2021, 10:00 UTC

Overview

In this tutorial, leading researchers and engineers from Toloka will share their unique industry experience in achieving efficient natural language annotation with crowdsourcing. We will introduce data labeling via public crowdsourcing marketplaces and present the key components of efficient label collection. Then, in the practice session, participants will choose one real language resource production task, experiment with selecting settings for the labeling process, and launch their label collection project on Toloka, one of the world’s largest crowdsourcing marketplaces. During the tutorial session, all projects will be run on the real Toloka crowd. We will also present useful quality control techniques and give the attendees an opportunity to discuss their own annotation ideas.

Topics

Reasons for collecting and labeling data via crowdsourcing for NLP
Key components of crowdsourcing for efficient data labeling
Decomposition approach
Performers selection and training
Hands-on practice session: audio transcription
Advanced crowdsourcing techniques: aggregation, incremental relabeling & pricing
Example of how to aggregate crowdsourced texts using the Crowd-Kit Python library

Speakers

Dmitry Ustalov

TolokaAnalyst / Software Developer

Daria Baidakova

TolokaDirector of Educational Programs

Natalie Fedorova

TolokaEducational Project Manager

Organizers

Natalie Fedorova

TolokaEducational Project Manager

Aljona Johnson

TolokaEducational Project Manager

Dr. Jesús Savage Carmona

UNAMFull-Time Professor C (Profesor de Tiempo Completo Titular C)

Schedule

10:00 – 10:15

Part 0: Introduction
— The concept of data labeling via crowdsourcing
— Crowdsourcing task examples
— Crowdsourcing platforms
— Yandex crowdsourcing experience

10:15 – 10:45

Part I: Key Components for Efficient Data Collection
— Decomposition for effective pipeline
— Task instruction & interface: best practices
— Quality control techniques

10:45 – 11:45

Part II: Practice part I
— Dataset and required labels
— Discussion: how to collect labels?
— Data labeling pipeline for implementation
— You
» create
» configure
» run data labeling projects on real performers in real-time

11:45 – 12:15

Break

12:15 – 12:40

Part III: Advanced techniques
— Incremental relabeling
— Dynamic pricing

12:40 – 13:00

Part IV: Practice part II
— Finishing up label collection
— Results aggregation

13:00 – 13:20

Part V: Conclusion
— Results of your projects
— Ideas for further work and research
— References to literature and other tutorials
— Q&A

Back to Events

(

Don't miss out

Be the first to hear about our workshops,
tutorials, and webinars.