Education

KSU: Crowdsourcing natural language data

In this tutorial, leading researchers and engineers from Toloka share their unique industry experience in achieving efficient natural language annotation with crowdsourcing.

Jun 28, 2021, 16:00 UTC

Overview

In this tutorial, leading researchers and engineers from Toloka share their unique industry experience in achieving efficient natural language annotation with crowdsourcing. We will introduce data labeling via public crowdsourcing marketplaces and present the key components of efficient label collection. Then, in the practice session, participants choose one real language resource production task, experiment with selecting settings for the labeling process, and launch their label collection project on Toloka, one of the world's largest crowdsourcing marketplaces. During the tutorial session, all projects are run on the real Toloka crowd. We also present useful quality control techniques and give the attendees an opportunity to discuss their own annotation ideas.

Topics

Reasons for collecting and labeling data via crowdsourcing for SDC: pros & cons
Key components of crowdsourcing for efficient data labeling
Decomposition approach
Performers selection and training
Hands-on practice session: audio transcription
Advanced crowdsourcing techniques: aggregation, incremental relabeling & pricing
Example of how to aggregate crowdsourced texts using the Crowd-Kit Python library

Speakers

Dmitry Ustalov

TolokaAnalyst / Software DeveloperProfile link

Daria Baidakova

TolokaDirector of Educational ProgramsProfile link

Natalie Fedorova

TolokaEducational Project ManagerProfile link

Organizers

Hend S. Al-Khalifa

King Saud University Saudi Arabia, RiyadhPhD Professor, Information Technology Department CCIS

Daria Baidakova

TolokaDirector of Educational ProgramsProfile link

Natalie Fedorova

TolokaEducational Project ManagerProfile link

Schedule

16:00 – 16:15

Part 0: Introduction

The concept of data labeling via crowdsourcing
Crowdsourcing task examples
Crowdsourcing platforms
Yandex crowdsourcing experience

16:15 – 16:45

Part I: Key Components for Efficient Data Collection

Decomposition for an effective pipeline
Task instruction & interface: best practices
Quality control techniques

16:45 – 17:45

Part II: Practice part I

Dataset and required labels
Discussion: how to collect labels?
Data labeling pipeline for implementation
You
» create
» configure
» run data labeling projects on real performers in real-time

17:45 – 18:30

Break

18:30 – 19:00

Part III: Advanced techniques

Incremental relabeling
Dynamic pricing

19:00 - 19:30

Break

19:30 – 19:45

Part IV: Practice part II

Finishing up label collection
Results aggregation

19:45 – 20:15

Part V: Conclusion

Results of your projects
Ideas for further work and research
References to literature and other tutorials
Q&A

Back to Events

Don't miss out

Be the first to hear about our workshops,
tutorials, and webinars.