In this tutorial, leading researchers and engineers share their unique industry experience in efficient data annotation (labeling) for self-driving cars.
In this tutorial, leading Yandex researchers and engineers share their unique industry experience in efficient data annotation (labeling) for self-driving cars. We present the data processing pipeline required for the cars to learn how to behave autonomously on the roads, and we also demonstrate the crucial role of data annotation in making the learning process effective. This is followed by an introduction to public crowdsourcing marketplaces and key crowdsourcing techniques for efficient annotation: task decomposition, quality control methods, aggregation, incremental relabeling, and others.
In the practice session, participants choose one real label collection task, experiment with selecting settings for the labeling process, and launch their own labeling project on Toloka, one of the world’s largest crowdsourcing marketplaces. During the tutorial, all projects run on the real Toloka crowd. Participants also receive feedback and practical advice on making their projects more efficient.
— The concept of crowdsourcing
— Crowdsourcing task examples
— Crowdsourcing platforms
— Yandex crowdsourcing experience
— Reasons for crowdsourcing
— The kind of data we collect and label
— Most common tasks and their applications
— Decomposition for an effective pipeline
— Task instruction & interface: best practices
— Quality control techniques
— Project: creation & configuration
— Pool: creation & configuration
— Tasks: uploading & golden set creation
— Statistics in flight and results downloading
— Demos of 2D and 3D object segmentation tasks
— Performer training and selection for complex tasks
— Dataset and required labels
— Discussion: how to collect labels?
— Data labeling pipeline for implementation
› run data labeling projects on real performers in real-time
— Aggregation models
— Incremental relabeling
— Performance-based pricing
— Project results
— Ideas for further work and research
— References to literature and other tutorials
PyData Global 2022
Crowd-Kit: A scikit-learn for crowdsourced annotations. The talk includes the demonstration of Crowd-Kit - an open-source computational quality control library.
WSDM 2023 Crowd Science Workshop
CANDLE: Collaboration of Humans and Learning Algorithms for Data Labeling.