In this tutorial we present a portion of our six-year experience in solving real-world tasks that combine efforts made by humans and machines.
Modern Web applications employ sophisticated Machine Learning models to rank news, posts, products, and other items presented to the users or contributed by them. To keep these models useful, one has to constantly train, evaluate, and monitor these models using freshly annotated data, which can be done using crowdsourcing.
In this tutorial we present a portion of our six-year experience in solving real-world tasks with human-in-the-loop pipelines that combine efforts made by humans and machines. We introduce data labeling via public crowdsourcing marketplaces and present the critical components of efficient data labeling. Then, we run a practical session, where participants address a challenging real-world Information Retrieval for e-Commerce task, experiment with selecting settings for the labeling process, and launch their label collection project on real crowds within the tutorial session. We present useful quality control techniques and provide the attendees with an opportunity to discuss their annotation ideas. Methods and techniques described in this tutorial can be applied to any crowdsourced data and are not bound to any specific crowdsourcing platform.
The role of Human-in-the-Loop in building Search engines
Ranking and Quality Metrics
Hands-On Practice Session
Results aggregation and implementation into the ML pipeline
Results discussion and Conclusion
PyData Global 2022
Crowd-Kit: A scikit-learn for crowdsourced annotations. The talk includes the demonstration of Crowd-Kit - an open-source computational quality control library.
WSDM 2023 Crowd Science Workshop
CANDLE: Collaboration of Humans and Learning Algorithms for Data Labeling.