In our tutorial, we present a systematic view on using Human-in-the-Loop to obtain scalable offline evaluation processes and, in particular, high-quality relevance judgements.
Modern Web services widely employ sophisticated Machine Learning techniques to rank news, posts, products, and other items presented to the users or contributed by them. These techniques are usually built on offline data pipelines and use a numerical approximation of the relevance of the demonstrated content. In our hands-on tutorial, we present a systematic view on using Human-in-the-Loop to obtain scalable offline evaluation processes and, in particular, high-quality relevance judgements. We will introduce the ranking problem to the attendees, discuss the commonly used ranking quality metrics, and then focus on Human-in-the-Loop-based approach to obtain relevance judgements at scale. More precisely, we will present a thorough introduction to pairwise comparisons, demonstrate how these comparisons can be obtained using Crowdsourcing, and organize a hands-on practice session in which the attendees will obtain high-quality relevance judgements for search quality evaluation. Finally, we will discuss the obtained relevance judgements, point out directions for further studies, and answer questions asked during the tutorial.
Introduction to Offline Evaluation
Ranking and Quality Metrics
Hands-On Practice Session I
Hands-On Practice Session II
Final Remarks and Conclusion
PyData Global 2022
Crowd-Kit: A scikit-learn for crowdsourced annotations. The talk includes the demonstration of Crowd-Kit - an open-source computational quality control library.
WSDM 2023 Crowd Science Workshop
CANDLE: Collaboration of Humans and Learning Algorithms for Data Labeling.