Overview
Training and deploying machine learning models relies on a large amount of human-annotated data. As human labeling becomes increasingly expensive and time-consuming, recent research has developed multiple strategies to speed up annotation and reduce costs and human workload: generating synthetic training data, active learning, and hybrid labeling. This tutorial is oriented toward practical applications: we will present the basics of each strategy, highlight their benefits and limitations, and discuss in detail real-life case studies. Additionally, we will walk through best practices for managing human annotators and controlling the quality of the final dataset. The tutorial includes a hands-on workshop, where attendees will be guided in implementing a hybrid annotation setup. This tutorial is designed for NLP practitioners from both research and industry backgrounds who are involved in or interested in optimizing data labeling projects.
Part 1: Introduction (20 min)
This section addresses the critical need for extensive labeled datasets and introduces key concepts.
Part 2: LM workflows (30 min)
This section will demonstrate best practices for common workflows involving language models (LMs) and large language models (LLMs). These workflows aim to (i) create efficient LMs with acceptable performance optimized for labeling data, and (ii) generate synthetic data for data augmentation.
Part 3: Active learning with LMs (40 min)
This section presents Active learning (AL) in data annotation. We discuss key strategies for both generative and non-generative AL, their applications, advantages, and limitations.
Part 4: Quality control and managing human workers (30 min)
This section focuses on quality control and best practices in working with human annotators.
Part 5: Hybrid pipelines (40 min)
This section presents developing hybrid pipelines, e.g. effectively combining human and model labeling to achieve the best balance of quality, cost, and speed.
Part 6: Limitations (20 min)
This section addresses the challenges of labeling tasks with LMs, the various reasons behind these difficulties, and future research directions to lift these limitations.
Part 7: Hands-on session: Hybrid data annotation (30 min)
In this hands-on session, we will implement a hybrid approach on a real-world dataset and demonstrate improvements in annotation quality.