Reinforcement learning from human feedback (RLHF) has dramatically improved the real-world performance and user experience of large machine learning models. However, this approach has remained largely out of reach for most academic researchers.
Our tutorial covers the two essential parts of a successful RLHF project: the core machine learning techniques behind RLHF, and the human-in-the-loop methodology for collecting human feedback data. As a take-home assignment, you will run your first RLHF experiment with a simple creative task to make top-5 lists of items the user requests (for instance, top ideas for Thanksgiving meals).
Introduction
Technical overview
Break
Human annotation for RLHF
Conclusion and Q&A
The technical overview will break down the RLHF process into three primary stages: language modeling for the base policy, modeling of human preferences, and performing optimization with RL. This part of the tutorial will be technical and will describe potential research questions, pitfalls, and tips for successful projects.
In this section, we will address challenges in collecting human-produced texts and performing human preference scoring. We will focus on three aspects of human annotation for RLHF:
To apply what you learn about RLHF methods, you'll have the opportunity to complete a short practical assignment after the tutorial. The assigned task will involve generating top-5 lists for a user's query. You'll use example lists as human demonstration data at the supervised fine-tuning stage, then run preference modeling and RL fine-tuning to make the model's outputs more helpful and faithful to the prompt.