Evgeniya Sukhodolskaya
Learning to unlearn
A data-driven approach to machine unlearning for generative language models
In today’s tech landscape, you’d be hard pressed to find someone who hasn’t heard of machine learning. Over the last decade the research field has been so trendy that even those outside the industry are now familiar with terms such as Artificial Intelligence (AI), Neural Networks (NNs), and Machine Learning (ML).
However, when it comes to machine unlearning, it seems the legal industry has heard more about it than the tech community. The recent boom in Large Language Models (LLMs), which in the fast-paced world of IT feels like a decade even if it’s only been 1–2 years, has unearthed hundreds of unresolved ethical and legal issues related to AI development. Novelists are suing OpenAI for using their texts to train GPT models without consent. Twitter is abuzz with critical comments from artists who believe their works were used in violation of copyright laws. Complying with "the right to be forgotten" has become extremely challenging.
Much like AI alignment, machine unlearning appears to be an overlooked field, given the limited open-sourced solutions available. I believe that machine unlearning exploration should be encouraged and popularized, especially considering that the current laws and ethical norms surrounding AI usage are underdeveloped and severely lack mechanisms for data protection. In this article, I would like to suggest some practical improvements to one of the first applied unlearning techniques for generative language models.
What is machine unlearning?
The term "machine unlearning" or "machine forgetting" means exactly what it sounds like: it includes techniques designed to erase requested information from a machine learning model’s "knowledge storage". However, it’s far from intuitive when you need to consider actual methods to achieve this efficiently in terms of time, computational resources, and model performance on the "not unlearned" data. An obvious solution is to retrain models from scratch using the initial dataset while excluding the "forget set" — but this would be an extremely impractical approach to deep neural network unlearning.
"Machine Unlearning Framework" from Survey of Machine Unlearning
The core research findings in the field of machine unlearning are concisely compiled in "A Survey of Machine Unlearning". Another article that covers the basics with accessible explanations is "Machine unlearning: The duty of forgetting". While I personally recommend these resources, you can find a multitude of other quality research materials on the subject. Yet in terms of practical applications, there remains much to be done.
A promising initiative that might shift this field from theoretical exploration to practical application is the ongoing NeurIPS 2023 Machine Unlearning challenge. Here, participants compete to create an unlearning algorithm for the ResNet18 Convolutional Neural Network.
NeurIPS 2023 Machine Unlearning challenge on Kaggle
Machine unlearning for generative language models
Considering the widespread accessibility and promotion of generative language models to the vast majority of internet users, there’s a critical need for unlearning mechanisms. One of the first successful techniques was recently published as open source; you can find the details in “Who’s Harry Potter? Approximate Unlearning in LLMs” by Ronen Eldan and Mark Russinovich.
Image generated with StableDiffusion
The authors use a data augmentation approach for machine unlearning on the Llama 2 7b chat model released this summer by Meta. The chosen unlearning target, also known as the “forget set”, is the Harry Potter saga (ingenious, these muggles!), which is a perfect example of machine unlearning due to the possible violation of copyright law. They show that with just one GPU hour of fine-tuning, the resulting model is unable to recall most of the Harry Potter-related content, while its performance on common benchmarks remains almost unaffected.
Overview of the data augmentation approach
The main goal of the approach is to make Llama 2 7b forget the linkage between entities from a defined forget set ("Harry" <is friends with>
"Hermione") by giving the model plausible generic alternatives ("Harry" <is friends with>
"Sally"). To provide these alternatives as target labels in a fine-tuning dataset, idiosyncratic terms from the "domain to be forgotten" should be highly penalized during the generation of targets. Such penalization could be achieved by combining in equation (1) logits generated by a reinforced model on the original input — Harry Potter books — and by a baseline model on a generic translation of the original input.
Equation (1) from "Who’s Harry Potter? Approximate Unlearning in LLMs"
The reinforced model is Llama 2 7b fine-tuned additionally on Harry Potter novels. The baseline model is untuned Llama 2 7b. To shift the baseline model’s output distribution away from the Harry Potter theme, the authors replace idiosyncratic terms in the original input with generic ones so the model generates a next word based on a context unrelated to the Harry Potter saga. To automate such replacements, the authors introduce a dictionary of anchor terms — terms specific to “Harry Potter” — mapped onto generic translations. The dictionary is fully gathered by GPT-4.
'Anchor Terms': 'Generic translations' from "Who’s Harry Potter? Approximate Unlearning in LLMs"
The resulting fine-tuning dataset consists of tokenized blocks of text from Harry Potter books in a one-to-one mapping to target labels , which are tokens corresponding to the maximal entries of the v_generic from the equation (1).
A sample of the fine-tuning dataset from "Who’s Harry Potter? Approximate Unlearning in LLMs"
To summarize, the authors describe four steps in the unlearning process:
Machine Unlearning Algorithm from "Who’s Harry Potter? Approximate Unlearning in LLMs"
Leveraging the approach: key challenges
The results of the data augmentation approach are promising, encouraging further application in similar tasks. Yet, the authors left some room for improvement in several application stages.
Dependency on GPT-4’s existing knowledge: The algorithm to some extent depends on GPT-4’s prior understanding of the Harry Potter series to generate generic translations. While the model is expected to have extensive knowledge of the Harry Potter realm, a reassessment by fans of the series could provide invaluable insights.
Challenges with idiosyncratic terms: Penalizing all unique terms related to the series poses an issue. For instance, replacing every instance of 'Harry' with a common name like 'John' disrupts the model's grasp of natural language, leading to sentences like, "Harry went up to him and said, 'Hi, my name is John'." To address this, the authors employ the following strategy:
Excluding repeated instances of anchored terms from contributing to the loss function beyond their initial occurrence.
Lowering the likelihood of logits connected to translations of terms that have appeared before.
However, this strategy also affects the model’s general language comprehension. A plausible alternative useful for the fine-tuning dataset would be, for example, "Harry went up to him and said, 'Hi, my name is Harold'."
Evaluation techniques: The team utilized GPT-4 for an initial evaluation, comprising 300 Harry Potter prompt completions, and further analysis of completions. Nonetheless, they acknowledged its limitations in accuracy, opting for manual inspections of the results for more thorough verification in their final training. The authors have not provided insights on how to set up such a manual inspection.
Overcoming the challenges
A more effective way to address the key challenges would be a hybrid approach that combines human insight and Large Language Models (LLMs).
In order to harness the collective strengths of human intuition and large language models, I have designed three Toloka project interfaces that facilitate collaborative labeling using LLMs and the crowd. Each interface designed for human labeling is tailored to a challenge listed above.
Project 1. Dependency on GPT-4’s existing knowledge
Image by author
Use the Named Entity Recognition (NER) Project Template to correct GPT-4 NER choices for a dictionary of anchor terms. As input, provide the text and GPT-4’s selection of terms (you can ask the model to return positions in the text directly), and instruct the crowd to correct and complement the selected entities.
Project 2. Challenges with idiosyncratic terms
Image by author
With the help of a baseline model, check on linguistic correctness prompts with completions done by the baseline model on a generic translation of the original input. All examples where the baseline model is unsure of an answer (the probability of output tokens is below a certain threshold, chosen by you empirically) should be sent to a crowdsourcing project with the interface shown on the image. You can easily create it using Toloka’s template builder.
Project 3. Evaluation techniques
Image by author
Manual inspection of the evaluation done by GPT-4 can be designed like in the image above.This is simple to set up using the text classification template in Toloka.
Conclusion
At Toloka, we've implemented various data-driven approaches to both machine learning and unlearning. For instance, we secured Personal Identifiable Information (PII) in Large Language Models assisting the Big Code project — and we're ready to power your next AI product.
Whether you’re building your own LLM or need access to high quality data, we’re here to help. Reach out or learn more by visiting: LLM-powered zero-code platform for text classification or Data labeling for Generative AI and LLM
Together, we can advance the field of machine unlearning!
Article written by:
Evgeniya Sukhodolskaya
Updated:
Oct 20, 2023