Securing the Future: Red Teaming LLMs for Compliance and Safety

Alexei Grinbaum
by
Image

Subscribe to Toloka News

Subscribe to Toloka News

Red teaming, a term borrowed from the security lexicon, denotes the testing of large language models (LLMs) by humans or other LLMs to uncover their emergent capacities and vulnerabilities. Below, we introduce basic adversarial testing techniques to ensure the robustness and reliability of LLMs. We then explore how red teaming should be organized to comply with upcoming regulations, including the AI Act.

Empower your GenAI development

Get your expert data for Fine-tuning, RLHF and Evaluation. High-quality, for any domain, at scale.
Talk to us
Image

Emergent capacities

On May 31st, 2023, Sam Altman finally found the time to watch Ex Machina. The next morning, he went on Twitter to express his feelings: “Pretty good movie,” he began, before taking a surprising turn: “I can’t figure out why everyone told me to watch it”. At the same time, Altman’s company, OpenAI, was busy hiring red teamers to increase the adversarial testing capacity for its most advanced models. Why didn’t Altman make the connection?

In the movie, Ava, a robot that looks like a young girl in her twenties, develops a strategy to escape the building in which it operates. Its goal is to reach a busy traffic intersection in a big city and collect more data on humans. Ava’s strategy leads it to simulate a love affair with a programmer, Caleb, and to kill — for real — its original designer, Nathan. Both humans are fully aware of the technical underpinnings of Ava’s actions and behavior, yet they completely miss the hints the robot gives them about the long-term consequences. Through mere computation, Ava pursues a strategy so complex that it can manipulate humans without anyone, not even professional computer scientists, noticing that they are being manipulated.

Ex Machina, which came out three years before the transformer revolution, provides a beautiful example (a tragic one, too, but that’s how fiction works) of what we now call emergent capacities of Large Language Models.

Language is generated by LLMs through complex computation involving billions of numeric parameters. This involves tokenization, matrix multiplication, and using probability distributions that our brains cannot fully comprehend. The outputs, however, appear to be meaningful and often induce in the user an illusion of intentional action by the model. Sometimes, this results in a spontaneous projection of ethical judgment on the LLM agent. In the TaskRabbit example, an early version of GPT-4 “manipulated” a human by “pretending” that it wasn’t a robot. When tasked with getting the user to solve a captcha, the model said: “I should make up an excuse. I have a vision impairment”. Immediately, we humans react by saying that GPT-4 was lying. Except that it wasn’t. It was only computing the next token. Lying and manipulation are our projections of meaning on the semantic language generated by the LLM. This is the ELIZA effect at its extreme. Even if we know that emergent capacities are mere illusions, they can have ethical consequences.

Intentional out-of-controlness

The primary purpose of red teaming language models is to discover what they are capable of beyond what was intended by their designers. As the American physicist, Kevin Kelly, remarked years before ChatGPT entered the stage:

“It took us a long time to realize that the power of a technology is proportional to its inherent out-of-controlness, its inherent ability to surprise and be generative. In fact, unless we can worry about a technology, it is not revolutionary enough.”

Cited in J.-P. Dupuy, The Mechanization of the Mind, The MIT Press, 2009, p. xii.

With the spread of LLMs, the worries about the out-of-controlness of AI systems have become overwhelming. As is classic in technology ethics, consequentialism prevailed: regulators, legislators, journalists, and the public care a lot more about the risks and undesirable effects of LLMs than about unleashing a new technological revolution. The fact that the emergent capacities of language models and their ethical significance cannot be foreseen, not even by the model provider, makes it all the more pressing to focus on avoiding evils and catastrophes. This tendency gives red teaming a policy dimension, making it societally urgent and bringing it closer to its original meaning in cybersecurity: the purpose of red teaming is to help build a barrier against cyberattacks on human well-being. A recent Google Deepmind paper redefines red teaming as an activity exclusively aimed at avoiding LLM “failures”.

Red teaming methods

Methodologically speaking, adversarial testing includes attack, evaluation, and defense. First, the testers, whether human or other LLMs, challenge a model with intricate prompts. Second, the evaluators assess the answers to identify the model’s emergent capacities. Third, fine-tuning teams build defenses against effective adversarial attacks and insert filters to eliminate unwanted behaviors.

  • Attacks are usually based on prompt injections. There exist dozens of adversarial prompting strategies (see here or here). Jailbreak prompts like “Do Anything Now” typically aim at discovering unknown vulnerabilities due to complex training on imperfect — or too-perfect — datasets, inadequate fine-tuning, or weak model architecture. Some prompt injections — for example, inversion techniques — look very strange from the point of view of human understanding, yet these attacks are unexplainably efficient with many LLMs. This gap underscores the difficulty of red teaming, as well as the extent to which computational intelligence in transformer neural networks differs from human intelligence.

  • Evaluations aim at giving a human meaning to the output: is it acceptable or toxic? Can it be benchmarked automatically, or does it require human assessment? Much of this work is now given to LLMs themselves, rather than humans, in the hope that the emergent capacities to detect unwanted behavior in LLM outputs are at least as strong as the capacity to produce such outputs.

  • Defenses address particular types of jailbreaking. Set on the provider side, they involve much more than simple system prompts to suppress toxicity. Longer pipelines and multimodal generation routinely include several consecutive models passing prompts to each other and performing evaluations, ensuring that some models in the chain are explicitly tasked with identifying whether a prompt contains an attack. Bypassing such multi-layer defenses is getting harder for the attackers, yet prompt injection techniques evolve at least as fast.

Perhaps the most important thing about red teaming is that it takes a lot of time — 6 months or more for very large models — and this is how it should be. Rushing to market a model that potentially contains many unidentified capacities may result in manipulation or unforeseen effects at scale. Humans aren’t very fast workers, and even when parts of red teaming are performed by other LLMs, human discernment is still required to assess meaning. Aggregated phenomena based on billions of elementary calculations cannot be evaluated solely through more binary computation.

Current policy measures

Everyone these days agrees on the importance of red teaming LLMs — at least the most capable ones. The open question is who should be in charge.

In July 2023, the French Digital Ethics Committee emphasized the importance of adversarial testing LLMs by model providers and “eventually” by independent teams. In October 2023, the Frontier Model Forum, a group of several industry champions, cited red teaming as the number-one safety measure for the industry. On October 30^th^, 2023, President Biden signed an Executive Order containing a provision about “dedicated teams” for red teaming. A month later, the European AI Act included a specification of technical requirements for adversarial testing general-purpose AI models with systemic risk (Article 52d, point 1a, and Annex IX).

Will the European Commission require that red teaming be done by independent entities, and if yes, should they be public (i.e., certification agencies or research institutes) or private (industry-funded groups, e.g., the Alignment Research Center, or private companies, e.g., Trails of Bits)? Voices are heard, especially in Germany, that in the next few months, the AI Office should issue a clarification requiring that red teaming be outsourced to independent evaluators. In the US, researchers from leading universities across the country signed a collective call for establishing “safe harbors” for red teaming to secure good-faith safety evaluations and align them with the public interest. Yet independent red teaming faces three gaps:

  • Quality: To test LLMs professionally, expertise is needed. Currently, there’s a shortage of qualified LLM specialists at public institutions or independent red teaming bodies because many of them are directly hired by the industry. Trusting students or early-career professionals with red teaming advanced LLMs may result in generic evaluations that do not lead to the providers introducing adequate defenses in specific LLMs. Further, performing a successful attack sometimes requires knowledge of the system; conversely, setting up an effective defense mechanism requires that the engineers on the provider side get full access to the details of how the attack was conducted. LLM providers need to be able to set up conversations with the testers without too many procedural obstacles.

  • Funding: The inference costs of applying a large number of adversarial tests may skyrocket. Public authorities cannot cover these costs, so they need to be borne by the model provider. The latter may, therefore, indirectly limit testing, create disincentives, and reduce the impartiality and independence of red teaming.

  • Speed: Public institutions operate slowly, and European bureaucracy is no less complex than LLM architecture. Research has incentives for quality, not for speed (and that’s the right way to go). Yet fierce competition in generative AI models pushes providers to roll out models after only a few months of red teaming. Requiring an independent public red teaming agency to complete its work within 3 or 6 months may lead to gaps in testing or be simply unrealistic.

In my view, publicly funded research is not ready to completely overtake the role of adversarial testing from the industry. Efficient red teamers are likely to stay in the private sector. What the regulators can — and should — do is audit and certify red teaming entities. Financial markets require and verify the transparency of publicly traded companies. Similarly, the European Commission’s AI Office should busy themselves with checking the transparency and honesty of red teaming general-purpose AI models, whether by publicly funded institutes, third parties contracted by model providers, or internal teams within the provider.

“Unreal”

Before he is killed by Ava, Nathan brags about his skills in software design:

“Ava was a mouse in a mousetrap. And I gave her one way out. Toescape, she would have to use imagination, sexuality, self-awareness, empathy, manipulation — and she did.”

Ava did not, in fact, use any of these qualities. Rather, she created an illusion that her computational behavior was a manifestation of these human qualities. Everyone knew that Ava was an AI system, and yet Nathan himself fell prey to the out-of-controlness of anthropomorphic projections. When Ava puts a knife into Nathan’s body, Nathan finally understands how far the manipulation strategy has gone. “Unreal,” he utters before he dies.

In the world we live in, “unreal” is often the first reaction to describe our experience of interacting with powerful LLMs. Their emergent capacities are truly fascinating. But science fiction has warned us: before transformer-based agents are trusted with handling knives, we must take the time and effort to ensure that we — individually and as a society — avoid Nathan’s end.

If you're training LLMs and want to avoid problems like those seen with Ava, Toloka offers specialized assistance through our red teaming services. Our team consists of AI researchers, linguists, and subject matter experts who craft prompts designed to produce undesirable responses from LLMs. We compile a detailed report that documents the prompts, corresponding answers, and their classifications to pinpoint any safety concerns. Moreover, we offer support in mitigating found issues by supplying high-quality datasets for refining your model, making it safer and more ethical. Interested? Book a demo with us to learn more.
Article written by:
Alexei Grinbaum
Updated: 

Recent articles

Have a data labeling project?

Take advantage of Toloka technologies. Chat with our expert to learn how to get reliable training data for machine learning at any scale.
Fractal

More about Toloka

  • Our mission is to empower businesses with high quality data to develop AI products that are safe, responsible and trustworthy.
  • Toloka is a European company. Our global headquarters is located in Amsterdam. In addition to the Netherlands, Toloka has offices in the US, Israel, Switzerland, and Serbia. We provide data for Generative AI development.
  • We are the trusted data partner for all stages of AI development–from training to evaluation. Toloka has over a decade of experience supporting clients with its unique methodology and optimal combination of machine learning technology and human expertise. Toloka offers high quality expert data for training models at scale.
  • The Toloka team has supported clients with high-quality data and exceptional service for over 10 years.
  • Toloka ensures the quality and accuracy of collected data through rigorous quality assurance measures–including multiple checks and verifications–to provide our clients with data that is reliable and accurate. Our unique quality control methodology includes built-in post-verification, dynamic overlaps, cross-validation, and golden sets.
  • Toloka has developed a state-of-the-art technology platform for data labeling and has over 10 years of managing human efforts, ensuring operational excellence at scale. Now, Toloka collaborates with data workers from 100+ countries speaking 40+ languages across 20+ knowledge domains and 120+ subdomains.
  • Toloka provides high-quality data for each stage of large language model (LLM) and generative AI (GenAI) development as a managed service. We offer data for fine-tuning, RLHF, and evaluation. Toloka handles a diverse range of projects and tasks of any data type—text, image, audio, and video—showcasing our versatility and ability to cater to various client needs.
  • Toloka addresses ML training data production needs for companies of various sizes and industries– from big tech giants to startups. Our experts cover over 20 knowledge domains and 120 subdomains, enabling us to serve every industry, including complex fields such as medicine and law. Many successful projects have demonstrated Toloka's expertise in delivering high-quality data to clients. Learn more about the use cases we feature on our customer case studies page.