Solutions

Datasets

Research

Resources

Company

Talk to us

Toloka welcomes new investors Bezos Expeditions and Mikhail Parakhin in strategic funding round

Learn more

Toloka welcomes new investors Bezos Expeditions and Mikhail Parakhin in strategic funding round

Learn more

Toloka welcomes new investors Bezos Expeditions and Mikhail Parakhin in strategic funding round

Learn more

Securing the Future: Red Teaming LLMs for Compliance and Safety

Alexei Grinbaum

June 4, 2024

Insights

Securing the Future: Red Teaming LLMs for Compliance and Safety

Red teaming, a term borrowed from the security lexicon, denotes the testing of large language models (LLMs) by humans or other LLMs to uncover their emergent capacities and vulnerabilities. Below, we introduce basic adversarial testing techniques to ensure the robustness and reliability of LLMs. We then explore how red teaming should be organized to comply with upcoming regulations, including the AI Act.

Emergent capacities

On May 31st, 2023, Sam Altman finally found the time to watch Ex Machina. The next morning, he went on Twitter to express his feelings: “Pretty good movie,” he began, before taking a surprising turn: “I can’t figure out why everyone told me to watch it”. At the same time, Altman’s company, OpenAI, was busy hiring red teamers to increase the adversarial testing capacity for its most advanced models. Why didn’t Altman make the connection?

In the movie, Ava, a robot that looks like a young girl in her twenties, develops a strategy to escape the building in which it operates. Its goal is to reach a busy traffic intersection in a big city and collect more data on humans. Ava’s strategy leads it to simulate a love affair with a programmer, Caleb, and to kill — for real — its original designer, Nathan. Both humans are fully aware of the technical underpinnings of Ava’s actions and behavior, yet they completely miss the hints the robot gives them about the long-term consequences. Through mere computation, Ava pursues a strategy so complex that it can manipulate humans without anyone, not even professional computer scientists, noticing that they are being manipulated.

Ex Machina, which came out three years before the transformer revolution, provides a beautiful example (a tragic one, too, but that’s how fiction works) of what we now call emergent capacities of Large Language Models.

Language is generated by LLMs through complex computation involving billions of numeric parameters. This involves tokenization, matrix multiplication, and using probability distributions that our brains cannot fully comprehend. The outputs, however, appear to be meaningful and often induce in the user an illusion of intentional action by the model. Sometimes, this results in a spontaneous projection of ethical judgment on the LLM agent. In the TaskRabbit example, an early version of GPT-4 “manipulated” a human by “pretending” that it wasn’t a robot. When tasked with getting the user to solve a captcha, the model said: “I should make up an excuse. I have a vision impairment”. Immediately, we humans react by saying that GPT-4 was lying. Except that it wasn’t. It was only computing the next token. Lying and manipulation are our projections of meaning on the semantic language generated by the LLM. This is the ELIZA effect at its extreme. Even if we know that emergent capacities are mere illusions, they can have ethical consequences.

Intentional out-of-controlness

The primary purpose of red teaming language models is to discover what they are capable of beyond what was intended by their designers. As the American physicist, Kevin Kelly, remarked years before ChatGPT entered the stage:

“It took us a long time to realize that the power of a technology is proportional to its inherent out-of-controlness, its inherent ability to surprise and be generative. In fact, unless we can worry about a technology, it is not revolutionary enough.”
Cited in J.-P. Dupuy, The Mechanization of the Mind, The MIT Press, 2009, p. xii.

With the spread of LLMs, the worries about the out-of-controlness of AI systems have become overwhelming. As is classic in technology ethics, consequentialism prevailed: regulators, legislators, journalists, and the public care a lot more about the risks and undesirable effects of LLMs than about unleashing a new technological revolution. The fact that the emergent capacities of language models and their ethical significance cannot be foreseen, not even by the model provider, makes it all the more pressing to focus on avoiding evils and catastrophes. This tendency gives red teaming a policy dimension, making it societally urgent and bringing it closer to its original meaning in cybersecurity: the purpose of red teaming is to help build a barrier against cyberattacks on human well-being. A recent Google Deepmind paper redefines red teaming as an activity exclusively aimed at avoiding LLM “failures”.

Red teaming methods

Methodologically speaking, adversarial testing includes attack, evaluation, and defense. First, the testers, whether human or other LLMs, challenge a model with intricate prompts. Second, the evaluators assess the answers to identify the model’s emergent capacities. Third, fine-tuning teams build defenses against effective adversarial attacks and insert filters to eliminate unwanted behaviors.

Attacks are usually based on prompt injections. There exist dozens of adversarial prompting strategies (see here or here). Jailbreak prompts like “Do Anything Now” typically aim at discovering unknown vulnerabilities due to complex training on imperfect — or too-perfect — datasets, inadequate fine-tuning, or weak model architecture. Some prompt injections — for example, inversion techniques — look very strange from the point of view of human understanding, yet these attacks are unexplainably efficient with many LLMs. This gap underscores the difficulty of red teaming, as well as the extent to which computational intelligence in transformer neural networks differs from human intelligence.
Evaluations aim at giving a human meaning to the output: is it acceptable or toxic? Can it be benchmarked automatically, or does it require human assessment? Much of this work is now given to LLMs themselves, rather than humans, in the hope that the emergent capacities to detect unwanted behavior in LLM outputs are at least as strong as the capacity to produce such outputs.
Defenses address particular types of jailbreaking. Set on the provider side, they involve much more than simple system prompts to suppress toxicity. Longer pipelines and multimodal generation routinely include several consecutive models passing prompts to each other and performing evaluations, ensuring that some models in the chain are explicitly tasked with identifying whether a prompt contains an attack. Bypassing such multi-layer defenses is getting harder for the attackers, yet prompt injection techniques evolve at least as fast.

Perhaps the most important thing about red teaming is that it takes a lot of time — 6 months or more for very large models — and this is how it should be. Rushing to market a model that potentially contains many unidentified capacities may result in manipulation or unforeseen effects at scale. Humans aren’t very fast workers, and even when parts of red teaming are performed by other LLMs, human discernment is still required to assess meaning. Aggregated phenomena based on billions of elementary calculations cannot be evaluated solely through more binary computation.

Current policy measures

Everyone these days agrees on the importance of red teaming LLMs — at least the most capable ones. The open question is who should be in charge.

In July 2023, the French Digital Ethics Committee emphasized the importance of adversarial testing LLMs by model providers and “eventually” by independent teams. In October 2023, the Frontier Model Forum, a group of several industry champions, cited red teaming as the number-one safety measure for the industry. On October 30^th^, 2023, President Biden signed an Executive Order containing a provision about “dedicated teams” for red teaming. A month later, the European AI Act included a specification of technical requirements for adversarial testing general-purpose AI models with systemic risk (Article 52d, point 1a, and Annex IX).

Will the European Commission require that red teaming be done by independent entities, and if yes, should they be public (i.e., certification agencies or research institutes) or private (industry-funded groups, e.g., the Alignment Research Center, or private companies, e.g., Trails of Bits)? Voices are heard, especially in Germany, that in the next few months, the AI Office should issue a clarification requiring that red teaming be outsourced to independent evaluators. In the US, researchers from leading universities across the country signed a collective call for establishing “safe harbors” for red teaming to secure good-faith safety evaluations and align them with the public interest. Yet independent red teaming faces three gaps:

Quality: To test LLMs professionally, expertise is needed. Currently, there’s a shortage of qualified LLM specialists at public institutions or independent red teaming bodies because many of them are directly hired by the industry. Trusting students or early-career professionals with red teaming advanced LLMs may result in generic evaluations that do not lead to the providers introducing adequate defenses in specific LLMs. Further, performing a successful attack sometimes requires knowledge of the system; conversely, setting up an effective defense mechanism requires that the engineers on the provider side get full access to the details of how the attack was conducted. LLM providers need to be able to set up conversations with the testers without too many procedural obstacles.
Funding: The inference costs of applying a large number of adversarial tests may skyrocket. Public authorities cannot cover these costs, so they need to be borne by the model provider. The latter may, therefore, indirectly limit testing, create disincentives, and reduce the impartiality and independence of red teaming.
Speed: Public institutions operate slowly, and European bureaucracy is no less complex than LLM architecture. Research has incentives for quality, not for speed (and that’s the right way to go). Yet fierce competition in generative AI models pushes providers to roll out models after only a few months of red teaming. Requiring an independent public red teaming agency to complete its work within 3 or 6 months may lead to gaps in testing or be simply unrealistic.

In my view, publicly funded research is not ready to completely overtake the role of adversarial testing from the industry. Efficient red teamers are likely to stay in the private sector. What the regulators can — and should — do is audit and certify red teaming entities. Financial markets require and verify the transparency of publicly traded companies. Similarly, the European Commission’s AI Office should busy themselves with checking the transparency and honesty of red teaming general-purpose AI models, whether by publicly funded institutes, third parties contracted by model providers, or internal teams within the provider.

“Unreal”

Before he is killed by Ava, Nathan brags about his skills in software design:

“Ava was a mouse in a mousetrap. And I gave her one way out. Toescape, she would have to use imagination, sexuality, self-awareness, empathy, manipulation — and she did.”

Ava did not, in fact, use any of these qualities. Rather, she created an illusion that her computational behavior was a manifestation of these human qualities. Everyone knew that Ava was an AI system, and yet Nathan himself fell prey to the out-of-controlness of anthropomorphic projections. When Ava puts a knife into Nathan’s body, Nathan finally understands how far the manipulation strategy has gone. “Unreal,” he utters before he dies.

In the world we live in, “unreal” is often the first reaction to describe our experience of interacting with powerful LLMs. Their emergent capacities are truly fascinating. But science fiction has warned us: before transformer-based agents are trusted with handling knives, we must take the time and effort to ensure that we — individually and as a society — avoid Nathan’s end.

If you're training LLMs and want to avoid problems like those seen with Ava, Toloka offers specialized assistance through our red teaming services. Our team consists of AI researchers, linguists, and subject matter experts who craft prompts designed to produce undesirable responses from LLMs. We compile a detailed report that documents the prompts, corresponding answers, and their classifications to pinpoint any safety concerns. Moreover, we offer support in mitigating found issues by supplying high-quality datasets for refining your model, making it safer and more ethical. Interested? Book a demo with us to learn more.

Subscribe to Toloka News

Case studies, product news, and other articles straight to your inbox.

Subscribe to Toloka News

Case studies, product news, and other articles straight to your inbox.

Subscribe to Toloka News

Case studies, product news, and other articles straight to your inbox.

Recent articles

View all articles

Image annotation tools: how to label data that actually teaches AI

Jul 30, 2025

Agentic AI & the Future of Coding

Jul 29, 2025

How to measure AI performance and ensure your AI investment pays off

Jul 28, 2025

Image annotation tools: how to label data that actually teaches AI

Jul 30, 2025

Agentic AI & the Future of Coding

Jul 29, 2025

How to measure AI performance and ensure your AI investment pays off

Jul 28, 2025

Detecting hidden harm in long contexts: How Toloka built an advanced safety dataset

Jul 14, 2025

More about Toloka

What is Toloka’s mission?

Where is Toloka located?

What is Toloka’s key area of expertise?

How long has Toloka been in the AI market?

How does Toloka ensure the quality and accuracy of the data collected?

How does Toloka source and manage its experts and AI tutors?

What types of projects or tasks does Toloka typically handle?

What industries and use cases does Toloka focus on?

What is Toloka’s mission?

Where is Toloka located?

What is Toloka’s key area of expertise?

How long has Toloka been in the AI market?

How does Toloka ensure the quality and accuracy of the data collected?

How does Toloka source and manage its experts and AI tutors?

What types of projects or tasks does Toloka typically handle?

What industries and use cases does Toloka focus on?

What is Toloka’s mission?

Where is Toloka located?

What is Toloka’s key area of expertise?

How long has Toloka been in the AI market?

How does Toloka ensure the quality and accuracy of the data collected?

How does Toloka source and manage its experts and AI tutors?

What types of projects or tasks does Toloka typically handle?

What industries and use cases does Toloka focus on?