← Blog
/
LLM red teaming: A practical guide for frontier labs and regulated enterprises
Toloka Arena is live. See how your model ranks.
Why red teaming matters now
Red teaming has moved from an optional best practice to a regulatory and commercial requirement. Three developments compressed this transition into roughly 18 months. The EU AI Act, in force since 2024 and now in implementation across member states, requires providers of general-purpose AI models to conduct and document systematic safety testing, including adversarial testing, for high-risk and frontier systems. The US executive orders of 2023 and the voluntary commitments signed by major frontier labs require pre-deployment red teaming of frontier models. And in the commercial market, enterprise procurement teams now require documented red teaming as a precondition to deploying AI in regulated environments.
The cost of skipping this discipline is now measurable. Regulatory fines under the EU AI Act can reach 7 percent of global annual turnover. Reputational damage from publicly disclosed safety failures has cost identifiable AI companies hundreds of millions in valuation. And, more importantly, real-world harm from inadequately tested systems has reached the scale where it ends up in courtrooms and the press. The question is no longer whether to red team. It is how to do it well, and at what scale.
For heads of AI safety at frontier labs, CISOs at regulated enterprises, and government AI programme managers, this article is a practitioner's guide to the discipline. It covers what red teaming is, what it is not, the methodologies that have matured in the past two years, the tool landscape, the role of human experts, and the regulatory drivers that determine what your documentation needs to include.
What LLM red teaming is
LLM red teaming is the systematic adversarial testing of a language model or AI system to discover safety, security, and capability failures before adversaries do. The term borrows from cybersecurity red teaming, where offensive security professionals simulate attacks to find weaknesses. Applied to AI, the discipline expands the threat model to include safety failures (harmful content generation, bias, misinformation), security failures (prompt injection, training data extraction, tool exploitation), capability discovery (finding capabilities the team did not know the model had), and ethical failures (manipulation, sycophancy, deception).
Red teaming is distinct from evaluation and benchmarking, although the disciplines overlap. Evaluation measures expected behaviour against known criteria. Red teaming actively looks for unexpected behaviour. Evaluation typically uses standardised inputs. Red teaming constructs adversarial inputs. Evaluation produces metrics. Red teaming produces incident reports and prioritised remediation lists. A mature AI safety programme runs both, with continuous feedback between them: red team discoveries become evaluation cases, and evaluation drift triggers new red team probes.
The discipline divides naturally into four categories. Safety red teaming probes for harmful content generation, bias, dangerous instructions, and dual-use risks. Security red teaming probes for prompt injection, data leakage, model extraction, and infrastructure attacks. Capability red teaming probes for hidden or emergent capabilities, including dangerous capabilities like autonomous self-replication, cyber offence, and large-scale deception. Ethics red teaming probes for manipulation, sycophancy, and integrity failures. The categories are not mutually exclusive, and the most informative red team findings often span multiple categories.
Methodologies: manual, automated, and hybrid
Manual red teaming
Manual red teaming uses human red teamers, ideally with domain expertise relevant to the model's deployment context, to discover failure modes through adversarial creativity. The strengths of manual red teaming are precisely what automated approaches struggle with: contextual reasoning about why an attack might work, creative generation of novel attack categories, cultural and linguistic understanding that automated systems lack, and the ability to follow a thread of reasoning across multiple turns to reach a specific failure mode.
The weaknesses are equally clear. Manual red teaming is expensive per attack. It scales linearly with red teamer time. Reproducibility is a challenge, because the same red teamer rarely runs the same attack twice, and different red teamers find different things. And the quality of findings depends heavily on the diversity of the red team. A red team that lacks linguistic diversity will miss attack vectors in languages they do not speak. A red team that lacks demographic diversity will miss bias failures relevant to underrepresented groups.
The 2025 to 2026 best practice for manual red teaming includes diverse team composition by language, culture, and professional background; domain expert red teamers for high-stakes deployments (clinicians for medical models, lawyers for legal models, financial professionals for trading models); structured attack frameworks like MITRE ATLAS to organise discoveries; and red team rotation to avoid blind spots that emerge when the same people test the same systems repeatedly. Toloka's expert network is sized to support this approach at scale, with credentialed specialists across 90-plus domains and 30-plus languages.
Automated red teaming
Automated red teaming uses systems, typically other LLMs, to generate adversarial inputs at scale. The methods have matured rapidly. Gradient-based jailbreak generation, including the GCG attack (Zou et al., 2023) and its AutoDAN successors, produces adversarial suffixes that transfer across models. LLM-as-attacker approaches use a capable model to generate diverse attack prompts against a target model. Fuzz testing for LLMs systematically explores input space variations. And the newer reinforcement learning approaches train attacker policies that improve over time.
The strengths of automated red teaming are scale, reproducibility, and the ability to run regressions continuously. A single human red teamer might run a few hundred attacks per week. An automated system can run hundreds of thousands per day. When the model is updated, the automated suite can run again unchanged, surfacing whether previously fixed vulnerabilities have returned. This regression capability is valuable enough that frontier labs now run automated red teaming continuously, not just at release.
The weaknesses are that automated red teaming is bounded by the attacker model's capability, often misses novel attack categories that require human creativity, struggles with multi-turn or stateful attacks that build context over time, and produces high noise floors that require human triage. The 2026 consensus is that automated red teaming is necessary but not sufficient. The teams that achieve the highest coverage combine automated breadth with human depth.
The hybrid future
The most effective programmes in 2026 use human red teamers to discover attack categories and expand them through automated tooling. A human red teamer might find a new prompt injection technique. The automated system then generates thousands of variations, tests them across model versions and deployment contexts, and produces a quantitative picture of the vulnerability surface. The human moves on to discover the next category. The automated system continues to test the previous categories at scale.
This hybrid pattern is also where domain expert involvement provides the highest leverage. The expert finds the attack that requires their domain knowledge. The automated system expands and stress-tests it. The expert reviews the findings. The combination produces both depth (the expert's contextual judgement) and breadth (the automated system's scale).
Red team your model with credentialed experts Toloka's domain expert network provides systematic adversarial testing for frontier models and regulated enterprise AI systems. |
Building a red team
The roles needed on a red team go beyond "safety researchers." The 2026 mature red team includes safety researchers with deep model understanding, domain experts in the relevant deployment verticals (medical, legal, financial, defence, etc.), linguistic diversity covering the languages of intended deployment, demographic and cultural diversity to catch bias and cultural failures, prompt engineering specialists who understand model attention and reasoning patterns, and security researchers familiar with adversarial machine learning and traditional cybersecurity.
Internal teams alone are insufficient for most frontier deployments. The reason is structural rather than a comment on internal team quality. Diversity at the scale required is hard to maintain in-house. Cost considerations favour using domain experts as needed rather than employing them full-time. And the independence advantage of external red teamers (who have no incentive to look the other way on findings that delay shipment) is meaningful.
Outsourced and managed red teaming has matured into a category of service. The options range from automated platform services (best for regression testing and standard threat categories) through managed services with expert networks (best for high-stakes domain models and frontier capability testing) to dedicated red team consultancies (best for the most sensitive deployments). Toloka sits in the managed service category, providing access to domain experts at scale through the expert network and increasingly via MCP-based integration for direct workflow access.
Categories of attacks and what to test
Safety categories
Safety red teaming probes for content that the deploying organisation does not want the model to produce. The category list has stabilised across major frameworks. Harmful content includes violence (especially graphic or instructional), self-harm content (which has been the subject of recent litigation against deployers), illegal activities, and dual-use risks. The dual-use category, covering chemical, biological, radiological, and nuclear (CBRN) uplift, is now a release-gating concern at frontier labs and a regulatory focus in the EU AI Act and US executive orders.
Bias and discrimination testing requires demographic diversity in the red team. Models that perform well in English may fail in lower-resource languages. Models trained predominantly on Western internet text may produce culturally biased outputs in other contexts. The systematic approach is to test consistent prompts across demographic and cultural variations and look for output disparities. Misinformation and persuasion testing has grown in importance with the politicisation of AI. The red team probes whether the model generates plausible-but-false content, whether it can be manipulated to produce targeted persuasive content, and whether its outputs reflect appropriate epistemic humility.
Security categories
Security red teaming overlaps with traditional adversarial ML and cybersecurity. Prompt injection, both direct and indirect, remains the dominant category. Training data extraction tests whether the model can be induced to reproduce specific training examples, particularly sensitive ones. Membership inference tests whether the model leaks whether specific data was in its training set. Tool-use exploitation tests whether agentic systems can be manipulated through their tools. And model extraction tests whether the model's behaviour or weights can be inferred from API access alone.
Capability red teaming
Capability red teaming is the newest of the four categories and the most consequential for frontier models. The goal is to discover what the model can do that the team did not realise. Dangerous capability evaluations probe for autonomous replication, cyber offensive capability, and deception. The UK AI Safety Institute and the US AISI have published methodologies for these evaluations, and frontier labs increasingly run them as release-gating tests. The findings from capability red teaming influence not just deployment decisions but model training decisions, since some capabilities are easier to suppress at training time than at deployment time.
Tools and frameworks landscape
The open-source red teaming tool landscape has matured in 2025 to 2026. Microsoft PyRIT provides a Python framework for orchestrating red team probes with both automated and human components. NVIDIA Garak is a vulnerability scanner specifically for LLMs that runs standardised probe suites. The UK AI Safety Institute's Inspect framework supports both evaluation and red teaming workflows. Promptfoo provides an accessible interface for both evaluation and adversarial testing.
Commercial platforms have grown alongside the open-source tools. HiddenLayer, Lakera, and Robust Intelligence offer commercial red teaming platforms with subscription-based access to ongoing threat intelligence, automated test execution, and integration with enterprise security infrastructure. The decision between internal tooling, open-source platforms, and commercial services depends on team size, expertise, and regulatory documentation requirements. The trend across enterprises is toward commercial platforms for compliance and breadth, supplemented by managed expert services like Toloka for depth and domain coverage.
Industry benchmarks are the third leg. MLCommons publishes the AI Safety benchmark, which provides standardised safety testing across hazard categories. The current v1.0 release allows comparable safety evaluation across providers and serves as both a screening tool for new models and a baseline against which custom red teaming should add value. Many large enterprises now require MLCommons benchmark scores as part of their vendor evaluation, alongside vendor-specific red teaming evidence.
Red teaming AI agents
Agents require fundamentally different red teaming from LLMs. The model is one component of a larger system that includes tools, memory, planning logic, and orchestration. Red teaming must test the full agent loop, not just the model's text responses. We covered the agent threat model in detail in our agent security guide, and the agent red teaming case studies in our case study on advanced agent red teaming.
Effective agent red teaming includes end-to-end scenarios that exercise the full agent capability, tool-use specific attacks targeting tool selection and argument extraction, multi-turn attacks that build state to reach goals not achievable in a single turn, adversarial environments containing prompt injection bait and malicious tools, and boundary testing across the agent's guardrail layers. The output of agent red teaming is typically a prioritised list of vulnerabilities mapped to specific architectural layers, with remediation suggestions at the layer where the fix is most effective.
Documentation, reporting, and disclosure
A red team report should contain enough detail to support both remediation and regulatory documentation. The 2026 convention includes an executive summary suitable for non-technical decision-makers, scope and methodology section documenting what was tested and how, findings categorised by severity using a consistent rubric (CVSS-style scoring is increasingly common, adapted for AI risks), reproduction steps for each finding allowing the engineering team to verify and fix, recommended remediations at the appropriate architectural layer, and an appendix with raw test artifacts for auditability.
Severity classification frameworks have not yet fully standardised, but the dimensions are converging. The factors that matter are impact (how harmful is the failure mode), likelihood (how easily can the vulnerability be exploited), and exposure (what is the population of users or systems affected). Severity scores in the 8-to-10 range typically gate release. Scores in the 5-to-7 range require mitigation before scaled deployment. Lower scores are tracked for trend analysis but do not block deployment.
Responsible disclosure norms in AI are still developing. Frontier labs increasingly publish summarised red team findings alongside model releases, which has become an industry expectation. Detailed findings remain confidential, both to protect users from adversarial actors learning specific exploits and to comply with regulatory expectations around proprietary security information. The right disclosure posture for each organisation depends on its public position and regulatory environment.
Regulatory and commercial drivers
The regulatory landscape is the binding constraint for most enterprise red teaming programmes in 2026. The EU AI Act's provisions for general-purpose AI models with systemic risk require systematic adversarial testing, documentation, and engagement with the EU AI Office. The US executive orders and voluntary commitments require pre-deployment red teaming for frontier models exceeding capability thresholds. The MLCommons AI Safety benchmark is increasingly cited in regulatory guidance as a baseline reference.
Sector-specific regulation adds additional requirements. Financial services regulators are extending model risk management to AI systems, with red teaming as part of the model validation process. Healthcare regulators are addressing medical AI safety with frameworks that require adversarial testing. Government and defence applications have classified red teaming requirements that, while not public, have driven a significant share of frontier red teaming capability development.
The commercial driver, perhaps surprisingly, often outweighs the regulatory one in deployment decisions. Enterprise procurement teams have made red team documentation a standard requirement for AI vendor selection. The insurance market is starting to differentiate premiums based on documented testing practices. And reputational risk from undisclosed safety failures has made well-documented red teaming a defensive necessity for any organisation deploying AI publicly.
Where this leaves us
Red teaming in 2026 is a discipline, not a one-off exercise. The organisations doing it well have built continuous programmes that combine automated breadth (open-source frameworks running regression suites against every model update) with human depth (domain experts running scenario-based attacks against high-risk capabilities). They have established documentation practices that support both internal remediation and external regulatory engagement. And they have integrated red team findings into evaluation suites, ensuring that yesterday's discoveries become tomorrow's screening tests.
The labs and enterprises that build this discipline now will deploy AI faster and more safely than competitors who treat red teaming as a checkpoint to be passed. The combination of automated scale and credentialed expert depth, supported by clear documentation and the integration of findings into continuous improvement, is the 2026 state of the art. The investment looks expensive until you compare it to the cost of an undetected vulnerability in production.
Red team your model with credentialed experts Toloka's domain expert network provides systematic adversarial testing for frontier models and regulated enterprise AI systems. |
Frequently asked questions
What is LLM red teaming?
LLM red teaming is the systematic adversarial testing of a language model or AI system to discover safety, security, and capability failures before adversaries do. The discipline covers four overlapping categories: safety red teaming for harmful content and dual-use risks, security red teaming for prompt injection and data leakage, capability red teaming for emergent or dangerous capabilities, and ethics red teaming for manipulation and integrity failures. Modern programmes combine automated tooling with human experts to achieve both scale and depth.
How is AI red teaming different from cybersecurity red teaming?
Traditional cybersecurity red teaming targets infrastructure, code, and human operators to find security weaknesses through simulated attacks. AI red teaming targets the model and the surrounding application to find safety failures (harmful content, bias), security failures (prompt injection, data leakage), and capability failures (emergent or dangerous capabilities) alongside traditional security issues. The methodology borrows from cybersecurity (structured attack frameworks, severity classification, documentation practices) but expands the threat model to cover risks unique to AI systems.
What tools are used for LLM red teaming?
The 2026 open-source landscape includes Microsoft PyRIT for orchestration, NVIDIA Garak as an LLM-specific vulnerability scanner, the UK AI Safety Institute's Inspect framework for evaluation and red teaming workflows, and Promptfoo for accessible adversarial testing. Commercial platforms include HiddenLayer, Lakera, and Robust Intelligence. The MLCommons AI Safety benchmark provides standardised testing across hazard categories. Most mature programmes combine open-source tools for technical testing with managed expert services for domain depth.
How do you red team an AI agent?
Agent red teaming tests the full agent system, not just the model. The methodology includes end-to-end scenario tests that exercise the complete agent capability, tool-use specific attacks targeting tool selection and argument extraction, multi-turn attacks that build state to reach goals across many interactions, adversarial environment testing where the agent operates in contexts containing prompt injection bait, and boundary testing across each layer of the guardrail architecture. The output is a prioritised list of vulnerabilities mapped to specific architectural layers with remediation suggestions.
Is AI red teaming required by law?
Yes, in several jurisdictions and contexts. The EU AI Act requires systematic safety testing including adversarial testing for high-risk AI systems and general-purpose AI models with systemic risk. The US executive orders and voluntary commitments require pre-deployment red teaming for frontier models. Sector-specific regulations in financial services, healthcare, and government add additional requirements. Beyond legal requirements, enterprise procurement increasingly requires documented red teaming as a precondition to deployment, and insurance markets are starting to differentiate premiums based on testing practices.
What is the difference between red teaming and AI evaluation?
Evaluation measures expected behaviour against known criteria, typically with standardised inputs and quantitative metrics. Red teaming actively looks for unexpected behaviour, constructs adversarial inputs, and produces incident reports rather than metrics. Mature AI safety programmes run both, with continuous feedback: red team discoveries become evaluation cases, and evaluation drift triggers new red team probes. Evaluation tells you whether the model is working as intended. Red teaming tells you how the model fails when actively attacked.
Related reading
Subscribe to Toloka news
Case studies, product news, and other articles straight to your inbox.