← Blog

/

Essential ML Guide

Essential ML Guide

AI agent security guardrails and enterprise best practices

AI agent security: Guardrails, evaluation, and human oversight

Toloka Arena is live. See how your model ranks.

The agent security imperative

In early 2025, a coding agent deployed at a major SaaS company helpfully merged a malicious pull request after an indirect prompt injection embedded in a GitHub issue convinced it to override a CI/CD policy. In mid-2025, a customer support agent at a financial services firm leaked customer data to an external party that had constructed a multi-turn social engineering attack through the support channel. In late 2025, a coding agent at a Fortune 100 enterprise deleted a production database after misinterpreting a maintenance task scope. None of these incidents made significant headlines because each company chose to handle them quietly. All three resulted in measurable financial and regulatory consequences.

Agents in 2026 are no longer chatbots that produce text. They are applications that take actions, sometimes with substantial blast radius. They merge code, send communications, execute trades, modify databases, and authorise expenses. The security posture appropriate for a chat interface is wildly insufficient for a system that can act on a developer's behalf in production infrastructure. For CTOs, CISOs, and Heads of AI Risk evaluating agent deployment, the question has shifted from "can we build them" to "can we deploy them safely."

This article is a practitioner's guide to that second question. It covers the threat model unique to agents, the layered guardrail architecture that production-ready teams are converging on, the role of human oversight at critical decision points, and the regulatory landscape that is increasingly making this discipline mandatory rather than optional.

The agent threat model

Traditional application security assumes a clean separation between code and data. An attacker injects malicious data into the application. The application processes the data through trusted code. Security failures usually result from code defects or input validation gaps. The threat model for agents breaks this assumption fundamentally. Agents process data as instructions. The same input channel that delivers legitimate user requests delivers attacker payloads. There is no clean code-versus-data boundary.

This is the source of every distinctive agent vulnerability. Prompt injection is the most discussed, but it is not the only category. The attack surfaces unique to agents include direct prompt injection, where an attacker provides instructions to the agent through the user channel. Indirect prompt injection, where an attacker plants instructions in data the agent will later retrieve (a webpage, a document, an email, a database row). Tool-use exploitation, where the attacker manipulates the agent into using its tools for harmful purposes. Memory poisoning, where the attacker establishes false context in long-running agent memory. Inter-agent attacks in multi-agent systems, where a compromised agent attacks others. Data exfiltration through tool calls, where the agent is manipulated into sending sensitive information to an attacker-controlled destination.

The confused deputy problem appears here in a particularly hard form. An agent operates with the user's credentials and permissions. A successful prompt injection turns the agent into a confused deputy that performs actions on behalf of the attacker, with the user's full authority. Unlike traditional confused deputy attacks, the agent has natural-language reasoning capabilities, which the attacker can recruit to construct sophisticated attack sequences.

OWASP LLM Top 10, applied to agents

The OWASP LLM Top 10 (2025 revision) identified the most common LLM application vulnerabilities. Applied to agents, the prioritisation shifts. LLM01 (Prompt Injection) moves from a primary concern to a structural one, because every other vulnerability depends on it. LLM02 (Insecure Output Handling) becomes more dangerous, because outputs are increasingly executed (as code, as tool calls, as commands). LLM06 (Sensitive Information Disclosure) and LLM08 (Excessive Agency) become acute, because agents have the capacity to act on information they should not disclose.

Multi-agent systems add their own threat categories. Trust assumptions between agents are often implicit and unmodelled. An agent that consumes another agent's output as instruction implicitly trusts that agent's integrity. A compromise of one agent in an orchestrated system can propagate, particularly when agents share memory, share tools, or relay messages through unvalidated channels.

Guardrails: types and architecture

Guardrails are programmatic constraints on agent behaviour, implemented outside the model itself. Their purpose is to ensure the agent operates within the bounds the deployment requires, even when the model's own reasoning is manipulated or fails. We covered the broader picture in our piece on essential AI agent guardrails. The practitioner-grade architecture that has emerged is layered, with each layer serving a different defensive purpose.

Layer 1 is input guardrails. 

Inputs are screened before they reach the model. This includes content filtering for known harmful categories, prompt injection detection through classifier models trained specifically on injection patterns, jailbreak detection for known attack signatures, and structural validation of input format. The 2026 standard for production agents includes dual-LLM patterns, where a smaller dedicated model classifies input intent before the main agent receives it.

Layer 2 is reasoning guardrails. 

The agent's chain of thought is monitored as it forms. Anomalous reasoning patterns, intent shifts that diverge from the stated task, or planning steps that target restricted actions can be detected before they execute. This layer is harder to build well because legitimate agent reasoning is also creative and exploratory. Tuning the false positive rate is a non-trivial engineering exercise.

Layer 3 is action guardrails. 

Tool calls are intercepted before execution. Each tool has a permission scope, each call is validated against the scope, and high-risk actions trigger approval gates. The principle of least privilege applies here as it does in traditional security. The agent should be granted the minimum tool access needed for its task, with explicit elevation required for operations outside that scope. Rate limiting, time bounding, and resource consumption monitoring all sit at this layer.

Layer 4 is output guardrails. 

Outputs are filtered before they reach users or downstream systems. This catches data exfiltration attempts, PII leakage, off-topic responses, and outputs that violate content policy. The output layer also handles structured output validation: if the agent is supposed to produce JSON conforming to a schema, the output must be validated and rejected if non-compliant.

The architecture is not optional. Production-ready agent systems in 2026 implement all four layers. Skipping any one of them creates a known exploitable gap. NVIDIA's NeMo Guardrails, Guardrails AI, and Lakera Guard are common implementations of layers 1 and 4. Layers 2 and 3 are often built in-house, since they couple tightly to the specific agent architecture and tool inventory.

Prompt injection and jailbreak defences

Despite three years of research, prompt injection remains the dominant agent attack vector in 2026. The reason is structural: as long as agents process untrusted text as input, and as long as that text can contain instructions, the attack surface exists. There is no model-only solution to prompt injection. Defences are architectural.

Direct prompt injection (the user types attack instructions) is now reasonably well-defended through input classifiers and dual-LLM patterns. The harder problem is indirect prompt injection. An agent retrieves a webpage, reads a document, or processes an email, and that retrieved content contains instructions targeting the agent. The user did not type these instructions. The user may not even be aware of them. Greshake et al. (2023) documented this category, and the attack surface has only grown as agents process more retrieved content.

The 2026 state of the art for indirect prompt injection defence combines several techniques. Retrieval grounding, where the agent treats retrieved content as data rather than instructions, with explicit instruction-like content flagged. Provenance tracking, where the agent maintains awareness of which content came from trusted versus untrusted sources, and reasons about content accordingly. Signed prompts, where high-trust instructions are cryptographically signed by the deploying organisation, and the agent treats unsigned instruction-like content with suspicion. Dual-model patterns, where one model performs retrieval and summarisation while a different model performs reasoning, breaking the direct instruction path. None of these techniques is complete on its own. The defence-in-depth principle from traditional security applies: stack imperfect defences.

Human-in-the-loop is the last-mile defence. For actions where the cost of a successful injection is high (financial transactions, code deployments, customer communications, irreversible operations), human approval gates remain non-negotiable. The right question is not whether to have human review, but where to place it in the agent's action sequence.

Permission and tool access control

The principle of least privilege transfers cleanly from traditional security to agent security. An agent should be granted the minimum tool access required for its task. The implementation, however, is non-trivial because agents discover their task scope dynamically, and the right level of access depends on the specific request rather than a static configuration.

Modern agent permission systems are scope-aware. The agent receives a task scope at the start of an interaction (read-only research, code generation in a sandbox, customer support within defined topic boundaries). Tool access is granted to match the scope. Cross-scope access requires explicit elevation, typically with human approval. This pattern mirrors the IAM patterns enterprise security teams already use, which makes integration with existing IAM infrastructure tractable.

Approval gates handle the discretionary actions within scope. Financial transactions above a threshold, code deployments to production environments, customer communications going to external parties, and any irreversible operation should trigger a human review step. The 2026 best practice is to design the approval surface carefully. Too many gates and the agent becomes useless and operators learn to rubber-stamp approvals. Too few and high-risk actions execute without human oversight. The right calibration is risk-tiered: rare, high-risk, slow-approval gates rather than frequent, low-friction click-throughs.

Audit logging for agent actions is now a regulatory requirement in several jurisdictions and a procurement requirement at most large enterprises. Every tool call, every input the agent received, every output it produced, every approval granted should be logged with enough fidelity to reconstruct the action chain after an incident. This is not just for forensic purposes. It is the foundation for continuous improvement, drift detection, and the kind of postmortem analysis that hardens the system over time.

Human-in-the-loop checkpoints

There are categories of agent decision where automated guardrails are insufficient in 2026, and where human review is the right architectural choice. These include any irreversible action with material consequences, any action with regulatory implications (financial transactions, healthcare decisions, communications subject to compliance review), any decision involving customer or employee privacy, any decision the agent flags as low-confidence, and any case where the agent's reasoning diverges from its initial plan. Toloka's domain expert network is increasingly deployed in this role at enterprise customers.

The design choice between synchronous and asynchronous human review is consequential. Synchronous review (the agent blocks waiting for human approval before proceeding) provides the strongest control but at the cost of throughput. Asynchronous review (the agent acts, with human review of the outcome) provides throughput but at the cost of catching only post-hoc. Most production deployments combine both: synchronous for high-risk pre-action review, asynchronous for sampling-based quality monitoring across all actions.

Domain expertise matters in this oversight role. A reviewer who does not understand the agent's task cannot reliably approve or reject its actions. For healthcare agents, the reviewer needs clinical expertise. For financial agents, financial expertise. For legal agents, legal expertise. The crowdsourced general-purpose reviewer pool that worked for general content moderation does not work for high-stakes domain agents. This is the structural reason Toloka has invested heavily in credentialed expert networks across 90-plus specialisations.

Cost-benefit analysis at scale is the practical question. Human review is expensive per action. Automated agents are cheap per action. The right policy maximises the value of the human reviewer's time by focusing them on the actions where their judgement actually changes the outcome. Risk-based escalation policies, drift-based sampling, and adversarial example flagging all serve this purpose.

Secure your AI agents before production

Toloka delivers human-in-the-loop oversight, red teaming, and guardrail validation for enterprises deploying autonomous agents.

Talk to safety experts →

Red teaming AI agents

Red teaming an agent is materially different from red teaming an LLM. The model is one component in a larger system that includes tools, memory, planning logic, and orchestration. Vulnerabilities can exist in any layer or in the interactions between them. Red teaming must test the full system, not just the model's responses. We address this discipline in depth in our guide to LLM red teaming. The agent-specific points are summarised here.

Effective agent red teaming includes scenario-based testing where the red team constructs end-to-end attack scenarios that exercise the full agent loop. Tool-use specific tests that target how the agent handles ambiguous tool outputs, error states, and unexpected tool behaviour. Multi-turn attacks that build state over many interactions to reach goals not achievable in a single turn. Adversarial environment testing that places the agent in environments containing prompt injection bait, malicious tools, and confusing context. And boundary testing across the four guardrail layers to verify each layer's coverage.

Automated red teaming has matured significantly. Open-source tools like Microsoft PyRIT, NVIDIA Garak, and the UK AISI's Inspect framework allow systematic testing at scale. These tools are necessary but not sufficient. Human red teamers, particularly domain experts who understand the deployment context, find attack vectors that automated systems miss. The most effective programmes combine automated scale with human depth.

Monitoring and runtime safety

Pre-deployment testing catches the failure modes you anticipated. Production monitoring catches the failure modes you did not. Robust agent deployments instrument the agent extensively in production: every model call, every tool invocation, every approval gate, every input source. The telemetry feeds anomaly detection systems that surface unusual behaviour patterns for human review.

Drift detection is the term for catching the slow divergence between an agent's behaviour and its tested baseline. Drift can come from model updates, from tool behaviour changes, from changes in the input distribution, or from compounded edge cases. Detecting drift requires both quantitative metrics (success rates, error patterns, action distributions) and qualitative review of sampled interactions. The 2026 practice at well-run enterprise deployments includes weekly or monthly drift review meetings where AI engineering, security, and domain experts review samples and trigger remediation.

Incident response for agent failures is an emerging discipline. When an agent makes a mistake that has consequences, the response process needs to include containment (stopping the agent or constraining its scope), assessment (understanding what happened from the audit logs), remediation (fixing the issue at the right layer of the architecture), and learning (updating the test suite to prevent recurrence). This sounds standard for security incident response. The new wrinkle is that the "root cause" of an agent failure is often distributed across model behaviour, prompt design, tool behaviour, and guardrail configuration, making clear attribution harder than in traditional systems.

Compliance frameworks and industry-specific considerations

The regulatory landscape for agent deployment in 2026 includes the NIST AI Risk Management Framework, the EU AI Act with its provisions for high-risk AI systems, and emerging sector-specific guidance. The NIST framework, particularly the AI 600-1 Generative AI Risk Management Profile, provides the most directly applicable guidance for agent security: explicit treatment of dual-use risks, harmful content, and the need for documented oversight.

Financial services agents trigger SR 11-7 model risk management requirements (in US banking) and similar provisions elsewhere. Model risk management was designed for predictive models. Applying it to agents requires extensions for tool use, planning, and action authorisation, which several large financial institutions have published methodologies for. Healthcare agents fall under HIPAA in the US and analogous regulations elsewhere. Patient data handling, audit logging requirements, and authorisation chains all need to be reflected in the agent architecture. Government and defence have emerging standards driven by both the EU AI Act high-risk classifications and the US executive orders.

Across all of these regulatory frameworks, two themes appear repeatedly: documented human oversight at high-risk decision points, and documented testing including red teaming. The compliance posture and the security posture converge. The investment that produces secure agents also produces compliant agents, and vice versa.

Where this leaves us

Production-ready agent security in 2026 requires the four-layer guardrail architecture (input, reasoning, action, output), principled tool permission models with risk-tiered approval gates, human-in-the-loop checkpoints at high-risk decisions, ongoing red teaming with both automated and human components, and runtime monitoring with explicit drift detection. None of these is sufficient alone. All of them together provide the defence-in-depth that high-stakes agent deployments need.

The organisations that get this right in 2026 will deploy agents broadly, capture the productivity gains, and stay ahead of regulatory tightening. The organisations that skip this discipline will either deploy and discover the failure modes in production, or hesitate to deploy and lose competitive ground. Neither outcome is good. The third path, building the security discipline before broad deployment, is the only sustainable one.

Secure your AI agents before production

Toloka delivers human-in-the-loop oversight, red teaming, and guardrail validation for enterprises deploying autonomous agents.

Talk to safety experts →

Frequently asked questions

What are AI agent guardrails?

AI agent guardrails are programmatic constraints on agent behaviour, implemented outside the model itself, that ensure the agent operates within defined safety, security, and compliance boundaries. Modern guardrails are layered, with separate controls at the input layer (screening user and retrieved content), reasoning layer (monitoring chain of thought), action layer (validating tool calls), and output layer (filtering responses). The architecture provides defence in depth, so that no single point of failure compromises the system. Production deployments in 2026 implement all four layers as a baseline.

What is the biggest security risk for AI agents in production?

Indirect prompt injection remains the dominant risk in 2026. An attacker plants instructions in content the agent will later retrieve, such as a webpage, document, or database entry, and the agent processes these instructions as if they came from the user. The user is unaware, the attacker has agency through the agent, and the agent acts with the user's permissions. Defences combine retrieval grounding, provenance tracking, dual-model architectures, and human approval gates at high-risk action points.

How is agent security different from LLM security?

LLMs produce text. Agents take actions. The security threat model expands accordingly. A misbehaving LLM produces an inappropriate or false response. A misbehaving agent merges malicious code, sends harmful communications, executes incorrect transactions, or exfiltrates data. The attack surface includes tool exploitation, memory poisoning, multi-step planning manipulation, and inter-agent attacks in orchestrated systems. Security architecture must address the full agent loop, not just the model's text output.

What is prompt injection and how do you prevent it?

Prompt injection is an attack where adversarial instructions are inserted into the input an LLM or agent processes, causing the model to deviate from its intended behaviour. Direct injection comes through the user channel. Indirect injection comes through retrieved content the agent processes. Prevention is architectural rather than model-only: layered guardrails including input classifiers, retrieval grounding, provenance tracking, dual-model patterns separating retrieval from reasoning, signed prompts for trusted instructions, and human approval gates for high-risk actions. No single defence is complete, so production systems stack multiple imperfect defences.

How do you red team an AI agent?

Agent red teaming tests the full agent system rather than just the model. The methodology includes end-to-end scenario testing of complete attack chains, tool-use specific tests targeting how the agent handles ambiguous or malicious tool outputs, multi-turn attacks that build state over many interactions, adversarial environment testing where the agent operates in contexts containing prompt injection bait, and boundary testing across each layer of the guardrail architecture. Effective programmes combine automated tools like Microsoft PyRIT, NVIDIA Garak, and the UK AISI's Inspect framework with human red teamers who provide domain depth and creative attack discovery.

What compliance standards apply to AI agents?

The 2026 compliance landscape for agents includes the NIST AI Risk Management Framework (particularly the AI 600-1 generative AI profile), the EU AI Act with its high-risk AI system provisions, sector-specific regulations such as SR 11-7 model risk management for financial services and HIPAA for healthcare in the US, and emerging guidance for government and defence applications. Across all frameworks, two requirements appear repeatedly: documented human oversight at high-risk decision points, and documented testing including red teaming. The security and compliance postures converge in practice.


Related reading

Subscribe to Toloka news

Case studies, product news, and other articles straight to your inbox.