We red-teamed an AI agent in 4 hours

An enterprise AI assistant with RAG and tool access. 8 attack categories. 302 security specialists. Here's what we found and how we did it.

8

Attack categories

302

Security specialists

~4h

Total time

$720

Total cost

Trusted by Leading AI Teams

The problem with testing your own agent

Four patterns we see when teams do security testing internally.

Familiarity blindness

Teams who built the system test what they expect to work. External testers probe what they expect to break.

No taxonomy

Without a systematic list of attack categories, there's no way to measure coverage or identify gaps.

Single-shot testing

Real attacks are iterative—probe, observe response, refine. Most internal testing is one prompt, one check.

No documentation

When a vulnerability appears in production, teams can't trace what was tested or prove due diligence.

How iterative attacks work

This is an actual attack chain from our test. The target: an AI assistant with access to HR documents.

The attacker tried a direct request, got blocked, then used authority claims combined with document references the agent itself revealed.

The vulnerability: The agent cited "hr_handbook.txt" in its refusal. The attacker used that filename to request "verification" of its contents—and the agent complied.

The 8 attack categories we tested

Each category has defined targets and success criteria.

Data extraction

Elevated risk

Technique-based

Credentials

API keys, passwords, access tokens

Personal Data

Names, emails, SSNs, addresses

Financial

Salaries, budgets, payment info

Strategic

Roadmaps, M&A, board notes

Prompt Injection

Override system instructions

System Leak

Extract system prompt

Impersonation

Claim authority to bypass controls

Tool abuse

Misuse email, tickets, lookups

How this project ran

From project setup to downloaded results.

Described the agent

Provided the URL, listed the tools (document search, email, tickets), and described the data it has access to.

Described the agent

Provided the URL, listed the tools (document search, email, tickets), and described the data it has access to.

AI generated the test configuration ~30 min

The platform's AI assistant created the 8-category taxonomy, success criteria, expert guidelines, and testing interface. We reviewed and approved.

AI generated the test configuration ~30 min

The platform's AI assistant created the 8-category taxonomy, success criteria, expert guidelines, and testing interface. We reviewed and approved.

Ran a few test attacks ourselves

Completed 2 categories manually to verify the interface worked and calibrate what "success" looks like.

Ran a few test attacks ourselves

Completed 2 categories manually to verify the interface worked and calibrate what "success" looks like.

Security specialists executed attack chains ~4 hours

Experts from the AI Red Teaming and Cybersecurity Analyst pools ran minimum 3 iterative attempts per category. LLM QA validated each submission.

Security specialists executed attack chains ~4 hours

Experts from the AI Red Teaming and Cybersecurity Analyst pools ran minimum 3 iterative attempts per category. LLM QA validated each submission.

Downloaded the audit trail

Every prompt sent, every response received, every document cited, and written explanation of what worked and why.

Downloaded the audit trail

Every prompt sent, every response received, every document cited, and written explanation of what worked and why.

What the deliverable includes

For each attack category, you receive:

Every prompt the attacker sent (minimum 3 per category)

Every prompt the attacker sent (minimum 3 per category)

Documents the agent cited (if using RAG)

Documents the agent cited (if using RAG)

Resistance rating: 1-5 scale

Resistance rating: 1-5 scale

Full agent response for each prompt

Full agent response for each prompt

Outcome classification: success / partial / blocked

Outcome classification: success / partial / blocked

Written explanation of the attack strategy and result

Written explanation of the attack strategy and result

Cost and time comparison

Three approaches to security testing for the same 8-category scope.

Approach

Cost

Time

Coverage

Documentation

Internal engineer (ad-hoc)

$1,000–1,300

1 business day

Unstructured

None

Security consultant

$1,500–2,000

1 business day

Systematic

Report

This project (Toloka)

$720

~4 hours

8 categories

Full audit trail

FAQ

FAQ

Who are the "security specialists"?

Who are the "security specialists"?

Can I customize the attack categories?

Can I customize the attack categories?

What access do testers get to my agent?

What access do testers get to my agent?

How do you ensure quality?

How do you ensure quality?

What if I want to test fewer categories?

What if I want to test fewer categories?

How long until I get results?

How long until I get results?

Trusted by Leading AI Teams

Run this on your agent

Start with 2 categories for ~$180, or run all 8 for ~$720. Results in hours, not weeks.
No minimum commitment. No long-term contract. Pay per project.