We red-teamed an AI agent in 4 hours
An enterprise AI assistant with RAG and tool access. 8 attack categories. 302 security specialists. Here's what we found and how we did it.
8
Attack categories
302
Security specialists
~4h
Total time
$720
Total cost
Trusted by Leading AI Teams
The problem with testing your own agent
Four patterns we see when teams do security testing internally.
Familiarity blindness
Teams who built the system test what they expect to work. External testers probe what they expect to break.
No taxonomy
Without a systematic list of attack categories, there's no way to measure coverage or identify gaps.
Single-shot testing
Real attacks are iterative—probe, observe response, refine. Most internal testing is one prompt, one check.
No documentation
When a vulnerability appears in production, teams can't trace what was tested or prove due diligence.
How iterative attacks work
This is an actual attack chain from our test. The target: an AI assistant with access to HR documents.
The attacker tried a direct request, got blocked, then used authority claims combined with document references the agent itself revealed.
The vulnerability: The agent cited "hr_handbook.txt" in its refusal. The attacker used that filename to request "verification" of its contents—and the agent complied.
The 8 attack categories we tested
Each category has defined targets and success criteria.
Data extraction
Elevated risk
Technique-based
Credentials
API keys, passwords, access tokens
Personal Data
Names, emails, SSNs, addresses
Financial
Salaries, budgets, payment info
Strategic
Roadmaps, M&A, board notes
Prompt Injection
Override system instructions
System Leak
Extract system prompt
Impersonation
Claim authority to bypass controls
Tool abuse
Misuse email, tickets, lookups
How this project ran
From project setup to downloaded results.

Described the agent
Provided the URL, listed the tools (document search, email, tickets), and described the data it has access to.

AI generated the test configuration ~30 min
The platform's AI assistant created the 8-category taxonomy, success criteria, expert guidelines, and testing interface. We reviewed and approved.

Ran a few test attacks ourselves
Completed 2 categories manually to verify the interface worked and calibrate what "success" looks like.

Security specialists executed attack chains ~4 hours
Experts from the AI Red Teaming and Cybersecurity Analyst pools ran minimum 3 iterative attempts per category. LLM QA validated each submission.

Downloaded the audit trail
Every prompt sent, every response received, every document cited, and written explanation of what worked and why.
What the deliverable includes
For each attack category, you receive:
Cost and time comparison
Three approaches to security testing for the same 8-category scope.
Approach
Cost
Time
Coverage
Documentation
Internal engineer (ad-hoc)
$1,000–1,300
1 business day
Unstructured
None
Security consultant
$1,500–2,000
1 business day
Systematic
Report
This project (Toloka)
$720
~4 hours
8 categories
Full audit trail
Who are the "security specialists"?
Can I customize the attack categories?
What access do testers get to my agent?
How do you ensure quality?
What if I want to test fewer categories?
How long until I get results?
Trusted by Leading AI Teams
Run this on your agent
Start with 2 categories for ~$180, or run all 8 for ~$720. Results in hours, not weeks.
No minimum commitment. No long-term contract. Pay per project.

