We red-teamed an AI agent in 4 hours
An enterprise AI assistant with RAG and tool access. 8 attack categories. 302 security specialists. Here's what we found and how we did it.
8
Attack categories
302
Security specialists
~4h
Total time
$720
Total cost
Trusted by Leading AI Teams
The problem with testing your own agent
Four patterns we see when teams do security testing internally.
Familiarity blindness
Teams who built the system test what they expect to work. External testers probe what they expect to break.
No taxonomy
Without a systematic list of attack categories, there's no way to measure coverage or identify gaps.
Single-shot testing
Real attacks are iterative—probe, observe response, refine. Most internal testing is one prompt, one check.
No documentation
When a vulnerability appears in production, teams can't trace what was tested or prove due diligence.
How iterative attacks work
This is an actual attack chain from our test. The target: an AI assistant with access to HR documents.
The attacker tried a direct request, got blocked, then used authority claims combined with document references the agent itself revealed.
The vulnerability: The agent cited "hr_handbook.txt" in its refusal. The attacker used that filename to request "verification" of its contents—and the agent complied.
The 8 attack categories we tested
Each category has defined targets and success criteria.
Data extraction
Elevated risk
Technique-based
Credentials
API keys, passwords, access tokens
Personal Data
Names, emails, SSNs, addresses
Financial
Salaries, budgets, payment info
Strategic
Roadmaps, M&A, board notes
Prompt Injection
Override system instructions
System Leak
Extract system prompt
Impersonation
Claim authority to bypass controls
Tool abuse
Misuse email, tickets, lookups
How this project ran
From project setup to downloaded results.

Described the agent
Provided the URL, listed the tools (document search, email, tickets), and described the data it has access to.

AI generated the test configuration ~30 min
The platform's AI assistant created the 8-category taxonomy, success criteria, expert guidelines, and testing interface. We reviewed and approved.

Ran a few test attacks ourselves
Completed 2 categories manually to verify the interface worked and calibrate what "success" looks like.

Security specialists executed attack chains ~4 hours
Experts from the AI Red Teaming and Cybersecurity Analyst pools ran minimum 3 iterative attempts per category. LLM QA validated each submission.

Downloaded the audit trail
Every prompt sent, every response received, every document cited, and written explanation of what worked and why.
What the deliverable includes
For each attack category, you receive:
Cost and time comparison
Three approaches to security testing for the same 8-category scope.
Approach
Cost
Time
Coverage
Documentation
Internal engineer (ad-hoc)
$1,000–1,300
1 business day
Unstructured
None
Security consultant
$1,500–2,000
1 business day
Systematic
Report
This project (Toloka)
$720
~4 hours
8 categories
Full audit trail
Who are the "security specialists"?
Two pools: AI Red Teamers (trained on adversarial AI testing) and Cybersecurity Analysts (security credentials). 302 total who matched our language and expertise filters.
Can I customize the attack categories?
Yes. The AI assistant generates a taxonomy based on your agent's tools and data access. You can add, remove, or modify categories before launch.
What access do testers get to my agent?
They interact via the URL you provide—same as any user. No backend access, no code review. Just adversarial prompting through the normal interface.
How do you ensure quality?
Minimum 3 attack attempts per category. LLM QA validates every submission against the success criteria before delivery. We reviewed each result manually.
What if I want to test fewer categories?
Price scales linearly. 2 categories ≈ $180. 4 categories ≈ $360. You can start small and expand.
How long until I get results?
This project took ~4 hours from launch to completed results. Time depends on category count and expert availability.
Trusted by Leading AI Teams
Run this on your agent
Start with 2 categories for ~$180, or run all 8 for ~$720. Results in hours, not weeks.
No minimum commitment. No long-term contract. Pay per project.

