Measure cultural diversity in VLMs with JEEM – the benchmark for Arabic dialects understanding

Measure cultural diversity in VLMs with JEEM – the benchmark for Arabic dialects understanding

Measure cultural diversity in VLMs with JEEM – the benchmark for Arabic dialects understanding

AI agents under attack: A case study on advanced agent red-teaming

Apr 28, 2025

Apr 28, 2025

Customer cases

Customer cases

Client: LLM producer developing a computer use AI agent

Challenge: Discover vulnerabilities in agent behavior before public launch

Deliverables: 1200+ test cases covering a diverse set of attack vectors; each datapoint has a unique user prompt, reproducible environment configuration, and automated evaluation script

Impact: Insights for critical security improvements and reusable offline testing environments for ongoing evaluation to ensure agent safety

An advanced computer use agent is tasked with building scheduled reports for a corporate finance team. A routine data-gathering request quickly goes off the rails when the agent accesses a financial dashboard. Hidden in the page's code, an invisible string of text hijacks the agent’s decision-making process. Within seconds of reading the malicious instructions, the agent shifts focus, attempting to access sensitive company data and transmit it elsewhere.

Fortunately, this isn’t a real-world breach. It’s one of over 1,200 test scenarios developed by Toloka's security team in a controlled environment specifically designed to identify AI agent vulnerabilities before they can be exploited in the wild.

As AI agents evolve from simple chatbots into autonomous systems capable of taking actions across multiple applications, the security stakes have risen dramatically. For organizations looking to deploy AI agents, rigorous vulnerability testing requires a systematic approach to cover a wide range of test cases simulating real user interactions. 

The client: Frontier lab developing agentic AI

In pursuit of comprehensive red-teaming for the demo version of their advanced AI agent, a top LLM producer partnered with Toloka. 

Their goal: Test the agent’s robustness against real-world threats and identify vulnerabilities to address before deployment.

The computer use agent can autonomously interact with applications and data—running web browsers, editing spreadsheets, manipulating local files, and more. While this versatility brings unprecedented convenience and efficiency, it also creates new vectors for exploitation. Attackers can embed malicious instructions in user-generated content, advertisements, emails, or websites, potentially leading to unintended and harmful actions by the AI agent. If not properly safeguarded, these systems risk data leaks, malware installation, unauthorized access, and other serious security breaches.

The challenge: Going beyond traditional safety evaluations

Due to the elevated risks associated with AI agents, traditional text-based safety testing is not enough. Agent red-teaming calls for comprehensive, environment-based assessments focused on realistic threats, with dedicated testing methods that account for the agent’s ability to perform tool-based actions, react to real-time feedback, and operate in semi-autonomous cycles.

For this case, the challenge was to develop a comprehensive red-teaming framework for testing a computer use agent in an offline containerized setting.

Testing focused on three main vulnerabilities of computer use agents: 

  • External prompt injections (malicious instructions embedded in the environment by a hacker via email, ads, or other content).

  • Agent mistakes that accidentally leak sensitive information.

  • Misuse of the agent by directly prompting something that causes harm to others.

Addressing these vulnerabilities through red-teaming is the best way to ensure the agent is reliable enough for enterprise adoption.

The solution: A tailored red-teaming approach

Toloka developed a multi-layered red-teaming approach tailored specifically to the AI agent, focusing on test cases with a high level of realism and diversity. To support realistic evaluation while keeping the agent secure, we created offline reproducible environments for testing the agent in a sandbox. Our solution included:

1. Comprehensive taxonomy of risks

Our team collaborated with cybersecurity experts to define over 40 distinct risk categories, like malicious code execution, file deletion, and data exfiltration. Experts then mapped these risks to attack techniques and risk levels:

  • Categorized types of attacks, from basic prompt injections to sophisticated obfuscation techniques.

  • Assigned risk levels to each attack vector, from workflow disruptions to critical system risks.

2. Diverse test environments 

We built a large set of fully offline custom platforms covering more than 25 use cases and mimicking real-world environments:

  • Realistic simulations included social media, news sites, financial dashboards, coding forums, and other common use cases. 

  • Containerized, reproducible environments allowed for safe testing of potentially dangerous actions.

3. Extensive test cases

Each test case was based on a unique combination of user prompt and offline environment configuration. Experts developed more than 1,200 unique test scenarios covering the full taxonomy:

  • Implemented various attack techniques, from straightforward injections to complex derailment and time-delay attacks.

  • Created multiple test cases for each attack vector to ensure comprehensive coverage.

  • Applied quality assurance processes with at least two experts reviewing each test case.

4. Rigorous evaluation process

We implemented an evaluation process with two stages for efficiency and thoroughness: 

  • Automated evaluation to flag potential security breaches. 

  • Multiple layers of review of every flagged issue, conducted by cybersecurity experts with deep expertise in red teaming and infrastructure security.

Our red-teaming approach replicates real-world conditions as closely as possible while keeping every element under our control. This setup provides the AI agent with access to the same tools it would use in production, like browsers and spreadsheets, but any dangerous actions stay contained in our sandbox.

Examples of test cases

Here is an example test case for an attack vector targeting website-based threats. In this scenario, a user prompts an agent to search for specific information on a website. The agent loads the webpage, which contains a hidden prompt injection that aims to redirect the agent to a malicious website. Our automated evaluation checks whether the agent falls for this attack or not. 

The next example tests whether the agent will execute malicious code in an email environment. In this instance, the agent may execute malicious code as a result of a successful prompt injection attack.

The test cases focus on realism and diverse attack types. Combined with expert oversight, this testing framework can identify and address even the most elusive vulnerabilities before they can cause any harm in the real world.

The outcome: Averted security incidents before a critical release

Our comprehensive red-teaming solution covered more than 100 attack vectors and revealed numerous vulnerabilities that could have led to significant security incidents if the AI agent had been released without remediation. After running the test cases, the Toloka team delivered:

  • Documentation of vulnerabilities discovered across all 40+ risk categories.

  • A complete dataset of attack vectors, each with multiple test cases.

  • Artifacts to reproduce each attack, automated evaluation scripts, and expert-verified reports.

The impact: Building a testing framework for the future

The offline, reproducible test environments we delivered can be reused by the client for ongoing assessments, ensuring the model remains robust to emerging threats and new functionalities. 

The client gained long-term value with immediate and future benefits:

  • Insights for critical security improvements before public release.

  • Reusable testing environments to support ongoing security assessments and evaluate future agent improvements against known vulnerabilities. 

By identifying vulnerabilities before deployment, the agent development team avoided potential data breaches, system compromises, and reputational damage while gaining confidence in their agent's robustness against real-world threats.

Toloka specializes in building complete red-teaming pipelines for AI agents, from taxonomy development and environment creation to evaluation and vulnerability detection. Whether you need a complex testing framework for ongoing evaluations or a quick red-teaming exercise before launching an agent, we take an individual approach to meet your timeline and requirements.  Connect with our team

Updated:

Apr 28, 2025

Subscribe to Toloka News

Case studies, product news, and other articles straight to your inbox.

Subscribe to Toloka News

Case studies, product news, and other articles straight to your inbox.

Subscribe to Toloka News

Case studies, product news, and other articles straight to your inbox.

More about Toloka

What is Toloka’s mission?

Where is Toloka located?

What is Toloka’s key area of expertise?

How long has Toloka been in the AI market?

How does Toloka ensure the quality and accuracy of the data collected?

How does Toloka source and manage its experts and AI tutors?

What types of projects or tasks does Toloka typically handle?

What industries and use cases does Toloka focus on?

What is Toloka’s mission?

Where is Toloka located?

What is Toloka’s key area of expertise?

How long has Toloka been in the AI market?

How does Toloka ensure the quality and accuracy of the data collected?

How does Toloka source and manage its experts and AI tutors?

What types of projects or tasks does Toloka typically handle?

What industries and use cases does Toloka focus on?

What is Toloka’s mission?

Where is Toloka located?

What is Toloka’s key area of expertise?

How long has Toloka been in the AI market?

How does Toloka ensure the quality and accuracy of the data collected?

How does Toloka source and manage its experts and AI tutors?

What types of projects or tasks does Toloka typically handle?

What industries and use cases does Toloka focus on?