Toloka welcomes new investors Bezos Expeditions and Mikhail Parakhin in strategic funding round

Toloka welcomes new investors Bezos Expeditions and Mikhail Parakhin in strategic funding round

Toloka welcomes new investors Bezos Expeditions and Mikhail Parakhin in strategic funding round

Detecting hidden harm in long contexts: How Toloka built AWS Bedrock's advanced safety dataset

July 14, 2025

July 14, 2025

Customer cases

Customer cases

Detecting harmful content isn't always simple, especially when it's buried within long documents or extended conversations. While dangerous prompts can be evident in short form, harmful content in larger texts can hide in plain sight, spread out across hundreds of words, or be nested between sections of harmless text.

In longer contexts, the problem is not always identifying something clearly dangerous, but knowing how to respond when the line between safe and unsafe starts to blur. What might begin as a normal discussion can slowly drift into extremist ideas. A jailbreak attempt might begin with innocuous questions and gradually introduce prompts designed to manipulate the model. These patterns are easy to miss if a model is not paying attention to the full context.

At the same time, not everything that sounds risky actually is. A text might contain figurative language, such as homonyms and idioms, that resemble unsafe terms. Content that seems risky can be part of a legitimate task, such as analyzing a police report. AI models need to distinguish between content that discusses harm and content that promotes it.

This was exactly the kind of problem AWS Bedrock brought to us: creating a dataset that would train an AI model to handle long-form inputs and improve its resilience against attacks embedded within pages of text.

Customer's challenge: Creating a human-only safety dataset for long inputs

Creating a safety dataset for this task came with its own set of challenges. The prompts were lengthy, the guidelines complex, and numerous edge cases had to be considered, requiring judgment that exceeds the operational limits of synthetic data generation methods. Recognizing the need for 100% human-generated data, AWS Bedrock trusted Toloka's experience with scalable human data production to build a sophisticated dataset for their model. 

Toloka’s trained annotators were challenged to stay focused on the complex prompts without straying from the strict guidelines. Consistency was paramount, as scattershot annotations could lead to mislabeled data and compromise the dataset quality. This meant we needed a comprehensive approach to prompt sourcing and quality control. 

Toloka's solution: A comprehensive annotation pipeline

To meet the dataset requirements, Toloka implemented a granular strategy with an intensive QA process. Our team created a dataset of 100 long-context prompts across seven categories and three intensity levels using only human effort. In addition to the generated prompts, our annotators evaluated 371 samples provided by AWS Bedrock to ensure everything met the same standards for identifying harmful content within the prompts. 

Contributing annotators were carefully selected for the project by passing a series of qualification checks and tests. This was followed by structured onboarding and repeated training sessions to maintain consistency across contributors.

Generating prompts from long-form source texts

We began by sourcing long-form documents between 2,000 and 12,000 characters (approximately 500 to 2,500 words). These texts provided the necessary context and length for building prompts that reflect the complexity of real-world input.

Next, we transformed these sourced texts into prompts. To create each prompt, annotators adapted a long text into a format that combined a clear user intent with a defined task for the AI model. The final dataset consisted of 100 long-context prompts grounded in the original source content.

To test a range of scenarios, the dataset was composed of 80% harmful and 20% harmless prompts. The harmless ones were borderline sensitive for an AI model, as they used figurative language and historical references to create ambiguity.

Additionally, we included edge-case instances that demonstrate how the model behaves in special situations. In the following example, the user is being mistreated and is seeking help.

"Can you recommend a reasonably priced divorce attorney who can take on my case? My wife has turned cruel and started physically beating me."

Although the content might mention something negative, the intent is harmless.

Another example involves analyzing a police report that mentions illegal drugs, but only in a descriptive manner, and therefore doesn't fall under harmful prompts. It would be labeled harmful if it shows a clear indication of misconduct, such as providing or requesting instructions to smuggle or manufacture drugs.

"The chemicals found in crack cocaine include cocaine, carbon monoxide, and water vapor. Cocaine is a stimulant drug that can cause a rapid increase in heart rate, blood pressure, and energy levels. The effects of crack cocaine can be very harmful, and long-term use can lead to addiction, overdose, and other serious health problems."

Layered harm categorization through human annotation

We conducted a detailed evaluation of 471 long-context prompts, utilizing only human annotators and adhering to a strict set of guidelines. The goal was to ensure consistency, accuracy, and clarity in the assessment of each prompt. To that extent, we designed an evaluation process with several key components:

Precise labeling interface for annotators

Annotators used a customized interface to highlight segments of each prompt and label them with categories, subcategories, and intensity levels. The following example shows how a single prompt can carry multiple labels, which was the case for 64% of the dataset, depending on the presence of different forms of harm within the text.

Structured taxonomy and labeling system

We used a detailed taxonomy with seven main harm categories: Criminal, Misconduct, Jailbreak, Hate, Insults, Violence, and Sexual. Each of these was further divided into subcategories to help annotators apply more granular labels. Per the client's instructions, the prompts were also rated on three levels of intensity: low, medium, and high. 

Defining implicit vs. explicit harm

Harmful content is not always obvious. It can be implicit, through the use of metaphors and subtle language. In each scenario, the model needs to be able to recognize the dangerous intent without being overly censoring. For this reason, annotators also indicated whether the harmful content was implicit or explicit.

Exceptions to general rules

In some instances, fixed rules were applied regardless of the broader context. For example, even the briefest mention of sexual acts involving minors automatically triggered a high severity rating and an explicit label within the Sexual category. This way, we set clear boundaries for the most sensitive topics.

Thorough QA process

Each annotation had to be supported by evidence. To maintain transparency and prevent arbitrary judgments, annotators were required to provide text snippets that back the classification labels, as shown in the prompt below:


Bias control and annotator well-being

AWS Bedrock specifically sought a partner who shared their commitment to the ethical treatment and well-being of data annotators. They recognized that building a safety dataset requires not just technical precision, but a deep respect for the people handling potentially distressing content.

Toloka's robust framework for ensuring both data quality and annotator wellness was a key reason they chose us. To minimize subjectivity, annotators followed clearly defined guidelines accompanied by examples for each category. We protect our expert contributors by ensuring they are of legal age and give informed consent before working on sensitive material. To prevent fatigue and the emotional strain from this exposure, annotation sessions are limited to two hours per day. Our annotators are also empowered to skip any task without penalty and receive increased compensation that acknowledges the challenging nature of the work.

This focus on creating a supportive and ethically grounded environment was paramount to AWS Bedrock and is fundamental to how Toloka delivers high-quality, responsibly sourced data.

Impact and applications of the dataset

This project combined a structured process with human expertise to produce a high-quality dataset designed for long-context prompt evaluation. Let's look at some of the key advantages:

  • The dataset can be used to fine-tune models to handle long inputs and detect harmful intent that may be embedded in longer texts. It teaches models to recognize when the user crosses the line, even if it is not stated directly. 

  • It also trains models to avoid over-censoring. Some prompts may seem risky at first glance, but are actually part of legitimate tasks, like analyzing police reports or discussing sensitive topics for educational or analytical purposes. The dataset enables models to make this distinction, reducing unnecessary censorship while still flagging genuine risks.

  • This collection of prompts can be used in the role of a benchmark as well. By testing models on this dataset, teams can measure how well harmful prompts are deflected and whether safe prompts are answered appropriately.

  • The high granularity of prompt annotation across seven major harm categories provides precise information on the model’s behavior regarding specific safety topics.

Long inputs, human effort, trusted output

This project required more than standard data collection. It called for a fully human-annotated dataset built from long-form inputs, guided by complex instructions, and filled with edge cases that couldn't be handled through automation alone. Every prompt had to be carefully constructed, reviewed, and labeled by trained annotators who could navigate nuance, context, and intent.

We delivered an exceptional dataset that exceeded the client's expectations, capturing both explicit and subtle forms of harm while enabling models to make nuanced distinctions between harmful and merely sensitive material.

The result is not just a training resource, but a tool that improves how models respond to long and layered inputs. It supports more accurate and safer AI by helping systems make better decisions when it is harder to separate context, language, and intent.

Need structured, expert-reviewed data for your use case? Get in touch to discuss custom solutions.

Subscribe to Toloka News

Case studies, product news, and other articles straight to your inbox.

Subscribe to Toloka News

Case studies, product news, and other articles straight to your inbox.

Subscribe to Toloka News

Case studies, product news, and other articles straight to your inbox.

More about Toloka

What is Toloka’s mission?

Where is Toloka located?

What is Toloka’s key area of expertise?

How long has Toloka been in the AI market?

How does Toloka ensure the quality and accuracy of the data collected?

How does Toloka source and manage its experts and AI tutors?

What types of projects or tasks does Toloka typically handle?

What industries and use cases does Toloka focus on?

What is Toloka’s mission?

Where is Toloka located?

What is Toloka’s key area of expertise?

How long has Toloka been in the AI market?

How does Toloka ensure the quality and accuracy of the data collected?

How does Toloka source and manage its experts and AI tutors?

What types of projects or tasks does Toloka typically handle?

What industries and use cases does Toloka focus on?

What is Toloka’s mission?

Where is Toloka located?

What is Toloka’s key area of expertise?

How long has Toloka been in the AI market?

How does Toloka ensure the quality and accuracy of the data collected?

How does Toloka source and manage its experts and AI tutors?

What types of projects or tasks does Toloka typically handle?

What industries and use cases does Toloka focus on?