Read our papers on AI training, evaluation, and safety

Read our papers on AI training, evaluation, and safety

Read our papers on AI training, evaluation, and safety

Toloka podcast: How RL Gyms are redefining data for AI agents

October 21, 2025

October 21, 2025

Insights

Insights

In our latest Toloka Live conversation, CEO Olga Megorskaya and Renaud du Breuil-Hélion de La Guéronnière, Toloka’s Director of Agentic Programs, unpack how data creation has gone from crowdsourcing to expert-led pipelines that now fuel the most advanced AI agents.

Talking through the industry's shift from pre-training and supervised fine-tuning (SFT) to reinforcement learning (RL), they look at what it takes to build controlled, realistic environments where agents learn through verifiable rewards rather than static human feedback. These environments are known as RL Gyms.

The conversation sheds light on how teams are moving beyond the chase for more data to focus on creating structured and realistic systems that teach models to reason.

The conversation covers …

  • Why reinforcement learning has become the new scaling frontier for AI.

  • How private, domain-specific benchmarks outperform public ones that quickly saturate.

  • What it takes to reach 99% data quality, and why each extra percentage point costs exponentially more.

  • The rise of orchestrated teams bringing together different kinds of expertise to build data that mirrors how agents operate in the real world.

Olga frames the shift as a natural next step in Toloka’s evolution, from simple labeling tasks to complex, expert-led data creation, while Renaud focuses on the technical frontier. 

They reveal how leading labs calibrate failure rates for faster learning and build “virtual companies” where agents can safely practise complex tasks. It’s a lively look at how data work is changing, with humans and machines advancing side by side rather than trading places.

Both also get into:

  • How to design RL-Gyms that stay effective as models improve.

  • Why targeting ~50% success rates produces the best RL outcomes.

  • Striking the right between human and synthetic data in building reliable AI agents.

  • Why deterministic tools matter for safety and continuous learning.

  • How to build controlled environments that reflect the real world without exposing sensitive systems.

If you’re building AI agents or exploring new ways to create the data that trains them, this discussion offers a grounded look at where the field is heading and how expert-orchestrated RL data is becoming the backbone of reliable, safe systems.


Olga: Welcome everyone. I’m Olga, CEO and founder of Toloka. Joining me today is Renaud, our Director of Agentic Programs, and one of the few people who’s been working with data for AI agents long before it became a buzzword.

Today we’re talking about the evolution of training data, covering how we moved from crowdsourcing and labeling to the new era of RL-Gyms, virtual environments, and reinforcement learning for AI agents.

A few years ago, it was enough to have any human provide a helpful signal to AI. Then you needed a PhD-level expert. Now, you need practicing professionals working in realistic, controlled environments to generate data that teaches AI how to reason and act.

Renaud: Most of the customers we’re working with now are focused on what we call the latest scaling law, which is reinforcement learning.

Pre-training? That’s mostly done. The internet has been scooped up. SFT? Also fairly mature. There’s plenty of it. The next leap in capability comes from reinforcement learning, which is creating environments where agents can train themselves through verifiable feedback rather than human preference.

Everything is shifting in that direction. The leading labs, the ones furthest along in capability, are building RL-Gyms because that’s where there’s still headroom for progress.

Olga: Right. And what’s interesting is seeing how different teams across the industry approach this. Some still need SFT data for tool use; others are already into custom benchmarks and continuous learning for production agents. You mentioned benchmarks earlier. There’s a boom in public ones, yet we keep seeing demand for private, custom benchmarks. Why is that?

Renaud: Two reasons. First, public benchmarks saturate fast. As soon as they’re released, models start overfitting to them. It’s not about memorizing data so much as it’s about gaming the evaluation.

Second is volume. There’s still far less high-quality RL or benchmark data than pre-training or SFT data. If you want your model to learn meaningfully, you need a lot more of it – and at much higher quality.

That’s why private benchmarks matter. They’re harder to overfit to and can reflect realistic domains rather than idealized ones. But the tradeoff is cost, and reaching 99% quality is exponentially harder than 90%.

Olga: Yes. That’s what we see, too. Each additional percentage point of data quality costs disproportionately more effort, from specialists to verification to annotation depth. That’s why you rarely see open-source benchmarks reaching that standard.

So, what kinds of data are most in demand now?

Renaud: Almost everything fits one high-level pattern involving user intent, an agent with access to tools, and a verifiable outcome. What varies is where the agent fails and what tools it uses.

Take coding, for example. You’re giving an agent access to something very sensitive, like your codebase or your server. Safety becomes vital here. So we build environments that teach safe tool use while grading effectiveness.

Then there’s customer support. Safety isn’t as much of a concern there, but policy compliance and process adherence are. The agent needs to learn to navigate realistic workflows and jargon, and those environments have to reflect that complexity.

Another area is browser or computer use, which tests multimodal understanding like clicking the right buttons, interpreting screens, following complex state changes. These are surprisingly difficult because they involve both reasoning and perception.

And finally, knowledge retrieval. RAG systems handle single queries well, but agents still fail when they have to reason across many documents or chain tool calls together. That’s where RL Gyms help build resilience.

We focus on safety and effectiveness across these domains

Olga: What’s fascinating to me is that we now deliberately target specific failure rates for reinforcement learning. It’s counterintuitive, as you actually want your agent to fail around half the time.

Renaud: Exactly. For RL, 50% failure is ideal. Too easy and there’s nothing to learn from. Too hard and the gradient disappears. So we design tasks to hit that sweet spot.

Benchmarks and RL-Gyms may look similar, but their goals differ. Benchmarks are meant to be tough and to last as long as possible before they’re solved. RL-Gyms are tuned for progress. Both have shelf lives, but for different reasons.

Olga: And when we talk about realism, virtual companies and controlled environments are a big part of that. Can you walk through how those work?

Renaud: Sure. Let’s say you’re building an agent for healthcare. You start by scoping, which involves understanding what tools exist, what policies apply, what “success” means. That requires business SMEs, from doctors to administrators – people who actually know the workflows.

Next, coders implement those tools and prompts as realistic systems. Then, context engineers design test cases that expose weaknesses, places where the agent might break, misunderstand, or misuse a tool.

Finally, annotators review transcripts to find out why the agent failed. Was it a logic issue? A safety lapse? A bad prompt? They close the loop.

You have orchestrated teams instead of one person labeling data, with each bringing specialized expertise to a pipeline that mirrors reality.

Olga: Which is such a change from Toloka’s early days. Back then, an annotation took 30 seconds. Now, a single data item can take hours, even days, and involve multiple experts.

But one thing hasn’t changed, and that’s the need for humans.

Renaud: You do. At least for now. You can’t rely purely on synthetic data. Before a model launches, you don’t have user traces yet. You need humans to define the edge cases, the rewards, and the constraints.

Synthetic helps, but hybrid pipelines combining humans and AI work best. AI handles volume; humans set the boundaries.

Certain tasks are perfectly good for AI. while others still need humans. As AI improves, that balance shifts. But you always need a human in the loop for reliability.

Olga: And you’re famous internally for pushing automation as far as it can go. But even you admit that for 99% quality, human oversight is non-negotiable.

Renaud: Absolutely. You can’t reach multiple nines of reliability without humans in the loop.

Olga: So what about user-behavior data? We get that question all the time. Can’t agents just learn from real interactions?

Renaud: Sometimes. But most of the time, you don’t even have user data yet. RL happens before launch. Even when you do, user traces tend to show where the model succeeds, not where it fails. RL is about the opposite. It’s those 10% of edge cases you didn’t cover.

Behavioral data can help diversify your dataset, but you can’t build an entire training strategy on it.

Olga: What about constraints, like physics or business rules?

Renaud: Keep deterministic checks inside the tools and not the agent. If something must never happen, make it a tool rule. Let the agent focus on the decisions that aren’t black and white. That’s how you get predictable, repeatable training signals.

Olga: I like that. Determinism where possible, exploration where needed.

Before we wrap, I want to point out something: the progress we’re seeing in AI agents is incredible. Just a year ago, tool-use agents were naive and now they’re sophisticated. But that leap only happened because of the data beneath them, be it the environments, the experts, and the feedback systems teams like ours build.

Renaud: Totally. The foundation of capable, safe AI isn’t model size so much as it’s the quality of the environments and feedback loops around it.

Olga: Exactly. Thank you, Renaud. 

AI workflows start with the right data

Toloka helps teams build the data systems and environments that let agents improve in real-world conditions, where it matters most. Get in touch

Subscribe to Toloka News

Case studies, product news, and other articles straight to your inbox.

Subscribe to Toloka News

Case studies, product news, and other articles straight to your inbox.

Subscribe to Toloka News

Case studies, product news, and other articles straight to your inbox.

More about Toloka

What is Toloka’s mission?

Where is Toloka located?

What is Toloka’s key area of expertise?

How long has Toloka been in the AI market?

How does Toloka ensure the quality and accuracy of the data collected?

How does Toloka source and manage its experts and AI tutors?

What types of projects or tasks does Toloka typically handle?

What industries and use cases does Toloka focus on?

What is Toloka’s mission?

Where is Toloka located?

What is Toloka’s key area of expertise?

How long has Toloka been in the AI market?

How does Toloka ensure the quality and accuracy of the data collected?

How does Toloka source and manage its experts and AI tutors?

What types of projects or tasks does Toloka typically handle?

What industries and use cases does Toloka focus on?

What is Toloka’s mission?

Where is Toloka located?

What is Toloka’s key area of expertise?

How long has Toloka been in the AI market?

How does Toloka ensure the quality and accuracy of the data collected?

How does Toloka source and manage its experts and AI tutors?

What types of projects or tasks does Toloka typically handle?

What industries and use cases does Toloka focus on?