How we build virtual companies to forge enterprise-ready AI

Renaud du Breuil-Hélion de La Guéronnière

on September 5, 2025

High-quality human expert data. Now accessible for all on Toloka Platform.

Learn more

Demos look convincing because they strip away the rough edges. In reality, business systems are messy, full of quirks and dependencies that don’t show up in a staged setting. We build virtual companies that reproduce that mess—safely—so you can test, break, fix, and improve in a realistic environment before rolling out.

Most testing happens in generic sandboxes that don't reflect how work actually gets done. Public benchmarks often rely on public data, which misses important context of internal operations. Real work is rarely a single code commit. It’s a chain of actions that includes updating documentation, checking in with a manager on Slack, and then writing the commit.

We’ve pioneered a different approach to solve this at Toloka. We build beyond simple virtual environments, creating full virtual companies. It’s not about testing AI on isolated tasks as much as it’s modeling the intricate web of real work, the unique personas, proprietary tools, and multi-step workflows that define your business.

Let’s pull back the curtain on our unique methodology and show how we construct these high-fidelity virtual companies to provide the most rigorous, realistic, and scalable evaluation available in the market for enterprise AI agents.

The blueprint for a virtual company

Creating a faithful digital twin of a business isn’t something you can do with off-the-shelf methods. It takes a structured approach that captures the way your people work and the systems they rely on, without losing sight of the bigger picture. That’s why we’ve developed a meticulous, multi-stage methodology, with each step designed to maximize realism and test coverage.

Deconstructing your business DNA: Domain mapping and taxonomy design

Before we build anything, we map everything. It begins with decomposing your business space using the MECE (Mutually Exclusive, Collectively Exhaustive) principle. The outcome is a comprehensive taxonomy covering every persona, functional domain, user attribute, and system state. That blueprint guarantees full coverage of your operations with no redundant overlap, making sure no corner of the business is left untested.

Building the world: High-fidelity company representations

With the blueprint in hand, we build out a living, working model of your business. Using our hybrid manual-synthetic pipeline, we generate realistic data, tools, and company policies without the security risks or licensing costs of using your live production data. Our "three-in-a-box" model brings together Domain Experts, Solution Engineers, and ML Engineers to seed and shape the environment, which is then validated and signed-off by an independent quality team. What you get is a rich, authentic world for your AI agent to inhabit.

Flipping the switch: Test-bed construction

The virtual company isn’t a flat model on a page. It’s a working replica of your technical environment, complete with the moving parts your AI will need to navigate. We containerize the entire representation, wiring in your real company connectors where available and using mock-MCP endpoints where necessary. In practice, the setup mirrors the real-world architecture your agent will encounter, making the tests not only theoretically sound but technically relevant.

Charting the course: Instruction and golden-trajectory creation

Defining success starts with our Subject-Matter Experts (SMEs), who author "golden trajectories"—the ideal, expert-level paths for completing complex user journeys. These trajectories codify what top-tier human performance looks like. While we use synthetic generation to achieve scale, every trajectory is manually validated and corrected by our experts to create a clear, high-quality benchmark for the AI to strive for.

The gauntlet: Toloka’s two-stage quality assurance

Every test goes through two layers of QA:

Specification QA: An independent review team validates every test case, artifact, and annotated log so it’s built perfectly to spec.
Stakeholder QA: We conduct weekly joint review sessions with you, our client, against a shared Quality Framework. Taking a collaborative process guarantees total alignment and builds confidence in the results.

Our philosophy: Why we aim for failure

Here's something that might seem counterintuitive: we design our benchmarks to be intentionally challenging, targeting a failure rate of ≥50%.

Vanity metrics hold little weight, and they definitely don't build better AI. An easy test that results in a 99% success rate tells you what you already know. A challenging test that surfaces failures is where true learning begins. Pushing models into harder territory exposes weaknesses and creates a graded curriculum for Reinforcement Learning (RL) training.

Our internal research also shows that roughly half of the failures on existing benchmarks are basic errors that will disappear as base models evolve. We focus on the hard, domain-specific problems that represent the genuine long-term barrier to enterprise adoption. We don't create benchmarks to make AI look good and instead focus on creating proving grounds that make AI become good.

Research presented at ICLM 2025

The End game: Tau-Style automatic test cases ("RL Gyms")

The end result of this entire methodology goes beyond a report card and is a dynamic, fully automated evaluation sandbox we call a "Tau-Style Gym." These gyms are the ultimate training environment for enterprise AI. They:

Produce deterministic reward signals essential for Reinforcement Learning.
Emit massive volumes of rich agentic traces for analysis and fine-tuning.
Serve as an enterprise-grade benchmark with fully automated grading.

The result is a powerful, self-reinforcing loop: Execute → Collect Traces → Retrain → Re-evaluate. It’s the most economical and efficient path to building a high-quality data foundation for training capable, enterprise-ready AI agents.

Tailored to your reality: A fully customizable proving ground

Every business is unique, so every virtual company we build is fully customizable across multiple dimensions. Our methodology is not a rigid product but a flexible framework designed to create the precise proving ground you need. You can think of these as dials we can turn to perfectly match your operational reality and evaluation goals.

Scale the complexity

Your workflows aren't simple, so your tests shouldn't be either. We can scale the simulation to meet your needs, from a baseline of 15 to 20 tools per agent to complex mega-workflows involving 50-plus tools. We can model entire source systems like SAP or Salesforce in granular detail, creating dedicated agents with over 20 mocked tools to cover specific functionalities.

Choose your automation level

We meet you where you are by delivering everything from meticulously crafted manual test cases with full transcripts to semi-automated golden-trace generation, all the way to the end state of a fully autonomous RL Gym.

Measure what matters to you

A simple pass or fail is often not enough. We configure the evaluation to provide a multi-dimensional view of agent performance, tracking the metrics that matter to your business. This includes granular success criteria like success@1 and recovery rates (recovery@K, steps‑to‑recovery), operational efficiency KPIs such as latency (p50/p95) and cost per task, and critical compliance measures like policy adherence and custom safety scores.

Adapt to any industry

Our methodology is industry-agnostic, meaning we can quickly customize our framework to model any business space.

Match your modality and language

Your AI agent doesn't just live in the cloud or speak English. Our virtual companies can be configured to test for a variety of deployment targets (including on-device and edge agents), modalities (voice, vision, document OCR), and languages (with support in over 40 languages).

Flexible engagement models

Whether you need a one-off data corpus for a time-boxed project or a continuous service subscription that provides a live Data Flywheel for constant model improvement, we tailor the partnership to your goals and provide an engagement model that matches your priorities

Build your AI's future on a real foundation

If you want to know whether an AI agent is fit for purpose, it has to be tested against a faithful representation of how your organisation works in the real world. Generic benchmarks and simplified sandboxes can no longer provide the signal needed to bridge the gap between potential and performance.

Toloka brings together the right methodology and expertise to build a virtual company, a proving ground that’s scalable, rigorous, and designed for enterprise-ready AI.

Move beyond the sandbox and contact us to learn how we can build a virtual company to perfect your AI agents.

Subscribe to Toloka news

Case studies, product news, and other articles straight to your inbox.