Creating domain-ready datasets: How Toloka's hybrid approach generates realistic and high-quality data

Training domain-specific AI models often hits a familiar roadblock: access to high-quality, realistic, and diverse datasets. In highly regulated industries, using real data isn’t always an option due to privacy concerns, compliance constraints, or simply the unavailability of complete datasets. To build systems for question answering, search, or data analysis, models must be trained on examples that accurately reflect the full complexity of real business environments.
To solve this task, we must move past off-the-shelf datasets and create interconnected systems of entities, documents, and user interactions that align with an organizational narrative. The challenge is to do this at scale while maintaining both realism and quality.
This was the central focus of a recent project for our client, a leading LLM developer, who tasked us with creating a synthetic financial organization. The result was a Salesforce instance with entities that represent realistic personas, customers, contracts, internal documents, and business activities.
Toloka's solution: Combining LLMs with expert oversight
We began by defining the anatomy of a modern financial organization, identifying its core entities and the data types supported by the Salesforce online connector. Data generation was achieved using a hybrid pipeline that combined the capabilities of an LLM with human oversight. The LLM was instructed to generate content based on detailed specifications matching the company's profile. Once generated, each data item was reviewed by domain experts in accordance with a predefined set of quality criteria, allowing them to accept the content as is, make corrections, or reject it entirely.
We generated test environments using simulated accounts populated with authentic emails and attachments. All test content was realistic but fabricated, avoiding references to actual projects, companies, or individuals. The project was carried out in two phases:
First phase: Company and entity generation
In the first phase, we created the digital skeleton of a fictional 30-person company, comprising entities such as "Accounts", "Contracts", "Campaigns", "Knowledge articles", "Partners", and "Opportunities", among others. We generated 100 "Accounts" and 100 "Documents", each designed to reflect real-world formats and use cases.
Each account differed in the number and types of associated entities to simulate organizational diversity. The documents were designed to represent different business lines (Sales, Operations, Marketing, Human Resources, Legal, etc.) and came in multiple formats (text, RTF, PDF, DOC, HTML, PPT, Excel). To better simulate real data scenarios, we varied the document length and complexity by including:
Multi-page documents (at least 2-3 pages or more than 2000 words).
Documents containing a mix of images, tables, bulleted lists, and sections.
Extended documents that reach 4+ pages
After the LLM generated the entities, experts reviewed the data to ensure it was factually correct and contextually relevant. Beyond creating "believable" content, we ensured it could realistically train and test internal enterprise AI models.
Second phase: Q&A generation
With the synthetic company in place, the second phase focused on creating a Q&A dataset that aligned with the generated content. We produced 500 Q&A pairs to reflect three types of user queries:
Factoid queries: Direct questions with specific answers found within a single entity. For example: "What is the location of Crane and Green?". The answer is linked to the entity ID where the information is found, and in this case, it would be something like this: "The company Crane and Green is located in San Francisco, California."
Aggregation queries: Questions that require combining data from multiple entities, such as "Which accounts have more than 100 employees?". In this case, we listed all relevant accounts in the answer.
Summarization queries: Questions that focus on condensing existing content and extracting the key points into a short summary. For example, we might generate a query like this: "Give a brief summary of the marketing plan for Crane and Green.", and the answer would be: "Focuses on digital ad campaigns, a Q3 product rollout, and regional brand awareness efforts."
As with the document generation, all Q&A content was reviewed by domain experts. Answers were validated for correctness, and in cases where the LLM output was incorrect or insufficient, experts stepped in to rewrite or refine the responses.
In addition to being syntactically correct, the Q&A pairs were contextually grounded in the context of the made-up company, making them ideal for evaluating enterprise applications, such as internal assistants or data retrieval systems.
Adding complexity with multimodal data
To further enrich the dataset and test more complex AI interactions, we extended the initial text-based data by incorporating multimodal content, including:
100 standalone images
100 standalone tables
90 knowledge articles with embedded tables and images
This opened the door to a new class of queries that required interpreting visual and tabular data, a crucial step in building AI systems that can understand business reports, dashboards, and presentations. The QA process for such multimodal content included various checks based on the following preset criteria:
Contextual relevance: Does the image align with the article's subject?
Structural completeness: Are the tables filled with plausible data?
Overall document coherence: Does the multimodal document flow logically?
Despite the complexity, 80% of the generated data met all quality benchmarks without requiring expert intervention.
Moving toward reasoning-based queries
Building on the success of the initial dataset, we are now developing a fourth type of query: reasoning-based questions. These are more advanced and require the model to infer insights by connecting multiple pieces of information.
For instance, to answer the question
"Which accounts are likely to expand next year?"
we need to consider accounts and opportunities. This is achieved by developing multiple queries like
"Which accounts have shown consistent revenue growth over the past year?" or
"Which clients have recently opened new offices or increased hiring?"
We define the ground truth for the indicators that suggest expansion potential, providing a clear basis for generating the data needed to answer such questions accurately.
Practical outcomes and cross-domain potential
The result of this project is a high-fidelity dataset that mirrors the depth and complexity of real business environments without compromising on privacy or compliance. We generated and validated such data to serve practical use cases, such as teaching a model assistant to answer internal policy questions or analyze financial performance.
Although this project primarily focused on the financial sector, the underlying methods are domain-agnostic. The same approach can be used to simulate synthetic data for healthcare, legal, insurance, or manufacturing applications, wherever there's a need for intelligent systems trained on credible and compliant data.
Building AI, but missing the data?
Our hybrid approach delivers high-quality, domain-specific datasets without the compliance risks. Contact us for a custom solution.