The human difference in high-stakes AI evaluation

on May 18, 2026

Toloka Arena is live. See how your model ranks.

Learn more

Enterprise-grade evaluation datasets don't usually get built without a project manager, a procurement cycle, and a multi-week setup process. The assumption is that the complexity demands it.

That assumption is wrong.

A benchmark is only as good as the tasks in it. And creating tasks that truly challenge frontier models in domains that require expert reasoning demands the same level of expertise the model is being tested on.

That's a harder constraint than it sounds. A question about Basel IV output floors or DIFC legal grounding isn't good enough for model evaluation in the legal domain unless it demands multi-step reasoning across specialized knowledge that no amount of prompt engineering can shortcut. Correctly creating expert-level evaluation tasks requires someone who knows the domain.

The Toloka Platform handles specialized evaluation by routing tasks to verified professionals in 90+ disciplines. That routing happens in the platform, removing the need for separate procurement cycles or managed project overhead. Domain expertise is built directly into the pipeline from the start.

Here’s what that looks like in practice across two very different domains and languages: a French institutional finance benchmark designed to resist frontier models, and a legal dataset validated against DIFC source documents to gold-standard grounding precision. Both ran entirely self-serve.

French institutional finance: designing questions that break frontier models

Field	Value
Language	French
Domain coverage	Regulatory, corporate finance, quantitative markets
Question format	First-person practitioner scenarios
Task structure	Question + golden answer
Answer length	Up to 3,000 characters
Quality control	LLM QA
End use	AI evaluation benchmark

Any question in such a benchmark only has value if a general-purpose LLM can't answer it reliably. "Explain CVA in OTC derivatives" has been in pre-training data a thousand times over. It doesn't tell you anything about a model's capability in high-stakes financial contexts. The only questions worth having in an evaluation dataset are the ones that demand regulatory expertise to answer correctly, which means they can only be written by someone who has it.

Writing questions from real scenarios

Domain experts through the Toloka Platform generated Q&A pairs across three areas:

Regulatory compliance: MiFID II/MiFIR, Basel III/IV, Solvency II, AIFMD, EMIR/SFTR
Corporate finance: IFRS 9, 15, 16, 17, consolidation and business combinations under IFRS 3, M&A valuation and purchase price allocation, LBO structures and covenant analysis
Quantitative markets: exotic options and volatility surfaces, XVA calculations (CVA, DVA, FVA), CLO/CDO mechanics, yield curve construction and inflation products

Every question had to be written from a practitioner's first-person perspective, be it a CFO working through a specific Basel IV output floor calculation against their IRB approach, or a risk manager deciding how negative MBS convexity should inform a dynamic hedging strategy. Each question required multi-step reasoning across regulatory and technical knowledge, the kind of problem where even knowing the domain isn't enough without understanding how its moving parts interact.

Answers ran to up to 3,000 characters of technically accurate, current content in professional French, producing a dataset that frontier models couldn't shortcut their way through.

Where LLM QA stops and experts step in

The LLM QA layer caught issues in real time. Where it flagged something that needed deeper judgment, like a question that might not be hard enough or an answer that needed checking against current 2024 regulatory standards, that task moved to human expert review.

DIFC legal validation: exact sufficiency as a quality standard

Field	Value
Language	English
Document source	DIFC legal documents
Task type	Question + answer + reference documents
Quality standard	Exact page citation sufficiency
Answer formats	Number, boolean, names, date (YYYY-MM-DD), free text
Quality control	Human expert review
End use	Gold standard dataset for international RAG benchmark

The legal project had a different structure but the same underlying logic: a wrong answer isn't recognizably wrong unless you understand the source material well enough to check it.

Experts on the Toloka Platform validated synthetically generated Q&A pairs against DIFC source documents, verifying that every factual claim was grounded in the cited pages and that the page citations were exactly right.

Why exact sufficiency is difficult in practice

An answer can be factually correct and still fail. If the answer draws on information from pages one and five but only cites pages one and two, it gets rejected because the grounding is incomplete. The inverse applies with equal force — citing page four when only pages one and three are needed also fails, because unnecessary references undermine the integrity of a dataset being built as a gold standard for an international RAG benchmark competition.s

Format compliance added another layer. Dates had to be in ISO 8601 format — "2024-03-15" not "March 15, 2024" — regardless of how the source document presented them. Numbers needed to be pure numerals with no currency symbols or units. Free text answers were capped at 280 characters in a professional register. A correct answer in the wrong format failed the same way a wrong answer did.

How quality is assessed

Validators were assessed on:

Grounding quality
Page accuracy
Format compliance
The clarity of their reasoning

A correct call supported by vague reasoning still failed. The point was auditability, with every decision traceable to specific text in a way another validator could independently verify.

Self-service, all the way down

Both projects ran entirely on the Toloka Platform. The teams configured their projects, and the platform's routing logic matched each task to the right level of domain expertise – institutional finance specialists for the French benchmark, legal experts for the DIFC validation — without a separate procurement cycle or managed service escalation.

The cases above aren't exceptions reserved for managed engagements. They're examples of what any team can run directly: verified domain specialists, LLM QA, full pipeline control, no minimums. The complexity of the task doesn't determine the access model. You get the same expert network and the same quality stack whether you're running a one-off experiment or a production benchmark pipeline.

When the benchmark demands expertise, the platform should already have it — and hand you the controls.

Build your next benchmark with Toloka Platform

Build now

Subscribe to Toloka news

Case studies, product news, and other articles straight to your inbox.