← Blog

/

Insights

Insights

How Toloka evaluates search relevance in the age of AI search

on November 25, 2025

on November 25, 2025

Meet Tendem: a new standard for getting work done better and faster

Meet Tendem: a new standard for getting work done better and faster

For the first time in almost two decades, search is fundamentally changing. The new generation of AI-driven search goes beyond retrieving web pages and uses reasoning and synthesis with the aim of delivering richer, more helpful answers.

At Toloka, we work with teams building advanced search systems to evaluate how well their engines respond to real user intent. Over time, we have refined a methodology that balances automation with expert human judgment.

We outline how that approach works and why human evaluation remains so integral to AI search.

Why relevance remains the foundation of search

No matter how advanced a search engine becomes, its value still depends on relevance. Models learn to rank and interpret results more accurately through exposure to judged examples of what “good” looks like in an answer.

When evaluations are consistent and well-defined, they provide a stronger signal for measuring relevance and understanding how search quality changes over time.

Producing context-aware results in multiple languages remains a difficult balance for anyone building modern search. Automated checks can measure surface-level accuracy, but subtle qualities such as intent and credibility still depend on human judgment.

Our framework combines two layers of evaluation to address that: static golden sets for auto-evaluation and on-the-go evaluations of live model outputs. 

Step one: identifying user intent

Every evaluation begins by understanding the user’s intent behind the query. Without that, it’s impossible to judge whether a result answers the question it’s being asked.

We categorize queries by intent, such as navigational or informational, but annotators go further by examining how specific phrasing changes the meaning. A search for “company careers” points to employment opportunities, while “company homepage” means something else entirely. The same distinction applies when users include platforms in their queries: “reddit tennis racket mistakes” doesn’t seek a single Reddit page, but relevant discussions within it.

Clarifying the intent lets annotators judge whether search results meet expectations. The first link should lead directly to the intended destination when it comes to navigational queries. For informational ones, results need to come from credible, well-maintained sources. It’s different for transactional intent, where users should be able to complete an action without additional steps.

The process often reveals queries that are unclear or impossible to satisfy – a query like “best smartphones under $500” combines research intent with an interest in buying. Recognizing those indicators early on in the process helps prevent flawed data from skewing results of the evaluation and keeps the overall assessment more reliable.

Step two: assessing relevance and quality

Once the intent is defined, the next task is to evaluate how well the search results align with it.

To determine how two systems differ in quality, we can compare both individual webpages from query results and complete search engine results pages (SERPs).

For the second case, annotators primarily focus on the top portion of each SERP, where user attention is concentrated and model errors are most visible.

Each link is rated according to four key signals: 

  1. Relevance itself measures how directly the result answers the query.

  2. Helpfulness considers whether the content is usable and presented in the right language. 

  3. Trustworthiness focuses on the authority of the source, especially when it comes to sensitive or expert topics. 

  4. Freshness matters when the subject depends on up-to-date information, such as news or live data.

Each principle is flexible rather than formulaic and depends on the query type and domain. A historical topic may score highly even if it’s years old, while a sports update might be downgraded within hours. Evaluators provide short explanations alongside their labels, describing what worked or failed. Those notes become invaluable for refining both automated scorers and annotation guidelines later.

The human element behind automation

Toloka’s evaluation pipelines are built for efficiency, but they rely on people with a deep understanding of language and context. AI tutors (those who are highly trained annotators) review the results in more than ten languages. Each goes through calibration rounds where feedback from Toloka’s specialists and the client’s team helps align interpretations of the same guidelines.

Automation supports the process rather than replacing it, with machine-learning models that help flag inconsistencies and check coverage around annotation. What you get is a sped up throughput while keeping expert reviewers in control of the final judgment.

That balance is central for projects that involve AI search companies, where search operates across different regions and languages. A literal translation might be accurate linguistically but fail to match local expectations. Native-language reviewers can catch nuances that automated metrics would typically miss, giving the client confidence that model quality reflects how users search.

Dealing with real-world edge cases

Not every query fits cleanly into a single intent or produces a straightforward answer. Some combine multiple goals, like wanting to compare and then buy a product. Others are effectively unanswerable. Generative or nonsensical inputs, where no existing webpage could reasonably satisfy the query, are also common in real-world data.

Rather than discarding these examples, our annotators flag them with context so they can be analyzed separately. By doing this, the client better understands where their model’s boundaries lie and avoids drawing false conclusions from noise.

Location adds another layer of complexity, especially when search is so global and local at the same time. A political query might have different relevance criteria in the US, UK, or Canada, and a local business search in one language might still surface multilingual results. 

Because AI tutors are native speakers, they can distinguish between acceptable variation and genuine error. Native understanding builds confidence that quality metrics reflect how users in each market experience search.

From evaluation to actionable insight

Relevance evaluation shows how a search engine behaves in practice and where its results can improve.

Feedback from our evaluators is used to improve scoring systems and refine how relevance is measured over time. What you get is a stable process where the search engine can evolve without losing its grounding in human expectations.

The collaboration model is iterative, and early phases often reveal small misalignments in definitions or weighting, which are refined together. Once calibration stabilizes, the framework can operate continuously and feed fresh data to sustain model quality at scale.

A continuous loop of human and machine learning

Evaluating search relevance at scale means understanding when a result genuinely answers the question, and Toloka’s evaluators turn that insight into signals models can learn from.

The systems that power modern search may be increasingly autonomous, but relevance remains a human benchmark. With each evaluation cycle, that benchmark becomes more defined and guides the AI models toward the kind of results people trust.

Get in touch to discuss how our methodology strengthens your own search relevance evaluation.

Subscribe to Toloka news

Case studies, product news, and other articles straight to your inbox.