Solutions

Datasets

Research

Resources

Company

Talk to us

Toloka welcomes new investors Bezos Expeditions and Mikhail Parakhin in strategic funding round

Learn more

Toloka welcomes new investors Bezos Expeditions and Mikhail Parakhin in strategic funding round

Learn more

Toloka welcomes new investors Bezos Expeditions and Mikhail Parakhin in strategic funding round

Learn more

Evaluation experiments for confident decisions: a case study on new search engine functionality

Toloka Team

August 21, 2023

Customer cases

Evaluation experiments for confident decisions: a case study on new search engine functionality

About the client

Our client, a search engine developer, added new generative AI functionality for the search results page. The goal was to evaluate which version of the new feature is preferred by users to ensure a successful launch. The development team wanted to get an explicit signal from real people to help them make the right decision before going live with the product update.

Challenge

The new feature detects an object in a search query and uses a language model to generate an object answer, which shows a brief summary about the object. For instance, if a user searches for [Kia Rio], the model outputs a list of information about the car and a set of images. This answer is shown separately from the search results, in a special section on the right side.

The team developed two versions of the feature: one with a list of details, and one with mostly images. They assumed that users prefer images, but they needed to compare the versions and confirm which one is best.

The client sometimes uses A/B testing to track user behavior in production (by measuring clicks and other actions), but that method wouldn’t provide useful metrics for this feature. Since the answer is shown next to the search results, they expected users to get information without clicking on it. The goal was to get an explicit signal about the user experience before launching the feature.

Solution

We set up a side-by-side project to compare the two versions and asked the Toloka crowd to choose the option they like best. The image shows the evaluation task for the query [Kia Rio], where participants were asked which variant is most informative.

By posing this question to a large group of people, we explicitly measured user preference. We were able to directly ask about specific parts of the screen and obtain concrete results for the client.

Occasionally, Tolokers identified cases where the model generated uninteresting or irrelevant results. As an extra benefit of the evaluation process, these queries were passed back to the client’s team to analyze and identify areas for improving the language model. The client uses a similar process to systematically detect issues in search results on a large scale with dissatisfaction analytics (DSAT).

Business impact

The end results were surprising. The initial assumption was that users would like to see more images, but the version with the detailed characteristics actually ranked higher in 75% of comparisons with high statistical significance. For the client, this was a clear indicator of which version to implement in production.

Benefits of side-by-side comparisons

Side-by-side comparisons are an effective tool for confident decision-making based on direct human feedback. This type of evaluation is often overlooked, but it’s versatile and straightforward enough to apply in a wide variety of scenarios. For evaluating search performance, this is also a fast and accurate way to measure aspects like freshness and diversity of search results, the overall quality of search results and ranking, and how visual design and formatting affect user experience — all things that contribute to user satisfaction just as much as search relevance does.

Subscribe to Toloka News

Case studies, product news, and other articles straight to your inbox.

Subscribe to Toloka News

Case studies, product news, and other articles straight to your inbox.

Subscribe to Toloka News

Case studies, product news, and other articles straight to your inbox.

Recent articles

View all articles

An Annotator's Perspective: Building a Dataset to Challenge LLM Evaluation

Aug 11, 2025

Creating domain-ready datasets: How Toloka's hybrid approach generates realistic and high-quality data

Aug 4, 2025

Image annotation tools: how to label data that actually teaches AI

Jul 30, 2025

An Annotator's Perspective: Building a Dataset to Challenge LLM Evaluation

Aug 11, 2025

Creating domain-ready datasets: How Toloka's hybrid approach generates realistic and high-quality data

Aug 4, 2025

Image annotation tools: how to label data that actually teaches AI

Jul 30, 2025

Toloka Podcast: Agentic AI & the Future of Coding

Jul 29, 2025

More about Toloka

What is Toloka’s mission?

Where is Toloka located?

What is Toloka’s key area of expertise?

How long has Toloka been in the AI market?

How does Toloka ensure the quality and accuracy of the data collected?

How does Toloka source and manage its experts and AI tutors?

What types of projects or tasks does Toloka typically handle?

What industries and use cases does Toloka focus on?

What is Toloka’s mission?

Where is Toloka located?

What is Toloka’s key area of expertise?

How long has Toloka been in the AI market?

How does Toloka ensure the quality and accuracy of the data collected?

How does Toloka source and manage its experts and AI tutors?

What types of projects or tasks does Toloka typically handle?

What industries and use cases does Toloka focus on?

What is Toloka’s mission?

Where is Toloka located?

What is Toloka’s key area of expertise?

How long has Toloka been in the AI market?

How does Toloka ensure the quality and accuracy of the data collected?

How does Toloka source and manage its experts and AI tutors?

What types of projects or tasks does Toloka typically handle?

What industries and use cases does Toloka focus on?