Toloka Team
Evaluation experiments for confident decisions: a case study on new search engine functionality
About the client
Our client, a search engine developer, added new generative AI functionality for the search results page. The goal was to evaluate which version of the new feature is preferred by users to ensure a successful launch. The development team wanted to get an explicit signal from real people to help them make the right decision before going live with the product update.
Challenge
The new feature detects an object in a search query and uses a language model to generate an object answer, which shows a brief summary about the object. For instance, if a user searches for [Kia Rio], the model outputs a list of information about the car and a set of images. This answer is shown separately from the search results, in a special section on the right side.
The team developed two versions of the feature: one with a list of details, and one with mostly images. They assumed that users prefer images, but they needed to compare the versions and confirm which one is best.
The client sometimes uses A/B testing to track user behavior in production (by measuring clicks and other actions), but that method wouldn’t provide useful metrics for this feature. Since the answer is shown next to the search results, they expected users to get information without clicking on it. The goal was to get an explicit signal about the user experience before launching the feature.
Solution
We set up a side-by-side project to compare the two versions and asked the Toloka crowd to choose the option they like best. The image shows the evaluation task for the query [Kia Rio], where participants were asked which variant is most informative.
By posing this question to a large group of people, we explicitly measured user preference. We were able to directly ask about specific parts of the screen and obtain concrete results for the client.
Occasionally, Tolokers identified cases where the model generated uninteresting or irrelevant results. As an extra benefit of the evaluation process, these queries were passed back to the client’s team to analyze and identify areas for improving the language model. The client uses a similar process to systematically detect issues in search results on a large scale with dissatisfaction analytics (DSAT).
Business impact
The end results were surprising. The initial assumption was that users would like to see more images, but the version with the detailed characteristics actually ranked higher in 75% of comparisons with high statistical significance. For the client, this was a clear indicator of which version to implement in production.
Benefits of side-by-side comparisons
Side-by-side comparisons are an effective tool for confident decision-making based on direct human feedback. This type of evaluation is often overlooked, but it’s versatile and straightforward enough to apply in a wide variety of scenarios. For evaluating search performance, this is also a fast and accurate way to measure aspects like freshness and diversity of search results, the overall quality of search results and ranking, and how visual design and formatting affect user experience — all things that contribute to user satisfaction just as much as search relevance does.
Article written by:
Toloka Team
Updated:
Aug 21, 2023