Solutions

Datasets

Research

Resources

Company

Talk to us

Toloka welcomes new investors Bezos Expeditions and Mikhail Parakhin in strategic funding round

Learn more

Toloka welcomes new investors Bezos Expeditions and Mikhail Parakhin in strategic funding round

Learn more

Toloka welcomes new investors Bezos Expeditions and Mikhail Parakhin in strategic funding round

Learn more

Boosting revenue via DSAT: offline evaluation to enhance e-commerce search performance

Toloka Team

September 25, 2023

Customer cases

Offline evaluation to enhance e-commerce search performance

About the client

Any online marketplace relies on their product search to help customers find items and drive a large portion of sales. Our client, one of the largest e-commerce platforms in the EMEA region, aimed to boost their revenue by improving the quality of search results.

Challenge

The client’s team knew that their site search was underperforming, but they didn’t know exactly why. They chose the most efficient path to discover the underlying problems — offline evaluation of search results.

The goal was to perform dissatisfaction analytics (DSAT), which involves in-depth analysis of specific cases where a machine learning model or product fails to provide an acceptable level of service. DSAT is a high-precision tool that can help teams identify pain points and enhance their product.

Solution

The Toloka team set up a DSAT process to identify problems in the client’s product search.

The overall process has 5 steps:

Select a stratified sample of search queries to analyze for search relevance.
Label relevance for pairs of queries and search results.
Extract results labeled as “irrelevant” and categorize them to find the queries that represent fixable problems or pain points.
Label the queries with fixable problems and identify which issues are most prevalent.
Fix issues in search algorithms and measure search quality to track improvements.

Ideally, these steps are repeated on a regular basis (every quarter, for example).

Let’s look at how we handled each step.

Step 1. Sampling search queries

The client provided data on search queries and we selected 20,000 random queries based on their frequency. The resulting dataset contained a balanced set of queries with high frequency, average frequency, and low frequency.

Step 2. Labeling search relevance

Once we acquired our sample of search queries, we paired them with search results and labeled search relevance using Toloka’s global crowd. To guarantee labeling accuracy, we used high overlap, meaning multiple people rated each query and the results were aggregated.

Step 3. Categorizing “irrelevant” results

About 5% of the search queries showed no relevant products in the top six search results. We focused on this set (about 1000 queries) and asked the crowd to categorize them using a series of questions. This helped weed out pointless search sessions, like nonsense and products that aren’t sold on the site. The result was a clearly defined set of queries where the product search should have shown relevant items, but failed.

The image shows the overall process and the questions used for categorizing results.

Overview of the DSAT process and questions for categorizing failed searches.

Step 4. Identifying issues

Each “problematic” query in the final set was labeled to identify the problem. Three main issues were discovered:

Wrong category: search results were shown for the wrong category of products (like books instead of electronics).
Wrong sorting: search results were sorted incorrectly, with irrelevant items at the top.
Typos: misspelled words in the query were not detected and the intent was misunderstood.

After we identified the percentage of failed searches affected by each type of issue, the team was able to prioritize which issues to tackle first.

Main issues discovered and their prevalence, used for prioritizing improvements.

Business impact

Offline evaluation with DSAT helped to pinpoint three main issues to focus on for improving product search. The team then used search quality metrics in an improvement cycle to measure the impact of changes in the target areas.

The result was 8% better search relevancy overall, with a clear connection to GMV growth for the marketplace.

Subscribe to Toloka News

Case studies, product news, and other articles straight to your inbox.

Subscribe to Toloka News

Case studies, product news, and other articles straight to your inbox.

Subscribe to Toloka News

Case studies, product news, and other articles straight to your inbox.

Recent articles

View all articles

Detecting hidden harm in long contexts: How Toloka built AWS Bedrock's advanced safety dataset

Jul 14, 2025

Does Your Agent Work? AI Agent Benchmarks Explained

Jul 7, 2025

What is data governance for AI, and why does it matter?

Jul 4, 2025

Detecting hidden harm in long contexts: How Toloka built AWS Bedrock's advanced safety dataset

Jul 14, 2025

Does Your Agent Work? AI Agent Benchmarks Explained

Jul 7, 2025

What is data governance for AI, and why does it matter?

Jul 4, 2025

LLM evaluation framework: principles, practices, and tools

Jul 3, 2025

More about Toloka

What is Toloka’s mission?

Where is Toloka located?

What is Toloka’s key area of expertise?

How long has Toloka been in the AI market?

How does Toloka ensure the quality and accuracy of the data collected?

How does Toloka source and manage its experts and AI tutors?

What types of projects or tasks does Toloka typically handle?

What industries and use cases does Toloka focus on?

What is Toloka’s mission?

Where is Toloka located?

What is Toloka’s key area of expertise?

How long has Toloka been in the AI market?

How does Toloka ensure the quality and accuracy of the data collected?

How does Toloka source and manage its experts and AI tutors?

What types of projects or tasks does Toloka typically handle?

What industries and use cases does Toloka focus on?

What is Toloka’s mission?

Where is Toloka located?

What is Toloka’s key area of expertise?

How long has Toloka been in the AI market?

How does Toloka ensure the quality and accuracy of the data collected?

How does Toloka source and manage its experts and AI tutors?

What types of projects or tasks does Toloka typically handle?

What industries and use cases does Toloka focus on?