Toloka Team
Boosting revenue via DSAT: offline evaluation to enhance e-commerce search performance
About the client
Any online marketplace relies on their product search to help customers find items and drive a large portion of sales. Our client, one of the largest e-commerce platforms in the EMEA region, aimed to boost their revenue by improving the quality of search results.
Challenge
The client’s team knew that their site search was underperforming, but they didn’t know exactly why. They chose the most efficient path to discover the underlying problems — offline evaluation of search results.
The goal was to perform dissatisfaction analytics (DSAT), which involves in-depth analysis of specific cases where a machine learning model or product fails to provide an acceptable level of service. DSAT is a high-precision tool that can help teams identify pain points and enhance their product.
Solution
The Toloka team set up a DSAT process to identify problems in the client’s product search.
The overall process has 5 steps:
Select a stratified sample of search queries to analyze for search relevance.
Label relevance for pairs of queries and search results.
Extract results labeled as “irrelevant” and categorize them to find the queries that represent fixable problems or pain points.
Label the queries with fixable problems and identify which issues are most prevalent.
Fix issues in search algorithms and measure search quality to track improvements.
Ideally, these steps are repeated on a regular basis (every quarter, for example).
Let’s look at how we handled each step.
Step 1. Sampling search queries
The client provided data on search queries and we selected 20,000 random queries based on their frequency. The resulting dataset contained a balanced set of queries with high frequency, average frequency, and low frequency.
Step 2. Labeling search relevance
Once we acquired our sample of search queries, we paired them with search results and labeled search relevance using Toloka’s global crowd. To guarantee labeling accuracy, we used high overlap, meaning multiple people rated each query and the results were aggregated.
Step 3. Categorizing “irrelevant” results
About 5% of the search queries showed no relevant products in the top six search results. We focused on this set (about 1000 queries) and asked the crowd to categorize them using a series of questions. This helped weed out pointless search sessions, like nonsense and products that aren’t sold on the site. The result was a clearly defined set of queries where the product search should have shown relevant items, but failed.
The image shows the overall process and the questions used for categorizing results.
Overview of the DSAT process and questions for categorizing failed searches.
Step 4. Identifying issues
Each “problematic” query in the final set was labeled to identify the problem. Three main issues were discovered:
Wrong category: search results were shown for the wrong category of products (like books instead of electronics).
Wrong sorting: search results were sorted incorrectly, with irrelevant items at the top.
Typos: misspelled words in the query were not detected and the intent was misunderstood.
After we identified the percentage of failed searches affected by each type of issue, the team was able to prioritize which issues to tackle first.
Main issues discovered and their prevalence, used for prioritizing improvements.
Business impact
Offline evaluation with DSAT helped to pinpoint three main issues to focus on for improving product search. The team then used search quality metrics in an improvement cycle to measure the impact of changes in the target areas.
The result was 8% better search relevancy overall, with a clear connection to GMV growth for the marketplace.
Article written by:
Toloka Team
Updated:
Sep 25, 2023