Data labeling to improve the quality of search results in an online store

Toloka Team
by Toloka Team

Subscribe to Toloka News

Subscribe to Toloka News

About Ozon

Ozon is Russia's leading multi-category e-commerce platform, which offers more than 9 million SKUs across 24 different categories.

Use Case

Ozon uses Toloka for creating reference samples. They have several purposes:

  • To evaluate the quality of the new search engine.
  • To determine the most effective ranking model.
  • To improve the quality of the search algorithm using machine learning.

Test run

Ozon employees created the first test sample manually — they took 100 search queries and did the labeling themselves. Even this small sample helped to identify problems in the search engine and determine the evaluation criteria. The company wanted to create its own tool for evaluating search quality, hire assessors, and train them, but this would take too much time, so they decided to choose a ready-made crowdsourcing platform.

Training turned out to be the hardest stage of the task for performers: even Ozon employees failed at the first test task. With feedback from the team, they developed a new test. Training was now organized from simple to complex, and tasks accounted for performer qualities that were important to the company.

To eliminate errors, Ozon did a test run. The task consisted of three blocks: training, a control set with a 60% threshold for correct answers, and the main task with an 80% threshold for correct answers. To improve the quality of the sample, each task was offered to five tolokers.

Test run statistics

Main launch

The scenario of the main launch was more complex: it involved new tolokers as well as those who received the necessary skill during the test stage. The newbies went through the standard procedure, and the experienced tolokers were admitted to the main tasks straightaway. For the main launch, additional skills were added — the percentage of correct answers in the main sample and the majority vote score. The task was offered to five tolokers, like before.

Main launch statistics

Now the Ozon task on Toloka looks like this:

Ozon task on Toloka

The toloker sees the search query and 9 products from the search results. Their task is to rate the results, choosing among:

  • "perfect"
  • "good"
  • "might fit"
  • "doesn't fit"
  • "page not found"

The last value helps identify technical problems on the website. To simulate user behavior as accurately as possible, the developers recreated the interface of the online store in an iframe.

At the same time as the task was launched on Toloka, the search queries were labeled using rules. The focus was on popular queries, in order to improve their search results first.

Labeling with rules made it possible to get data faster using a small number of queries, and the results for top queries were good. But there were also disadvantages: ambiguous queries can't be evaluated using rules, and there are many controversial situations. This method also proved rather expensive in the long term.

Manual labeling doesn't have those disadvantages. In Toloka, you can collect the opinions of a large number of tolokers and get more granular evaluations, which lets you analyze search results more deeply. After the initial setup, the platform works stably and processes large amounts of data.

Manual labeling

Manual labor and AI aren't mutually exclusive of each other. The more AI develops, the more manual labor is needed to train it. On the other hand, the more training neural networks get, the more routine tasks can be automated, so people don't have to do them.

Almost any task, even a large one, can be divided into many small ones and done with the help of crowdsourcing. Most of the tasks that are solved in Toloka are the first step to training models and automating processes with manually collected data.

Article written by:
Toloka Team
Toloka Team

Recent articles

Have a data labeling project?

Take advantage of Toloka technologies. Chat with our expert to learn how to get reliable training data for machine learning at any scale.