Data pipeline for e-commerce price matching: a case study

Toloka Team
by Toloka Team

Subscribe to Toloka News

Subscribe to Toloka News

Competitor pricing information is a strategic weapon in e-commerce — and inaccurate information may ultimately lead to revenue loss. With the goal to sharpen their pricing strategy, Yandex.Market turned to Toloka for help with their large-scale price matching.


Yandex.Market is a large online marketplace with a huge assortment of products. They need to continually collect information on the catalogs and prices from other major retailers in order to meet the following business goals:

  • Perform market analysis to learn about competitors’ product assortment and pricing.
  • Estimate the overlap in product categories.
  • Analyze competitor prices, find the market equilibrium, and dynamically adjust pricing where necessary.

Automated algorithms match items on retailer websites, but they don’t consistently perform well enough to achieve the business outcomes set out by Yandex.Market. Human-labeled data is needed to improve match quality and cover missing matches. For instance, human annotators are better at handling issues like matching identical items that have different names.

In pursuit of high-quality data, the company started off with an in-house labeling team, but it proved to be an expensive asset. After considering other available options, they chose Toloka for its quality, speed, and affordability, with an important deciding factor — the API allowed them to integrate with existing internal pipelines.

Tasks in Toloka were designed to serve three purposes:

  • Assess the quality of automatic matching.
  • Improve the quality of existing matches by removing incorrect matches.
  • Increase match coverage by finding more URLs of matching items on competitor sites.


When Toloka stepped in, there were two task components for the crowd performers to tackle:

  • Find and save a URL link – Tolokers identify specific products on various e-commerce websites.
  • Check and compare – Tolokers decide whether a pair of products with different URLs are the same.

An efficient pipeline was designed to be compatible with the company’s internal processes. The pipeline includes two directions of interaction, from the dynamic pricing system to Toloka and back, with data labeling in the middle.


Four major steps are embedded within the pipeline:

  • The preparatory stage
  • Data collection
  • Quality control
  • Labeling and accuracy check

The company uses automated pre-labeling to prepare and verify each pool of URL links before sending them to Toloka for human labeling. All outdated and visibly erroneous matches are removed. The remaining links with potential matches are left for Tolokers to analyze.


The Toloka pipeline provides 2.5% better coverage of key products compared to using automated solutions and in-house annotators. Tolokers are used on demand to find competitor products in priority categories — groups of products that contribute the most to GMV and have the greatest business value for Yandex.Market. As the project is ongoing, labeling accuracy and speed continue to improve as Tolokers build their skills. To maintain quality, the project’s honeypots (control tasks) are updated regularly to make sure that the URL links are active and all of the listed products remain in stock.
Article written by:
Toloka Team
Toloka Team

Recent articles

Have a data labeling project?

Take advantage of Toloka technologies. Chat with our expert to learn how to get reliable training data for machine learning at any scale.

More about Toloka

  • Our mission is to empower businesses with high quality data to develop AI products that are safe, responsible and trustworthy.
  • Toloka is a European company. Our global headquarters is located in Amsterdam. In addition to the Netherlands, Toloka has offices in the US, Israel, Switzerland, and Serbia. We provide data for Generative AI development.
  • We are the trusted data partner for all stages of AI development–from training to evaluation. Toloka has over a decade of experience supporting clients with its unique methodology and optimal combination of machine learning technology and human expertise. Toloka offers high quality expert data for training models at scale.
  • The Toloka team has supported clients with high-quality data and exceptional service for over 10 years.
  • Toloka ensures the quality and accuracy of collected data through rigorous quality assurance measures–including multiple checks and verifications–to provide our clients with data that is reliable and accurate. Our unique quality control methodology includes built-in post-verification, dynamic overlaps, cross-validation, and golden sets.
  • Toloka has developed a state-of-the-art technology platform for data labeling and has over 10 years of managing human efforts, ensuring operational excellence at scale. Now, Toloka collaborates with data workers from 100+ countries speaking 40+ languages across 20+ knowledge domains and 120+ subdomains.
  • Toloka provides high-quality data for each stage of large language model (LLM) and generative AI (GenAI) development as a managed service. We offer data for fine-tuning, RLHF, and evaluation. Toloka handles a diverse range of projects and tasks of any data type—text, image, audio, and video—showcasing our versatility and ability to cater to various client needs.
  • Toloka addresses ML training data production needs for companies of various sizes and industries– from big tech giants to startups. Our experts cover over 20 knowledge domains and 120 subdomains, enabling us to serve every industry, including complex fields such as medicine and law. Many successful projects have demonstrated Toloka's expertise in delivering high-quality data to clients. Learn more about the use cases we feature on our customer case studies page.