← Blog
/

Building Shopify's Product Catalogue at AI Speed
High-quality human expert data. Now accessible for all on Toloka Platform.
Challenge:
Shopify needed high quality labeled data to match millions of products to a taxonomy. The work had to be done in days with as high accuracy as possible, at least 95%, despite noisy, user generated data.
Solution:
Toloka built a sophisticated hybrid pipeline where AI models handled routine classifications while trained annotators focused on high-disagreement cases.
Impact
Shopify was able to ensure good source-of-truth data in its most strategic segment of products, powering Shopify Catalog, their global catalog that was being used for agentic commerce.
The taxonomy problem
Shopify’s Catalog team was continually developing a taxonomy system to bring structure to the broad mix of product descriptions on the platform. It sounded simple… until the work began.
The sheer dimensionality of the data was staggering. The team worked with over 10,000 distinct product categories and more than 22,000 active collections. These collections weren’t standardized taxonomy nodes either. They were merchant-defined groupings used to organize storefronts in bespoke ways. A single product could appear in multiple collections, some structural and others promotional, like “Best Deals of the Week.” On top of that, the data wasn’t static, which meant merchants were constantly creating, editing, and reshuffling products and collections in real time.
Each merchant named and grouped products in their own way. A candle might sit under Home Décor for one seller and under Gifts for another, depending on how they structured their storefront. The task was to interpret that intent and map each collection to the closest matching node in Shopify’s taxonomy while preserving the merchant’s original logic.
For every collection, the system had to determine whether it reflected a true taxonomy category and, if so, map it to the closest matching node in Shopify’s structured hierarchy while staying as faithful as possible to the merchant’s original intent. Apply that decision-making across tens of thousands of collections and the complexity rises quickly.
Volume meets urgency
The shift toward conversational shopping accelerated the schedule, but the volume made standard approaches impossible. With 22,000 active product collections requiring immediate processing, human-only annotation wasn’t realistic; even a skilled team couldn't meet the deadline given the review cycles required.
Automated classification also wasn’t enough. While models could handle routine tagging, the target taxonomy contained 10,000 possible labels. Models often hallucinated or lost context when navigating a tree that deep, especially when the input data was "noisy." Shopify needed a way to cover this vast volume without giving up the judgement needed to place ambiguous products.
The solution: an ensemble approach
To solve this, Toloka built a workflow where models and humans worked in the same stream.
Overcoming the Context Limit
The immediate technical hurdle was the taxonomy size. With up to seven levels of depth and 10,000+ categories, the data structure was too large to fit into a standard LLM prompt context. To solve this, the engineering team built a sophisticated pipeline running two distinct hypothesis methods in parallel:
Vector-Based RAG: We indexed all taxonomy options in a vector database to find semantically similar categories. This allowed the model to identify specific concepts—like "hunting gear"—even if they were buried deep in the tree structure.
Structural Tree Search: Simultaneously, a "greedy tree search" algorithm navigated the taxonomy top-down, selecting candidate categories level-by-level, mimicking how a human browses a catalog.
The Power of Ensembling
Individually, these methods hovered around 60% accuracy. However, when combined, they covered each other’s blind spots. For example, while a structural search might get stuck at a generic "Clothing" level, the vector search would correctly identify the semantic context of "Hunting." By ensembling the candidates from both methods and reranking them, the automated baseline quality improved.
Operationalizing to 95%
To get to at least 95% accuracy, we moved beyond simple human review and applied intelligent segmentation and data hygiene.
1. Targeting Disagreement & Hygiene: We analyzed where the two models disagreed and identified those instances as high-risk zones. Humans were directed specifically to these disagreements. Simultaneously, we engineered the pipeline to handle real-world data noise.
2. Addressing Multimodal Gaps: Some distinctions couldn’t be resolved from text alone. For example, separating “infant” from “toddler” apparel often depended on visual cues in the product image rather than the description. While vision-language models were evaluated, their performance at this taxonomy depth wasn’t consistent enough for high-accuracy targets. For these categories, items were routed to trained annotators who reviewed both text and images, closing a gap that automated classification alone couldn’t reliably handle.
3. Defining the ‘Allowed Limit of Error’: Achieving 95% accuracy required aligning on what should count as an error. Many “incorrect” labels turned out to be valid outliers, merchant-driven edge cases that didn’t meaningfully disrupt the browsing experience. Take a collection titled “Light Blue Hoodies”, might include a single light blue pet hoodie. Situations like this forced a decision: either enforce strict alignment to the human apparel taxonomy or recognize that collections can contain small, intentional deviations. After reviewing these clusters together, we agreed that up to 10% variance could be treated as genuine outliers. With that threshold in place, the model was no longer penalized for placements that mirrored how merchants curate their collections, even when those groupings went beyond a strict taxonomy boundary. That alignment materially improved the effective quality score.
Quality as a living system
Quality wasn’t treated as a final checkpoint; it ran through the process from the first batch onward. A dedicated team used automatic QA checks to surface emerging trends. When several annotators struggled with the same type of item, it signalled that the taxonomy needed clearer explanation.
Every annotation fed back into the models, sharpening their sense of placement as the dataset expanded. Over time, the share of cases that could be resolved more confidently by the models increased incrementally. Annotators were then able to concentrate on the more ambiguous decisions where contextual judgement was still essential.
How the work paid off for Shopify
The result was high quality, ground truth product data that enables Shopify’s AI development, ensuring high quality, product categorization across their global Catalog.
The project showed what it takes to move fast without breaking the integrity of the data. Strong models and skilled annotators helped, but the edge came from a workflow built for unstructured data. Broken links, outliers, and visual noise were all treated with the same rigour as the code itself.
Talk to us and see how we can help with your data challenges.
Subscribe to Toloka news
Case studies, product news, and other articles straight to your inbox.