Offline evaluation for ML and beyond: 5 insights on what your ML product is missing

Alexey Dolotov
by Alexey Dolotov

Subscribe to Toloka News

Subscribe to Toloka News

When is it worth investing in offline evaluation for your ML model or product?

Many ML teams assume that offline evaluation is so complex and expensive that it can’t pay off, even in the long term. After all, you’re already getting feedback from real users in production to steer your product development, right? Or maybe you’re collecting some automated offline metrics and decided that’s “good enough” — especially when your team doesn’t have the bandwidth or expertise to set up “real” offline evaluation.

The truth is that you can do more for your ML-based product with offline evaluation, and it’s easily within reach. We’ll tell you how offline evaluation works and why your product needs it, with some surprising insights and real-world examples of how companies are benefiting from it.

What is offline evaluation, anyway?

Model evaluation is an important part of ML production both before and after deployment. You can evaluate model accuracy and business value metrics after making changes to the model, or comparing it to other models and benchmarks.

You’re most likely already using some form of online evaluation based on behavior statistics collected from real users in production, like user clicks, checkout conversions, and site search terms. The main disadvantage is that implicit signals from user behavior are prone to distortion — you can’t know for sure when the user is happy or not. It can also take time to collect online statistics (more than 7 days on average), which means you might discover problems late in the game or delay time to market due to long experiment cycles.

Offline evaluation gives you explicit signals about user preferences without launching your product and exposing real users to it. This is the most systematic way to detect model drift and find patterns where improvement is needed. Better yet, you can collect this feedback within 24 hours and fix problems before they have a negative impact on business value.

Here’s a comparison summary:


Ideally, you should use both online and offline evaluation to monitor the quality of your ML product. The tricky part is deciding which metrics to calculate and how to get them.

What are the offline options?

There are three basic types of solutions for offline evaluation. You can try out the one that best fits your goals, or use them all at different stages of your product development. On mature product teams, offline evaluation often forms the backbone of the whole development process.

Dissatisfaction analytics (DSAT). This type of evaluation involves in-depth analysis of specific cases where the ML model or product fails to provide the intended level of service. As a best practice, many development teams use DSAT to identify pain points and enhance their product.

Evaluation metrics. Regularly monitoring an ML model or product can shine a light on performance changes due to new launches, model degradation, or external factors. Ongoing collection of offline metrics can help you identify areas for improvement, control progress with KPIs, and compare model performance to benchmarks and competitor products.

Experiments. For comparing model or product versions to choose which one is better, offline A/B testing and other experiments offer much higher precision and faster decision making than online experiments. Even nascent products can easily benefit from offline experimentation.

If you’re wondering where the data comes from, human data labeling and automated data labeling are both viable options. Toloka collects offline evaluation data using a combination of automated solutions, LLMs, human annotators, and domain experts when needed — that’s how we ensure efficiency and accuracy.

Real-world success stories using offline evaluation

Toloka has over 10 years of experience with offline evaluation developing high-precision experiments and metrics for clients. Here are a few recent case studies that illustrate how offline evaluation is applied to real-world scenarios.

DSAT for an e-commerce website

One of the largest e-commerce platforms in the EMEA region found that at least 20% of sales depends on the quality of search results. Online behavior metrics weren’t giving them enough information about where the search feature was failing. They set out to discover underlying problems in their product search and pinpoint areas for improvement using dissatisfaction analytics (DSAT).

Toloka sampled search queries from the client’s site, labeled search relevancy, and identified the top 3 problems that were interfering with product searches. DSAT was implemented as part of a product improvement cycle that resulted in 8% better search relevancy with a clear connection to GMV growth. Read the case study.


Evaluation metrics for a voice assistant

An ML team developing a voice assistant needed to monitor model performance and choose the best model before releasing new versions. The goal was to evaluate the validity and accuracy of the voice assistant’s responses in conversations with users.

Toloka set up custom evaluation metrics and continuous monitoring of model performance with human data labeling for early detection of model degradation and tracking KPIs. The team also ran offline A/B testing to decide which model version to implement. By choosing and launching the right model in just one experiment, the client was able to balance out the cost of performance monitoring for an entire year. Read the case study.


Experiments for new search engine functionality

Our client, a search engine developer, tested two versions of a new generative AI feature on the search results page to find out what users like better.

Toloka used side-by-side comparison to ask the crowd directly about the options. The crowd showed a 75% preference for one of the variants, which gave a strong explicit signal to the development team to choose the best version for production. Read the case study.


5 insights on why your ML needs offline evaluation

Now that we’ve looked at a high-level overview of offline evaluation, here are 5 pivotal insights that might surprise you:

  • You need offline evaluation even if you aren’t training ML with human-labeled data. If your product isn’t mature enough to be impacted by ML training, now is a great time to start offline evaluation. Before you can improve anything, you need to measure it. Put the pieces of the puzzle into place with offline evaluation, then continue using it to oversee progress as you make improvements.
  • A/B testing with offline evaluation usually pays off after 1-2 experiments. One good experiment can increase revenue enough to justify the cost of evaluation for a whole year (yes, you read that right!). Offline A/B testing gives you much higher “resolution” than online testing. You can pinpoint the absolute best version in your experiments and make fast decisions about your product with full confidence.
  • Offline evaluation gives you control of decision making before you‌r product goes into production. You’ll know exactly what makes your users happy instead of wasting resources while you wait for behavior data and guess on where to make improvements. As a bonus, you avoid the risks of online experiments, like potentially losing money or audience loyalty by showing undesirable versions to your users.
  • Data labeling is just as important for offline evaluation as it is for model training. Data labeling is an efficient way to collect real human feedback. In fact, over half of the data labeling projects on Toloka are used for offline evaluation. Some ML teams use the same labeled data for training their model after evaluation.
  • LLMs are a great tool to democratize offline evaluation (and we're excited about that)! Collaboration between LLMs and human annotators makes evaluation more affordable, faster, and in some cases — more accurate. We're developing state-of-the-art fine-tuned models and complex Human-LLM pipelines to handle the most common types of offline evaluation and make it accessible to any team.

Where do we go from here?

Toloka offers bespoke services for offline evaluation with custom metrics and model monitoring, using human data labeling and LLMs. You don’t need to wait until you are ready to invest in data labeling for ML training. For many ML products, offline evaluation with DSAT is a solid approach to start enhancing performance.

Toloka started doing offline evaluation for Big Tech companies before ML training was even a thing. We don’t think you’ll find another team with stronger technologies or better expertise for building effective offline metrics to track the business value of your product.

Reach out to our experts to discuss what type of evaluation solution will work for you.

Talk to us
Article written by:
Alexey Dolotov
Alexey Dolotov

Recent articles

Have a data labeling project?

Take advantage of Toloka technologies. Chat with our expert to learn how to get reliable training data for machine learning at any scale.