Solutions

Datasets

Research

Resources

Company

Talk to us

Read our papers on AI training, evaluation, and safety

Learn more

Read our papers on AI training, evaluation, and safety

Learn more

Read our papers on AI training, evaluation, and safety

Learn more

Calculating evaluation metrics for a voice assistant

Toloka Team

February 3, 2023

Customer cases

Calculating evaluation metrics for a voice assistant

About the client

Our client develops one of the leading voice assistants in their target region. The ML team continually monitors the overall performance of the system, which consists of a complex combination of LLMs and other technologies. They also need metrics to choose the best models before releasing new versions.

The client’s goal was to evaluate the validity and accuracy of the voice assistant’s responses in conversations with users on an ongoing basis.

Challenge

Traditional “online” A/B testing involves deploying multiple versions of the model and assessing performance based on user interactions. This type of evaluation is impractical for a voice assistant. One concern is that it can take weeks of experimenting to get enough data to make a decision. This can also disrupt the customer experience if one of the versions is worse than the base version or has problems that slipped through testing undetected. But the biggest drawback is that we can only get an implicit signal, and an ambiguous one at that. There’s no clear way to interpret user behavior, especially in chit-chat mode, to understand how the user feels about the conversation.

Offline evaluation is a better approach because it collects immediate human feedback without exposing users to unacceptable responses from the model. But the real strength of this method is that we directly ask people for opinions, providing an explicit signal, which means we can make a statistically significant decision based on a handful of data points.

The client had two goals for model evaluation:

Set up “offline” A/B testing as a data-driven approach to choose the best model versions for faster launches.
Develop evaluation metrics and run continual offline evaluation to measure validity and accuracy of the voice assistant’s responses.

Solution

The client consulted with Toloka to design an efficient evaluation process using a combination of auto labeling and human feedback.

The evaluation process involves 4 steps:

Step 1: Label the conversation type: general purpose (“chit-chat”) or action

Step 2: Label quality of chit-chat conversations

Step 3: Detect categories for action conversations

Step 4: Split action categories into 3 datasets

Evaluation metrics for a voice assistant

The resulting datasets are used for training, evaluating, and monitoring the model.

Step 1. Label the conversation type: general purpose or action

Voice assistants and chatbots have two main modes of conversation: general-purpose dialogues (chit-chat mode) and action dialogues (conversations where the user expects a specific result).

Chit-chat does not have a specific goal. For instance, you could ask the voice assistant, “What do you think about Banksy’s art?” There is no right or wrong answer here, but there are good and bad responses.

Action conversations have a goal: the user wants to find information, play a song, turn on the lights, or hear the weather forecast. The answer or action can usually be auto-labeled as right or wrong if we know the expected result.

We use human data labeling to identify the type of conversation and associate it with an action before evaluation.

Step 2. Label quality of chit-chat conversations

The conversations labeled as chit-chat are then labeled for quality. Conversations were initially tagged only for 2 extremes:

Failed conversation. For instance, the voice assistant didn’t understand the question at all.
Great conversation. For instance, the voice assistant answered a question much better than expected, or cracked a good joke.

Additional metrics were introduced to reflect more nuanced aspects of conversation quality:

Appropriateness: To evaluate how well the voice assistant's responses align with the context, ensuring that interactions feel natural and appropriate.
User engagement: To measure the entertainment value of the voice assistant's responses.
Matching the character: To measure how well the responses align with the voice assistant's established persona and tone of voice.
Compliance with ethical norms: To ensure that responses do not violate ethical norms or offend users in any way. These metrics are measured separately and not aggregated in any way.

Step 3. Detect categories for action conversations

For action conversations, the client selects categories for continuous evaluation.

The challenge is to get enough real-life examples in each category to develop strong metrics. The Toloka crowd provides large-scale human labeling for this purpose.

The client’s team prioritizes scenarios where the voice assistant needs improvement, such as setting an alarm. They measure the performance of the model and identify the percentage of correct responses, categorized by scenario and device type.

Step 4. Split action categories into 3 datasets

The categorized conversations are divided into 3 datasets for different purposes, each containing several thousand items used for training data.

Dev dataset: used for developing the voice assistant and training models.
Accept dataset: used for checking the model quality and day-to-day system performance.
KPI dataset: used for calculating key performance indicator metrics. The KPI dataset deserves attention as an ideal way for management to monitor the product — and the only way to know for certain if the product is making progress or not.

Here is a more detailed breakdown:

Offline A/B testing: auto-labeling augmented by human labeling

For maximum efficiency, the client uses a combination of auto-labeling and human labeling. The A/B testing process has two stages.

At the first stage, there are too many competing model versions to test them all on real users. All model versions are tested in the action categories and the results are analyzed with auto-labeling. Labeling accuracy isn’t as good as with human labeling, but it’s an efficient way to roughly select the best 2 or 3 model versions to continue to the next stage.
The second stage evaluates chit-chat mode, which requires collecting real-life conversations. The winning model versions are run in production for a short time on real users to collect data.
A random sample of conversations are selected from each experimental model and labeled with the Toloka crowd to get an explicit signal. The best model is chosen based on this data.

Toloka’s contribution

Toloka provides continuous human labeling for offline evaluation with thousands of labels per day. With custom metrics, the client can use a relatively small number of conversations to get explicit signals. Human labeling is focused on chit-chat conversations and labeling new action categories that are added by the team for evaluation.

The graph below shows performance metrics for action conversations by dataset. The team regularly updates the categories to refocus on underperforming areas and keep the metrics useful. The sharp drop near the beginning of this graph represents one of these category changes. Other fluctuations on the graph may be related to new conversations added, model fixes, or other factors.

The calculated offline metrics are useful for early detection of model degradation and tracking KPIs.

Business impact

Consistent monitoring allows the team to compare model versions before launch and track changes that occur after the launch as well. The management team can see daily updates on model performance and spot model degradation before it becomes an issue.

The client gained real value when a single successful A/B experiment balanced out the cost of monitoring model performance for an entire year. The expense of running experiments is negligible compared to the time and money saved on selecting the best model, especially considering the extensive resources invested in training and developing large language models to run the voice assistant.

The team now has a regular, predictable, and reliable process for offline A/B experiments that helps them make confident data-driven decisions and roll out updates faster.

Subscribe to Toloka News

Case studies, product news, and other articles straight to your inbox.

Subscribe to Toloka News

Case studies, product news, and other articles straight to your inbox.

Subscribe to Toloka News

Case studies, product news, and other articles straight to your inbox.

Recent articles

View all articles

AI Ethics: Charting a course for a responsible and trustworthy future

Oct 16, 2025

How Toloka helped poolside define and measure AI quality for developers

Oct 15, 2025

Continual learning: Building AI that adapts to changing data

Oct 10, 2025

AI Ethics: Charting a course for a responsible and trustworthy future

Oct 16, 2025

How Toloka helped poolside define and measure AI quality for developers

Oct 15, 2025

Continual learning: Building AI that adapts to changing data

Oct 10, 2025

From word docs to data analysis: Evaluating AI agent performance across everyday apps

Oct 1, 2025

More about Toloka

What is Toloka’s mission?

Where is Toloka located?

What is Toloka’s key area of expertise?

How long has Toloka been in the AI market?

How does Toloka ensure the quality and accuracy of the data collected?

How does Toloka source and manage its experts and AI tutors?

What types of projects or tasks does Toloka typically handle?

What industries and use cases does Toloka focus on?

What is Toloka’s mission?

Where is Toloka located?

What is Toloka’s key area of expertise?

How long has Toloka been in the AI market?

How does Toloka ensure the quality and accuracy of the data collected?

How does Toloka source and manage its experts and AI tutors?

What types of projects or tasks does Toloka typically handle?

What industries and use cases does Toloka focus on?

What is Toloka’s mission?

Where is Toloka located?

What is Toloka’s key area of expertise?

How long has Toloka been in the AI market?

How does Toloka ensure the quality and accuracy of the data collected?

How does Toloka source and manage its experts and AI tutors?

What types of projects or tasks does Toloka typically handle?

What industries and use cases does Toloka focus on?