Toloka Team
Calculating evaluation metrics for a voice assistant
About the client
Our client develops one of the leading voice assistants in their target region. The ML team continually monitors the overall performance of the system, which consists of a complex combination of LLMs and other technologies. They also need metrics to choose the best models before releasing new versions.
The client’s goal was to evaluate the validity and accuracy of the voice assistant’s responses in conversations with users on an ongoing basis.
Challenge
Traditional “online” A/B testing involves deploying multiple versions of the model and assessing performance based on user interactions. This type of evaluation is impractical for a voice assistant. One concern is that it can take weeks of experimenting to get enough data to make a decision. This can also disrupt the customer experience if one of the versions is worse than the base version or has problems that slipped through testing undetected. But the biggest drawback is that we can only get an implicit signal, and an ambiguous one at that. There’s no clear way to interpret user behavior, especially in chit-chat mode, to understand how the user feels about the conversation.
Offline evaluation is a better approach because it collects immediate human feedback without exposing users to unacceptable responses from the model. But the real strength of this method is that we directly ask people for opinions, providing an explicit signal, which means we can make a statistically significant decision based on a handful of data points.
The client had two goals for model evaluation:
Set up “offline” A/B testing as a data-driven approach to choose the best model versions for faster launches.
Develop evaluation metrics and run continual offline evaluation to measure validity and accuracy of the voice assistant’s responses.
Solution
The client consulted with Toloka to design an efficient evaluation process using a combination of auto labeling and human feedback.
The evaluation process involves 4 steps:
Step 1: Label the conversation type: general purpose (“chit-chat”) or action
Step 2: Label quality of chit-chat conversations
Step 3: Detect categories for action conversations
Step 4: Split action categories into 3 datasets
The resulting datasets are used for training, evaluating, and monitoring the model.
Step 1. Label the conversation type: general purpose or action
Voice assistants and chatbots have two main modes of conversation: general-purpose dialogues (chit-chat mode) and action dialogues (conversations where the user expects a specific result).
Chit-chat does not have a specific goal. For instance, you could ask the voice assistant, “What do you think about Banksy’s art?” There is no right or wrong answer here, but there are good and bad responses.
Action conversations have a goal: the user wants to find information, play a song, turn on the lights, or hear the weather forecast. The answer or action can usually be auto-labeled as right or wrong if we know the expected result.
We use human data labeling to identify the type of conversation and associate it with an action before evaluation.
Step 2. Label quality of chit-chat conversations
The conversations labeled as chit-chat are then labeled for quality. Conversations were initially tagged only for 2 extremes:
Failed conversation. For instance, the voice assistant didn’t understand the question at all.
Great conversation. For instance, the voice assistant answered a question much better than expected, or cracked a good joke.
Additional metrics were introduced to reflect more nuanced aspects of conversation quality:
Appropriateness: To evaluate how well the voice assistant's responses align with the context, ensuring that interactions feel natural and appropriate.
User engagement: To measure the entertainment value of the voice assistant's responses.
Matching the character: To measure how well the responses align with the voice assistant's established persona and tone of voice.
Compliance with ethical norms: To ensure that responses do not violate ethical norms or offend users in any way. These metrics are measured separately and not aggregated in any way.
Step 3. Detect categories for action conversations
For action conversations, the client selects categories for continuous evaluation.
The challenge is to get enough real-life examples in each category to develop strong metrics. The Toloka crowd provides large-scale human labeling for this purpose.
The client’s team prioritizes scenarios where the voice assistant needs improvement, such as setting an alarm. They measure the performance of the model and identify the percentage of correct responses, categorized by scenario and device type.
Step 4. Split action categories into 3 datasets
The categorized conversations are divided into 3 datasets for different purposes, each containing several thousand items used for training data.
Dev dataset: used for developing the voice assistant and training models.
Accept dataset: used for checking the model quality and day-to-day system performance.
KPI dataset: used for calculating key performance indicator metrics. The KPI dataset deserves attention as an ideal way for management to monitor the product — and the only way to know for certain if the product is making progress or not.
Here is a more detailed breakdown:
Offline A/B testing: auto-labeling augmented by human labeling
For maximum efficiency, the client uses a combination of auto-labeling and human labeling. The A/B testing process has two stages.
At the first stage, there are too many competing model versions to test them all on real users. All model versions are tested in the action categories and the results are analyzed with auto-labeling. Labeling accuracy isn’t as good as with human labeling, but it’s an efficient way to roughly select the best 2 or 3 model versions to continue to the next stage.
The second stage evaluates chit-chat mode, which requires collecting real-life conversations. The winning model versions are run in production for a short time on real users to collect data.
A random sample of conversations are selected from each experimental model and labeled with the Toloka crowd to get an explicit signal. The best model is chosen based on this data.
Toloka’s contribution
Toloka provides continuous human labeling for offline evaluation with thousands of labels per day. With custom metrics, the client can use a relatively small number of conversations to get explicit signals. Human labeling is focused on chit-chat conversations and labeling new action categories that are added by the team for evaluation.
The graph below shows performance metrics for action conversations by dataset. The team regularly updates the categories to refocus on underperforming areas and keep the metrics useful. The sharp drop near the beginning of this graph represents one of these category changes. Other fluctuations on the graph may be related to new conversations added, model fixes, or other factors.
The calculated offline metrics are useful for early detection of model degradation and tracking KPIs.
Business impact
Consistent monitoring allows the team to compare model versions before launch and track changes that occur after the launch as well. The management team can see daily updates on model performance and spot model degradation before it becomes an issue.
The client gained real value when a single successful A/B experiment balanced out the cost of monitoring model performance for an entire year. The expense of running experiments is negligible compared to the time and money saved on selecting the best model, especially considering the extensive resources invested in training and developing large language models to run the voice assistant.
The team now has a regular, predictable, and reliable process for offline A/B experiments that helps them make confident data-driven decisions and roll out updates faster.
Article written by:
Toloka Team
Updated:
Feb 3, 2023