Toloka Team
Training a voice assistant in new languages: speech-to-text, audio recording, answer relevance, and request classification
About the client
The client’s conversational voice assistant can search online, play music, control smart home devices, and chat with the user. With an average of over 30 million users per month and 75–100 minutes of daily average use time, the assistant plays a vital role in helping people with everyday requests.
Сhallenge
Continuous improvement of the voice assistant focuses on areas like on-time activation, accurate speech recognition even in situations with background noise, and continuing or ending conversations appropriately.
In addition, the service was looking to expand into new markets and languages, where they would need to retrain the voice assistant from scratch. Directly translating requests and responses to a new language isn’t accurate enough to support natural conversations — linguistic nuances get in the way.
Human input is crucial for collecting training data and evaluating current output from the voice assistant. The client needed large volumes of human-labeled data for multiple scenarios, including speech-to-text, request classification, answer relevance evaluation, and audio recording.
The client’s primary business goals included:
Refining the relevance of answers in multiple languages for better user satisfaction.
Expanding into new markets with functionality in new languages.
Boosting the integral accuracy of responses in Arabic, a new language for the assistant.
Tracking metrics to measure quality in labeled data on a daily basis.
The Arabic language presented a significant challenge for the voice assistant. Writing is very different from speaking, particularly for Arabic, with multiple dialects to consider and variations in vocabulary and pronunciation. There are also no vowels in Arabic, which complicates reading and articulation, making human input even more critical.
Solution
The client leveraged Toloka’s crowd to help train the voice assistant to perform various tasks in multiple languages.
Evaluation of the assistant’s responses used binary classification to label which responses were “good” or “complete”.
The crowd provided vital input for speech recognition, request classification, and answer relevance in different languages. They also made audio recordings of spoken language, including Arabic. The client used the data collected to train models, track metrics, and evaluate quality.
Here’s what a typical workflow looked like:
The client used existing scripts of interactions with the voice assistant, like setting an alarm, turning on music, or chatting about the weather.
They created questions for the Toloka crowd and translated instructions to Arabic, asking Tolokers to pretend they are talking to the voice assistant in a particular scenario (for instance, setting an alarm for weekdays only).
Arabic-speaking Tolokers followed the instructions and recorded voice requests on their phones. The audio recordings were given to other Tolokers to transcribe into text and classify by intent during manual review.
The resulting data was fed to speech recognition models.
Results
With the help of Toloka, the client was able to obtain data for model training, assess current quality, and track metrics for various stages. The client improved the performance of the voice assistant by boosting the accuracy of responses.
When it comes to new language development, voice assistants have a baseline accuracy of approximately 12%. With human labelers involved at every step of the way, the client was able to build an efficient pipeline for model training and quality assessment. Thanks to the crowd’s input, the assistant’s responses in Arabic ended up being about 62% accurate (two correct responses for every three requests), which was a major step up from the baseline scenario. For comparison, the original version of the voice assistant, which has been under development for many years, has an accuracy rate of approximately 77%.
In terms of throughput, Toloka processed on average 50,000 unique tasks in Arabic per day, sometimes reaching 400,000 tasks per day. At the peak, there were 970,000 unique tasks.
With the help of Toloka, the client was able to enter new markets by branching out into new languages and improving metrics across all key areas.
Article written by:
Toloka Team
Updated:
Jul 13, 2023