Natalie Kudan
Sound recognition technology
A few decades ago, people marveled at the possibility of hearing someone's voice over the phone and couldn't believe it was a reality. These days, although it's no longer a surprise to anyone, there are even more intelligent technologies like voice recognition. Now you may pronounce something into the microphone of your device, and the system will identify it and provide you with a written text.
Speech recognition (or voice recognition) is a technology by means of which it is possible to transform human speech into text. It may operate autonomously, or learn the pronunciation characteristics of a particular user. In the last few years, speech recognition has evolved rapidly and is employed in countless areas of our daily lives as well as in more specialized professional fields. Apart from speech identification, sound recognition technologies are applied in:
Voice recognition software assisting the handicapped or senior citizens with limited hearing capabilities or those who, for one reason or another, cannot type.
Smart home device voice control systems that employ voice commands.
Music recognition.
Identification of wildlife species such as birds, fish, and mammals.
Continuous speech recognition automated detection and identification of alarms for supervisory, monitoring, and acoustic environment systems, and so on.
This article introduces you to automatic speech recognition technology which involves the digital processing of audio signals and the recognition of human speech and its transition into text. It is a rapidly expanding sector of applications that rely on neural networks.
You will find this information useful if you are a newcomer to the field of speech recognition. It may also be of use to those who are just intrigued by modern AI technology and natural language processing in particular. It will give you an insight into the origins of sound recognition systems, how modern systems work in general, what types of speech recognition technology using neural networks exist, and how to train a voice recognition model. The goal is to provide an overview of speech recognition technology and why it is so important in today's world.
How do voice recognition systems work?
A common shortcoming of all the early efforts at automated speech recognition technology was its very basic approach: the spoken phrases were identified as solid audio samples, which were then matched with a collection of examples or, in other words, a reference database of words. Anyway, any variations in pace, pitch, and pronunciation of spoken words dramatically altered the quality of the recognition.
The specialists realized that the machines had to be taught to pick up individual sounds, phonemes, or syllables, and then construct words from them. Such an approach would allow for neutralizing the problem of changing the speaker when the recognition would vary drastically depending on the person talking.
The implementation of neural networks and deep learning approaches presented an opportunity to fundamentally enhance the quality of coherent speech recognition, the key success factor being the richness and quality of training sets for building acoustic and linguistic speech models. Neural networks' training capability has significantly raised the level of speech recognition quality.
These algorithms are familiar with the pattern of typical word sequences found in a natural conversation and therefore are capable of perceiving the linguistic structure of the target language. Moreover, each new piece of voice information impacts the processing quality for the following one, thus training the neural network and minimizing the mistakes.
The overall principles of this technology may be described with the example of how search is based on the voice recognition work of a smartphone. The device doesn't actually hear the whole word, but rather an unstructured audio signal with no distinct limitations. The speech recognition software analyzes a phrase pronounced by a human based on such an uninterrupted digital signal in the following way:
The gadget records a voice inquiry.
The neural network analyzes the flow of speech.
The sound wave is then fragmented into phonemes.
The neural network addresses its patterns and matches the phonemes to a character, syllable, or word.
The order of words familiar to the software is generated while inserting unknown words according to the context.
The information from the two previous steps is integrated, resulting in the transition of speech into text.
Modern speech recognition methods
Previous speech recognition approaches mainly specialized in the manual extraction of speech features and traditional tools such as Gaussian Mixture Models (GMM), Dynamic Time Warping (DTW) algorithms, and Hidden Markov Models (HMM). Nowadays neural network-driven algorithms hold a prominent position in speech recognition technology.
Neural networks are computing systems representing a collection of individual units linked as neuron-like elements. They all serve relatively simple purposes. The two major benefits of utilizing a neural network are its learning capability and generalization of the acquired knowledge. The machine learning process serves to train neural networks.
The process of learning provides the neural network with the power to detect complex correlations between the input and output data. The network can provide accurate answers from incomplete or distorted data when generalizing information. Once a large amount of information is acquired, the network gradually gains resistance to errors.
However, the training procedure is a rather labor-intensive process requiring a massive sample of training data. Moreover, the learning process does not guarantee a successful result in all cases, but despite the disadvantages of this method, it is one of the most frequently used ones for speech recognition.
The process of speech recognition with neural networks
Speech is composed of sounds and text is composed of letters. The major task of a neural network during voice detection is to comprehend which letter corresponds to the sound on an audio recording. The letters are then assembled into words, while the words are assembled into complete sentences. Developers train a neural network on a processed dataset to teach it to identify letters by sound.
The dataset contains voice recordings, but raw, unlabeled data won't be of much help. The annotation experts preliminarily label such recordings with text, meaning they attach text to audio fragments. The training dataset consists of a collection of audio with text annotations. Commonly one audio fragment is no longer than 10 seconds. Sets of audio and text are fed to the neural network as input, and it must learn it to be able to match the audio track to certain letters and words.
After training, the artificial intelligence is supposed to be able to perform a similar task as the annotation experts have done before: it divides the voice recording into short segments and attempts to correctly predict each letter that corresponds to a sound.
Once the likely letters in the voice recording have been estimated, the AI attempts to work out which word it is. The neural network has a dictionary to compare probable characters in it. As a result, it creates a set of identified words. It then puts the words together into sentences. It is important that the output text makes sense and is coherent, and is also properly arranged, apart from the recognition itself.
Cohesion and meaningfulness of the speech recognition technology are guaranteed, in particular, by the number and accuracy of texts that the neural network managed to process at the learning stage. The greater the number and quality of voice recordings processed by the AI including different intonations and emotions, narrators, and contextual content, the better the prediction will be. Experts have to collect a dataset of at least a few thousand hours of audio for proper recognition quality. Such data is then presented to the algorithm, which trains the voice recognition model.
Types of speech recognition technology
Recurrent Neural Networks
Currently, the finest technology for building speech recognition software comes in the form of recurrent neural networks, the basis for all modern voice, music, image, face, object, and text identification services. It is extremely efficient in word processing, as well as predicting the most likely contextual terms if they have not been recognized.
Recurrent neural networks feature the power to process a sequential chain of events in time or consecutive spatial chains as well as to utilize its internal memory to handle sequences of random lengths. Therefore, recurrent neural networks are especially useful for processing complex tasks where something integral is separated into parts. For example, speech recognition.
The ability of recurrent neural networks to accurately differentiate phonemes and assemble words from them, regardless of the type and quality of the pronunciation, and to correctly predict words within the context that it could not identify explicitly due to the background noise or vague pronunciation, can be achieved after prolonged training with a database of diverse pronunciation types.
Yet, the recurrent neural network only predicts the missing word relying on the immediate context of approximately five words. The long short-term memory (LSTM) for recurrent neural networks, designed in 1997, was specifically created to add to such networks the skill to account for the context at a greater distance from the data segment being processed. A neural network with LSTM will consider the entire text to be recognized when it has to guess a word.
Convolutional Neural Networks
A convolutional neural network refers to a particular artificial neural network architecture targeted at the accurate recognition of objects and faces in photos, as well as speech recognition technology.
This network architecture is much more precise in object recognition in images, as it takes the two-dimensional topology of the picture into account. However, convolution is a versatile process that specialists employ for any signal, be it a video, an audio signal, or an image.
The essence of the convolution is that each segment of the image is multiplied by the convolution matrix piece by piece, with the result summed up and recorded in the same position of the output image. Despite its primary application, it can also be applied to text and audio.
Audio data may be converted into an image called a spectrogram. A spectrogram is a visual representation of an audio signal as a time-varying spectrum of frequencies. CNNs are capable of analyzing this kind of representation of sound signals by means of convolution layers for speech recognition.
How to train a voice recognition model?
Overall, no matter what the goals of your project are, the process of creating a voice recognition model will consist of the following steps.
Define the task
For voice recognition projects, experts have to determine exactly what the model has to learn and what kind of audio data it needs, along with figuring out how to collect and label it. At this stage, appropriate instructions and documents regulating the development of the future voice recognition model must be drafted.
The goals of the project will determine the software you will need to collect and annotate data, along with the type of audio information. For example, the training material type may depend on whether the model will be used to recognize phone calls or audiobooks.
Choose an ML model
A suitable speech recognition system may be selected for training from the already existing ones, and modified to suit your needs if necessary. Many tech companies provide their own, so-called proprietary recognition systems. Most of these are fee-based, however, you can test some for free.
Alternatively, you may use open-source speech recognition software. Some of these open-sourced and proprietary tools contain previously uploaded and trained datasets for voice recognition and generation of the required texts. Others merely deliver an engine with no set of data, and developers have to build training modules on their own. However, some projects may require programmers to develop custom speech recognition models from scratch. Such development is affordable for larger companies that seek higher levels of security.
Collect audio data
An ML model which is expected to recognize sounds must be presented with a large collection of audio voice recordings, accompanied by text. That is why at this vital stage, specialists gather large collections of audio data to train their future ML models.
Many specialized websites already have pre-labeled, ready-to-use speech datasets. In addition, crowdsourcing platforms provide services for collecting new datasets specifically for each project. Users record audio messages according to the specified task and get remuneration for it.
Organizations may also assemble a dataset from recorded conversations (although they need to have a proper and legal process established). The availability of real conversations is crucial in a dataset for speech recognition since people do not recite the text in real life, so a neural network trained exclusively on speeches recited on purpose does not do a good job of recognizing real conversations.
To ensure the finest performance of the final speech recognition system, experts have to collect audio files that would maximally reflect the essence of the targeted data that the model will have to recognize in the future. They have to train the models by adapting them to the specifics of their tasks. Each application field contains unique features. This should be taken into consideration during the development and preparation of datasets.
Label audio data
To train the system, a large number of matched recordings and texts are necessary. Therefore, audio data has to be labeled so that the model would know the result it is expected to produce. The annotators mostly employ audio labeling software, where audio recordings fragments are assigned matching text data.
Train the model
Once all the data is labeled, experts use it to train their speech recognition model.
Set up model monitoring
Once the experts achieve the desired performance of the voice recognition model, continuous quality control should be established. That step is critical to monitor recognition consistency so that the model's voice recognition performance is always top-notch.
Conclusion
Speech recognition systems have come a long way from primitive robots to sophisticated neural networks. Currently, speech recognition techniques based on the latter are at the height of their popularity. They continue to evolve, providing more and more advanced techniques for converting speech into text.
These days, it is inconceivable to imagine the modern world without voice recognition technology: some people cannot live without voice assistants in their smartphones or their smart homes, while some cannot live and work properly without speech recognition solutions. For instance, for people with hearing impairments or for those who have trouble typing, specialized systems can convert spoken speech into text. This is why the development of speech recognition technology is so important. Not only does it improve life and take it to the next level, but it also allows people with disabilities to lead fuller lives, work and grow.
About Toloka
Toloka is a European company based in Amsterdam, the Netherlands that provides data for Generative AI development. Toloka empowers businesses to build high quality, safe, and responsible AI. We are the trusted data partner for all stages of AI development from training to evaluation. Toloka has over a decade of experience supporting clients with its unique methodology and optimal combination of machine learning technology and human expertise, offering the highest quality and scalability in the market.
Article written by:
Natalie Kudan
Updated:
Feb 8, 2023