Natalie Kudan
Supervised machine learning and the usage of labeled data
Machine learning stands for a specialized approach to training computers without the need for explicit programming. The crucial difference from conventional programming lies in the fact that the developer does not have to create highly structured code in order to instruct the system.
Any kind of learning becomes easier when someone helps you, that is, supervises you. Likewise, in machine learning, a machine learns faster if there is something that guides it through the processes it needs to understand. But we should not forget that although supervised learning is considered to be the most extensively studied and the most common type of machine learning, there are several types of machine learning, which will be discussed in more detail below.
But why train machines? Machine learning is applied in situations where humans do not know precisely how to design an algorithm to solve a problem. These are such tasks as recognizing traffic signs, reading handwritten text, and other tasks that people solve every day. Yet we do not clearly comprehend how we accomplish it.
Machine learning algorithms are employed to handle such tasks in order to help humans simplify and often speed up their work. That is, unless the algorithm is clear, in which case developers just create specific software code.
Machine learning utilizes data and algorithms to emulate human behavior and represents one of the areas of Artificial Intelligence. AI refers to an artificially created system that is equipped with the capability to replicate the intellectual and cognitive processes performed by human beings.
Further, we will take a closer look at supervised and unsupervised machine learning, we will understand what is labeled and unlabeled data, why it is needed, and we will also briefly touch upon machine learning models.
Supervised and unsupervised learning
Supervised learning (a concept closely related to active learning) and unsupervised learning. Both groups exist and are utilized for different purposes. The key distinguishing feature of a supervised learning algorithm lies in the fact that the machine studies with labeled training data. For algorithms of this type, we need a supervisor mechanism that will supervise these algorithms.
Meanwhile, in unsupervised machine learning, the machine becomes smarter with the help of unlabeled data. So there are no correct choices assigned to the data. The machine has to find dependencies between objects or information on its own. As the name suggests, these algorithms have to learn "on their own".
Labeled data stems out of the labeling process where certain objects in raw data, such as videos or images, are identified and labeled. It helps machine learning models make accurate predictions and estimates. Commonly, the data is labeled by humans. A machine learning model refers to a program or file that is taught to identify particular kinds of patterns.
Supervised learning
Supervised learning is the most common and researched kind of machine learning since it is much easier to train a machine with labeled training data. With this type of machine learning, certain experts generate what is called a training dataset. They label data to produce a set of examples and correct answers from an expert's point of view.
As an example, the set may represent pictures of highways, where all objects are labeled by a certain label. A dataset can also consist of videos, texts, or audio files, depending on the type of data a model has to work with.
After labeling, the resulting dataset with labels is presented, or in other words, fed to the ML model by ML engineers. The purpose of all of this is to train the machine learning model to figure out what it will need to predict in the future from examples. Labeled training data provides information to supervise the model's predictions. Supervised learning may be employed for two kinds of algorithms, based on the type of prediction desired.
Classification
The goal of classification involves separating objects according to a pre-defined feature. With the help or better said under the supervision of labeled training data, the machine learns to identify the properties in the unlabeled data, which will be provided to it after successful training. It is suitable for classifying articles by topic or language, images by their content, music by its genre, users by their preferences, and the letters in people's mailboxes.
Regression
Regression is about predicting a number, not a trait or category as in classification tasks. The machine attempts to draw a line that reflects the average dependency. With regression, we may calculate the predicted cost of a car by its mileage, the chance of occurrence of traffic jams based on the time of day, the expected demand for a product, and so on.
Supervised machine learning: labeled data
Obviously, we can see how computer systems have evolved over the past twenty to thirty years and what they are capable of, including progress in artificial intelligence. Some of their capabilities were unimaginable only a few years ago.
However, without human help, AI-based computer systems are very limited and often require data labeling process in order to become useful.
For large companies as well as for the average individual, data is the most useful resource. All human experience can be called data, which they accumulate throughout their lives. For many, their experiences have not come easy for them. Similarly, data annotation process may often be one of the most expensive and time-consuming expenditures on a project. Although seemingly trivial, labeling requires a huge amount of time and energy.
Data labeling as a concept and a field of expertise has emerged due to the need to provide large volumes of specifically prepared data as input for training systems. In the process of labeling, a qualitative transformation is carried out - raw data is supplemented with labels and turned into labeled datasets.
A labeled dataset consists of processed and structured information. In addition to the already mentioned labeled images, video, and sound files, datasets are often presented in a table format. The rows of such a table are called objects or data points, and the columns are called features. Altogether, they are the labeled set of data, which is the basis for machine learning.
Labeling data is quite a routine and tedious process. For instance, in order to train an application to recognize plants from a photo, it is necessary to identify vegetation in hundreds or even thousands of images. By doing this, the ML model is presented with information it can use to be able to identify plants in any photo.
The task is more complicated because an ML engineer responsible for this model has to plan which kind of data they need, how much of it, and what features it should have. And all of that before the actual labeling process starts, because you cannot simply train the model with random annotated data: to avoid biases and incorrect training, you need to plan it in advance.
Process of data labeling
The role of labeling experts in a company dealing with manual labeling of data may be performed by specially hired specialists with narrow expertise. For example, in the medical field to recognize features on X-rays or traffic experts to label videos or photos for future autonomous cars software.
Yet you don't have to have special knowledge to label data. Not all projects and content require expert knowledge. Often companies hire people on so-called crowdsourcing platforms, where a large number of contributors label a large amount of data needed for a company project in a relatively short amount of time. This is one of the fastest and most financially affordable ways to acquire a lot of data assigned with labels.
However, to choose a team of responsible people who will execute their tasks properly, the contractor has to come up with a reliable QA system. This system will have to check the skills of the participants before they start their work on the project and afterward will check the final data for inaccuracies and errors.
Also, companies or startups turn to qualified outsourcing organizations or to a crowdsourcing platform for data set creation or data labeling tasks, involving hundreds or thousands of employees who label the data. Then, to eliminate the possibility of errors, QA managers review and double-check the result of labeling.
To accomplish a task as extensive as data labeling, labeling experts are aided by ML engineers or data scientists, and use special software. Some companies develop it from scratch to achieve maximum confidentiality, while others prefer to use ready-made solutions.
In addition to the above approaches, it is possible to resort to a simpler approach and find a ready-to-use open-source dataset for your tasks.
The types of data that may require labeling can be very diverse. For instance, it may be a dataset consisting of audio files with voice recordings. To create software that can recognize human speech, such data has to be accompanied by a transcript. Further such a dataset has to be fed to an ML model. There should be a whole range of such fragments so that future software could work efficiently and accurately.
A comparable example would be creating software for self-driving cars that will have to identify items on the road, such as pedestrians, road signs, markings, borders, and so on. The ML model training that would be at the core of such an application requires a massive number of labeled video and/or photo files containing all these objects with the appropriate labels.
Types of data labeling
Various data annotation techniques have been proposed for a wide range of purposes. Here are some of the most popular types these days.
Data processing for computer vision models
Object detection: identifying 2D and 3D objects, for example with bounding boxes.
Object classification: sorting images into categories based on their content.
Image segmentation: selecting objects in images.
Computer vision (CV) refers to the computer science branch which attempts to duplicate human eyesight and perception. In other words, computer vision tries to analyze images and videos to retrieve information from what it sees.
A CV system trained on such data is employed to automatically perform image recognition, classify images, find similar images, locate objects, assist in navigation, identification, and so on.
Natural language processing (NLP)
Text recognition: transcription of printed or handwritten text.
Named entity recognition and semantic segmentation: assigning labels to different parts of text.
Text classification.
Sentiment analysis: analyzing the tone of texts.
Natural language processing, as well as CV, assists computers, except that instead of helping them to see the world around them, it helps computers comprehend and interpret human language.
An NLP model (often a text recognition and/or classification model) can be employed to create voice assistants, analyze the mood of a speaker, perform optical character recognition, translation, and much more.
Audio processing
Audio recognition: transcribing texts for audio files.
Assigning labels to sounds, etc.
Audio file labeling is the foundation for speech recognition technology and is employed in such fields as voice assistant development, voice-to-text transition, and so on.
Data plays a major role in machine learning. There should be enough of it or even a lot of it. And remember that you need high data quality. It often happens that some projects are canceled or postponed indefinitely due to the fact that it is simply impossible to collect data.
Supervised machine learning is simply impossible without labeled data. Previously, we have described how you can label the necessary data, but if these methods are not available to you in any way, you can use unlabeled data.
Unsupervised learning
As we have already learned, supervised learning is best suited for problems where there is an impressive set of reliable labeled data to train the algorithm. But that is not always the case. Lack of data is the most common problem in machine learning. Therefore, unsupervised learning exists, a method of machine learning in which a model is trained to identify patterns and hidden dependencies in unlabeled data sets without any human oversight.
Unsupervised learning is less frequently applied in reality. However, this type of ML should not be considered to be inferior or superior to supervised learning. It is just employed for different kinds of tasks. There are tasks or circumstances where there is simply no other choice. For example, if labeled data cannot be found for a project, or if the project budget does not allow for high-quality training data labeling.
Despite this, unsupervised learning algorithms may be employed to solve more complex processing problems compared to supervised learning. At the same time, the result of this type of ML is frequently unexpected and there are often no obvious regularities. Since machines lack labels to learn from, the goal of unsupervised learning is to discover patterns in the data and aggregate them. It solves 3 types of problems:
Clustering
The task of clustering consists in distinguishing items by unknown characteristics. Clustering is essentially a classification, but with no classes known in advance. The clusterization process aims to find similar objects and group similar elements into clusters.
Some examples of clustering can be found in photo apps that identify people's faces in photos and group them into photo albums. The app doesn't even know people's names, although it can distinguish them by their facial features.
Association rule learning
This ML method solves the problem of understanding the rules and meanings of different item groups. It is used for recommendations in online stores and for finding correlations between customer purchases. Retailers may learn which items were bought together and apply that information to increase their sales.
Association rules allow us to find out when and under what circumstances customers purchase certain product combinations. Information regarding past shopping patterns and the timing of those purchases helps to establish a discount program and generate personalized offers to boost sales.
Dimensionality reduction
This kind of method assembles particular features into higher-level abstractions. It entails a data transformation so that the amount of data is reduced. This technique is employed to remove uninformative and redundant data which complicates processing from the set. It reduces the size of memory that is required for the dataset and speeds up the work of ML algorithms.
Unlabeled data
As mentioned earlier, data labeling requires time and sometimes the specialized knowledge of a skilled professional. For instance, it would take a huge amount of time for a person to label ten thousand photos of handwritten text to solve the problem of handwriting recognition.
In contrast, unlabeled input data may be used in its original form, that is, in its unprepared state. That's why they are also called raw data.
Besides the use of unlabeled data in unsupervised learning, it is also applied in semi-supervised learning. This is a type of supervised learning, which employs a small amount of labeled data and a comparatively larger amount of unlabeled data.
Training a semi-supervised learning model starts with training on a small amount of data assigned with labels. After a thorough learning process, a large amount of unlabeled data is fed to the algorithm. The model should independently determine the labels for such unlabeled data, which will be referred to as pseudo-labels or synthetic labeling. Then the previously labeled set of data is blended with unlabeled data, which now has pseudo labels, so in fact, it has become labeled as well. The model can now be trained on such mixed data.
Another subset of machine learning called reinforcement learning employs a very low amount of unlabeled data. Such algorithms do not require labels for learning, since they allow the computer to develop an optimal strategy on its own as it interacts with the surrounding environment.
Reinforcement implies constant improvement of oneself, whereas learning implies continuous environmental perception and feedback from the environment. Reinforcement and learning represent a continuous and repetitive process until the system achieves the optimal solution to a problem that needs solving. A reward in reinforcement learning happens to be the most important thing, as it represents feedback that indicates how well the algorithm is working at the current moment in time.
There is also the concept of transfer learning. It is a problem in machine learning that focuses on storing knowledge gained while solving one problem and applying it to a different but related problem. For example, pre-trained models which can recognize cars could be applied when trying to recognize trucks.
Thus, not all ML algorithms necessarily need labeled data to accomplish tasks. As we have discussed, unlabeled data may be used in unsupervised learning, semi-supervised learning, and reinforcement learning. Only its quantity will be different.
However, it is worth remembering that in some cases unlabeled data may provide significant help and considerably increase the algorithm's precision, whereas in other cases it may even reduce the quality of the solution to the problem.
To figure out which data you need to create ML models, you should define the goals of the project before embarking on it, and thoroughly examine the available data. Undoubtedly, using unlabeled data saves time and budget on data labeling. That's why unsupervised learning, semi-supervised learning, and reinforcement learning have become great solutions in many fields involving terabytes of unlabeled data.
Conclusion
Machine learning has undoubtedly been firmly embedded in our lives lately. It will only be evolving even faster and faster. Labeled data plays a huge role in the development of ML algorithms. Supervised machine learning is impossible without it, and it is the type of machine learning that is considered the most widespread and thoroughly studied. The computer is given a ready-made dataset, from which it learns how to make future predictions.
Unsupervised learning on the other hand employs unlabeled data. Compared to supervised learning its task is defined less clearly. It is more challenging for a machine to solve such a task since it applies unlabeled data, which means that such a system does not possess a predetermined correct answer and should identify the patterns on its own.
However, a well-thought-out course of action helps developers obtain the solutions they need with unsupervised learning algorithms. Unlabeled data is also widely applied in a form of supervised learning - semi-supervised learning and in reinforcement learning.
In addition, models of supervised learning tend to be more precise than the unsupervised learning ones, although they require direct human interference and accurate data labeling. On the contrary, unsupervised models explore unlabeled data on their own. But they still require some degree of human input to help them verify the output.
The use of unlabeled data may substantially save time and money for the company since the labeling process is very time-consuming. Nevertheless, the results of unsupervised learning, semi-supervised learning, and reinforcement learning algorithms should also be validated by humans to avoid errors and inaccuracies.
Therefore, when choosing an approach to machine learning, everything depends on how we formulate the problem that needs to be solved for a particular project. Furthermore, many tasks are easily defined as one type of learning and then can be transformed into another.
As the world changes dramatically, machine learning is becoming more and more integrated into our daily lives. Understanding the basics will help you navigate the world better and allow you to make better judgments about today's modern ML technologies. Humans have significantly expanded their capabilities with AI, and soon, these capabilities will become even greater.
Article written by:
Natalie Kudan
Updated:
Mar 24, 2023