What is data labeling in machine learning and how it works

Natalie Kudan
by
Image

Subscribe to Toloka News

Subscribe to Toloka News

Data is becoming increasingly more valuable as artificial intelligence (AI) makes its way into our daily lives. Today, everybody talks about algorithms and new machine learning models pre-trained on billions of parameters, but very few people understand that what actually powers machine learning is the process of acquiring the training data. So how and where does it all start?

The shift towards ML and AI technologies relies heavily on properly labeled data for machine learning (ML), so algorithms can use it to identify issues and suggest solutions. In other words, for data to be used for training models, it has to be labeled first.

Get high-quality data. Fast.

Try Toloka Hybrid Labeling for image, audio, or text annotation from LLM+Humans in 40+ languages
Talk to us
Image

What is the data labeling process?

Data labeling for machine learning and AI development involves processing unlabeled data like images, videos, text, and audio by adding one or more meaningful labels to provide context for the models so they can use the labeled data to make accurate predictions.

Such labeled data can be used as a training dataset, as well as for testing and verification of the performance of a given machine learning algorithm. Data quality is often more important than quantity: carefully selected and properly balanced sets of high quality data are crucial in supervised learning. For example, a large but imbalanced training dataset, which contains a largely uneven number of labels for each class, can lead to biased and low-quality predictions made by a model trained with such data.

To avoid such cases, people responsible for creating the model (e.g. data scientists or machine learning engineers) need to develop and set up a proper data labeling process. They carefully plan which kind of data they need, how it should be labeled, which data labeling tools are the most fitting, and how to set up quality assurance to get accurately labeled data.

Some machine learning models employ an even more complex data labeling process, which involves active learning. In ML, active learning is a special case in which an algorithm can interactively request to label new data points with the desired outputs. Such a process might involve acquiring synthetic data through automated data labeling, getting human-labeled data, or both.

Now let's look at some data labeling examples. Labels may indicate whether a picture contains a building or a vehicle, what was said on an audio recording, or whether an X-ray reveals a fracture.

What are the common data labeling tasks for machine learning?

In ML, data labeling is used to provide machine learning models with data to train on. For instance, if we want to train a machine learning model to find defects on roads, we first need to show it images of cracks or abrasions. The annotation would consist of polygons highlighting the flaws and tags identifying them.

Let's now take a look at the most common AI domains and the types of data labeling that go with them.

Computer vision

Simply put, data labeling in computer vision is the process of annotating images so that computers can "see" the world around them. Computer vision models usually require large amounts of labeled image data and/or video data. Some examples of data labeling in this scenario are image classification, side-by-side comparison, and object detection (with bounding boxes or polygons).

Image

Depending on the job you want the machine learning model to perform, the data labeling task could be to classify images into predefined categories, detect the location of objects, identify key points in an image, or generate a bounding box that completely encloses an object on an image.

The labels can then be used as input data to train a computer vision model to automatically analyze images. For example, an image moderation model can evaluate whether an image contains any violations.

Natural language processing

Natural language processing (NLP) refers to the analysis of human languages and their forms during human-machine interaction. For example, you might want to recognize key parts of speech or text in images or perform sentiment analysis to determine the emotional tone behind a text.

Image

Labeled data in natural language processing is used to train machine learning models to perform such tasks. Spam detection, machine translation, speech recognition, text summarization, virtual assistants, and chatbots are all examples of how text data labeling is used in natural language processing solutions.

Audio processing

The process of converting unstructured sounds, like speech or wildlife noises, into a structured format that can be used in machine learning is known as "audio processing." Often, audio processing requires the manual transcription of sounds into text, from which you can derive additional information about the audio by adding tags and categorizing it. As a result, you get labeled data for machine learning that can be incorporated into a variety of audio processing models.

Image

How is labeled training data used in AI and machine learning?

AI technologies rely heavily on data labeling workforce, or, to be more specific, on people who manually label datasets for machine learning. Here are a few examples.

When we ask Alexa or another voice assistant to play our favorite music, how do they know what we mean? Under the hood, they rely on deep learning models that have learned to hear and speak to people. And that means the computer needs to be either pre-trained or fed hundreds of thousands of hours of human speech with different voices and accents. This process involves a significant amount of text data labeling.

Another example is self-driving cars and autonomous vehicles. They are also fed millions of images, including photos of pedestrians, vehicles, and traffic signs.

Search engines are another application of data labeling for AI. Let's take a look at how search engine technologies have progressed to include labeling data. About ten years ago, the industry saw a significant uptick in search technology, with particularly impressive results from tech giants like Google and Yahoo. However, then there was a plateau because no machine learning model could improve by always relying on the same mechanism, which in the case of search engines was probability and the number of clicks.

Today, new generations of e-commerce companies are seeing about 50% of their revenue come from direct search results. Leveraging the power of search demands a more reliable system that can manage product visibility and match suitability. And that's where data tagging steps in, helping to find and evaluate the search results most relevant to a specific case.

However, it does not end there. High quality data labeling relies heavily on human-generated labels, or the "human-in-the-loop" model, which refers to human supervision and validation of the machine learning model's results. In other words, human judgment is used to train, refine, and test ML models. They guide the data labeling process to feed the models with the datasets most relevant to a given project.

Here's an example. When you're trying to find something using a search engine or buying a product online, where do you think the most relevant search results come from? Simply put, they come from human annotators. Their job is to decide which search result is the most or least relevant for a given user query, and their judgments are then used to train the search algorithm.

Let's break down how ML data labeling is usually set up in companies and take a look at the potential drawbacks.

How do AI-powered businesses label data today?

The short answer is, manual labeling. Ironic, right? It's a job that cannot be done other than manually by a human. Let's say a business decides to develop its own pre-trained machine learning model and needs to label a collection of datasets. What's their first step? Nearly every company and AI solution starts with a very simple procedure: they take the machine learning model and label data in their datasets by hand.

An in-house approach to data labeling (also called internal labeling), which is when the company's employees produce a labeled dataset, is very common and works well for startups through their first MVP, but problems arise when businesses try to scale further. For industries like insurance and healthcare that need to secure the highest quality labels possible, a different labeling approach is necessary.

Scaling data labeling from a few in-house labelers to an industrial solution would require large managed data labeling teams of hundreds, if not thousands, of labelers, as well as dozens of managers to supervise this huge workforce.

That's a lot of human resources to manage, not to mention a lot of operational tasks and the pressure of maintaining data quality. Driving data labeling quality for in-house labeling means dramatically increasing the time involved in labeling that data, making the entire data labeling process slow and costly. So, are there alternative ways to label data for machine learning?

Fortunately, there are, and one of them is crowdsourcing. Crowdsourcing refers to data labeling that employs a large number of freelancers who have signed up with a crowdsourcing platform. Simply put, AI teams post unlabeled data and tasks for labeling, and people choose and complete tasks they are interested in.

What are the common problems with ensuring high-quality labeled data?

ML model performance depends on the quality of the training data. And that's where the human element comes into play. So, how do you ensure the accuracy of human-powered data labeling for AI?

The most common approach is to fact-check by comparing random samples of people's labels with ground truth, usually obtained from a highly trusted source or expert. This method works well on classification tasks where you need to compare two objects with two labels.

However, since it's sometimes impossible to determine the baseline or ground truth you're comparing against, it can't be applied to all tasks. To put it another way, predicting whether a person will provide accurate and reliable labels is extremely difficult.

The reasons behind poor quality data labeling can range from a bad mood to a lack of time, the latter potentially due to the annotator looking to churn through as many tasks as possible.

Another factor sabotaging accuracy and consistency in labeling data is data subjectivity. For example, a typical search solution for an e-commerce site, whose job is to pair user queries and products, requires thousands of labels representing human judgments. The question is, to what extent will they be relevant?

For maximum accuracy, you need large amounts of relevant data. But asking different people will most likely get you different answers. Subjective judgments vary from person to person. As a result, the labeler inputs collected at that scale require further investigation in order to extract useful results for subsequent labeling data.

As you can see, there are numerous factors that contribute to the accuracy of the large-scale data labeling process. The problem is that not every company can find the time, the people, and the money to develop something that sophisticated, whether its goals are commercial or scientific.

And here's where purpose-built platforms for scaling and accelerating data labeling come in. They have all the required resources and take a sophisticated, purely technical approach to data labeling for AI. Let's take a closer look at one of them.

How does Toloka ensure accurate data labeling for AI?

Toloka provides a data labeling platform where millions of annotators (Tolokers) from all around the world perform tasks posted by AI teams and companies. The platform brings these two audiences together, and its smart technologies transform the crowd into computing power.

The platform provides AI-powered businesses with the tools they need to manage the quality of data labeling and allows them to smoothly build a pipeline that delivers high-quality labeled data for machine learning.

Toloka's central philosophy is that the collective effort of annotators should be viewed as a dedicated computing resource, and data labeling pipelines should be developed with an eye toward pure technical efficiency. Now let's dig into how the problems we outlined earlier are addressed by Toloka.

Smart matching

Toloka's data labeling procedure, like the vast majority of others, operates under the assumption that human participants will, on average, provide mostly correct answers, with the possibility of some errors.

With the help of Toloka's algorithms, you can configure labeling pipelines with both task characteristics and annotators' skills and experience levels taken into account, allowing for more accurate error prediction and better labeling results.

On top of that, Toloka matches data annotators to specific task types based on the rules specified by the requester: age, spoken language, location, and more. Toloka's algorithms make predictions about how well an annotator will complete their task and the type of task best suited for them. These smart matching mechanisms drive task recommendations for Tolokers as well, ultimately enhancing the accuracy of data labeling. These algorithms are supported by Toloka's 10+ years of research and industry expertise.

Handling non-trivial tasks

A quality control approach based on ground truth is typically used in cases where data labeling is trivial –for example, in image classification tasks. However, since it can be hard to obtain ground truth for more complex labeling tasks like object detection or speech recognition, Toloka employs a post-verification method to ensure high-quality labeling results.

Let's take a look at how post-verification works. A task is assigned to the first annotator. After the initial annotator completes the task, their result is used to generate a new task, which is then distributed to other Tolokers for verification. If a majority of other annotators agree that the task was completed correctly, it is considered verified. If the task is deemed to have been completed incorrectly, the loop is repeated until it is considered correct.

Spreading tasks

In Toloka, you can delegate microtasks to thousands of independent Tolokers and combine their efforts into error-resistant pipelines by cross-validating each step. Each task is delegated to multiple individuals, and then sophisticated mathematical models are applied to the votes to extract the most reliable signal, factoring in each Toloker's past experience.

Platform-wide anti-fraud system

In addition, Toloka monitors data labeling behavior in ways that are less sophisticated but nonetheless crucial for anti-fraud protection: detecting answers given too fast, monitoring mouse movements, detecting the device used for completing labeling tasks, and determining the number of devices connected to a particular Toloker.

Responsive pricing

Toloka operates as an open market where data labeling requesters set the prices themselves. Access to statistics drives a pricing recommendation model for requesters as well as performance-based payment for Tolokers.

On top of that, price-per-performance optimization significantly improves the quality of data labeling outcomes. It's simple: the better a Toloker performs labeling tasks, the more tasks they have access to.

Is the cost of a task, however, a good indicator of its quality? A higher price will indeed attract more Tolokers and motivate them to complete tasks more quickly, allowing you to label training data in less time. Increasing task costs may, however, have side-effects. For example, since fraudsters are mostly drawn to well-paid tasks, increasing the price could encourage even more fraud. To avoid this, Toloka offers recommendations for keeping prices within the market's standard range.

As a result, there is no direct correlation between price and quality. What really matters is so-called dynamic pricing. The higher the performer's quality, the more they are paid, and this has a statistically significant impact on overall labeling quality.

Global crowdsourcing

To achieve inclusive and diverse training data that is representative of the entire global population, Toloka provides access to millions of annotators from over 100 countries across all time zones and speaking over 40 languages.

How does data labeling work with Toloka?

Let's start with what data labeling looks like for Toloka annotators. Becoming a Toloker is straightforward: anyone can register on the platform and gain access to tasks. Once done, Tolokers receive instructions on how to label datasets for machine learning and can complete training, where they receive the training set and the validation set with pre-loaded ground-truth labels, just like in ML.

All the standard quality control methods for data labeling, like vote aggregation and comparison with ground truth, are applied at every step of the data labeling process. Requesters don't need to worry about the number of people at the top of the funnel. Anybody who passes the tests with a certain accuracy (for example, over 99%) gets access to production pools and can start earning money.

So, we've covered the annotators. But what is it like for ML teams to work in Toloka? To meet the needs of ML teams, also known as "requesters", Toloka offers the following product categories.

One is an infrastructure-as-a-service platform that provides all the tools needed to set up data labeling independently. For example, if a company needs to process a lot of unlabeled data to process but does not have its own in-house data labelers, it can use Toloka to build pipelines with millions of people already available from every country and speaking any language it might require.

For businesses that have their own labelers responsible for producing ground truth, Toloka provides all of the infrastructure required for managing in-house data labeling. Parallel to in-house labeling, some labeling tasks can be distributed to and completed by the Toloka crowd.

As an infrastructural platform, Toloka effectively empowers scalable and fully automated human-in-the-loop ML pipelines that are integrated into business processes. Data labeling tasks can be sent to Toloka daily and collected to retrain ML models. Pipelines can be frequently tested and optimized for quality.

Additionally, Toloka offers custom solutions to companies that don't want to build their own labeling process. This includes custom-built solutions that support data-related processes throughout the machine learning lifecycle, from unlabeled data collection and data annotation to model training, deployment, and monitoring.

With a service level agreement in place, companies that don't have the engineering capacity or expertise themselves can work with one of Toloka's crowd solution architects. They dive deep into the project to design and build custom data labeling pipelines to meet current and changing needs.

What is an adaptive machine learning model

Toloka supports various types of data labeling, which makes it a general-purpose environment for AI and ML development. As a data labeling AI platform, it has the tools to support any solution and almost any AI domain, including NLP, search, speech recognition technologies, self-driving cars, and search engine solutions.

Additionally, businesses and industries can use a collection of pre-trained models out of the box, adapt them to fit their needs, and build their own data labeling solutions on top of the Toloka platform.

So what are these adaptive machine learning models?

In a nutshell, Toloka provides a number of pre-trained models which can be used out of the box or adapted to your data streams automatically. Our automated service can handle model tuning, evaluation, deployment, and monitoring. With it, you can save time and avoid the repetitive tasks of a typical machine learning lifecycle.

Adaptive ML models are supported by Toloka's expertise in both crowd science and machine learning, and can produce high-quality results and maximize throughput: You can achieve reliable accuracy due to background human-in-the-loop processes that keep model accuracy stable over time. You can continuously improve, optimize and retrain these models: model evaluation and maintenance use HITL processes for model retraining and updates. You don't need to invest in infrastructure: the models are available via an API with low latency for model predictions. You can easily obtain both raw data and labeled ground truth datasets thanks to a seamless integration with the Toloka data labeling platform.

Wrapping up

Successful machine learning models are built on large amounts of high-quality training data. Accurate and thorough data collection, data labeling, and applying data for model training guarantees that they perform to their full capacity.

However, for models to learn how to make the best decisions, they need humans to manually label the training data. This can make the process costly, complicated, and time-consuming, as well as prevent large-scale integration of ML models.

Integrating data labeling with human-in-the-loop and crowdsourcing tools is one way to improve labeling efficiency and thus overcome this barrier. Thanks to platforms like Toloka that facilitate rapid and scalable AI and ML development, from data collection and annotation to model training, deployment, and monitoring, this task is made easier.

Article written by:
Natalie Kudan
Updated: 

Recent articles

Have a data labeling project?

Take advantage of Toloka technologies. Chat with our expert to learn how to get reliable training data for machine learning at any scale.
Fractal