Reasons to consider data labeling outsourcing

Natalie Kudan
by
Image

Subscribe to Toloka News

Subscribe to Toloka News

In this post, we'll cover the data labeling process in the context of the ML pipeline, from data collection to model training and monitoring. You'll learn what options there are for data-driven ML: from using ready-made datasets to collecting your own data with an in-house team or data labeling outsourcing.

Get high-quality data. Fast.

Try Toloka Hybrid Labeling for image, audio, or text annotation from LLM+Humans in 40+ languages
Talk to us
Image

Why do we need labeled data?

Creating an artificial intelligence-powered product is no easy feat – the road from inception to unveiling it to the public is paved with numerous challenges. Specialists from a multitude of fields bring their expertise onboard to ensure that a downstream application can satisfy the end user and can therefore successfully compete in the marketplace. A regular AI product pipeline (i.e., the so-called "ML value chain") can be roughly divided into the model- and data-oriented parts. It normally looks something like this:

Image

Selecting the right model and "fine-tuning it" (i.e., improving it) is crucial to achieving success when you're preparing a downstream application (i.e., an AI product that serves a particular purpose). We've talked about "foundation models" that serve as the basis for model fine-tuning in our previous posts here (Natural Language Processing / NLP) and here (Computer Vision / CV). But it's also important to remember not to adopt the "model-centric approach" exclusively by disregarding data work, because no matter how good an machine learning model is (i.e., a training model acting as instructions for AI), it always needs training data to function, which is why some refer to it as "food for AI."

In fact, one could argue that good data is even more important than a machine learning model, because the pipeline displayed above consists of three stages that focus on data-related work and an additional three stages that focus on data-related work alongside model-related work. So, in effect all six stages of the ML value chain require high-quality data without exception. Those who understand this organize their business processes accordingly and ultimately put out a successful AI product free of biases and ethical issues. This is referred to as adopting the "data-centric approach" to AI, which we at Toloka consider the right way to move forward.

Data collection

Now that we've decided to proceed using the data-centric approach, the question is what do we do next, that is, who's doing what and how exactly? As our pipeline suggests, any machine learning project has to start with training data collection. A number of methodologies currently exist that allow AI product developers to obtain raw (i.e., unlabeled) data. Let's take a look at the pros and cons of each approach.

Using ready libraries and datasets

This is a methodology that some AI product makers utilize, and it can sometimes provide acceptable datasets. The main advantage of it is simplicity. You don't have to do anything from scratch – simply take something pre-made and ready. There are, however, two major drawbacks.

First of all, other product developers are likely to use the same data as you, which makes your data generic by definition. To rephrase that, you can't hope for an exclusive product if your data isn't exclusive. And secondly, you have to trust someone else as far as ensuring that the data is actually valid and up-to-date. If it's not, you're basically back to square one (and hopefully not when you're already too far in).

Crawling and scraping

This approach is about finding and extracting useful data samples available on the web. Here, too, AI product developers take something that's already there, without creating anything new. The advantage is that with this approach, you stand a better chance of obtaining a more unique dataset compared to the first option.

However, with this approach you also have a limited degree of control over the quality of your extracted data. And, what's worse, this data hasn't been "cleared," meaning that different parts of your dataset may contain copyrighted materials and sensitive personal information, which in some cases may result in serious repercussions, including lawsuits. For this reason, data scraping is banned in some countries.

Synthetic data generation

This option implies generating new data synthetically, that is creating fake data that resembles real data. The advantage is that those who choose this route can indeed get their hands on exclusive data. One of the drawbacks here is that depending on the particulars of your dataset requirements (i.e., what exactly you need and how much of it), quite a bit of computational power may be required – something not every product developer possesses.

Another issue, and a more serious one potentially, is that this data may be substantially divorced from the real world. This might not be such an issue if you were designing an AI product, say, as a training practice. However, if you need a real AI product that's up to the challenge out there in the real world (i.e., a product that can meet the needs of end users), it must be trained on real-world data (or some data approximating it to a reasonable extent), which is seldom the case when using synthetic data generation.

Outsourcing to individuals or companies

This option is more about "who" rather than "how." Outsourcing implies giving a task (in this case, data collection and labeling) away to data labeling service providers. This may be an individual or a group of individuals that as an AI developer you can put together yourself (for example, through LinkedIn). Or, it may be a ready team of individuals, in other words a company, that can accept your data collection task as a turnkey challenge (i.e., from nothing to a ready set).

How exactly your outsourced team will get you the data may differ from team to team, but they're likely to use one of the methods we mentioned earlier. The advantage is that your outsourced data labeling team will probably know more about obtaining raw data than your own team, provided you have no prior experience. You also wouldn't need to worry about this stage of the pipeline, at least in theory.

The main disadvantage is that your company may have to incur significant expenses of using data annotation services. They are likely to be higher than some other options, while at the end of the day it'll still be your job to make sure that the quality of your new dataset is acceptable (which may or may not be so).

Crowdsourcing

Crowdsourcing is a particular type of outsourcing that can be utilized throughout different stages of the ML value chain. This approach is becoming more and more popular today as more AI developers acknowledge its high time- and cost-effectiveness. Whereas regular outsourcing can be more expensive than other approaches, and it often takes more time, the price tag on crowdsourcing tasks is usually much more reasonable, and these tasks can be carried out in a matter of days or even hours.

This is possible due to what's known as "aggregation," which is like a digital game of tug-of-war played against the data. The logic is that rather than having fewer narrowly trained specialists tackle various data-related tasks, with crowdsourcing, a lot of people pull on the digital rope at the same time, and their efforts are put together by data managers known as "data annotation specialists" or Crowd Solutions Architects (CSAs). The same subtask may be completed by several people, and the end result is an amalgam of their accumulated "best of" efforts that serve to ensure dataset quality.

In the context of data collection, "spatial crowdsourcing" is the most ubiquitous type, which is also known as "field" or "feet-on-street" tasks. One of the major upsides of this approach is that completely new data is generated (as opposed to generic stocks), and this data comes from the real world (as opposed to synthetic options). In this scenario, "crowd contributors" as they're known visit places and objects of interest in person and take photos (e.g., pets, cafes, or billboards), make videos (e.g., moving traffic), record sounds (e.g., voices), or write text (e.g., descriptions of floor plans) in real time, that is "on location."

Data processing

Data processing, also known as data preparation, is a stage of the ML value chain during which collected data is prepared for labeling (not to be confused with another stage that's sometimes inserted in the pipeline known as "data preprocessing" / "final processing" that occurs immediately before machine learning model training). If you're interested in knowing more about data preparation, we recommend that you have a look at some other posts in this blog that address this stage in more detail. Suffice to say, data preparation involves data cleaning and removal of corrupted files, faulties, irregularities, duplicates, missing values, and other issues.

It's also during this stage that machine learning engineers gauge their datasets to find the right balance between bias and variance (i.e., "the bias-variability trade-off") in order to have a dataset that's neither too specific ("overfitting"), nor too general ("underfitting"). When things tilt too much in one of the unwanted directions, techniques like data augmentation can be used to even things out. This is done to have an optimally performing model.

Data labeling process

When an AI developer has collected enough relevant data and this data has been cleaned up and augmented if necessary, it now has to be "annotated" or "labeled." This means the data basically has to be "explained" in a way that a machine will understand before an ML engineer can proceed with feeding it into the training model. Some foundation models make use of large quantities of unlabeled data; however, when an AI developer prepares a downstream application aiming to solve a particular user-oriented problem, a foundation model always has to be retrained (i.e., "fine-tuned") using labeled data.

Depending on what sort of downstream application is required, different types of data labeling may take place. For instance, image annotation, drawing outlines of different objects and shapes (bounding boxes or polygons), transcribing speech from audio files, video annotation, providing titles or summaries for written texts, and so on.

There are three main types or rough categories of data labeling:

  • Human-in-the-loop labeling, i.e., manual labeling carried out by human annotators.
  • Synthetic labeling, i.e., data labeling carried out by machines.
  • Hybrid labeling, i.e., a format that combines elements of the first two types.

Both synthetic labeling and, to a lesser extent, hybrid labeling come with the same sets of advantages and shortcomings as the synthetic data generation we've already discussed. These options are somewhat less of a "hassle," which is good, but at the same time, they provide less labeling accuracy and little "real-worldness", while demanding a lot of computational power in return.

On the other hand, human-in-the-loop annotation can be divided into two major camps: in-house labeling (done by an internal team) and outsourcing data labeling (of which crowdsourcing is arguably the most effective type), which raises a question:

What to do: go for in-house labeling or outsource data labeling services?

The main advantage of the in-house option is that it entails the most amount of control and data security – you have your own annotation team that handles all aspects of data labeling in the same space that your ML engineers prep their code. Consequently, as a project manager, you have a great deal of leverage.

However, there are also a number of major disadvantages to this approach. To start with, in-house labeling is a slow process, because you normally have a finite number of team members dedicated to this stage of the pipeline who usually have other roles within the company and who often have to learn to data-label from scratch. In contrast to crowdsourcing, there's no aggregation, which means each task component has to be tackled piecemeal rather than cumulatively.

Another issue is that in-house labeling is expensive. One reason is that time is money, so if something is slow, it means it's costing you more as a general rule, that is, your data-labeling progress affects your product's time to market. But another reason is that your in-house team is all about "do it yourself." This means supplying your team with everything they may require including annotation tools, subscriptions, and even software training if necessary.

Crowdsourced data labeling

With crowdsourcing, the situation is very different. Among its drawbacks is the fact that as an AI product developer, you have less control over the data-labeling stage of your pipeline compared to the in-house route (though close supervision during crowdsourced data annotation is allowed and even encouraged by platforms like Toloka).

On the plus side, your time to market is reduced substantially as the whole process takes a lot less time. As a result of this and the fact that you don't need to provide any training for your staff or purchase multiple software tools, the final bill for this stage of the pipeline also ends up being significantly lower.

At the same time, AI product developers gain access to crowd contributors with specific profiles and skill sets that often cannot be found elsewhere, let alone in a regular office setting of an inhouse team based in one location. For instance, a particular dataset may require data labelers who are speakers of a relatively uncommon African language, a person who can identify parts of a motorboat engine, or someone who can take photos and label street signs in Quebec. The power of the global crowd makes it possible and few options exist that can step up to these challenges and offer the same timeframe or financial conditions.

All praise aside, this approach requires a great deal of attention and expertise on the part of the crowdsourcing platform. To ensure that a crowdsourcing task is carried out successfully, the following steps should be taken by those who offer this type of data labeling services:

Task decomposition

Larger tasks should be broken into more manageable pieces, with each one treated as a separate project task.

Clear instructions

In order to avoid confusion, misunderstandings, or personal biases, detailed instructions with clear examples should be provided to data annotators. From our experience, the better the instructions, the more accurate the results.

Intuitive interface

An interface that a platform uses (usually a creation of their own) should allow data annotators to carry out and submit labeling tasks in the simplest and fastest way possible. When building such an interface, UX design (user experience design) is always considered, that is, how effectively data annotators can interact with it (functionality).

Quality control

Quality control tools should be configured and integrated into labeling projects to ensure high-quality results. They include mechanisms like CAPTCHA (proceed only after a deciphered word has been entered), speed monitoring (proceed when it's clear that a sufficient amount of time has elapsed), and action checking (proceed only after certain actions like clicking on a link or scrolling down have taken place, etc).

Flexible pricing

The best possible price should be worked out that reflects a fair compromise between how much crowd contributors would like to be paid for a task and how much an AI application developer (a "requester") is willing to pay.

Verification of results

After final submissions by data annotators, the results should be aggregated and statistical tests should be run to ensure quality and accuracy.

Model evaluation before and after deployment

Machine learning models' performance or "prediction" evaluation is carried out by AI product developers both before deployment (aka post-training or initial model evaluation) and after deployment / in production (aka model monitoring). Since all AI downstream applications are made for end users who are real people, human-in-the-loop evaluation is considered the industry standard. For this reason, crowdsourcing for the purposes of ML model evaluation has become one of the most sought-after methodologies.

It works much the same as crowdsourced data labeling, except that the goal here is not to get the data ready for model fine-tuning, but rather to see whether the now trained model can perform well when encountering new and previously unseen data. Two routes can be taken:

Straight-up evaluation

Data annotators are shown predictions (i.e., responses) made by the model, and they have to rate the model's precision and accuracy. For example, a model for Computer Vision may be required to name colors of different objects. The job of the annotators who are evaluating this model would be to say whether the model's labels corresponding to differently colored objects are correct, that is, whether this really is violet, this really is beige, and so on. This will provide enough information for ML engineers to understand how well their downstream application is doing.

Evaluation with fine-tuning

This is a more elaborate version of the first option that entails two stages. The first one was to give human annotators new data – the same data that the model encountered post-training. Following our previous example, the annotators respond to this data and put different colors names to differently colored objects. The second stage is to compare their responses to the model's responses (for example, using a pairwise comparison or Side-by-Side).

The downside of this second option is that it takes a bit more time, but the greatest benefit is that it allows AI application developers not just to gauge their model's performance – they can also fine-tune their ML model if necessary, because they now have an additional labeled dataset with high-quality responses provided by human annotators (i.e., "golden sets" or "honeypots").

Bottom line

A number of approaches that tackle the data-oriented parts of the ML value chain are utilized by AI product developers today. While the in-house route gives project managers more control, generally this approach is not time- and cost-effective. Outsourcing offers a viable alternative, with crowdsourcing being considered by many the fastest, the most affordable, and often the most reliable of all outsourcing options.

Crowdsourcing can be utilized during data collection (spatial crowdsourcing, feet-on-street, or field tasks), during data labeling for a variety of domains and specific applications (e.g., NLP or CV), and during all types of performance evaluation – be it post-training evaluation or model monitoring in production. In addition, model evaluation through crowdsourcing can assist with further model fine-tuning whenever necessary.

Importantly, crowdsourcing also allows ML practitioners, data scientists, and social science researchers to access a hard-to-reach demographic all over the world that remains largely inaccessible via any other means.

Article written by:
Natalie Kudan
Updated: 

Recent articles

Have a data labeling project?

Take advantage of Toloka technologies. Chat with our expert to learn how to get reliable training data for machine learning at any scale.
Fractal