Products

Resources

Impact on AI

Company

Test your LLM's math skills with our benchmark for complex problems and step-by-step reasoning

Test your LLM's math skills with our benchmark for complex problems and step-by-step reasoning

Test your LLM's math skills with our benchmark for complex problems and step-by-step reasoning

Natalie Kudan

Mar 6, 2023

Mar 6, 2023

Essential ML Guide

Essential ML Guide

A guide to data annotation services

A guide to data annotation services If you have heard of data annotation (or labeling) services and want to know more about how they might help you, your company, or your industry, this article is for you. We plan to provide some answers: what these services are (with several examples of use cases), who can benefit from them, and how.

What are data annotation services?

"Annotation" is the process of adding comments, notes, or explanations to something, for example, images or text. Data annotation is the process of labeling data, be it text, images, videos, audio, or any other format, so that computers can understand and interpret that data.

Companies provide data annotation services and outsource the process of taking large quantities of data and then annotating it. The annotation process can include data labelling, categorizing unstructured data, correcting, verifying, and more. Check out our article on data labeling to learn more.

Who uses data labeling services and why?

Data annotation is absolutely necessary for anyone developing artificial intelligence and machine learning algorithms, which rely on supervised learning: companies, researchers, ML engineers, and more. Products and tools powered by machine learning are only as good as the data they are trained with. So, when a company using machine learning wants to improve their ML model, they will naturally look to their training data. Whether it's sourcing new data or adding to and improving the quality of existing data, there are many organizations ready to outsource this important, but potentially resource-intensive task to a data annotation company.

And nowadays, it's not just large, cutting-edge tech companies who require this service. Machine learning is already running in the background of so many elements of modern life.

Data annotation is especially popular with certain industries, such as:

  • Retail and e-commerce.

  • Social science research.

  • Product & marketing research.

  • Healthcare and automation of medical data processing.

Mainstream sectors like retail and marketing are using data annotation to attract new and repeat customers. Retail and marketing sectors may outsource the acquisition of crucial data to a data annotation company and use it to improve their online search relevance or recommender system ("people who bought this product, also liked these products").

How do data annotation services work?

There are three main approaches to fulfilling data labeling needs: companies can do it with their in-house team (their own employees label the data), outsource it (a company that provides data collection and/or data annotation), or use a data annotation platform.

The latter approach is known as crowdsourcing, and for it to work, you need three key components:

Requesters – those who need to collect high-quality data. To get high-quality labels, the requester needs to create labeling tasks that are easy to understand and complete, which involves breaking them down into more manageable parts and writing clear instructions. This can be a difficult job to do, but that's what platforms like Toloka are for – our data annotation experts can help requesters build data annotation pipelines that work.

Annotators – large numbers of individuals, from across the globe, who will annotate and label data for a small reward. Since the tasks are decomposed into smaller parts, even people without any particular qualification or skill set can complete them successfully. That being said, many data annotation tasks filter who can complete them, so that only annotators with a certain education, language, location, device, age, or gender can complete that specific task.

A platform like Toloka, which brings requesters and annotators together.

What is Toloka?

Toloka provides a unified environment to support fast and scalable AI/ML development. Among many other things, Toloka uses crowdsourcing and microtasking to provide data labeling and annotation services, at scale.

Toloka's data labeling platform can be used by businesses and researchers to collect or label data to optimize machine learning models (used in anything from voice recognition to autonomous vehicles and search algorithms). The quality and size of a dataset is crucial to maximize the accuracy of a machine learning model.

Data labeling tasks

We can break down the kinds of data annotation services available on Toloka's platform into the following data labeling subgroups:

  • Image annotation

  • Video annotation

  • Audio annotation

  • Texts annotation

  • Data enrichment

  • Surveys

  • Field tasks

Before annotators, or Tolokers as we call them, can begin their work, the requester needs to create a task for them: set up the task interface (what the task will look like), and provide instructions and examples for the Tolokers to follow. The requester can then specify several attributes they might desire of their Tolokers – such as location or language, for instance, to avoid biased results and improve data diversity. The requester can also then add quality control methods, such as golden sets or post-verification, to ensure the data they receive in the end is the standard they require.

Now we will discuss in more detail what sort of tasks Tolokers are asked to complete, and what those tasks look like.

I am not a robot

These first couple of example tasks we're picking out are ones you may already be familiar with:

  • Recognize text in an image.

  • Recognize objects in a photo.

The data here are images. Object detection in an image can look like a multiple-choice array of images, where the Toloker needs to select the images with a certain object in them. Many websites ask you to complete similar tasks to "prove you're not a robot". Such labeled data is frequently used in computer vision models.

Another kind of object recognition task is one where Tolokers see a single image at a time and need to identify and outline specific objects within that image (there is an editor for them to draw an outline). Requesters can then take all the annotated images and use this dataset for training computer vision software.

Data labeling task examples

Below we cover some more common tasks for the main data labeling case subgroups, and with most tasks it's easy to understand why that information would be useful to these industries:

Image annotation

A "Product search relevance" task presents Tolokers with an image, a box for the search query, and radio buttons – for example, "relevant" or "not relevant". This type of data annotation will help to improve ML algorithms behind e-commerce search page results by rating how relevant products are to specific search queries.

In an "Image comparison (Side-by-side)", Tolokers are asked to compare two side-by-side images and select one of the possible response options. Requesters might use this type of image annotation service when they need to collect opinions from large groups of people, for example to:

  • Improve user experience by understanding which design users like best.

  • Test out which images are more impactful for targeted ads.

  • Decide on the best images for various kinds of publications.

Other popular tasks include image classification, selecting objects with bounding boxes, polygons, or key points, and so on.

Video annotation

"Video classification" is the type of data annotation task where Tolokers are given a video player and a few different response options. Tolokers must watch the video clip and select from the available response options. Requesters use this kind of task to get training data for several purposes:

  • Sorting video clips into categories.

  • Moderating content.

  • Detecting imperfections in the video – both audio and visual.

  • Rating video clips based on levels of enjoyment or other measures.

A "Side-by-side" comparison of two videos could be used to help editors or marketing specialists decide which is the best opening frame for a video, or to help decide which video looks more realistic. Of course, it can also be used to train ML models to make decisions like that on their own.

Text annotation

A "Generating product descriptions" task can help e-commerce companies to quickly write many accurate product descriptions. For this task, the Toloker is presented with an image of the product and the product name and asked to come up with a product description to put in the text input area.

Other text-based tasks focus on "Sentiment analysis and content moderation". These tasks provide Tolokers with a text and several response options. Tolokers read the text and select from the available options (they might then add specifics via an additional question with more checkboxes). The requesters' aims for this task can be varied. They might want their ML model to learn to:

  • To moderate comments and nicknames on social media.

  • To check product reviews in a store, ads on a site, or posts on social networks.

  • To check for the presence of a brand or company name.

There are various end uses for the datasets created: improving user/customer experience; complying with regulations; conducting brand and consumer research – to name a few. There are other popular types of text annotation used for natural language processing, for example text classification, semantic segmentation, named entity recognition, and more.

Audio annotation

"Transcribing audio recordings" tasks provide the data needed to improve speech recognition models. Tolokers completing these tasks see only an audio player and a text input area: they listen to the recording and write down what they hear. The same Tolokers might transcribe many different recordings, or many Tolokers might transcribe the same recording. Either way, the mass of transcriptions produced is vital annotated data from which machines can "learn" to accurately interpret speech – meaning less mistakes made by voice recognition software and thereby improving communications in numerous fields.

This is an example of the kind of task where a requester should specify a certain attribute for the Tolokers, like "language" (education, skillset, and device may also be relevant depending on the content and format of the audio recording).

Other kinds of audio tasks include:

"Audio classification" – where Tolokers are again given an audio player but, in place of a free-type input area, there is a choice of given answers, from which they must choose.

"Voice recording" – where Tolokers are given text to read aloud and a voice recorder button. Datasets produced by the latter can be useful in training text to speech software such as that used in translation apps.

Field tasks

Field tasks send our most intrepid Tolokers out on a mini mission. For example, a "Spatial Crowdsourcing" task (completed in the Toloka mobile app) requires Tolokers to select a point on the map, go to the location, take photos and/or write a comment.

The aim of a Field Task might not be connected to Machine Learning, but instead to improving processes and quality assurance. For example, an urban field task might ask Tolokers to go to the entrance of metro stations to take a photo and assess the cleanliness of the metro station. Several potential requesters would find this data useful – digital maps, transport companies, city councils, and city tourism boards.

Data enrichment

Another popular task among e-commerce requesters is a "Product photo search". Tolokers might be asked to search for product photos online or to search for instances of a fashion brand's logo. This task gives Tolokers links to search online, a product description, and an upload area.

The dataset produced can be used in brand awareness research which is crucial to influencing the marketing team activities behind any big brand.

The benefits of high-quality training data

Most of the requesters using Toloka are working with machine learning models, which either enhance their business and product or may form the whole basis for their product. Today's world has countless uses for machine learning – from image and speech recognition to weather modeling and medical diagnoses, and even self-driving cars.

Machine learning models are only as good as the data they learn from. The more data you have, and the better quality that data is, the more accurate your ML model will be. Better data leads to better machine learning, which leads to better real-world practical and financial results.

Starting off with high quality human-labeled training data ultimately means search algorithms are quicker and more accurate; it means that speech recognition software makes fewer mistakes; and your self-driving car is safer.

Getting humans to process and label large quantities of data could be incredibly expensive and time-consuming. Businesses, data scientists, and other individuals could end up making sacrifices in quality and accuracy, simply to avoid such costs.

Now, the possibility of outsourcing data annotation services to an efficient data annotation company is key to the viability of accessing human-labeled data in high quantities.

Can data annotation services help you?

We hope that by giving you some information on the type of data annotation services offered by Toloka, you now have a clearer understanding about what a data labeling service is and that you have some ideas about the potential uses for you and your industry.

Article written by:

Natalie Kudan

Updated:

Mar 6, 2023

Subscribe to Toloka News

Case studies, product news, and other articles straight to your inbox.

Subscribe

Subscribe
to Toloka News

Case studies, product news, and other articles straight to your inbox.

Subscribe

Subscribe to Toloka News

Case studies, product news, and other articles straight to your inbox.

Subscribe

More about Toloka

What is Toloka’s mission?

Where is Toloka located?

What is Toloka’s key area of expertise?

How long has Toloka been in the AI market?

How does Toloka ensure the quality and accuracy of the data collected?

How does Toloka source and manage its experts and AI tutors?

What types of projects or tasks does Toloka typically handle?

What industries and use cases does Toloka focus on?

What is Toloka’s mission?

Where is Toloka located?

What is Toloka’s key area of expertise?

How long has Toloka been in the AI market?

How does Toloka ensure the quality and accuracy of the data collected?

How does Toloka source and manage its experts and AI tutors?

What types of projects or tasks does Toloka typically handle?

What industries and use cases does Toloka focus on?

What is Toloka’s mission?

Where is Toloka located?

What is Toloka’s key area of expertise?

How long has Toloka been in the AI market?

How does Toloka ensure the quality and accuracy of the data collected?

How does Toloka source and manage its experts and AI tutors?

What types of projects or tasks does Toloka typically handle?

What industries and use cases does Toloka focus on?