How to annotate language data

Natalie Kudan

Subscribe to Toloka News

Subscribe to Toloka News

From chatbots to virtual assistants, language data annotation plays a key role in our daily interactions and is critical to developing intelligent human language technologies. Without human annotators, computers can only do so much. Human language is incredibly varied and complex. We express ourselves in countless ways (verbally and in written form) in different languages, dialects, and accents. It’s easy to see why companies continually turn to annotators to ensure the accuracy and quality of their training data.

So, if you’re looking to build specialized language corpora (training and test datasets) for a natural language app or simply wish to learn more about annotating language data, this article is for you. We cover a variety of topics under this umbrella including types of language data annotation tasks, real-life natural language processing applications powered by machine learning, and common techniques and processes.

But before we jump straight in, let’s start off with some basics. Language annotation refers to the process of annotating data in different formats (text, video, and audio) to make it applicable for machine learning. Annotators label different types of data with additional metadata or notes to make the entire sentence or document understandable through natural language processing or other language-based AI models. Now that you’re somewhat familiar with this term, let’s get into the details.

Get high-quality data. Fast.

Try Toloka Hybrid Labeling for image, audio, or text annotation from LLM+Humans in 40+ languages
Talk to us

What is natural language processing in machine learning

All language or speech-based AI models are trained through natural language processing (NLP). Under the umbrella of AI, NLP can be defined as the merging of language studies, computer science, and computational linguistics, which together form the building blocks of applications that facilitate human interactions with machines through natural language. In short, NLP annotation aims to fill the gap between human communication and computer understanding by helping computers comprehend, interpret, and manipulate human language. There are several subsets to NLP data annotation, including question answering systems, summarization, machine translation, automatic speech recognition, and document classification.

Question answering systems (QAS)

Voice controlled personal assistants like Siri are a good starting point for question answering systems, but they still don’t have the capacity to fully understand natural language, just some phrases here and there. With technological advances, soon you’ll be able to ask digital devices any question in any language and get a valid response — like what time is the new movie you want to go see showing in your local town.


Imagine having an app on your device that can analyze a group of documents, extract the main ideas, and produce a concise summary of their content — and while it’s at it, build a PowerPoint presentation for you to share at your next meeting. That’s what summarization does. Oh, the power of technology!

Machine translation

The foundation, basic building blocks, holy grail…however you want to define it, machine translation is the first major area of NLP research and engineering. Translation programs such as Google Translate keep improving with time, and eventually they’ll be able to translate for you in real time, say when you’re at the airport waiting to catch your next international flight.

Speech recognition

An area of contention to be sure, yet there have been a lot of advances in developing models that can recognize questions and commands. However, most of them are limited to narrow domains. (You’ve probably encountered those annoying automated answering systems where the robotic voice on the other end of the line puts you through a series of aggravating questions before you can speak to an actual person. And should you stray from the predetermined script? Look out!) The good news is, with technological advances, these models are sure to improve.

Document classification

A jewel in the treasure trove of NLP, document classification aims to identify in which category a document should be placed. Think spam filters or movie reviews. Things that make your life easier! This is definitely one of the many perks of NLP and artificial intelligence development.

How is language data used in machine learning

The internet is an excellent example of why we need labeled data to train ML models for natural language processing. Ever wonder how all those articles, blogs, forums, and social posts are being communicated? The Web hosts a multitude of media including text, images, video, and audio recordings — and language is what allows you to understand the content of each of these channels. While computers are adept at conveying this information to you, they aren’t so great at understanding the language aspect.

Theoretical and computational linguistics work together to decode the innate nature of language and capture the computational elements of linguistic structures. Human language technologies aim to transform these insights and algorithms into programs that can improve the way we interact with devices through language. With the tremendous amount of data available nowadays, linguistic annotation and modeling issues are seen as machine learning tasks.

Nevertheless, a computer can’t just be given a large amount of data and be expected to learn how to respond. Providing machine learning algorithms with poorly selected training data can slow them down and lead to mistakes or erroneous outcomes. Annotators play a critical role in accurately prepping the data for the relevant task at hand, such as adding metadata tags (labels), so that the computer can more easily identify patterns and inferences. Otherwise, things could go haywire. Next, we’ll investigate how to go about annotating language data for machine learning.

How to annotate text or audio data for machine learning

Typically, those who annotate language data fall into one of two categories: full-time specialists or freelancers who complete micro tasks via crowdsourcing platforms. Either way, their goal is the same: to ensure that the target reader (an AI model or machine learning model) can better understand key segments of the data. These individuals carefully annotate the language data according to its meaning and context by adding labels and metadata when needed. When done on a large scale, text annotation requires a heavy lift in terms of managing human capital. The good news is that crowdsourcing provides a convenient alternative by delegating labeling tasks to individuals all around the world. Once you sign up on a crowdsourcing platform like Toloka, you can choose from numerous tasks and earn money.

If you’re the specialist working behind the scenes to set up the data labeling task, make sure that it’s appropriately formulated so that thousands of freelancers can complete it correctly within a matter of hours. It’s important to think strategically about what you’re hoping to achieve and what information is most applicable. Be sure to also correctly configure the data annotation pipeline and quality control steps so that you can scale text annotation and get large volumes of marked-up data quickly and conveniently, saving you time and money.

Platforms like Toloka offer a wide range of benefits when it comes to annotating language data, two of which include controlling data labeling accuracy and building a reliable pipeline. Our platform supports a wide variety of data labeling tasks, including search relevance, text classification, sentiment analysis, utterance collection, audio data collection, transcription, and more.

Types of language data annotation tasks

There are a number of different types of text annotations used in machine learning. Here are just a few of the most common ones.

Entity annotation

This process is key to training NLP models like those used to develop chatbots. In short, entity annotation involves locating, extracting, and tagging entities (or individual words or phrases) such as names and keywords within text. Annotators help locate and assign predetermined labels to the relevant entities. Entity annotation combined with entity linking facilitates an enhanced learning process for NLP models.

Examples: named entity recognition (as in proper names), key phrase (or keyword) tagging in text data, and part-of-speech tagging (Remember your grammar lessons at school? Think verbs, nouns, adjectives, and more.)

Entity linking

Once entities within a specific text have been located and annotated, entity linking connects those entities to larger data repositories. In such a task, an annotator assigns a specific identity to an entity from the text, for example, a company name, or a location. Used to create a better user experience and improve search results, entity linking involves assigning annotators to link labeled entities to a URL that hosts more info about that specific entity.

Examples: end-to-end entity linking (named entity recognition combined with entity disambiguation to clear things up!), entity disambiguation (as mentioned, to distinguish between entities and gain clarity by linking entities to knowledge databases).

Text classification/categorization

This is a rather broad category with a lot of subsets. In essence, text classification involves adding a single label to an entire body or line of text. Annotators read and analyze an assigned text to determine the main topic, context, and sentiment. Then they classify the text according to a list of predetermined categories.

Examples: document classification (used to help sort or recall content), product categorization (mostly self-explanatory, products or services are sorted into categories to improve search results and user experience — e-сommerce sites use this a lot!), sentiment annotation (classifying text based on emotions or opinions presented)

Sentiment annotation/analysis

This one’s all about emotional intelligence. Heard of that term before? It’s pretty multilayered. No wonder it’s one of the most challenging parts of machine learning. It can even be hard for humans to dissect! (You know those text messages where you’re left wondering what the other person actually meant and trying to read between the lines. Was it sarcasm, a joke, or something else?) So, you can bet it’s practically impossible for a machine to figure out. Well, sentiment annotation tries to address this dilemma. AI models are trained with sentiment-annotated text data provided by annotators who label emotion, opinion, or sentiment within a body of text such as a social media post.

Example: As a real-world use case, sentiment analysis can help companies improve their products and strategies by staying abreast of consumer reviews and feedback.

Linguistic/corpus annotation

What is a corpus in NLP? It’s a collection of text or audio organized into datasets. The process of labeling a corpus involves tagging language data in these texts or audio recordings; annotators identify and flag grammatical, semantic, or phonetic elements in this data. Linguistic annotation can be used to develop AI training datasets for a wide range of NLP solutions like chatbots, search engines, translation apps, and more.

Examples: discourse annotation (as in “Sarah got a promotion. She was proud of it.”), part-of-speech tagging (which we talked about earlier), phonetic annotation (labeling intonation, stress, and natural pauses), and semantic annotation (as in word definitions)

Applications of machine learning-powered NLP

These days, computers seem to be able to perform an unlimited number of tasks. However, one area that they still haven’t mastered is NLP. Without human annotators an AI model can’t gain a deeper understanding of the natural flow of language — and there’s no end to the variety of text annotations that can be applied. Annotators play a vital role in checking the accuracy and quality of the annotated text used for training models. NLP-based AI encompasses voice assistants, automatic translation, chatbots, and search engines.

Lately, there’s been a shift toward using neural networks for NLP including the application of word embeddings to reflect the semantics of words and complete learning for higher-level tasks like answering questions. In the section below, we cover several examples of the most common NLP tasks. Keep reading to learn more.

Common NLP techniques and processes

Given that the majority of data is unstructured — and text based — we advise you to familiarize yourself with basic NLP techniques and processes in order to derive valuable insights from text data.

What is a text corpus A corpus (or text corpus) is defined as a collection (or language resource) of a large and structured set of machine-readable texts or documents that have been produced in a natural communicative setting. Getting into the weeds, here’s a breakdown of subsectors for clarity: a corpus comprises documents, documents contain paragraphs, paragraphs encompass sentences, and sentences hold smaller units known as “tokens”.

Tokens and tokenization Tokens can be words, phrases, engrams, or symbols. Tokenization refers to the process of representing or dividing unprocessed text into smaller units called tokens. These tokens can then be mapped with numbers to further feed into an NLP model. This process can be broken down even further into “whitespace tokenization” (where the entire text is split into words by separating them from white spaces) and “regular expression tokenization” (where a regular expression pattern is used to get tokens).

Lemmatization A systematic process for eliminating inflected forms of a word, lemmatization encompasses vocabulary, word structure, part-of-speech tags, and grammar. The output is a lemma, a.k.a. a root word. Annotators specify part of the relevant speech tag for a given word. Lemmatization will only be carried out if the word has the correct speech tag part assigned to it. A simple way to do this is by looking the word up in a dictionary, but a rule-based system may be needed for more complex cases.

Stemming Stemming is described as a simple, rules-based process for getting rid of inflections from a token where the outputs serve as the “stem” or “root” of the word. However, stemming can sometimes produce made-up or incomplete words and many search engines treat words with the same stem as synonyms — so keep your eyes peeled for anything that looks off!

Part-of-speech tagging As mentioned earlier, part-of-speech tags describe how words relate to other words in a sentence from which machine learning models acquire knowledge. Annotators mark up a word in a text (corpus) to correspond with a particular part of speech as per its definition and context. These word tags define their context, function, and usage. Each word in a sentence is attached to a particular speech tag.

But remember, this is just the beginning. There’s a lot more information out there on these techniques and processes, so keep reading and expanding your knowledge base — continuing with our Toloka blog where we house lots of useful information and timely insights.

As we wrap up…

Here are some key takeaways to reflect upon: NLP helps computers communicate with people in their native language and scales other language-related tasks. For instance, NLP allows computers to read text, hear speech, interpret its meaning, measure sentiment, and extract the main ideas.

Keeping in mind the massive amount of unstructured data that’s produced on a daily basis from medical records to social posts, automation is the next step in holistically evaluating text and speech data. However, there’s still a need for syntactic and semantic understanding as well as domain knowledge that only human annotators and NLP can support — in addition to training and quality control of machine learning models. Namely, NLP helps to address ambiguity in language and adds effective numeric structure to various applications such as speech recognition or text analytics.

To learn more about how Toloka can help you annotate language data for machine learning, visit our blog.

Article written by:
Natalie Kudan

Recent articles

Have a data labeling project?

Take advantage of Toloka technologies. Chat with our expert to learn how to get reliable training data for machine learning at any scale.