Natalie Kudan
What is text annotation in machine learning?
Used widely across ML-powered businesses, text annotation helps to solve Natural Language Processing (NLP) tasks for machine learning models. If you’re interested in learning more about text annotation in machine learning and how to annotate language data, this article is for you. We’ve outlined some key text annotation examples to help you gain a better understanding — and illustrated how you can employ crowdsourcing for increased productivity.
Before we dig deeper, it’s important to review the basics of machine learning, and how text annotation plays a role in each.
What is machine learning?
No doubt you’ve been hearing a lot about ML — it’s everywhere these days and it’s changing the world as we know it. ML is behind chatbots, virtual assistants, translation apps, your social media feeds, the shows you’re recommended to watch, and more. It powers autonomous vehicles and propels advancements in medical innovations, including gene-based technologies and customized therapy treatments — and that’s just the beginning. It’s used across all kinds of industries and can be applied to any scenario where large quantities of data need to be processed quickly.
ML focuses on the use of data and algorithms to copy the way humans learn, while improving accuracy. In other words, ML aims to build algorithms that can learn from data, identify patterns, and make predictions. ML and other next-generation technologies will inevitably change the way people connect, interact, and evolve.
Advances in NLP have highlighted the increasing demand for textual data in data science and across ML-powered industries. Text annotation provides datasets for training machine learning models to process text or audio data, recognize the contents of documents, and understand the underlying emotions within them.
What is text annotation?
Text annotation involves assigning labels to a text document or different elements of its content. Even though there has been remarkable progress in ML, language is sometimes difficult to understand and decode, even for humans. Text annotation can help with this: sentence components are highlighted by specific criteria to prepare datasets to train a model that can effectively recognize the language, context, or sentiment behind the words.
Language is both nuanced and complex, oftentimes invoking common expressions and colloquialisms, as well as specialized forms such as idioms, metaphors, sarcasm, and rhetorical questions that are culturally specific and require an understanding of the context to interpret correctly — something that machines are not yet able to do. Take for example, the expression “it’s a piece of cake!” While the intended meaning is “it’s simple” or easy to accomplish, the NLP model of a machine is likely to take this at face value — a literal piece of cake! Accurate text annotations help these AI models to better comprehend key information of the data provided, resulting in an error-free interpretation of the text.
What is text annotation used for?
Despite the growing number of tasks that computers can now be taught to carry out, NLP is one area of exclusion. Without annotators, an AI model can’t gain a deeper understanding of the natural flow of language. Therefore, companies continue to turn to human annotators to ensure the accuracy and quality of the annotated text used for training. NLP-based AI comprises voice assistants, automatic translation, chatbots, and search engines, however there’s no limit to the variety of text annotations that can be implemented.
Over the past several years there has been a significant shift toward using neural networks for NLP. Popular techniques include the use of word embeddings to reflect the semantics of words and end-to-end learning for higher-level tasks such as answering questions. Below are some examples of the most common tasks in NLP.
Text and speech processing
There are several subcategories under the umbrella of text and speech processing:
Optical character recognition (OCR)
OCR involves the process of converting images of typed, handwritten, or printed text into corresponding machine-encoded text — for example, from a scanned document or a photo of a document. An OCR program extracts and repurposes data from scanned documents, camera images and image-only PDFs, so that the original content can be edited. With the help of AI, OCR software can incorporate more advanced methods of intelligent character recognition (ICR), such as determining languages or handwriting styles.
Occasionally referred to as text recognition, OCR is an efficient technology that saves time, money, and resources by using automated data extraction and storage capabilities.
Speech recognition
Also known as automatic speech recognition (ASR), computer speech recognition, or speech-to-text (not to be confused with voice recognition), speech recognition converts audible words from a sound clip into readable text. In other words, it translates speech from a verbal to a text-based format.
Speech recognition aims to overcome several challenges. For example, there are rarely pauses in natural speech between words, which makes speech segmentation an important subtask. Additionally, sounds tend to blend in a process known as coarticulation. Accents and dialogues must be accounted for, which makes it essential for the software to recognize a wide variety of phonetic elements as being identical. More advanced software incorporates AI and ML to tackle these challenges.
Morphological analysis
A method for exploring all possible solutions to a multi-dimensional, non-quantified problem, morphological analysis covers multiple segments, including lemmatization, morphological segmentation, part-of-speech tagging, and stemming.
Lemmatization
Lemmatization involves the removal of inflection endings to reinstate the normalized dictionary-based form of a word, known as a lemma.
Morphological segmentation
The separation and classification of words into individual morphemes (the smallest meaningful element of a linguistic expression). This can be a difficult task depending on the complexity of the language.
Part-of-speech (POS) tagging
This process involves annotating the functional elements of speech within the text data. Many words can serve as multiple parts of speech such as adjectives, nouns, adverbs, verbs, etc. Take for example the word “book”, which can serve as both a noun and a verb, i.e., “to read a book” vs. “to book a flight”.
Stemming
Stemming involves reducing inflected (or derived) words to a base form – for example, “close” is the root for “closed”, “closing”, “close”, “closer”, etc. While the results are similar to lemmatization, stemming adheres to rules, not a dictionary.
Syntactic analysis
An analysis that focuses on understanding the logical meaning of sentences or parts of sentences, syntactic analysis covers:
Grammar induction
The process of learning formal grammar from a set of observations, from which you can construct a model that accounts for the characteristics of the observed objects in textual content.
Sentence breaking (or sentence boundary disambiguation)
The process of deciding where sentences begin and end.
Parsing
The process of analyzing a string of symbols that conform to the rules of a formal grammar.
Higher-level applications of natural language processing
Automatic summarization, dialogue management, document AI, grammatical error correction, machine translators, natural language generation (NLG), natural language understanding (NLU), question answering, and text-to-image generation are just some of the many examples that encompass higher-level NLP applications.
Automatic summarization
Produces a concise readable summary of a section of text such as research papers, articles, and more.
Document AI
Enables a computer to “understand” the content of documents, extract entities and data from various document types, and perform document classification.
Machine translation
Allows you to automatically translate text from one human language to another.
Types of text annotation
There are various types of text annotation projects. For example, you can label text data using named entity recognition, sentiment analysis, speech recognition, text and intent classification, text recognition, and more.
Here’s an outline of some of the most common types of the text annotation process.
Named entity recognition (NER)
NER is one of the most commonly used types of semantic annotation. It implies identifying parts of text, classifying proper nouns, or labeling any other entities. In essence, NER means entity annotation, when different entity structures within text are labeled.
Tagging entities in text is also known as entity annotation, extraction, chunking, or identification. This is a method of annotating entities with proper names. Common categories include names of organizations, locations, persons, numerical values, etc. Sometimes entity linking is performed as well, to define how tagged entities are related to each other.
Use cases: NER.
Sentiment analysis
Sentiment analysis requires labeling texts by assigning various sentiment categories (most commonly, positive or negaive). It can be used for a variety for of purposes, from understanding customer reviews to spam filtering.
Sentiment annotation is used to create training datasets for ML models performing sentiment analysis. By extracting human feelings and impressions from text, sentiment annotation categorizes opinions expressed in a text sample on a scale of positive, neutral, and negative.
Use cases: spam detection, email filtering, analyzing customer reviews.
Intent analysis
Intent annotation is used to train ML models to categorize user queries into relevant predefined intents. Use annotated text data to train your chatbot, voice assistant, or any other conversational agent to better understand your users.
Use cases: chatbots, voice assistants, conversational agents.
Content moderation
Protect users and your brand image from inappropriate content such as hate speech or violence, customize moderation to your business values and content policies.
Use cases: social media monitoring, review and comment moderation.
Text classification
Classify or categorize entire texts with predefined category tags.
Text classification involves annotating an entire body or line of text with a single label. Categories and tags are assigned to contextual data within lines or blocks of text. This is generally used for labeling topics, detecting spam, analyzing intent and emotion in text message or a comment.
Use cases: e-commerce, cataloging and recommendations, content moderation, optimized chatbots, web pages and social media posts.
How to annotate language data
Generally, either full-time specialists perform the labeling by hand, which requires a great deal of time and expense, or freelancers step in and perform small tasks using crowdsourcing platforms. When labeling text, your aim is to ensure that the target reader (i.e., an ML model) can better comprehend key pieces of the data.
Large scale text annotation requires a lot of human resources to manage. To scale up and keep the quality of in-house data labeling, companies need to dramatically increase the time and resources spent, making the entire process slow and costly.
Fortunately, there are alternative ways, and one of them is crowdsourcing. Crowdsourcing employs a large number of freelancers who pick up labeling tasks using crowdsourcing data annotation platform. ML teams post data labeling tasks, and people choose and complete tasks they want to do to earn money.
How to speed up text annotation with crowdsourcing
A data labeling task must be properly formulated so that thousands of people can complete it within hours. Having a specialist correctly configure the data annotation pipeline of the markup and quality control allows you to scale text annotation and obtain large volumes of marked-up data quickly and economically.
Toloka provides a data labeling platform with an opportunity to control data labeling accuracy and build a consistent pipeline of acquiring training data for ML. The platform supports various data labeling tasks, including search relevance, text classification, sentiment analysis, utterance collection, audio data collection, transcription, and more.
Furthermore, Toloka has ready-made task templates for labeling and step-by-step tutorials for various text annotation tasks. Check out our documentation for a tutorial on sentiment analysis and content moderation.
Article written by:
Natalie Kudan
Updated:
Dec 13, 2022