Natalie Kudan
Image recognition tasks: the most common uses across the industry
As we’ve discussed in many of our previous posts in this blog, both data collection and data labeling (aka data annotation) play an essential role in Machine Learning (ML) and AI application development, as well as some more traditional domains not powered by AI (e.g., stock trading, linguistics, and archeology).
At Toloka, our chosen data collection and data-labeling methodology is known as crowdsourcing, which is a type of global-crowd-powered outsourcing on a large scale. Apart from data collection and labeling for machine learning model training, Toloka’s data annotators (aka crowd contributors aka “Tolokers”) also collect and annotate data for machine learning model evaluation after training and machine learning model monitoring after deployment. To rephrase that, we help AI developers to make sure that their AI products do exactly what they’re supposed to do right before and after the products are released.
Data collection and labeling for machine learning
There are a number of common data collection and data-labeling applications that Toloka’s clients (aka “requesters”) are involved with. They include:
Natural Language Processing (NLP)
Text classification
Text summarization
Named entity recognition (NER)
Machine translation
Speech recognition
Speech synthesis
Question answering
Social Media
Sentiment analysis
Content moderation
Hate speech detection
Opinion mining
Brand reputation management
E-commerce
Product categorization / search-and-filter / rating systems
Product recommendation / recommender systems
Customer segmentation
Price optimization
Inventory management and forecasting
Supply chain optimization
Healthcare
Medical image analysis (e.g., X-rays, CT scans, MRIs, etc)
Electronic health record (EHR) annotation and classification
Disease diagnosis and prediction
Drug discovery and development
Patient monitoring and risk assessment
Medical chatbots and virtual assistants
Legal
Contract analysis
Legal document summarization
Legal document classification
Intellectual property analysis
Case outcome prediction
Legal research and analytics
Cybersecurity
Intrusion detection
Malware classification
Threat intelligence
Video surveillance
Access control
Data collection and labeling for computer vision
One of the most ubiquitous (and cutting-edge) ML applications can be seen in the domain known as Computer Vision (CV). The term is quite self-explanatory – computer vision essentially enables machines to “see” and recognize the world around them in order to execute a set of predetermined functions. Among some of the most common computer vision tasks, including those performed by Tolokers on our platform, are:
Image classification
Image segmentation
Image transcription
Object detection (including bounding boxes and polygon)
Optical character recognition (OCR)
Facial recognition (including keypoint)
Image recognition technology (i.e., knowing what a digital image shows more generally) and object recognition technology (i.e., the ability to identify objects in digital images) are some of the most popular CV tasks that have multiple real-world uses.
At this point, we won’t be going into the details of a typical image recognition pipeline, that is, how the whole process is organized and carried out from start to finish, beginning with a foundation model and ending with a ready downstream application. If you want to know more about this topic, we invite you to have a look at this blog post. The present article will instead briefly cover some of the most common image recognition methodologies and then focus on the more practical aspects and specific use cases of image recognition and object recognition technology.
Also note that when we talk about image recognition, we don’t necessarily mean that there’s an actual robotic camera with a lens that’s physically “looking” at something (though this is one possibility, of course). Rather, in this context, image recognition refers to any scenario in which a system can accurately process digital images, understand them, and successfully fulfill its intended function.
What is under the hood of the image recognition technology
Among some of the most well-known methodologies of image recognition using deep learning techniques and deep neural networks today are the following,. We’re listing them here along with their main strengths and shortcomings:
Convolutional Neural Networks (CNNs)
This is by far the most popular type of neural network for pretrained image recognition algorithms. It works by learning patterns and features in a digital image through convolution, i.e., “pooling layers.” CNNs are good at recognizing complex patterns and visual objects regardless of their position, which is great for most image identification systems.
Pros: High accuracy, ability to recognize complex patterns.
Cons: Can be computationally expensive to train, i.e., buying and then running high-end hardware, namely Central Processing Units (CPUs) and Graphics Processing Units (GPUs), which can be costly both time- and money-wise.
Recurrent Neural Networks (RNNs)
This type of network is designed to process image data sequentially (that is, bit by bit) and capture long-term dependencies and relationships, which is useful for video frames and other image recognition applications.
Pros: Good at processing sequential data.
Cons: Can also be computationally expensive.
Support Vector Machines (SVMs)
These are linear models that find the “hyperplane” (i.e., a line between data points) in a high-dimensional space in order to separate different object classes within images.
Pros: Simpler than neural networks and more computationally efficient.
Cons: May not perform as well on more complex image recognition and object detection tasks.
Deep Belief Networks (DBNs)
These are a type of deep learning image recognition network that’s composed of multiple layers of “hidden units” that learn hierarchical representations used in image processing. This means that these image recognition algorithms can identify which patterns are more and which are less significant.
Pros: Good at hierarchical learning and feature extraction.
Cons: Can also be expensive to train (CPUs, GPUs, etc).
K-Nearest Neighbors (KNNs)
This approach uses a non-parametric method (i.e., it doesn’t assume any particular distribution shape/type) in order to find the nearest training examples to a new image and then classify that image based on its nearest “neighbors.”
Pros: Simple and easy to implement.
Cons: Can be computationally expensive with larger datasets.
Transfer Learning models
These models for image recognition work on large datasets for a particular task and subsequently fine-tuned for a different task using smaller datasets.
Pros: Useful for more specialized image recognition and object identification tasks.
Cons: May not perform well on all target tasks (i.e., may require lengthy fine-tuning periods in order to achieve optimal performance).
Autoencoders
Autoencoders work by learning a condensed version of image data (“encoding”) and then proceed with reconstructing the original data (“decoding”). This can be especially useful for “denoising,” that is, cleaning “noisy” (corrupted or distorted) images during object detection tasks in various image identification systems.
Pros: Good for denoising and compression.
Cons: May not perform as well as other models on image classification tasks.
Faster R-CNN, Mask R-CNN, You Only Look Once (YOLO), and Single Shot MultiBox Detector (SSD)
These are all specific implementations of object detection and image recognition models based on CNNs:
Faster R-CNN
This type of networks uses a region proposal network to suggest object locations and a classification network to predict class labels, i.e., it basically suggests where the objects of interest are and what they are.
Pros: High accuracy for object detection tasks, can be easily fine-tuned for specific use cases.
Cons: Requires a large amount of training data, computationally expensive.
Mask R-CNN
It is a modified version of Faster R-CNN that predicts not only the location and class of an object in image recognition tasks, but also its precise shape or boundary as a set of pixels (called a “mask”).
Pros: Predicts both the location and shape of objects in image recognition tasks, high accuracy for object detection tasks.
Cons: Requires a large amount of training data, also computationally expensive.
You Only Look Once (YOLO)
This is a very popular one-stage model (not least among our requesters) that predicts class labels and bounding boxes of objects in an image. In other words, it can draw an outline of each object to show where it’s located and tell us what kind of object it is exactly (e.g., a traffic cone, a person, a tree, etc).
Pros: Fast inference time (i.e., rapid responses), low memory usage.
Cons: Lower accuracy compared to other models such as Faster R-CNN, may have difficulty detecting small objects.
Single Shot MultiBox Detector (SSD)
This is a one-stage model similar to YOLO, as it also identifies objects and informs us of their location. The difference is that SSD uses different scales to detect objects, which can improve accuracy but may also make image identification systems that rely on it run slower.
Pros: Offers a good balance between accuracy and performance speed, can detect objects of different sizes.
Cons: May struggle with identifying small objects, can be computationally expensive for larger models.
To sum up, choosing the best image recognition model depends on a number of factors, including the size of your dataset and the complexity of the objects being recognized. In other words, it all depends on the specific requirements of your task at hand. It’s also crucial to carefully consider the trade-off between accuracy and computational resources when selecting a methodology: for better or worse, there’s simply no magic pill here.
In most cases, at least some computational expense is unavoidable, which is why optimizing your ML pipeline and having a robust work environment is essential, as is reducing your data-associated costs, including those for data collection and data labeling. With that in mind, let’s now look at how this image recognition and object detection technology is being utilized by some of our requesters.
What is image recognition used for
Product development and moderation
Case 1
AliExpress, a global e-commerce giant, faced a significant challenge of having to comply with the constantly changing and differing regulations in each of the ten CIS (Commonwealth of Independent States) countries it serves. To address this challenge, the company needed a highly scalable and efficient moderation system that would filter out illegal, culturally inappropriate, or inferior products from its platform.
The company partnered with Toloka with the goal of creating one such system that would take into account each country’s local laws and cultural differences. The first step was defining the five categories of products to be moderated, ranging from items that can be sold without restrictions to products that are prohibited by law.
To handle the large volume of products on the platform, AliExpress prioritized items for moderation based on the frequency of views. These selected items were sent to a pre-trained ML model that predicted whether each item was acceptable or not, providing a confidence score to indicate the reliability of the verdict. Items with a high confidence score were added directly to the database, while uncertain ones were sent to Tolokers for human-handled verification.
The annotation process consisted of three stages, with all crowd contributors taking an exam before getting assigned to the final task. Inconsistencies in answers were sent to a second task level, where response accuracy was rechecked.
The platform’s partnership with Toloka (i.e., ML pipeline optimization with human verification) reduced the price of verified items by half and improved their image recognition system and specifically, the moderation efficiency by 500-fold: from only 200 items per day to 100,000 daily items. AliExpress has since been able to scale up and handle 100,000 items per day with a 98.7% label quality.
Case 2
Neatsy is an innovative app that creates a digital model of feet (yes, you read it right). It features a 3D scanner that generates a 3D model of the customer’s foot, which it then uses to suggest the most appropriate footwear.
The app is powered by a sophisticated neural network that’s designed to effectively isolate human feet from the ground. The app’s 3D scanner needed over 50,000 labeled images in order to train itself. Acquiring this many labeled images of feet within a tight deadline was a daunting challenge. The company explored several options before deciding to partner with Toloka, seeking a robust and speedy solution capable of handling image labeling at scale.
A special training pool was created for our crowd contributors, followed by a competency test for each Toloker before they could join the taskforce. Three weeks after that, the company received all 50,000 labeled images.
These labeled images played a crucial role in retraining Neatsy’s neural network, resulting in a significant improvement in segmentation quality as compared to the baseline – the app’s 3D scanner increased its accuracy by 12%.
Case 3
An up-and-coming smartphone company faced a challenge in developing an intuitive CV algorithm that could detect and recognize different hand gestures from a vast database of images. However, the small in-house team lacked the resources (and the time) required to quickly label and classify all of the images, leading to serious delays in the product’s time to market, with the “runway” beginning to run out (i.e., investment covering ongoing expenses).
To overcome this hurdle, the company sought the assistance of our crowd contributors to scale up the labeling volume. In just under two days, Tolokers were able to deliver 2000 new labels with a high accuracy rate of 93.5%. As a result, the company was able to improve their image recognition software and deliver a more user-friendly experience/interface to their customers.
Human face and body recognition
Case 1
A Japanese startup approached Toloka with a challenge of labeling human faces in 34,000 images from various TV shows. Unsurprisingly, the company was looking for a solution that was fast, affordable, and dependable.
The project faced its first challenge when it came to defining what should be considered a human face. This wasn’t clear since some of the images contained anime characters, drawings, computer-generated imagery, and humanoid androids. After some deliberation, it was determined that human faces would encompass all characters in the images except for animated figures and manga.
To tackle this, Toloka assigned different pay rates for processing images of varying complexity and tasked all crowd contributors with learning and labeling images of increasing difficulty levels. Moderators, in turn, checked for quality control, ensuring that each image was correctly labeled. The entire task was completed in three stages – introduction and practice, labeling, and quality control – with a smooth learning curve that allowed our crowd contributors to deliver high-quality work on more complex images.
Over a three-week period, Tolokers labeled and submitted images with 65,000 faces at a cost of approximately $0.015 per face. This was estimated to be 250% lower than any non-crowdsourcing solution available on the market at the time of project completion, while maintaining quality at or above the market average. Other methods like CVAT (Computer Vision Annotation Tool), which the startup team had previously considered, would have required more labor-induced expenses (i.e., highly trained specialists) and also more time.
Utilizing the power of Toloka’s global crowd made it possible to label tens of thousands of faces at a fraction of the expected cost by using a much larger pool of non-expert participants and aggregating their results instead, i.e., overlapping multiple submissions per image to reach higher accuracy levels.
Case 2
IVI is a video streaming platform that offers over 20,000 movies and TV series to its users, with personalized recommendations as one of its key features. Personalization of title posters is an important aspect of IVI’s marketing strategy. To achieve this, multiple posters need to be created for each title.
However, the process of handcrafting posters is very time-consuming. To save time, IVI created a tool called Parker, which generates high-quality title previews and posters automatically. Though this approach saves time, it has a major drawback – Parker can often produce posters with awkward-looking facial expressions when caught mid-frame, making them largely unusable. With no viable solution in sight and a pressing deadline for new releases, IVI turned to Toloka.
To reliably retrain Parker’s algorithm, IVI needed about 72,000 labeled images. To meet this requirement, the company extracted high-quality facial images from 18 different films, with each actor having 10 to 100 images.
The task of labeling this data involved five stages: decomposition, instructions, task interface, quality control, and aggregation. Tolokers were tasked with answering a binary classification question “Is this a normal facial expression or not?” in two phases. The first phase was to confirm that each image indeed contained a human face, while the second phase was to answer the main question.
IVI leveraged Toloka’s ready image classification template for all phases and subphases of the labeling task. Each page displayed 8-10 images and 3 radio buttons (i.e., graphical control elements that can be pressed), allowing Tolokers to use their keyboard and mouse for inputs. To minimize errors and exclude bots or unscrupulous annotators, the project had four quality control mechanisms in place, including training exam, hidden tests, overlap, and assurance tools (these are various techniques used to make sure that crowd performers are real people who are paying close attention).
All of the submitted responses were aggregated to get the final results. The assignment took four days – 11 times faster than the estimated in-house alternative. The accuracy rate fluctuated somewhat initially and subsequently climbed to and stayed at 90% toward the end of the project.
This significant improvement to Parker’s image recognition algorithm through “human-in-the-loop” data labeling (i.e., human-handled annotation) allowed the platform to identify and prioritize attractive-looking posters and previews over inferior ones.
Case 3
When it comes to missing person cases, law enforcement agencies have a variety of tools and tactics at their disposal. One such tool that has become increasingly popular in recent years is the use of drones, or unmanned aircraft systems (UAS), for surveillance and search-and-rescue operations across the United States and other parts of the world.
In Virginia, the Loudoun County Sheriff’s Office purchased an Indago drone that was deployed for search within 2 minutes of the first call and proved more agile than a helicopter. The drone was able to enter narrower and harder-to-reach spaces closer to the ground, ultimately leading to the quick discovery of a missing person. Likewise, in Connecticut, the Hartford Police Department has been able to locate several missing persons across large areas along the Connecticut River since the inclusion of UAS devices into their search-and-rescue operations.
However, simply having drones and accompanying navigation equipment is not always enough. With a large volume of raw and often noisy (i.e., inconclusive) drone imagery, finding a missing person can feel like finding a needle in a haystack. For this reason, post-surveillance data labeling – that is, processing search imagery after the fact – has become an essential component of drone surveillance and image recognition in missing person cases.
Liza Alert, a non-profit organization that searches for missing persons across Eastern Europe and Central Asia, has partnered with Toloka to utilize the power of our global crowd to carry out post-surveillance data labeling. Liza Alert teams use drones to search for missing persons on-site, often in the densely forested wilderness or secluded rural areas, and the imagery is then passed on to Tolokers for further processing.
Our crowd contributors label the data and identify areas where the missing person may be located, which helps with their rescue efforts. Simultaneously, this also provides new annotated data that can be used to train and fine-tune future ML-powered applications for search-and-rescue purposes, which is a combination of object detection and facial recognition technologies.
As of 2021, Tolokers processed over 330,000 drone images collected by Liza Alert, with 49,000 of those images containing people. This led to over 200 search missions and ultimately saved the rescue organization 330 working days, which is a significant improvement.
The use of surveillance drones and crowdsourced data labeling has proved to be a powerful combination in missing person cases. With the ability to collect and process large volumes of data quickly and efficiently, law enforcement agencies and rescue organizations are better equipped now than ever to locate missing persons and bring them to safety.
Concluding remarks
As we’ve seen, creating image recognition tools and object detection algorithms is a rather complex process that may rely on any number of CV models, depending on the specifics of a given AI downstream application. In most cases, it comes down to the trade-off between time and accuracy, as well as computational resources (mainly hardware-related) that are required to carry out ML training and fine-tuning.
Because of this, it’s important to have a well-structured ML pipeline in place along with a robust infrastructure that can facilitate work at every stage. It’s equally important to possess the ability to transition from one stage to the next without any incompatibility issues. Lastly, in addition to having a sustainable work environment, it’s crucial for AI developers to obtain the right data that has its uses not only during ML training, but also during evaluation and monitoring, both pre- and post-deployment (i.e., before and after an image recognition application hits the market).
When done right, image recognition can have far-reaching, industry-disrupting effects on different ML-backed fields, among them e-commerce, autonomous driving, smartphone apps, digital media, as well as search-and-rescue operations involving surveillance drones. We’re thrilled that these cutting-edge image recognition and object detection technologies are being designed, built, and implemented by many of our requesters the world over, with the data- and platform-related support coming from Toloka’s own crowd contributors, data scientists, and ML engineers.
Article written by:
Natalie Kudan
Updated:
Mar 9, 2023