How does image recognition work

Natalie Kudan
by
Image

Subscribe to Toloka News

Subscribe to Toloka News

As we’ve seen in other posts in this blog, machine learning (ML) supports AI applications across numerous industries. At its core, machine learning always implies two components – a machine learning model (or training algorithm) that serves as a set of coded instructions and annotated data that gets fed into the model and serves as a basis for the model’s learning.

Today, we’ll talk about computer vision (CV); namely, how does image recognition work and how are AI applications trained to carry out image recognition tasks, how AI image recognition software is used in business, and what role annotated data plays in all of this. If you are interested in learning about image recognition for business, or you’d like to become a data annotator who tackles image recognition tasks – read on! This article aims to make highly technical processes understandable to those who have little to no background in ML.

Get high-quality data. Fast.

Try Toloka Hybrid Labeling for image, audio, or text annotation from LLM+Humans in 40+ languages
Talk to us
Image

Computer vision and AI-assisted image recognition

We should start off by defining a few terms. Computer vision is a field of artificial intelligence that deals with systems that can “see” and understand the world around us. Much of it has to do with what’s known as image processing – interpretation and manipulation of visual data.

Among other subfields or “tasks,” computer vision and image processing include what’s known as “image recognition,” which is about being able to grasp what an image shows and categorizing its content into object classes. In layman’s terms, AI image recognition ultimately comes down to naming or describing an image (e.g., “this is a bicycle.”).

Sometimes, image recognition is used synonymously with object recognition or object detection, particularly in non-scientific publications; however, this isn’t right strictly speaking. While there’s an overlap, object recognition is normally a more complex task, as it involves identifying multiple objects within one digital image. In other words, object detection includes image recognition but not necessarily the other way around. All in all, image recognition as a computer vision task plays a key role in processing images in general and object detection in particular.

Typical image recognition pipeline

As is the case with NLP-related AI applications that we covered in this article, AI applications trained to carry out image recognition have a similar workflow or “pipeline.” This means that the same stages exist within its machine learning “life cycle” from start to finish:

  • Data collection
  • Data preparation
  • Data labeling
  • Data preprocessing
  • Model training
  • Model evaluation
  • Model deployment
  • Model monitoring

As you can see, the machine learning life cycle can be divided into two large segments – the one that deals with the data, and the one that deals with the model. Let’s look at each one in more detail.

Data collection for an image recognition system

Data collection refers to the process of obtaining a dataset required for ML model training. In the case of an image recognition model, it needs to contain certain images, whose type and content are dictated by our future downstream application, i.e., what exactly we need to accomplish with our artificial intelligence solution. For instance, if we were preparing an image recognition algorithm for airport security, we would need to have a dataset with images of potentially hazardous materials, firearms, any threatening poses, and so on.

File formats

It’s normally at this stage that we also need to decide if we want to use vector or raster images. Raster images are made up of a series of pixels, each one carrying certain values, such as color and intensity. Common formats we’ve all heard of include JPEG, PNG, TIFF, BMP, and GIF. These images always imply fixed resolutions, meaning that they will lose quality when scaled up or down.

Vector images with files format like SVG and EPS are different, because they are made up of lines and shapes that are defined in terms of mathematical equations. This characteristic makes vector images infinitely scalable, i.e., these images will not lose quality when scaled up or down.

Each type has its benefits and drawbacks. If our AI application for image recognition requires fixed high-resolution images that contain fine details and very slight differences in color and intensity, then going for raster images may be the way to proceed. Conversely, if our AI solution needs to have a degree of flexibility, that is, possess the ability to continuously resize or edit images, then choosing a vector format may be better.

Data collection methodologies

There are different ways that AI product makers can collect their data. Among the most common methodologies are using stock images or photo libraries, scraping, and crowdsourcing.

Stock images or photo libraries

Using ready libraries is a great, easy-to-execute option that can often help artificial intelligence developers. However, as always, there’s a limitation: you’re basically stuck with whatever images are available to everyone else, which may or may not work well with your specific requirements. These “ready-made” datasets may also contain inherent biases and subjective data, that is, they may not always reflect real-life situations or accurately represent target populations.

Scraping

Scraping refers to the process of using automated tools to find (known as “crawling”) and then extract data from the web – in this case, image data. The main disadvantage of scraping has to do with legal and ethical issues. For example, you may end up with copyrighted materials in your dataset and private or sensitive information. Some jurisdictions have gone as far as to restrict scraping in certain areas of the IT sector for this very reason, though this approach still remains widespread.

Crowdsourcing

The main advantage of crowdsourcing in the context of data collection – and spatial crowdsourcing at Toloka in particular – is that it implies creating completely new data offline. This is known as “field” or “feet-on-street” tasks. In this scenario, crowd contributors (i.e., data annotators) physically visit various places of interest and take photos of target objects.

This works tremendously well with AI-assisted image recognition systems that rely on information available to the general public. For example, image recognition software can be trained with images of road signs or traffic lights that are used to train self-driving vehicles. In other cases – for instance, ML-powered applications that rely on medical images of internal organs – crowdsourcing can assist with data labeling more than with data collection due to domain specificity.

Data preparation for image recognition

Now that we’ve collected our image data, we need to prepare it for labeling. This process entails a number of substeps. One of them is data cleaning, which involves removing corrupted/unreadable images, unnecessary duplicates, and other inconsistencies and errors, such as missing values or incorrect file names. Another big part of data preparation is known as data augmentation. This is a crucial step that’s aimed at making datasets more balanced in order to combat underfitting and overfitting.

Before we can explain what these terms mean, we first need to understand the so-called “bias-variability trade-off” – a key concept not only in computer vision but across all domains of ML. It basically means that we need to strike the right equilibrium between something that’s too specific and too general, because a substantial shift toward either one will result in a poorly performing model.

Bias vs. variance

If our model has low bias and high variance, it means it can recognize specific features as opposed to general patterns. In other words, this model won’t recognize different versions of the same object after training. A good example of it would be an ML model that was trained to recognize different types of clothing. Let’s say that one of the items in the training dataset had examples of a very particular kind – all of the ties were multicolored and had stickpins.

As a result, now when the ML model encounters a white tie without a stickpin during deployment (when new data is introduced), it fails to recognize this item as a tie. This is the case because it has come to associate all ties with stickpins and multi-colored designs by default. This is known as overfitting.

If our model has high bias and low variance, it means it can recognize only very general patterns as opposed to specific features. In other words, this model will have problems identifying objects from different object classes if they appear similar. Going back to our previous example about clothing, let’s imagine that our dataset wasn’t varied enough – it had too few examples.

As a result, the model may be able to recognize that an item, such as a tie, is indeed a piece of clothing, but it may not be able to distinguish between different types of clothing, such as ties and pants. Or, it may even confuse a tie with a non-clothing item that happens to look similar, such as a chest tattoo in a similar shape. This is known as underfitting.

Viral examples of overfitting and underfitting

Here are some of the best-known examples of AI image recognition algorithms “epic fails” that have become popular memes by now. They’re all about how AI image recognition software – when not properly trained – can struggle to tell the difference between our favorite food and our favorite pets:

  • Is it a labradoodle or a piece of crispy chicken?
  • Is it a corgi’s bottom or a loaf of white bread?
  • Is it a chocolate croissant or a sloth?

Let’s take the first example to explain. Perhaps, the model was trained on a dataset that’s not representative of the real-world distribution of labradoodles; for example, all of the labradoodles or dogs in general that the model encountered were black. This is overfitting. As far as the model is concerned, a dog looks a certain way, and it’s always black. If something looks like a labradoodle but isn’t black, then it cannot be a dog. However, it also fits another category – crispy chicken – and this category offers a better match by color.

The same confusion may arise because of underfitting. Let’s say the model was trained on a small dataset, resulting in poor variance. Now, it can only detect basic characteristics. And if these characteristics seem to overlap, our image recognition systems won’t be able to distinguish between different object classes. Consequently, as far as this model is concerned, there’s no difference between the golden curls on the fur of those labradoodles and the golden crispy skin on those pieces of fried chicken. These traits are not indicative of any one object class in particular.

Data augmentation

When either of the two scenarios is likely to unfold, and we understand that we’re either too low on variance or too low on bias, we can use data augmentation to even things out. In the case of overfitting, the idea is to create more versions of the same image with minor changes. This may entail intentionally adding some “noise” to the image (i.e., variations or fluctuations), applying random rotations, using techniques like flipping and cropping, and so on. The aim is to produce a series of similar images and not allow our model to cling to inconsequential features of object classes.

In contrast to when we have a dataset that’s too specific (low bias), having a dataset that’s too general (low variance) normally poses a bigger problem. The reason being is that it’s easier to make different versions of an existing image in order to boost bias when we already have one accurate representation of that object class. It’s harder (though sometimes possible) to come up with a whole new object class and boost variance when there are no representations of that object class in the dataset. Therefore, the best way to deal with low variance is ultimately to collect more images using methods like crowdsourcing.

Data labeling for image recognition

Data labeling is arguably one of the most important stages of the whole machine learning pipeline. This is the case because no matter how brilliant our ML model is, our image recognition application will only go as far as the training data we use. We already covered different data-labeling methodologies in this article about the role of data annotators. If you haven’t read it yet, we invite you to do so to get a better understanding of the global data-labeling landscape.

At Toloka, our chosen data-labeling methodology is crowdsourcing, which is often considered one of the most time- and cost-effective approaches. The reason why it’s affordable and fast is because rather than having a smaller, highly specialized team of data annotators, crowdsourcing relies on a large number of crowd contributors from all over the world who are essentially “average Joes.” The trick is that their individual contributions get aggregated (basically put together and verified) by experienced data annotation analysts in order to obtain a desired outcome.

When it comes to data labeling for AI-assisted image recognition applications, the following annotation tasks are completed by Toloka’s crowd contributors on a regular basis:

Image segmentation

Crowd contributors draw boundaries of a desired object or a group of objects within every image in the dataset.

Image classification

Crowd contributors classify images in the dataset by matching their content to predetermined object classes (e.g., clothes, food, tools, etc) or other descriptive categories (e.g., architecture, sports, family time, etc).

Image transcription

Crowd contributors transcribe text from images, which could be billboards, letterheads, receipts, or other types of content within the dataset.

Bounding boxes

Our contributors identify target objects within every image in the dataset that match certain object classes and use bounding boxes to mark their exact location.

Polygon

Similarly to the previous task, our contributors identify target objects within every image in the dataset that match certain object classes, but this time they draw pixel-perfect polygons around each shape.

Keypoint

Crowd contributors identify and label various anatomical components, facial features, expressions, gestures, and emotions in every image in the dataset that contains a human face.

Side-by-side

Our contributors look at two images next to each other and perform a pairwise comparison, that is, select the better one of the two based on specific criteria (e.g., “Which of the two objects has a round shape?”).

If you want to know more about how data labeling is carried out, both at Toloka and in general, you should check out this article on data labeling for ML and this one that discusses annotation of images.

Data preprocessing for image recognition

The next step in the ML life cycle is data preprocessing that refers to the stage when we get our labeled dataset ready for model training. The main idea is to make sure that everything is consistent and evened out, so that no errors arise in the training stage.

In some ways, preprocessing is similar to data preparation, but it’s positioned after data annotation in the pipeline, not before. In addition to offering solutions to other problems, some well-known techniques can be applied during preprocessing in order to address issues related to the aforementioned bias-variability trade-off. These may apply to images, as well as other data types:

  • Feature engineering (constructing specific data features that are more informative and relevant to the AI application in order to prevent underfitting).
  • Regularization (adding constraints or “penalties” to the ML model in order to reduce complexity and prevent overfitting).
  • Cross-validation (splitting the data into fractions or “folds” and training the ML model on a different fold each time in order to prevent overfitting or underfitting).

Model selection and training for image recognition

Now, we’ve arrived at another pivotal point in the ML lifecycle. This is when we need to come up with an ML model that we’ll be feeding our annotated data into. In theory, we could create our ML model from scratch, but this is a major undertaking that requires a significant amount of highly specialized expertise, and in most cases, a PhD in computer science (or a few of them).

More often than not, however, things aren’t done from scratch by artificial intelligence product developers. As is the case with NLP-based solutions and Large Language Models (Transformers) like BERT and GPT, when it comes to computer vision and image recognition, there are also available models out there that we can use to carry out “fine-tuning.” In other words, we can take a pretrained model and retrain it to perform a more specific image recognition task.

Today, most successful pretrained models for AI-assisted image recognition are based on Convolutional Neural Networks (CNNs). Some others may also be based on the following:

  • Recurrent Neural Networks (RNNs) are designed to process image data sequentially and capture long-term dependencies and relationships, which is particularly useful for things like video frames.
  • Support Vector Machines (SVMs) are linear models that work by finding the “hyperplane” (basically a decision boundary) in a high-dimensional space in order to separate different object classes within images.
  • Artificial Neural Networks (ANNs) are modeled on the human brain – they use interconnected processing nodes to learn patterns in images.
  • Deep Belief Networks (DBNs) are a type of deep neural network composed of multiple layers of “hidden units” that are trained to learn hierarchical representations used in image processing.
  • K-Nearest Neighbors (KNNs) use a non-parametric method to find the nearest training examples to a new image and then classify that image based on its nearest “neighbors.”
  • Transfer Learning models refer to any neural network that’s trained on a large dataset for a particular task and then fine-tuned for a different task (such as another type of image recognition) using a smaller dataset.
  • Autoencoders can learn a compressed representation of image data (“encoding”) and then reconstruct this data (“decoding”), which is why they’re sometimes used for “denoising” (i.e., cleaning) during image recognition.

CNNs: deep learning image recognition

By far the most popular neural network for pretrained image recognition models is Convolutional Neural Networks (CNNs). These networks are called convolutional because they use something known as “convolution” in mathematics to learn specific patterns and features in the images they encounter.

Like all deep learning networks, CNNs are composed of multiple layers of interconnected “neurons” that transform all incoming data through computations. The “convolutional” layers use special filters that find important features in an image, such as corners and edges. The “pooling” layers resize the image data, which makes the model more resilient to ongoing changes, such as variations in the orientation of objects within the image. Eventually, all of the extracted features are put together to classify the image. One of the key CNN strengths is its ability to recognize complex patterns as they travel through the network’s layers, as well as its aptness at recognizing visual objects irrespective of their position.

One critical aspect of this type of neural network is the presence of “weights.” A weight is essentially a degree of influence or importance that a particular connection between neurons (i.e., a “synapse”) has in the network. The CNN works to identify an appropriate weight for a particular feature in the image through a constant feedback loop between the input and the output. This continues until the neural network strikes a match, that is, it makes sense computationally. The output is then passed on to the subsequent layer of the neural network and ultimately used to perform a prediction – “we are looking at a pink elephant.”

Popular CNN-based pretrained models at the image recognition market include:

  • Faster R-CNN (Region-based CNN) is a two-stage pretrained model that uses a CNN to generate candidate object regions, which are then passed through a separate CNN to classify images and refine bounding boxes. It’s known for its accuracy, but it can take a long time to retrain.
  • You Only Look Once (YOLO) is a one-stage model that utilizes a CNN to predict class labels and bounding boxes of objects in an image. It is known for its fast inference time (i.e., quick delivery) and low memory usage, but it also has lower accuracy compared to Faster R-CNN.
  • Single Shot MultiBox Detector (SSD) is a one-stage model that uses a single CNN to predict bounding boxes and class labels of objects in an image. It’s generally known for its good balance between accuracy and performance speed.

In addition, there are also some cloud-based options to perform image recognition that allow users to leverage the power of CNN-based models. The most well-known among them are:

  • Amazon Rekognition
  • Cloud Vision API (Google)
  • Azure Custom Vision Service (Microsoft)

It’s important to remember that these three are not standalone image models; instead, they provide a platform for using trained image recognition models as a service. Those who decide to go for this option will still need to provide these cloud-based services with annotated data.

On the plus side, the whole process is easier in this scenario since these are what’s known as “turnkey” solutions – they basically do most things, including training, for you. However, the downside is that these solutions provide fewer degrees of freedom, meaning that customization options and fine-tuning are limited. As a result, these may or may not work well depending on the particulars of a given image recognition application.

Model evaluation, deployment, and monitoring for image recognition

Now that we have retrained our CNN-based foundation model on annotated data to meet the requirements of a specific image recognition task, we need to make sure that our AI solution actually works. Model evaluation, deployment, and monitoring are three distinct stages, but we’re going to combine them into one thread here for the purposes of simplicity. The reason we can do this is because, though each stage is unique in its own right, all three boil down to making sure that our AI-assisted image recognition solution is doing exactly what it’s supposed to do – immediately before, during, and after its release.

It is during these stages that Toloka’s crowd contributors come back into the picture. In addition to the previously covered stages of data collection and data labeling, human annotators play a huge role in gauging performance of AI-assisted image recognition solutions. This has to do with the fact that during model evaluation, deployment, and monitoring, AI solutions always face new, previously unseen data. As a result, additional data labeling is usually required in order to see how well the model’s doing by comparing its output to ground truth (i.e., what we know to be true).

In the explanations below, we’ll be using the terms “evaluation” and “monitoring” more broadly and interchangeably. From the perspective of data annotation, they entail similar actions on the part of data labelers, irrespective of when exactly after fine-tuning this “testing” takes place. There are two ways that crowd contributors can assist with gauging ML model performance.

Evaluation/monitoring

This is a one-stage process that involves rating the model’s predictions directly. Going back to one of the previous examples, let’s say that our AI-assisted image recognition solution for airport security had to sort through incoming images and identify any potential weapons or unlawful behavior.

In this case, crowd contributors may be given images with accompanying captions provided by the AI solution and asked to rate the model’s responses in every image, for instance: “There’s a weapon in the image [says the model]: MATCH / MISS” or “No threatening poses or unlawful behavior can be seen in the image [says the model]: MATCH / MISS.” These results allow machine learning engineers working on the project to see how accurate their image recognition solution is by counting the number of matches provided by the retrained model.

Evaluation/monitoring with future fine-tuning

This is a two-stage process that involves getting crowd contributors to produce ground truth in the form of golden sets (i.e., “part a”). Annotators are given the same images that the model faced after retraining, and they answer questions like: “Is there a weapon anywhere in the image?: YES/NO” or “Is anyone in the image displaying a threatening pose or attempting to harm anyone?: YES/NO?”

After this, crowd contributors need to compare their responses to the model’s responses (i.e., “part b”). This part of the evaluation could take a number of forms. One of the most common is the aforementioned side-by-side task that involves comparing a series of images displayed in pairs (the model vs. ground truth: “Are the two captions/responses the same or different?”).

This could be done by the same group of contributors that tackled “part a” of the evaluation process (i.e., producing golden sets) or a different group of annotators. As a result, just like in the one-stage process, ML engineers working on the project will again know how accurate their model is, but this time, they will also have golden sets at their disposal that can be used to further fine-tune the model if necessary.

Applications of image recognition technology

That was our complete ML life cycle. As to the use, fine-tuned and evaluated models for image recognition have a number of applications that offer useful solutions across multiple domains. They include:

Security and surveillance

As per our example seen throughout the article, security and surveillance is a domain where AI-assisted image recognition has started to play a major role. When it comes to security, such as airport security, image recognition technology is being used to process surveillance footage. This tends to boost both the accuracy and the speed of identifying suspicious activities and objects. Security personnel can be quickly alerted to potential threats. Other forms of surveillance include finding missing persons with image-recognition-trained drones (UAS).

Image recognition technology is also being used to grant user access to devices and platforms, the best example being your iPhone’s facial recognition technology (yes, it’s machine learning in case you didn’t know it). Amazon has recently started to use this technology to verify its online vendor identities.

Healthcare

As we mentioned earlier, this is one of the most promising and useful ML-backed image recognition applications that can help improve patient diagnosis and/or treatment by assisting doctors with analyzing and interpreting medical images such as MRIs, MEGs, CT scans, and X-rays. The goal is to detect any abnormalities or irregularities in these images more accurately and more efficiently, and also monitor patient progress.

Self-driving vehicles

One of the most industry-disrupting applications of image recognition technology is self-driving vehicles that we also already mentioned. Image recognition allows autonomous cars to “see” and understand their environment. Incoming imagery is processed from the vehicle’s onboard cameras and used for safe navigation – to identify other vehicles, pedestrians, traffic lights, road signs, and potential obstacles. Among companies that have started to use this technology in their automobiles are Tesla, Waymo (Google), Cruise (General Motors), and Yandex.

E-marketplaces and online communities

AI-assisted image recognition technology is being used in e-commerce to help shoppers find relevant products (think of our example about clothing). If a user doesn’t know the name of a particular product or its exact model, but they have a picture of it, they can easily conduct a search. This is basically like Shazam (that also uses ML of course), but for imagery instead of audio data. One of the first companies to use it was Google with its image search feature, and this technology has since been adopted by other companies like eBay. The very same AI-assisted technology is being used by human moderators to detect and remove graphic or unsuitable images from web platforms and online communities (i.e., content moderation).

Manufacturing

AI-assisted image recognition technology is also being used in manufacturing to bolster quality control and increase production efficiency. Rather than having live personnel perform quality control on the line – a practice that’s both tedious and prone to human error – ML-backed image recognition solutions can analyze images of finished products to identify any defects or deviations and quickly alert shift managers to potential issues. The same image recognition technology can also be implemented to monitor manufacturing processes from start to finish in order to identify streamlining opportunities.

Agriculture

AI-assisted image recognition technology has also begun to play an important role in agriculture. By looking at the images of field crops, AI solutions can quickly identify areas of concern such as pests, diseases and fungi, or nutrient deficiencies. In addition, this technology can help optimize expenditures by helping businesses rework irrigation schedules and reduce water usage. Likewise, image recognition can be used to monitor the well-being of livestock, for instance, detecting when farm animals are in heat.

Concluding remarks

As we’ve seen, ML-backed image recognition is already assisting multiple industries and business domains. At the core of this technology are pretrained image recognition models like SSD and YOLO that are based on the Convolutional Neural Network (CNN) architecture. Another big part of image recognition is having the right data, which has to be collected, annotated, and subsequently fed into these models in order to retrain and fine-tune them for specific downstream applications.

Different data collection approaches exist. Among them is feet-on-street crowdsourcing offered by Toloka. Data labeling for image recognition solutions can also be carried out in various ways, with crowd-assisted data annotation for computer vision being one of the most affordable and time-effective methods. Since new data must always be used after model fine-tuning, data labelers – including those from Toloka – also play a crucial role in the final stages of the ML life cycle, during which model performance is repeatedly tested.

We recommend that you do more research on the topic and get in touch with us if you require any assistance with data collection, data labeling, or model evaluation for your specific AI-assisted image recognition solution. We’d also be happy to talk to you if you’re considering integrating ML-backed image recognition into your existing business to improve efficiency and sales or cut costs.

Article written by:
Natalie Kudan
Updated: 

Recent articles

Have a data labeling project?

Take advantage of Toloka technologies. Chat with our expert to learn how to get reliable training data for machine learning at any scale.
Fractal