Solutions

Datasets

Research

Resources

Company

Talk to us

Toloka welcomes new investors Bezos Expeditions and Mikhail Parakhin in strategic funding round

Learn more

Toloka welcomes new investors Bezos Expeditions and Mikhail Parakhin in strategic funding round

Learn more

Toloka welcomes new investors Bezos Expeditions and Mikhail Parakhin in strategic funding round

Learn more

Transformer Architecture: Redefining Machine Learning Across NLP and Beyond

Toloka Team

July 6, 2024

Essential ML Guide

Can your AI agent survive in the real world?

Training datasets are what it needs to reason, adapt, and act in unpredictable environments

Get traning data

Transformer models represent a notable shift in machine learning, particularly in natural language processing (NLP) and computer vision. The transformer neural network architecture introduced a novel approach to capturing dependencies across input sequences. This innovation enables models to process data in parallel, significantly enhancing computational efficiency.

The core of the transformer model is the self-attention mechanism, which allows it to weigh the importance of different elements of the input sequence. As a result, transformers demonstrate a more nuanced and context-aware understanding of data than their predecessors, such as recurrent neural networks (RNNs) and convolutional neural networks (CNNs).

Transformer Architecture: Redefining Machine Learning Across NLP and Beyond 1

Timeline for the development of transformer-based large language models. Source: Entropy (2023)

In NLP, large language models (LLMs) like BERT and GPT leverage self-attention to capture long-range dependencies and context, leading to superior performance in machine translation, sentiment analysis, and text generation. Vision Transformers (ViTs) apply self-attention to learn patterns within an image and further image classification and object detection.

The scalability of transformers enables the creation of large-scale pre-trained models that can be later fine-tuned for diverse applications across various domains.

Transformer Architecture: Redefining Machine Learning Across NLP and Beyond 2

Vision Transformer can offer a better localization of the target lesion than convolutional neural network (CNN)-based models. Source: Nature

This article reviews the impact of the transformer neural networks, their core components, and diverse applications.

What Is Transformer Architecture?

A transformer is a deep neural network that analyzes sequential data, tracks its context, and generates corresponding new data. Transformers, trained on large amounts of unlabeled data in a self-supervised manner, can extract patterns in raw text to understand and produce a human-like copy.

Such model was originally described in the paper Attention Is All You Need, published in the spring of 2017 by eight researchers from Google. Their innovative idea aimed to improve existing approaches to machine translation by having a neural network understand entire sentences at once instead of the traditional method of reading and translating them word by word in sequence.

In May 2024, Steven Levy published an article about the legendary “Attention” paper, its eight authors, and the birth of the original transformer model in the WIRED.

Before the advent of transformers, sequence modeling tasks were predominantly handled by recurrent neural networks, which were effective but suffered from poor parallelization and vanishing gradient problems. Convolutional neural networks were also adapted for NLP tasks, but being originally designed for image processing, the CNNs were not naturally optimized for sequential data.

Transformer models retain the encoder-decoder framework, but unlike traditional RNNs and CNNs, they do not use recurrence or convolutions to process sequential data. Instead, transformers rely on a mathematical method known as self-attention to understand the relationships between different elements and thus learn the data's meaning and context.

An illustration of the main components of the transformer model from the original paper. Source: NIPS'17

When a transformer interprets a word in a text copy, it evaluates all other words in the same sentence. The attention mechanism allows the model to identify words and entire fragments that directly impact the meaning of a particular word. This guarantees a much more precise understanding of context.

The impact of the newly emerged approach was first notable in machine translation, but transformers were quickly adapted for various NLP tasks. Since 2017, transformer neural networks have become foundational for many of the most remarkable AI projects, standing behind ChatGPT, BERT, and Google Gemini series.

In December 2023, two of the transformer architecture co-inventors, Ashish Vaswani and Niki Parmar, raised $56 mln for their Essential AI startup . They aim to make it possible to carry out analytics using natural language commands. Photo credit: Andrew Gessler on behalf of March Capital. Source: Finacial Post

General pretrained transformers can undergo a process known as transfer learning. Then, their version, fine-tuned with the help of human annotators, can perform far more effectively on a particular task.

How Does Transformer Architecture Work?

Key Elements

Here are some of the transformer neural network architecture components.

Encoder-Decoder

The transformer architecture consists of an encoder and a decoder, each composed of multiple identical layers. The encoder processes the input, creating its matrix representation, while the decoder iteratively generates the output sequence.

All encoders share an identical structure, processing the input sequentially and passing it along to the next layer in the chain. Similarly, all decoders follow the same structure, receiving input from the final encoder and the preceding decoder. The number of these layers can be scaled depending on the project's requirements.

Global structure of Encoder-Decoder. Image by Josep Ferrer. Source: DataCamp

Multi-head Attention Mechanism

Both the encoder and decoder layers contain self-attention mechanisms to estimate the importance of different words in a sentence relative to each other, regardless of their positions. This mechanism enables transformers to capture long-range dependencies effectively.

An illustration from the original paper Attention Is All You Need. Source: NIPS'17

Multi-head attention enhances the model's ability to focus on different parts of the input sequence simultaneously. Using multiple attention heads, a transformer can capture various contextual relationships, improving its overall understanding of the sequence.

Feed Forward Neural Networks

Feed-forward neural Networks add non-linearity and complexity to the representations created by the encoder. Each position's representation is transformed through a sequence of fully connected layers, enabling the model to learn complex relationship patterns.

Each encoder and decoder layer includes a position-wise feed-forward network, which applies two linear transformations with a ReLU activation in between. This network transforms the attention outputs into a more helpful representation for subsequent layers.

Positional Encodings

Positional encodings are added to provide information about the position of each word in the sequence. They ensure the model can differentiate between sequences where word order matters.

By incorporating positional information, the transformer can understand the relationships between words and maintain context, which is essential for text generation or translation tasks.

A diagram of a sinusoidal positional encoding with parameters 𝑁 = 10000, 𝑑 = 100. Source: Cosmia Nebula

Residual Connections and Layer Normalization

Residual or skip connections don't let the model forget any input features during training. Therefore, they help mitigate the vanishing gradient problem common for other kinds of neural networks.

Residual learning frameworks were known long before transformers. Source: Deep Residual Learning for Image Recognition

Layer normalization sets standards for the inputs to each layer. This technique significantly enhances the model's ability to learn complex patterns and relationships in the data.

Cross-Attention Mechanism

The cross-attention layer is a special feature of the decoder, enabling the model to refer to different parts of the input sequence while generating the output.

Cross-attention in the transformer decoder, as shown in the Attention is All You Need paper. Source: Vaclav Kosar’s blog

This mechanism allows the model to consider the relevant context while generating a particular word. Such interaction between the encoder and decoder helps produce coherent text.

Softmax Layer

The softmax layer is crucial for converting raw output scores into probabilities. In the context of language models, it allows the model to predict the likelihood of each possible next token in a sequence, choosing the highest probability one as the following word in the generated text.

Transformer Model Workflow

Tokenization and Embedding

The first step in the transformer's workflow means splitting the input text into smaller units called tokens. These can be words, punctuation signs, prefixes, suffixes, characters, etc. Tokenization simplifies the original data and corresponds each element to a known token from the library.

Then, the model converts each token into a dense vector representation. This embedding maps each token to a high-dimensional space, capturing semantic information and relationships between tokens.

In case of transformers, these embeddings are supplemented by the positional encodings mentioned earlier. The information about the order of tokens in the sequence allows the model to turn that sequence, e.g., a piece of copy, into a relevant vector, different from a simple sum of the vectors representing each particular token.

Positional encoding keeps track of the positions of the words in a sentence. Otherwise, the model would see each piece of text as a random collection of words and signs far from the original context. Source: Transformer Architecture explained

Encoding

When words are converted into tokens and then vectors, the extensive neural network starts predicting the following word in a sentence. These embedded tokens enter the encoder, and each token, for example, a word, follows its own processing path.

Each encoder layer consists of a multi-head attention mechanism and a position-wise feed-forward neural network. Source: Jay Alammar on GitHub

The self-attention layer allows the encoder to consider all words in an input sentence while encoding a particular word, making each word's path dependent on the others. In contrast, the feed-forward layer operates independently for each word, allowing parallel processing.

Each token's representation is processed through fully connected layers with ReLU activations, enabling the model to learn intricate patterns.

Self-Attention

Self-attention calculates the importance of each token in relation to others in the sequence. The mechanism generates three vectors for each token: Query, Key, and Value. The dot product of the Query and Key vectors determines the attention scores, which are then normalized using the softmax function to obtain the attention weights.

These weights are used to compute a weighted sum of the Value vectors, producing the self-attention output. This process allows the model to capture dependencies regardless of the tokens' positions in the sequence.

As we encode the word "it" in encoder #5 (the top encoder in the stack), part of the attention mechanism focuses on "The Animal" and bakes a part of its representation into the encoding of "it." Source: Jay Alammar on GitHub

Decoding

The decoder also consists of multiple identical layers with three main components:

A masked multi-head self-attention mechanism
A multi-head cross-attention mechanism
A position-wise feed-forward neural network

The masked self-attention ensures that the decoder predicts the following word, using only data from the past, and prevents it from looking ahead. Using a look-ahead mask, the model can process the entire input in a single inference step while still predicting words one at a time, significantly speeding up the training and prediction process.

The cross-attention mechanism enables the decoder to focus on relevant parts of the encoder's output, while the position-wise feed-forward network processes each token independently through fully connected layers. Together, they enhance the generation of coherent and contextually accurate text.

Output Generation

However, the decoder stack outputs vectors are still needed to be converted into words or something else a human can understand. The linear layer, followed by the softmax layer, does this.

The linear layer is a fully connected neural network that maps the representation produced by the stack of decoders into a vocabulary space. It generates a logit vector with thousands of elements, each corresponding to the score of a unique word in the model's vocabulary. By aligning the output with the logits vector, the numerical representation can be interpreted as words.

The softmax layer then converts the scores computed by the linear layer into probabilities, with the token associated with the highest probability chosen as the next word or symbol in a sentence.

The softmax layer chooses the first word for a story. Source: Transformer Architecture Explained

Summary

So, the usual workflow for a transformer model looks like this:

Words come get turned into tokens (tokenization).
Tokenized words are turned into numbers (embeddings).
The order in which they appear is considered (positional encoding)ccount (positional encoding).
Every token gets a vector that is input into the model.
The encoder processes these vectors using self-attention mechanisms and feed-forward neural networks to create contextual representations.
The decoder takes these contextual representations, applies cross-attention to focus on relevant parts of the encoder's output, and generates the output sequence step-by-step.
The linear layer maps the decoder's output to a logit vector, and the softmax layer converts this into probabilities to predict the next word or symbol in the sequence.

The diagram depicts the primary components of the model as presented in the original Attention Is All You Need paper with additional comments. Source: Cross Validated

Transformer Models Use Cases

Initially, transformer inventors aimed to address the challenges in sequence-to-sequence tasks, particularly in translation. They wanted a model that could more effectively handle long-range dependencies and process data in parallel, overcoming the limitations of Recurrent Neural Networks (RNNs) and Long-Short-Term Memory (LSTM) networks.

The paper published in August 2017, a few months after the ‘Attention’ article, demonstrated the transformer outperforming RNNs and CNNs on translation benchmarks. Source: Google Research

However, this revolutionary approach laid the groundwork for many practical applications across multiple domains.

Early Applications and Expansion

The initial success of transformers in neural machine translation quickly led to their adoption in other natural language processing (NLP) tasks. Early applications included text summarization, sentiment analysis, and question-answering systems. The introduction of the Tensor2Tensor library in 2017 provided efficient implementations of the transformer mode, facilitating its further expansion.

In 2018, Google researchers released BERT (Bidirectional Encoder Representations from Transformers), which achieved groundbreaking results in question answering and language inference. GPT (Generative Pre-trained Transformer), released in 2018 by OpenAI, showcased transformers' text generation and language modeling capabilities, getting further with each update.

Before 2020, other notable examples of transformers' practical use included projects in translation, text summarization, and other complex language tasks. For instance, XLNet, released in 2019 by Google AI, addressed limitations related to the unidirectional context in BERT.

Broadening Horizons: Different Use Cases

Transformers have since expanded beyond NLP into various fields, demonstrating versatility and robustness. Here are just a few areas where transformers are making a significant impact:

Healthcare

Major healthcare data source modalities and corresponding tasks. Source: Transformers in Healthcare: A Survey

Medical Record Analysis: Transformers can extract valuable insights from electronic health records, aiding in patient diagnosis and treatment plans. For instance, BioBERT is a pre-trained model designed explicitly for biomedical text mining.

Overview of the pre-training and fine-tuning of BioBERT. Source: Bioinformatics, 2020

Drug Discovery: Transformers have shown significant promise in drug discovery by leveraging their ability to handle sequential data efficiently. For example, the UK-based pharmaceutical company Exscientia works on a transformer neural network for automating retrosynthesis and guiding the synthesis of new drug molecules, while Insilico Medicine's Chemistry42 platform integrates transformers with generative approaches to design novel compounds.

AstraZeneca's MolBART model, trained on a large chemical compound database using NVIDIA's Megatron framework, aims to understand molecular relationships like language models understand word relationships. Academic researchers are also exploring transformer architectures for predicting different drug interactions, cancer drug sensitivity, and protein-ligand affinity.

The structure of DeepTTA, a transformer-based prediction model for anti-cancer drug responses. Source: Briefings in Bioinformatics, 2022

Genomic Data Analysis: Transformers are increasingly utilized in genomic data analysis to identify disease markers and potential genetic disorders. Thanks to their advanced attention mechanisms, they excel in producing sequences, classifying data, and performing quantitative assays.

These models can be applied as standalone systems or following initial compression layers. Standalone transformers handle tasks without additional context, while the latter approach compresses data to manage computational costs. A notable example is DNABERT, a modified BERT model tailored for DNA sequence analysis.

DNABERT pre-trains on specific genomic tasks, achieving state-of-the-art results in predicting promoter regions and transcription-factor binding sites.

Finance

Forecasting and Algorithmic Trading: Transformers also revolutionize algorithmic trading by analyzing large volumes of financial data to predict market trends and inform trading strategies. FinBERT, a financial sentiment analysis model, helps understand market sentiments from news articles and social media.

Another innovative transformer model is the HFformer, designed for high-frequency Bitcoin-USDT log-return forecasting. HFformer explores various high-frequency trading strategies, including trade sizing, trading signal aggregation, and minimal trading threshold, overperforming traditional Long Short-Term Memory (LSTM) models.

According to the research published in China Finance Review International, second-generation transformer models (Informer, Autoformer, and PatchTST) prove to be highly effective in financial forecasting. This is especially remarkable in cases with limited historical data and high market volatility.

Fraud Detection: Transformers can be highly effective in fraud detection, providing a nuanced understanding of transactional behavior through contextual analysis.

In December 2023, a group of researchers from WeChat Pay suggested an innovative autoregressive model leveraging GPT architectures tailored for identifying fraudulent activities in payment systems. By reconstructing behavioral sequences and utilizing unsupervised pretraining, this model excels in analyzing users’ transactions without the need for labeled data.

Manufacturing

Predictive Maintenance: Transformers analyze sensor data from machinery to predict failures and schedule timely maintenance, reducing downtime and costs. Models like BERT are adapted to analyze machine logs and operational data.

In 2022, a group of researchers from Brazil suggested the T4PdM transformer-based model to identify various types of faults in rotating machinery. Their experimental results proved that the model significantly enhances diagnostic processes.

Comparison of the T4PdM and some other models' performance in the experiments for the MaFaulDa vibration dataset.

Popular Models

The original transformer architecture has evolved into three different variations based on specific needs.

Encoder Pretraining

These models focus on understanding complete textual fragments and excel in text classification and question answering.

BERT: Bidirectional Encoder Representations from Transformers is a language model introduced in October 2018 that has been used to improve Google search queries since 2019. BERT has numerous variations, including ALBERT, RoBERTa, ELECTRA, DistilBERT, SpanBERT, and TinyBERT.

ERNIE: ERNIE is a series of models by Baidu that show high efficiency, especially with tasks in Chinese.

AlphaFold: AlphaFold software is a deep learning system developed by DeepMind, a subsidiary of Alphabet, for protein structure predictions.

Decoder Pretraining

These transformers, also known as auto-regressive language models, specialize in text generation.

GPT: Generative pre-trained transformers are a type of large language model by OpenAI and the most famous generative AI framework.

Alpaca 7B: Alpaca is an advanced natural language processing model from Stanford University, showing efficiency similar to that of ChatGPT in generating coherent conversations.

Stanford researchers said their data generation process cost less than $500 using the OpenAI API. Source: Stanford University

Minerva: Minerva is a model created by Google for solving quantitative problems and mathematical reasoning.

Encoder-Decoder Pretraining

These combined variations are perfect for handling translation or summarization.

BART: Bidirectional and Auto-Regressive Transformer by Facebook, often called a generalization of BERT, GPT, and several other pretraining approaches.

HTLM: Hyper-text language model by Facebook AI intended for structured HTML prompting.

DQ-BART: Sequence-to-sequence model by Amazon for language generation tasks.

Transformers family tree. Source: Transformer models: an introduction and catalog

Most businesses prefer fine-tuning pre-trained models to building their own from scratch. This cost-effective approach leverages extensive training already done on vast datasets. However, some companies may develop custom transformer neural networks, especially those with unique data requirements or highly specialized tasks.

Final Thoughts

The practical advantages of transformer networks extended their impact far beyond their initial NLP application. As transformers evolve, their applications across healthcare, finance, manufacturing, agriculture, transportation, energetics, and other domains will only expand, driving efficiency.

Businesses evaluating these models should consider their performance, scalability, and adaptability to specific tasks. They often opt for fine-tuning pre-trained models to suit their needs.

Transformer models have grown much more significant. Source: NVIDIA Blog

Despite their many strengths, transformers can be computationally intensive and require substantial resources for training and deployment.

However, with ongoing research and development, the challenges associated with transformers will likely be addressed, making them even more integral to future advancements in artificial intelligence and machine learning.

Subscribe to Toloka News

Case studies, product news, and other articles straight to your inbox.

Subscribe to Toloka News

Case studies, product news, and other articles straight to your inbox.

Subscribe to Toloka News

Case studies, product news, and other articles straight to your inbox.

Recent articles

View all articles

Creating domain-ready datasets: How Toloka's hybrid approach generates realistic and high-quality data

Aug 4, 2025

Image annotation tools: how to label data that actually teaches AI

Jul 30, 2025

Agentic AI & the Future of Coding

Jul 29, 2025

Creating domain-ready datasets: How Toloka's hybrid approach generates realistic and high-quality data

Aug 4, 2025

Image annotation tools: how to label data that actually teaches AI

Jul 30, 2025

Agentic AI & the Future of Coding

Jul 29, 2025

How to measure AI performance and ensure your AI investment pays off

Jul 28, 2025

More about Toloka

What is Toloka’s mission?

Where is Toloka located?

What is Toloka’s key area of expertise?

How long has Toloka been in the AI market?

How does Toloka ensure the quality and accuracy of the data collected?

How does Toloka source and manage its experts and AI tutors?

What types of projects or tasks does Toloka typically handle?

What industries and use cases does Toloka focus on?

What is Toloka’s mission?

Where is Toloka located?

What is Toloka’s key area of expertise?

How long has Toloka been in the AI market?

How does Toloka ensure the quality and accuracy of the data collected?

How does Toloka source and manage its experts and AI tutors?

What types of projects or tasks does Toloka typically handle?

What industries and use cases does Toloka focus on?

What is Toloka’s mission?

Where is Toloka located?

What is Toloka’s key area of expertise?

How long has Toloka been in the AI market?

How does Toloka ensure the quality and accuracy of the data collected?

How does Toloka source and manage its experts and AI tutors?

What types of projects or tasks does Toloka typically handle?

What industries and use cases does Toloka focus on?