Toloka Team
Transformer Architecture: Redefining Machine Learning Across NLP and Beyond
Transformer models represent a notable shift in machine learning, particularly in natural language processing (NLP) and computer vision. The transformer neural network architecture introduced a novel approach to capturing dependencies across input sequences. This innovation enables models to process data in parallel, significantly enhancing computational efficiency.
The core of the transformer model is the self-attention mechanism, which allows it to weigh the importance of different elements of the input sequence. As a result, transformers demonstrate a more nuanced and context-aware understanding of data than their predecessors, such as recurrent neural networks (RNNs) and convolutional neural networks (CNNs).
Timeline for the development of transformer-based large language models. Source: Entropy (2023)
In NLP, large language models (LLMs) like BERT and GPT leverage self-attention to capture long-range dependencies and context, leading to superior performance in machine translation, sentiment analysis, and text generation. Vision Transformers (ViTs) apply self-attention to learn patterns within an image and further image classification and object detection.
The scalability of transformers enables the creation of large-scale pre-trained models that can be later fine-tuned for diverse applications across various domains.
Vision Transformer can offer a better localization of the target lesion than convolutional neural network (CNN)-based models. Source: Nature
This article reviews the impact of the transformer neural networks, their core components, and diverse applications.
What Is Transformer Architecture?
A transformer is a deep neural network that analyzes sequential data, tracks its context, and generates corresponding new data. Transformers, trained on large amounts of unlabeled data in a self-supervised manner, can extract patterns in raw text to understand and produce a human-like copy.
Such model was originally described in the paper Attention Is All You Need, published in the spring of 2017 by eight researchers from Google. Their innovative idea aimed to improve existing approaches to machine translation by having a neural network understand entire sentences at once instead of the traditional method of reading and translating them word by word in sequence.
In May 2024, Steven Levy published an article about the legendary “Attention” paper, its eight authors, and the birth of the original transformer model in the WIRED.
Before the advent of transformers, sequence modeling tasks were predominantly handled by recurrent neural networks, which were effective but suffered from poor parallelization and vanishing gradient problems. Convolutional neural networks were also adapted for NLP tasks, but being originally designed for image processing, the CNNs were not naturally optimized for sequential data.
Transformer models retain the encoder-decoder framework, but unlike traditional RNNs and CNNs, they do not use recurrence or convolutions to process sequential data. Instead, transformers rely on a mathematical method known as self-attention to understand the relationships between different elements and thus learn the data's meaning and context.
An illustration of the main components of the transformer model from the original paper. Source: NIPS'17
When a transformer interprets a word in a text copy, it evaluates all other words in the same sentence. The attention mechanism allows the model to identify words and entire fragments that directly impact the meaning of a particular word. This guarantees a much more precise understanding of context.
The impact of the newly emerged approach was first notable in machine translation, but transformers were quickly adapted for various NLP tasks. Since 2017, transformer neural networks have become foundational for many of the most remarkable AI projects, standing behind ChatGPT, BERT, and Google Gemini series.
In December 2023, two of the transformer architecture co-inventors, Ashish Vaswani and Niki Parmar, raised $56 mln for their Essential AI startup . They aim to make it possible to carry out analytics using natural language commands. Photo credit: Andrew Gessler on behalf of March Capital. Source: Finacial Post
General pretrained transformers can undergo a process known as transfer learning. Then, their version, fine-tuned with the help of human annotators, can perform far more effectively on a particular task.
How Does Transformer Architecture Work?
Key Elements
Here are some of the transformer neural network architecture components.
Encoder-Decoder
The transformer architecture consists of an encoder and a decoder, each composed of multiple identical layers. The encoder processes the input, creating its matrix representation, while the decoder iteratively generates the output sequence.
All encoders share an identical structure, processing the input sequentially and passing it along to the next layer in the chain. Similarly, all decoders follow the same structure, receiving input from the final encoder and the preceding decoder. The number of these layers can be scaled depending on the project's requirements.
Global structure of Encoder-Decoder. Image by Josep Ferrer. Source: DataCamp
Multi-head Attention Mechanism
Both the encoder and decoder layers contain self-attention mechanisms to estimate the importance of different words in a sentence relative to each other, regardless of their positions. This mechanism enables transformers to capture long-range dependencies effectively.
An illustration from the original paper Attention Is All You Need. Source: NIPS'17
Multi-head attention enhances the model's ability to focus on different parts of the input sequence simultaneously. Using multiple attention heads, a transformer can capture various contextual relationships, improving its overall understanding of the sequence.
Feed Forward Neural Networks
Feed-forward neural Networks add non-linearity and complexity to the representations created by the encoder. Each position's representation is transformed through a sequence of fully connected layers, enabling the model to learn complex relationship patterns.
Each encoder and decoder layer includes a position-wise feed-forward network, which applies two linear transformations with a ReLU activation in between. This network transforms the attention outputs into a more helpful representation for subsequent layers.
Positional Encodings
Positional encodings are added to provide information about the position of each word in the sequence. They ensure the model can differentiate between sequences where word order matters.
By incorporating positional information, the transformer can understand the relationships between words and maintain context, which is essential for text generation or translation tasks.
A diagram of a sinusoidal positional encoding with parameters 𝑁 = 10000, 𝑑 = 100. Source: Cosmia Nebula
Residual Connections and Layer Normalization
Residual or skip connections don't let the model forget any input features during training. Therefore, they help mitigate the vanishing gradient problem common for other kinds of neural networks.
Residual learning frameworks were known long before transformers. Source: Deep Residual Learning for Image Recognition
Layer normalization sets standards for the inputs to each layer. This technique significantly enhances the model's ability to learn complex patterns and relationships in the data.
Cross-Attention Mechanism
The cross-attention layer is a special feature of the decoder, enabling the model to refer to different parts of the input sequence while generating the output.
Cross-attention in the transformer decoder, as shown in the Attention is All You Need paper. Source: Vaclav Kosar’s blog
This mechanism allows the model to consider the relevant context while generating a particular word. Such interaction between the encoder and decoder helps produce coherent text.
Softmax Layer
The softmax layer is crucial for converting raw output scores into probabilities. In the context of language models, it allows the model to predict the likelihood of each possible next token in a sequence, choosing the highest probability one as the following word in the generated text.
Transformer Model Workflow
Tokenization and Embedding
The first step in the transformer's workflow means splitting the input text into smaller units called tokens. These can be words, punctuation signs, prefixes, suffixes, characters, etc. Tokenization simplifies the original data and corresponds each element to a known token from the library.
Then, the model converts each token into a dense vector representation. This embedding maps each token to a high-dimensional space, capturing semantic information and relationships between tokens.
In case of transformers, these embeddings are supplemented by the positional encodings mentioned earlier. The information about the order of tokens in the sequence allows the model to turn that sequence, e.g., a piece of copy, into a relevant vector, different from a simple sum of the vectors representing each particular token.
Positional encoding keeps track of the positions of the words in a sentence. Otherwise, the model would see each piece of text as a random collection of words and signs far from the original context. Source: Transformer Architecture explained
Encoding
When words are converted into tokens and then vectors, the extensive neural network starts predicting the following word in a sentence. These embedded tokens enter the encoder, and each token, for example, a word, follows its own processing path.
Each encoder layer consists of a multi-head attention mechanism and a position-wise feed-forward neural network. Source: Jay Alammar on GitHub
The self-attention layer allows the encoder to consider all words in an input sentence while encoding a particular word, making each word's path dependent on the others. In contrast, the feed-forward layer operates independently for each word, allowing parallel processing.
Each token's representation is processed through fully connected layers with ReLU activations, enabling the model to learn intricate patterns.
Self-Attention
Self-attention calculates the importance of each token in relation to others in the sequence. The mechanism generates three vectors for each token: Query, Key, and Value. The dot product of the Query and Key vectors determines the attention scores, which are then normalized using the softmax function to obtain the attention weights.
These weights are used to compute a weighted sum of the Value vectors, producing the self-attention output. This process allows the model to capture dependencies regardless of the tokens' positions in the sequence.
As we encode the word "it" in encoder #5 (the top encoder in the stack), part of the attention mechanism focuses on "The Animal" and bakes a part of its representation into the encoding of "it." Source: Jay Alammar on GitHub
Decoding
The decoder also consists of multiple identical layers with three main components:
A masked multi-head self-attention mechanism
A multi-head cross-attention mechanism
A position-wise feed-forward neural network
The masked self-attention ensures that the decoder predicts the following word, using only data from the past, and prevents it from looking ahead. Using a look-ahead mask, the model can process the entire input in a single inference step while still predicting words one at a time, significantly speeding up the training and prediction process.
The cross-attention mechanism enables the decoder to focus on relevant parts of the encoder's output, while the position-wise feed-forward network processes each token independently through fully connected layers. Together, they enhance the generation of coherent and contextually accurate text.
Output Generation
However, the decoder stack outputs vectors are still needed to be converted into words or something else a human can understand. The linear layer, followed by the softmax layer, does this.
The linear layer is a fully connected neural network that maps the representation produced by the stack of decoders into a vocabulary space. It generates a logit vector with thousands of elements, each corresponding to the score of a unique word in the model's vocabulary. By aligning the output with the logits vector, the numerical representation can be interpreted as words.
The softmax layer then converts the scores computed by the linear layer into probabilities, with the token associated with the highest probability chosen as the next word or symbol in a sentence.
The softmax layer chooses the first word for a story. Source: Transformer Architecture Explained
Summary
So, the usual workflow for a transformer model looks like this:
Words come get turned into tokens (tokenization).
Tokenized words are turned into numbers (embeddings).
The order in which they appear is considered (positional encoding)ccount (positional encoding).
Every token gets a vector that is input into the model.
The encoder processes these vectors using self-attention mechanisms and feed-forward neural networks to create contextual representations.
The decoder takes these contextual representations, applies cross-attention to focus on relevant parts of the encoder's output, and generates the output sequence step-by-step.
The linear layer maps the decoder's output to a logit vector, and the softmax layer converts this into probabilities to predict the next word or symbol in the sequence.
The diagram depicts the primary components of the model as presented in the original Attention Is All You Need paper with additional comments. Source: Cross Validated
Transformer Models Use Cases
Initially, transformer inventors aimed to address the challenges in sequence-to-sequence tasks, particularly in translation. They wanted a model that could more effectively handle long-range dependencies and process data in parallel, overcoming the limitations of Recurrent Neural Networks (RNNs) and Long-Short-Term Memory (LSTM) networks.
The paper published in August 2017, a few months after the ‘Attention’ article, demonstrated the transformer outperforming RNNs and CNNs on translation benchmarks. Source: Google Research
However, this revolutionary approach laid the groundwork for many practical applications across multiple domains.
Early Applications and Expansion
The initial success of transformers in neural machine translation quickly led to their adoption in other natural language processing (NLP) tasks. Early applications included text summarization, sentiment analysis, and question-answering systems. The introduction of the Tensor2Tensor library in 2017 provided efficient implementations of the transformer mode, facilitating its further expansion.
In 2018, Google researchers released BERT (Bidirectional Encoder Representations from Transformers), which achieved groundbreaking results in question answering and language inference. GPT (Generative Pre-trained Transformer), released in 2018 by OpenAI, showcased transformers' text generation and language modeling capabilities, getting further with each update.
Before 2020, other notable examples of transformers' practical use included projects in translation, text summarization, and other complex language tasks. For instance, XLNet, released in 2019 by Google AI, addressed limitations related to the unidirectional context in BERT.
Broadening Horizons: Different Use Cases
Transformers have since expanded beyond NLP into various fields, demonstrating versatility and robustness. Here are just a few areas where transformers are making a significant impact:
Healthcare
Major healthcare data source modalities and corresponding tasks. Source: Transformers in Healthcare: A Survey
Medical Record Analysis: Transformers can extract valuable insights from electronic health records, aiding in patient diagnosis and treatment plans. For instance, BioBERT is a pre-trained model designed explicitly for biomedical text mining.
Overview of the pre-training and fine-tuning of BioBERT. Source: Bioinformatics, 2020
Drug Discovery: Transformers have shown significant promise in drug discovery by leveraging their ability to handle sequential data efficiently. For example, the UK-based pharmaceutical company Exscientia works on a transformer neural network for automating retrosynthesis and guiding the synthesis of new drug molecules, while Insilico Medicine's Chemistry42 platform integrates transformers with generative approaches to design novel compounds.
AstraZeneca's MolBART model, trained on a large chemical compound database using NVIDIA's Megatron framework, aims to understand molecular relationships like language models understand word relationships. Academic researchers are also exploring transformer architectures for predicting different drug interactions, cancer drug sensitivity, and protein-ligand affinity.
The structure of DeepTTA, a transformer-based prediction model for anti-cancer drug responses. Source: Briefings in Bioinformatics, 2022
Genomic Data Analysis: Transformers are increasingly utilized in genomic data analysis to identify disease markers and potential genetic disorders. Thanks to their advanced attention mechanisms, they excel in producing sequences, classifying data, and performing quantitative assays.
These models can be applied as standalone systems or following initial compression layers. Standalone transformers handle tasks without additional context, while the latter approach compresses data to manage computational costs. A notable example is DNABERT, a modified BERT model tailored for DNA sequence analysis.
DNABERT pre-trains on specific genomic tasks, achieving state-of-the-art results in predicting promoter regions and transcription-factor binding sites.
Finance
Forecasting and Algorithmic Trading: Transformers also revolutionize algorithmic trading by analyzing large volumes of financial data to predict market trends and inform trading strategies. FinBERT, a financial sentiment analysis model, helps understand market sentiments from news articles and social media.
Another innovative transformer model is the HFformer, designed for high-frequency Bitcoin-USDT log-return forecasting. HFformer explores various high-frequency trading strategies, including trade sizing, trading signal aggregation, and minimal trading threshold, overperforming traditional Long Short-Term Memory (LSTM) models.
According to the research published in China Finance Review International, second-generation transformer models (Informer, Autoformer, and PatchTST) prove to be highly effective in financial forecasting. This is especially remarkable in cases with limited historical data and high market volatility.
Fraud Detection: Transformers can be highly effective in fraud detection, providing a nuanced understanding of transactional behavior through contextual analysis.
In December 2023, a group of researchers from WeChat Pay suggested an innovative autoregressive model leveraging GPT architectures tailored for identifying fraudulent activities in payment systems. By reconstructing behavioral sequences and utilizing unsupervised pretraining, this model excels in analyzing users’ transactions without the need for labeled data.
Manufacturing
Predictive Maintenance: Transformers analyze sensor data from machinery to predict failures and schedule timely maintenance, reducing downtime and costs. Models like BERT are adapted to analyze machine logs and operational data.
In 2022, a group of researchers from Brazil suggested the T4PdM transformer-based model to identify various types of faults in rotating machinery. Their experimental results proved that the model significantly enhances diagnostic processes.
Comparison of the T4PdM and some other models' performance in the experiments for the MaFaulDa vibration dataset.
Popular Models
The original transformer architecture has evolved into three different variations based on specific needs.
Encoder Pretraining
These models focus on understanding complete textual fragments and excel in text classification and question answering.
BERT: Bidirectional Encoder Representations from Transformers is a language model introduced in October 2018 that has been used to improve Google search queries since 2019. BERT has numerous variations, including ALBERT, RoBERTa, ELECTRA, DistilBERT, SpanBERT, and TinyBERT.
ERNIE: ERNIE is a series of models by Baidu that show high efficiency, especially with tasks in Chinese.
AlphaFold: AlphaFold software is a deep learning system developed by DeepMind, a subsidiary of Alphabet, for protein structure predictions.
Decoder Pretraining
These transformers, also known as auto-regressive language models, specialize in text generation.
GPT: Generative pre-trained transformers are a type of large language model by OpenAI and the most famous generative AI framework.
Alpaca 7B: Alpaca is an advanced natural language processing model from Stanford University, showing efficiency similar to that of ChatGPT in generating coherent conversations.
Stanford researchers said their data generation process cost less than $500 using the OpenAI API. Source: Stanford University
Minerva: Minerva is a model created by Google for solving quantitative problems and mathematical reasoning.
Encoder-Decoder Pretraining
These combined variations are perfect for handling translation or summarization.
BART: Bidirectional and Auto-Regressive Transformer by Facebook, often called a generalization of BERT, GPT, and several other pretraining approaches.
HTLM: Hyper-text language model by Facebook AI intended for structured HTML prompting.
DQ-BART: Sequence-to-sequence model by Amazon for language generation tasks.
Transformers family tree. Source: Transformer models: an introduction and catalog
Most businesses prefer fine-tuning pre-trained models to building their own from scratch. This cost-effective approach leverages extensive training already done on vast datasets. However, some companies may develop custom transformer neural networks, especially those with unique data requirements or highly specialized tasks.
Final Thoughts
The practical advantages of transformer networks extended their impact far beyond their initial NLP application. As transformers evolve, their applications across healthcare, finance, manufacturing, agriculture, transportation, energetics, and other domains will only expand, driving efficiency.
Businesses evaluating these models should consider their performance, scalability, and adaptability to specific tasks. They often opt for fine-tuning pre-trained models to suit their needs.
Transformer models have grown much more significant. Source: NVIDIA Blog
Despite their many strengths, transformers can be computationally intensive and require substantial resources for training and deployment.
However, with ongoing research and development, the challenges associated with transformers will likely be addressed, making them even more integral to future advancements in artificial intelligence and machine learning.
Article written by:
Toloka Team
Updated:
Jul 6, 2024