A Comprehensive Introduction to Transformer Models

Understanding the Encoder-Decoder Architecture and Why They Excel in Natural Language Processing Tasks

9 min readJul 23, 2024

Introduction

In 2017, researchers at Google published the paper “Attention Is All You Need” which proposed a novel neural network architecture for sequence modelling that is now popularly known as Transformer networks. Since then, we have seen the rise of several transformer model variants that removed the need to train task-specific architectures from scratch and broke almost every benchmark in NLP by a significant margin. These models include Generative Pretrained Transformers (GPT), Bidirectional Encoder Representations from Transformers (BERT) and Large Language Model from Meta AI (LLaMA) to mention a few. These models are now the go-to for text classification, summarization, and question-answering tasks. In this guide, we will take a comprehensive look at the Transformer model, explaining its architecture, and components, and showing why they excel in various tasks.

A Review of Sequence Modelling: The Encoder-Decoder Network

Before the advent of Transformers, Recurrent Neural Networks (RNN) were favoured for sequence modelling as they could recognize patterns in sequence data, such as time series data, speech, or text. They are useful for tasks where the order of the data matters.

The defining feature of RNNs is their ability to “remember” information from previous steps (t-1) in the sequence and use that information to influence the current step (t). This is achieved through a feedback loop structure in the network, where the output from one step is fed back into the network as input for the next step.

Consider the sentence “The cat sat on the mat”. Here are the steps RNN would typically utilize to process the sentence:

Step 1:
Input: "The"
Hidden state: ℎ0
Output: The network processes "The" and updates the hidden state to ℎ1

Step 2:
Input: "cat"
Hidden state: ℎ1
Output: The network processes "cat" and updates the hidden state to ℎ2

Step 3:
Input: "sat"
Hidden state: ℎ2
Output: The network processes "mat" and updates the hidden state to ℎ3

.
.
.

Step 6:
Input: "mat"
Hidden state: ℎ4
Output: The network processes "mat" and updates the hidden state to ℎ5

Notice that for each process step, the hidden state captures the context of the previous word, allowing RNNs to keep track of information
of the steps, and use it for its output predictions.

The process however is different in sequence-to-sequence modelling. In this case, the inputs and outputs are sequences of arbitrary length, such as in machine translation. In such cases, a pair of sequence models (such as RNNs) are used to process the input text into numerical representations (known as the encoder) and another to process the numerical representations into output sequences (known as the decoder).

The Encoder-Decoder Network for Sequence-to-Sequence Modelling

The major challenge of this architecture is that the final hidden state for the encoder (h3 in the image above) would need to carry information regarding all the previous hidden states of the inputs in the encoder to the decoder. However, because the hidden state at each step would only contain the compressed information of the previous states, the final hidden state of the encoder would only contain limited information for the decoder, especially when the input texts are long.

Fortunately, there is a solution to this problem, as documented in the reference paper, which is to allow the decoder access to all the hidden states of the encoder. This mechanism is generally known as attention.

Attention Mechanism

The proposed idea for the attention mechanism is that, instead of producing a single hidden state for the input sequence (as in RNNs), the encoder produces hidden states for each step that the decoder can access. However, since using all the hidden states would create a huge input for the decoder, the approach prioritizes the relevant hidden states to use by assigning different weights (or attention).

By identifying and focusing on the hidden states (input sequence) that are most relevant at each (output) timestep, attention-based models can properly align the inputs to the outputs and provide better context for text prediction. This is achieved by computing a set of attention weights that indicates the relevance of each input token to the current output token. This is often done using a similarity measure between the current (output) hidden state (ht) and the hidden states of the input sequence (hi). The attention scores are then converted into attention weights (αt) using a softmax function. This ensures the weights are positive and sum to 1, making them interpretable as probabilities. A context vector (ct) for the output time step (t) is then computed as a weighted sum of the input hidden states (hi). The context vector (ct) is then used, along with the output hidden state (ht), to generate the output (yt).

Self Attention: A Faster Attention Mechanism

While the attention mechanism helps to provide better context for the model by providing access to hidden states of the inputs, it generally poses certain problems. For example, since the hidden state for an input and attention weight for output is produced at each step, the sequential nature makes training and inference slow because each step depends on the completion of the previous step. Also, the computational complexity of RNN-based models increases significantly with the length of the sequence, making it challenging to handle very long sequences efficiently.

An efficient solution is self-attention mechanisms which improve the computational speed and capture long-term dependencies in the data.

Let’s see a practical example of how the self-attention mechanism works.

Consider a simple sequence: ["The", "cat", "sat"].

Step 1: Input Embedding
Each word in the sequence is converted into a fixed-size vector (embedding). 
Let's assume each word is represented as a 3-dimensional vector:

"The" -> [0.1, 0.2, 0.3]
"cat" -> [0.4, 0.5, 0.6]
"sat" -> [0.7, 0.8, 0.9]

Step 2: Creating Queries, Keys, and Values
For each word, we create three vectors: Query (Q), Key (K), and Value (V). 
These are obtained by multiplying the input embedding by the weight matrices.
Let's assume the weight matrices are identity matrices.
Thus, Q, K and V for each word are the same as the input embeddings.

Q, K, V for "The" -> [0.1, 0.2, 0.3]
Q, K, V for "cat" -> [0.4, 0.5, 0.6]
Q, K, V for "sat" -> [0.7, 0.8, 0.9]

Step 3: Compute Attention Scores
The attention score for a pair of words is computed by taking the dot product
of the Q of the first word with the K of the second word. 

For instance, for the first word "The":
Attention score with itself: Q("The") · K("The") = 0.10.1 + 0.20.2 + 0.3*0.3 = 0.14
Attention score with "cat": Q("The") · K("cat") = 0.10.4 + 0.20.5 + 0.3*0.6 = 0.32
Attention score with "sat": Q("The") · K("sat") = 0.10.7 + 0.20.8 + 0.3*0.9 = 0.50

Similarly, compute for "cat" and "sat".

Step 4: Scale Scores and Apply Softmax
To make the scores more stable, we scale them by dividing by the square root 
of the dimension of the Key vectors (3 in this case).
Then, we apply the softmax function to convert these scores into probabilities (attention weights).

For "The":
Attention weights: softmax([0.14, 0.32, 0.50]) = [0.26, 0.32, 0.42]

Step 5: Compute the Context Vector
The context vector for each word is a weighted sum of the Value vectors of all words, using the attention weights.

For "The":
Context vector = 0.26 * V("The") + 0.32 * V("cat") + 0.42 * V("sat")
Context vector = 0.26 * [0.1, 0.2, 0.3] + 0.32 * [0.4, 0.5, 0.6] + 0.42 * [0.7, 0.8, 0.9]
Context vector = [0.342, 0.442, 0.542]

Step 6: Generate Output
The context vectors are used in place of the original embeddings for further processing.

By following these steps, each token’s representation is computed in parallel, without relying on the previous token and the decoder has direct access to all tokens which helps capture long-range dependencies better than RNNs.

Now that we have an efficient mechanism to provide better context for our model, there is one more missing piece to remove the need to train task-specific architectures from scratch.

Transfer Learning

Transfer learning has already gained prominence in computer vision, where features learned from previous tasks are utilized to improve the performance on new tasks, even with limited data. This is usually achieved by initially training the model on a large dataset, which is known as pertaining.

By pre-training Transformers on a large corpus and utilizing the learned features, improved performance was achieved for new tasks. For example, GPT was pre-trained on the BookCorpus, which consists of 7,000 unpublished books from a variety of genres and can be utilized to train models for most text generation tasks.

Once the language model is pre-trained on a large-scale corpus, the next step is to adapt it to the in-domain corpus. Finally, the pre-trained models can then be fine-tuned on a downstream task with a relatively small number of data.

Applications of Transformers

Transformers excel in tasks that involve understanding and generating sequences of data. Their ability to model complex dependencies and relationships in data makes them highly effective across a wide range of domains, from NLP and computer vision to speech processing, reinforcement learning, and beyond. Some of their major applications include:

Text Generation: Models like GPT-3 and GPT-4 can generate coherent and contextually relevant text, making them useful for chatbots, content creation, and more.

Machine Translation: BERT and its variants have set new benchmarks in translating text between languages by understanding context from both directions.

Text Classification: Transformers can classify text based on sentiment (positive, negative, neutral) with high accuracy and can distinguish between spam and legitimate messages.

Named Entity Recognition (NER): Transformers are effective at identifying and classifying named entities in text, such as names of people, organizations, locations, dates, etc.

Question Answering: Models like BERT and RoBERTa have achieved state-of-the-art results in answering questions based on given passages.

Summarization: Transformers can generate concise summaries of long texts while retaining the essential information.

Image Classification: Vision Transformers (ViTs) have demonstrated competitive performance with traditional Convolutional Neural Networks (CNNs) in image classification tasks.

Object Detection: DETR (Detection Transformer) uses transformers for object detection in images, providing a new approach compared to traditional region-based methods.

Image Generation: DALL-E, a model by OpenAI generates images from textual descriptions, showing the versatility of transformers in visual creativity.

Speech Recognition: Transformers are used in Automatic Speech Recognition (ASR) systems to transcribe spoken language into written text accurately and can convert text into natural-sounding speech, improving TTS systems.

Code Generation: Codex is a transformer model that can generate code based on natural language descriptions, aiding software development.

Main Challenges to Transformers

Some of the main challenges to transformers include the dominance of the English language for NLP research. Therefore, it is harder to find pre-trained models for rare or low-resource languages. Also, self-attention works extremely well on paragraph-long texts, but it becomes expensive when we move to longer texts like whole documents. Another is that Transformer models are predominantly pre-trained on text data from the internet. This imprints all the biases in the data into the models.

Conclusion

Transformers have revolutionized machine learning, excelling in NLP, computer vision, and speech processing through self-attention mechanisms and transfer learning. Their ability to model complex dependencies and process data in parallel has set new benchmarks across various applications, from text generation and translation to image generation and object detection.

Looking forward, transformers will continue to drive advancements in AI, with potential frontiers including enhanced multimodal learning, more efficient model architectures, and broader applications in real-time systems and autonomous agents. Their versatility and efficiency make them indispensable for the future of intelligent systems.