Transformers & Attention Mechanism
From Bahdanau attention to GPT and BERT. A comprehensive, visually interactive deep dive into the architecture that revolutionized natural language processing and beyond.
Begin Learning ↓From Bahdanau attention to GPT and BERT. A comprehensive, visually interactive deep dive into the architecture that revolutionized natural language processing and beyond.
Begin Learning ↓How the attention mechanism evolved from a simple alignment trick into the foundation of modern deep learning.
The story of Transformers begins with a fundamental limitation of sequence-to-sequence models. In 2014, the standard approach to machine translation was an encoder-decoder architecture built from Recurrent Neural Networks (RNNs). The encoder would read an entire input sentence and compress it into a single fixed-length context vector. The decoder would then generate the translation from this single vector.
The problem was obvious: squeezing an entire sentence into one fixed-size vector creates an information bottleneck. Long sentences lost critical details. Dzmitry Bahdanau, along with Cho and Bengio, proposed a revolutionary solution in their 2014 paper: instead of using a single context vector, allow the decoder to look back at all encoder hidden states and focus on the most relevant ones at each decoding step.
This mechanism -- called attention -- computes a weighted sum of encoder hidden states, where the weights (attention scores) indicate how relevant each input token is to the current output token being generated. The attention score between decoder state \( s_t \) and encoder state \( h_j \) is:
Where \( a \) is a learned alignment function, \( \alpha_{tj} \) are the normalized attention weights, and \( c_t \) is the context vector for decoding step \( t \). This simple idea improved translation quality dramatically, especially for long sentences.
The critical leap came in 2017 when Vaswani et al. at Google published their landmark paper "Attention Is All You Need". Their insight was radical: remove the recurrence entirely and build the entire model using only attention mechanisms. The result was the Transformer architecture.
The key innovation was self-attention (also called intra-attention), where a sequence attends to itself. Instead of computing attention between encoder and decoder, each token in a sequence computes attention weights with every other token in the same sequence. This allows the model to capture dependencies regardless of distance in constant time.
The Transformer achieved state-of-the-art results on English-to-German and English-to-French translation while being significantly faster to train than RNN-based models. More importantly, it became the foundation for virtually every major advance in NLP that followed.
From a translation model to the backbone of modern AI, the Transformer's impact has been extraordinary:
Why letting every token attend to every other token is so powerful -- and why it replaced recurrence.
Recurrent Neural Networks process sequences one token at a time. To understand how the first word in a sentence relates to the last word, information must flow through every intermediate hidden state. For a sequence of length \( n \), the maximum path length between any two tokens is \( O(n) \).
This creates two critical problems. First, long-range dependencies are hard to learn: gradients must flow through \( n \) steps during backpropagation, suffering from vanishing or exploding gradients. Even LSTMs and GRUs, designed to mitigate this, struggle with very long sequences. Second, sequential processing is inherently slow: each time step depends on the previous one, so RNNs cannot be parallelized along the sequence dimension.
The Transformer's self-attention mechanism eliminates both problems with an elegant solution: every token directly attends to every other token in a single operation. The maximum path length between any two positions drops from \( O(n) \) in RNNs to \( O(1) \) in Transformers.
This means a token at position 1 can directly interact with a token at position 1000 without any information needing to pass through intermediate positions. The attention weights are computed in parallel for all positions simultaneously, making the operation highly efficient on modern GPUs.
The computational trade-off is that self-attention has \( O(n^2) \) time and memory complexity (every pair of tokens interacts), compared to \( O(n) \) for RNNs. However, the constant factors are much smaller and the parallelism more than compensates for typical sequence lengths.
RNN: \( O(n) \) path length, \( O(n) \) sequential ops, \( O(1) \) complexity per layer. Self-Attention: \( O(1) \) path length, \( O(1) \) sequential ops, \( O(n^2) \) complexity per layer. The constant path length is what makes Transformers so effective at capturing long-range dependencies.
All attention computations are independent and can run simultaneously on GPUs, enabling massive speedups during training compared to sequential RNNs
Any two tokens can interact directly in a single layer, eliminating the vanishing gradient problem for long-range dependencies
Attention weights provide a natural way to visualize which tokens the model focuses on, offering insights into model behavior
The heart of the Transformer -- computing query, key, and value projections to determine how tokens relate.
Self-attention can be understood through a database retrieval analogy. Imagine you have a question (query) and a database of entries, each with a label (key) and content (value). You compare your query against every key, find the most relevant matches, and retrieve a weighted combination of their values.
In self-attention, every token plays all three roles simultaneously. Given an input sequence of embeddings \( X \in \mathbb{R}^{n \times d} \), we project them into three separate spaces using learned weight matrices:
Where \( W^Q, W^K \in \mathbb{R}^{d \times d_k} \) and \( W^V \in \mathbb{R}^{d \times d_v} \). The query represents "what am I looking for?", the key represents "what do I contain?", and the value represents "what information do I provide?"
The attention output is computed by taking the dot product of queries with keys, scaling, applying softmax, and multiplying by values:
The scaling factor \( \sqrt{d_k} \) is crucial. Without it, for large \( d_k \), the dot products \( QK^T \) grow in magnitude, pushing the softmax into regions with extremely small gradients. Specifically, if the entries of \( Q \) and \( K \) have zero mean and unit variance, their dot product has variance \( d_k \). Dividing by \( \sqrt{d_k} \) normalizes this back to unit variance:
The softmax converts raw scores into a probability distribution: each row of the attention matrix sums to 1, representing how much each token attends to every other token (including itself).
Type a sentence and visualize the self-attention weights. Each cell shows how strongly token \( i \) (row) attends to token \( j \) (column). The diagonal is typically brightest because tokens attend strongly to themselves.
Enter a sentence below and click "Compute Attention" to see the attention heatmap. Darker blue indicates lower weight; brighter orange indicates higher weight. Arrows show top connections.
Running multiple attention operations in parallel to capture different types of relationships simultaneously.
A single attention head can only focus on one type of relationship at a time. For example, it might learn to attend to syntactic dependencies (subject-verb agreement) or semantic relationships (coreference), but it is difficult for one set of \( Q, K, V \) projections to capture all types of relationships simultaneously.
Multi-head attention solves this by running \( h \) attention operations in parallel, each with its own learned projection matrices. Each head can specialize in a different type of relationship:
Each head operates on a reduced dimensionality: \( d_k = d_v = d_{\text{model}} / h \). For the original Transformer with \( d_{\text{model}} = 512 \) and \( h = 8 \), each head works with \( d_k = 64 \) dimensions. The total computation is roughly the same as single-head attention with full dimensionality.
Research has shown that different attention heads learn to specialize in different linguistic phenomena. In BERT, for example:
The final output projection \( W^O \in \mathbb{R}^{hd_v \times d_{\text{model}}} \) combines these diverse perspectives into a single representation that benefits from all of them.
Visualize how multiple attention heads capture different patterns. Toggle between viewing individual heads and the merged (averaged) result.
Adjust the number of heads. Each small heatmap represents one head's attention pattern. Click "Show Merged" to see the averaged attention across all heads.
How Transformers understand word order without any recurrence -- injecting position information through sinusoidal functions.
Self-attention is permutation-equivariant: if you shuffle the input tokens, the attention mechanism will produce the same outputs (just shuffled). This means the Transformer has no inherent notion of token order. The sentence "dog bites man" and "man bites dog" would produce identical representations without positional information.
To solve this, the original Transformer adds positional encodings to the input embeddings. These encodings inject information about each token's position in the sequence. The authors chose sinusoidal functions of different frequencies:
Where \( pos \) is the position in the sequence and \( i \) is the dimension index. Each dimension of the positional encoding corresponds to a sinusoid with a different wavelength, forming a geometric progression from \( 2\pi \) to \( 10000 \cdot 2\pi \).
The sinusoidal encoding has several elegant properties:
The final input to the Transformer is the sum of the token embedding and the positional encoding:
BERT and GPT use learned positional embeddings instead of sinusoidal ones. A trainable embedding matrix \( P \in \mathbb{R}^{L_{\max} \times d} \) is used, where \( L_{\max} \) is the maximum sequence length. Both approaches perform similarly, but learned embeddings cannot extrapolate beyond \( L_{\max} \).
Visualize the sinusoidal positional encoding matrix. X-axis represents embedding dimensions, Y-axis represents sequence positions. Blue = -1, white = 0, orange = +1. Notice how low dimensions oscillate rapidly while high dimensions change slowly.
Adjust the model dimension and sequence length to explore how the encoding patterns change. Even dimensions use sine; odd dimensions use cosine.
The full encoder-decoder stack with layer normalization, residual connections, and feed-forward networks.
Each encoder layer consists of two sub-layers, each wrapped with a residual connection and layer normalization:
The input passes through multi-head self-attention. Every token attends to every other token in the sequence. The output has the same shape as the input: \( \mathbb{R}^{n \times d_{\text{model}}} \).
The attention output is added to the input (residual connection) and then layer-normalized: \( \text{LayerNorm}(x + \text{MultiHead}(x, x, x)) \). This stabilizes training and allows gradients to flow directly through skip connections.
A two-layer MLP applied independently to each position: \( \text{FFN}(x) = \text{ReLU}(xW_1 + b_1)W_2 + b_2 \). The inner dimension \( d_{ff} \) is typically 4x the model dimension (e.g., 2048 for \( d_{\text{model}} = 512 \)).
Another residual connection and layer normalization: \( \text{LayerNorm}(x + \text{FFN}(x)) \). The output feeds into the next encoder layer.
The decoder is similar to the encoder but with an additional sub-layer for cross-attention and a crucial modification to self-attention:
Explore the full Transformer architecture layer by layer. Use the slider to highlight individual layers and see the data flow through the encoder-decoder stack.
Adjust the layer depth slider to highlight specific layers. The diagram shows embeddings at the bottom, encoder stack on the left, and decoder stack on the right.
How bidirectional pre-training with masked language modeling revolutionized natural language understanding.
BERT (Bidirectional Encoder Representations from Transformers), introduced by Devlin et al. at Google in 2018, uses only the encoder stack of the Transformer. Its key innovation is bidirectional pre-training: unlike GPT which reads left-to-right, BERT can attend to both left and right context simultaneously.
To enable bidirectional training without the model trivially predicting the target token (since it can see it in the input), BERT uses Masked Language Modeling: 15% of input tokens are randomly selected, and of those, 80% are replaced with [MASK], 10% with a random token, and 10% are left unchanged. The model must predict the original token:
Where \( \mathcal{M} \) is the set of masked positions and \( x_{\backslash \mathcal{M}} \) represents the corrupted input.
BERT's second pre-training objective is Next Sentence Prediction. Given two sentences A and B, the model predicts whether B actually follows A in the original text (label: IsNext) or is a random sentence (label: NotNext). This helps the model learn inter-sentence relationships.
The input format is: [CLS] Sentence A [SEP] Sentence B [SEP]. The [CLS] token's final hidden state is used for the NSP classification. While NSP was part of the original BERT, later work (RoBERTa) showed that removing NSP and training with longer sequences often improves performance.
BERT's power comes from its pre-train then fine-tune paradigm. The same pre-trained model can be adapted to dozens of different tasks by adding a simple task-specific head:
Add a linear layer on top of [CLS] for sentiment analysis, spam detection, or topic classification
Add a per-token classifier for Named Entity Recognition (NER) or Part-of-Speech tagging
Predict start and end positions in a passage that answer a question (extractive QA)
RoBERTa removes NSP, trains longer with more data. ALBERT shares parameters across layers to reduce model size. DistilBERT is a 6-layer distilled version that retains 97% of BERT's performance at 60% of the size. DeBERTa uses disentangled attention for improved performance.
Autoregressive language modeling and the remarkable power of scaling Transformers.
The GPT (Generative Pre-trained Transformer) family uses only the decoder stack with causal masking. Each token can only attend to previous tokens and itself, never to future tokens. This makes GPT a left-to-right language model that predicts the next token given all previous tokens:
The training objective is simple: maximize the likelihood of the next token at every position. This next-token prediction objective, despite its simplicity, turns out to be extraordinarily powerful when combined with enough data and model capacity.
One of the most important discoveries in modern AI is that Transformer performance follows predictable scaling laws. Kaplan et al. (2020) showed that the cross-entropy loss \( L \) decreases as a power law with model size \( N \), dataset size \( D \), and compute budget \( C \):
This means that making models bigger, training on more data, and using more compute all reliably improve performance. GPT-3 demonstrated this dramatically with 175 billion parameters, showing emergent abilities like few-shot learning that smaller models lacked entirely.
The chart below shows how perplexity (lower is better) decreases as GPT model size increases. Note the log-log scale revealing the power-law relationship.
Demonstrated that unsupervised pre-training on a large corpus followed by supervised fine-tuning outperformed task-specific architectures. Used 12 Transformer layers.
Showed that scaling up enables zero-shot task transfer. A single language model could perform translation, summarization, and QA without any fine-tuning.
Demonstrated few-shot learning: by providing a few examples in the prompt, GPT-3 could perform new tasks without any gradient updates. Introduced "in-context learning."
Multimodal (text + images), dramatically improved reasoning, and achieved human-level performance on many professional benchmarks including the bar exam.
The key architectural and training hyperparameters that define a Transformer model.
The dimensionality of all token representations throughout the model. All sub-layers (attention, FFN) produce outputs of this dimension.
The number of parallel attention operations. Must divide \( d_{\text{model}} \) evenly, giving each head a dimension of \( d_k = d_{\text{model}} / h \).
The number of stacked encoder or decoder blocks. Deeper models can learn more complex representations but are harder to train.
The hidden size of the position-wise feed-forward network. Typically set to \( 4 \times d_{\text{model}} \):
Transformers are notoriously sensitive to learning rate. The original paper introduced a warmup schedule that linearly increases the learning rate for the first \( T_{\text{warmup}} \) steps, then decays proportionally to the inverse square root of the step number:
Modern practice typically uses cosine decay with warmup: linear warmup for 1-5% of training, then cosine annealing to zero. AdamW is the standard optimizer with \( \beta_1 = 0.9 \), \( \beta_2 = 0.999 \), and weight decay of 0.01-0.1.
At initialization, the model's attention patterns are essentially random. A large learning rate at this stage causes unstable updates to the attention weights, which can permanently damage training. The warmup period allows attention patterns to stabilize before applying larger learning rates.
From natural language processing to computer vision and beyond -- Transformers have become the universal architecture.
Machine translation, text generation, summarization, question answering, sentiment analysis. Transformers dominate every NLP benchmark.
Images are split into patches (e.g., 16x16), flattened, and treated as tokens. ViT matches or exceeds CNN performance on ImageNet when trained with sufficient data.
Whisper (OpenAI) uses a Transformer encoder-decoder for speech recognition. Music generation models like MusicLM also use Transformer backbones.
GPT-4V, Gemini, and CLIP process both text and images. Transformers serve as the unified backbone for combining multiple modalities.
The chart below compares popular Transformer models across standard NLP benchmarks. BERT, RoBERTa, GPT-2, and T5 are evaluated on GLUE, SQuAD, MNLI, and SST-2 tasks.
O(1) path length means any two tokens can directly interact, solving the fundamental limitation of RNNs.
All positions are processed simultaneously, enabling efficient training on modern GPU/TPU hardware.
Pre-trained Transformers can be fine-tuned on small datasets, democratizing access to powerful NLP capabilities.
Performance improves predictably with scale. The same architecture works from millions to trillions of parameters.
Self-attention requires O(n^2) memory for sequence length n, limiting context windows without specialized techniques like FlashAttention.
Training large Transformers requires enormous GPU clusters. GPT-3 training cost an estimated $4.6M in compute alone.
Position must be explicitly injected. Different positional encoding schemes (sinusoidal, learned, RoPE, ALiBi) have different trade-offs.
Transformers generally need large datasets to outperform simpler architectures. Small-data regimes may favor CNNs or RNNs.
From self-attention in NumPy to using pre-trained Transformers with HuggingFace.
A minimal implementation of scaled dot-product self-attention using only NumPy.
import numpy as np
def softmax(x, axis=-1):
"""Numerically stable softmax."""
e_x = np.exp(x - np.max(x, axis=axis, keepdims=True))
return e_x / e_x.sum(axis=axis, keepdims=True)
def self_attention(X, W_q, W_k, W_v):
"""
Scaled dot-product self-attention.
X: (seq_len, d_model) input embeddings
W_q, W_k: (d_model, d_k) projection matrices
W_v: (d_model, d_v) projection matrix
Returns: (seq_len, d_v) attention output
"""
Q = X @ W_q # (seq_len, d_k)
K = X @ W_k # (seq_len, d_k)
V = X @ W_v # (seq_len, d_v)
d_k = Q.shape[-1]
scores = (Q @ K.T) / np.sqrt(d_k) # (seq_len, seq_len)
weights = softmax(scores, axis=-1) # (seq_len, seq_len)
output = weights @ V # (seq_len, d_v)
return output, weights
# Example usage
np.random.seed(42)
seq_len, d_model, d_k, d_v = 5, 16, 8, 8
X = np.random.randn(seq_len, d_model)
W_q = np.random.randn(d_model, d_k) * 0.1
W_k = np.random.randn(d_model, d_k) * 0.1
W_v = np.random.randn(d_model, d_v) * 0.1
output, attn_weights = self_attention(X, W_q, W_k, W_v)
print(f"Input shape: {X.shape}")
print(f"Output shape: {output.shape}")
print(f"Attn weights:\n{np.round(attn_weights, 3)}")
import numpy as np
def multi_head_attention(X, num_heads, d_model):
"""
Multi-head self-attention from scratch.
X: (seq_len, d_model)
"""
assert d_model % num_heads == 0
d_k = d_model // num_heads
# Initialize projections for each head
np.random.seed(0)
heads_output = []
all_weights = []
for h in range(num_heads):
W_q = np.random.randn(d_model, d_k) * 0.1
W_k = np.random.randn(d_model, d_k) * 0.1
W_v = np.random.randn(d_model, d_k) * 0.1
Q = X @ W_q
K = X @ W_k
V = X @ W_v
scores = (Q @ K.T) / np.sqrt(d_k)
# Softmax
e = np.exp(scores - scores.max(axis=-1, keepdims=True))
weights = e / e.sum(axis=-1, keepdims=True)
head_out = weights @ V
heads_output.append(head_out)
all_weights.append(weights)
# Concatenate all heads
concat = np.concatenate(heads_output, axis=-1) # (seq_len, d_model)
# Output projection
W_o = np.random.randn(d_model, d_model) * 0.1
output = concat @ W_o
return output, all_weights
# Example
seq_len, d_model, num_heads = 6, 16, 4
X = np.random.randn(seq_len, d_model)
out, weights = multi_head_attention(X, num_heads, d_model)
print(f"Output shape: {out.shape}")
for i, w in enumerate(weights):
print(f"Head {i+1} attn shape: {w.shape}")
from transformers import pipeline
# Zero-shot sentiment analysis with a pre-trained model
classifier = pipeline("sentiment-analysis")
texts = [
"Transformers have completely revolutionized NLP!",
"The quadratic complexity is a major limitation.",
"BERT is still great for classification tasks.",
]
results = classifier(texts)
for text, result in zip(texts, results):
print(f"Text: {text}")
print(f" Label: {result['label']}, Score: {result['score']:.4f}\n")
from transformers import pipeline
# Text generation with GPT-2
generator = pipeline("text-generation", model="gpt2")
prompt = "The Transformer architecture revolutionized AI because"
outputs = generator(
prompt,
max_length=100,
num_return_sequences=2,
temperature=0.8,
top_p=0.9,
do_sample=True,
)
for i, output in enumerate(outputs):
print(f"--- Generation {i+1} ---")
print(output["generated_text"])
print()
from transformers import (
BertTokenizer, BertForSequenceClassification,
Trainer, TrainingArguments
)
from datasets import load_dataset
import numpy as np
# Load dataset and tokenizer
dataset = load_dataset("imdb")
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
def tokenize(batch):
return tokenizer(
batch["text"], padding="max_length",
truncation=True, max_length=256
)
tokenized = dataset.map(tokenize, batched=True)
tokenized.set_format("torch", columns=["input_ids", "attention_mask", "label"])
# Load pre-trained BERT with classification head
model = BertForSequenceClassification.from_pretrained(
"bert-base-uncased", num_labels=2
)
# Training arguments
args = TrainingArguments(
output_dir="./bert-imdb",
num_train_epochs=3,
per_device_train_batch_size=16,
per_device_eval_batch_size=64,
learning_rate=2e-5,
weight_decay=0.01,
warmup_steps=500,
evaluation_strategy="epoch",
save_strategy="epoch",
load_best_model_at_end=True,
)
def compute_metrics(eval_pred):
preds = np.argmax(eval_pred.predictions, axis=-1)
acc = (preds == eval_pred.label_ids).mean()
return {"accuracy": acc}
trainer = Trainer(
model=model, args=args,
train_dataset=tokenized["train"].select(range(5000)),
eval_dataset=tokenized["test"].select(range(1000)),
compute_metrics=compute_metrics,
)
trainer.train()
print(trainer.evaluate())
Use a small learning rate (2e-5 to 5e-5) for fine-tuning pre-trained models. Always use warmup (5-10% of total steps). Monitor validation loss to detect overfitting early. For small datasets, consider freezing lower layers and only fine-tuning the top few layers.