self-attention
multi-head
positional encoding
Quick Reference
Deep Learning
Home / Study Lab / Cheat Sheets / Transformers
QUICK REFERENCE

Transformer
Cheat Sheet

Your quick reference for Transformers -- from self-attention and multi-head attention to positional encoding, encoder-decoder architecture, and BERT vs GPT.

Key Formulas

Scaled Dot-Product Attention:
$$\text{Attention}(Q,K,V) = \text{softmax}\!\Big(\frac{QK^T}{\sqrt{d_k}}\Big)V$$
Multi-Head Attention:
$$\text{MultiHead}(Q,K,V) = \text{Concat}(\text{head}_i)\,W^O$$
Where each head:
$$\text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)$$
Positional Encoding:
$$PE_{(pos,2i)} = \sin\!\Big(\frac{pos}{10000^{2i/d}}\Big)$$$$PE_{(pos,2i+1)} = \cos\!\Big(\frac{pos}{10000^{2i/d}}\Big)$$
Complexity:
$$O(n^2 \cdot d)$$ per self-attention layer, where $n$ = sequence length, $d$ = model dimension.

Self-Attention

Q, K, V Computation:
$$Q = XW^Q, \quad K = XW^K, \quad V = XW^V$$
Where:
$X \in \mathbb{R}^{n \times d_{\text{model}}}$, $W^Q, W^K \in \mathbb{R}^{d_{\text{model}} \times d_k}$, $W^V \in \mathbb{R}^{d_{\text{model}} \times d_v}$. Queries ask "what am I looking for?", Keys answer "what do I contain?", Values carry "what information do I provide?".
Attention Weights:
$$\alpha = \text{softmax}\!\Big(\frac{QK^T}{\sqrt{d_k}}\Big) \in \mathbb{R}^{n \times n}$$
Scaling Purpose:
Dividing by $\sqrt{d_k}$ prevents the dot products from growing large as $d_k$ increases. Large dot products push softmax into saturated regions where gradients become vanishingly small.
Masking:
In the decoder, a causal mask sets future positions to $-\infty$ before softmax, preventing the model from attending to tokens that have not yet been generated. Padding masks ignore pad tokens.

Self-attention allows every token to attend to every other token in the sequence, capturing long-range dependencies in a single layer -- unlike RNNs which must propagate information step-by-step through the sequence.

Multi-Head Attention

Linear Projections:
Each head $i$ projects $Q$, $K$, $V$ through learned matrices $W_i^Q \in \mathbb{R}^{d_{\text{model}} \times d_k}$, $W_i^K \in \mathbb{R}^{d_{\text{model}} \times d_k}$, $W_i^V \in \mathbb{R}^{d_{\text{model}} \times d_v}$, where $d_k = d_v = d_{\text{model}} / h$.
Parallel Computation:
All $h$ heads compute attention in parallel on different subspaces. With $d_{\text{model}} = 512$ and $h = 8$, each head operates on $d_k = 64$ dimensions. Total cost equals single-head attention on the full dimension.
Concatenation:
$$\text{Concat}(\text{head}_1, \dots, \text{head}_h) \in \mathbb{R}^{n \times d_{\text{model}}}$$
Output Projection:
The concatenated output is multiplied by $W^O \in \mathbb{R}^{d_{\text{model}} \times d_{\text{model}}}$ to produce the final output. This mixes information from all heads.

Multiple heads allow the model to attend to information from different representation subspaces simultaneously -- one head might capture syntactic relationships while another captures semantic similarity, and a third captures positional patterns.

Positional Encoding

Even Dimensions (sin):
$$PE_{(pos, 2i)} = \sin\!\left(\frac{pos}{10000^{2i / d_{\text{model}}}}\right)$$
Odd Dimensions (cos):
$$PE_{(pos, 2i+1)} = \cos\!\left(\frac{pos}{10000^{2i / d_{\text{model}}}}\right)$$
Why Sinusoidal:
For any fixed offset $k$, $PE_{pos+k}$ can be represented as a linear function of $PE_{pos}$, allowing the model to learn relative positions. Wavelengths form a geometric progression from $2\pi$ to $10000 \cdot 2\pi$.
Learned Alternatives:
BERT and GPT use learned positional embeddings -- a trainable embedding table indexed by position. RoPE (Rotary Position Embeddings) encodes positions via rotation matrices, enabling extrapolation. ALiBi adds a linear bias to attention scores based on distance.

Positional encodings are essential because self-attention is permutation-invariant -- without them, the model cannot distinguish "the cat sat on the mat" from "the mat sat on the cat." They are added (not concatenated) to the input embeddings.

Architecture

Encoder Block:
Sub-layer 1: Multi-Head Self-Attention + Residual + LayerNorm. Sub-layer 2: Position-wise Feed-Forward Network (FFN) + Residual + LayerNorm. $\text{FFN}(x) = \max(0, xW_1 + b_1)W_2 + b_2$. Inner dim is typically $4 \times d_{\text{model}}$.
Decoder Block:
Sub-layer 1: Masked Multi-Head Self-Attention + Residual + LayerNorm (causal mask prevents attending to future tokens). Sub-layer 2: Multi-Head Cross-Attention over encoder output + Residual + LayerNorm. Sub-layer 3: FFN + Residual + LayerNorm.
Residual Connection:
$$\text{LayerNorm}(x + \text{SubLayer}(x))$$
Layer Normalization:
$$\text{LayerNorm}(x) = \gamma \cdot \frac{x - \mu}{\sigma + \epsilon} + \beta$$

The original Transformer uses 6 encoder blocks and 6 decoder blocks. Pre-norm (LayerNorm before sub-layer) is now preferred over post-norm (LayerNorm after) for training stability at scale. Each block maintains the same dimensionality throughout.

BERT vs GPT

BERT:
Architecture: Encoder-only (bidirectional). Pretraining: Masked Language Model (MLM) -- randomly mask 15% of tokens, predict them using both left and right context. Next Sentence Prediction (NSP) -- classify if sentence B follows sentence A. Use: Classification, NER, QA, sentence similarity. Fine-tune with task-specific head.
GPT:
Architecture: Decoder-only (autoregressive, left-to-right). Pretraining: Causal Language Modeling (CLM) -- predict the next token given all previous tokens. Uses a causal mask so each position can only attend to earlier positions. Use: Text generation, summarization, translation, code generation, few-shot/zero-shot tasks.
T5 (Text-to-Text):
Architecture: Full encoder-decoder. Pretraining: Span corruption -- mask contiguous spans and predict them. Paradigm: Every task is framed as text-to-text: "translate English to German: ..." or "summarize: ...". Unifies classification, generation, and regression under one framework.
Key Difference:
BERT sees all tokens simultaneously (bidirectional) -- better for understanding tasks. GPT generates one token at a time (autoregressive) -- better for generation tasks. T5 combines both: encoder reads bidirectionally, decoder generates autoregressively.

Pros vs Cons

Pros:

  • Fully parallelizable -- all positions processed simultaneously, unlike sequential RNNs, enabling efficient GPU utilization
  • Captures long-range dependencies in a single layer -- every token directly attends to every other token regardless of distance
  • State-of-the-art across NLP -- machine translation, text generation, question answering, summarization, and virtually all language tasks
  • Scalable -- performance improves predictably with more data, parameters, and compute (scaling laws)
  • Transfer learning -- pretrained models (BERT, GPT) fine-tune effectively on downstream tasks with minimal labeled data
  • Versatile architecture -- adapted to vision (ViT), audio (Whisper), protein folding (AlphaFold), and multimodal tasks

Cons:

  • Quadratic complexity -- self-attention is $O(n^2 \cdot d)$ in sequence length, making long sequences expensive in memory and compute
  • Massive data requirements -- large Transformers need billions of tokens to train effectively; small data leads to overfitting
  • Expensive training -- GPT-3 training cost estimated at $4.6M; GPT-4 orders of magnitude more. Requires large GPU clusters
  • No inherent positional awareness -- requires explicit positional encodings; struggles with length generalization beyond training context
  • Black-box reasoning -- attention weights provide limited interpretability; understanding what the model "knows" remains challenging

Interview Quick-Fire

Q: What is self-attention?

A: Self-attention computes a weighted sum of all positions in a sequence for each position, where the weights are determined by the compatibility (dot product) between the query of the current position and the keys of all positions. It allows every token to directly attend to every other token, capturing dependencies regardless of distance in a single computation step -- unlike RNNs which must propagate information sequentially.

Q: Why scale by $\sqrt{d_k}$?

A: The dot product $QK^T$ grows in magnitude with the dimension $d_k$. Specifically, if the components of $Q$ and $K$ are independent random variables with mean 0 and variance 1, their dot product has mean 0 and variance $d_k$. Without scaling, large values push softmax into regions with extremely small gradients, making learning difficult. Dividing by $\sqrt{d_k}$ normalizes the variance to 1, keeping softmax in a well-behaved regime.

Q: What is the purpose of multi-head attention?

A: Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions. A single attention head averages over all subspaces, which can wash out important patterns. Multiple heads let the model capture diverse relationships simultaneously -- for example, one head might focus on syntactic dependencies, another on coreference, and another on semantic similarity.

Q: Why are positional encodings needed?

A: Self-attention is permutation-invariant -- it computes the same output regardless of the order of input tokens. Without positional encodings, the model cannot distinguish "dog bites man" from "man bites dog." Positional encodings inject sequence order information by adding position-dependent vectors to the input embeddings, allowing the model to reason about token positions and relative distances.

Q: How does BERT differ from GPT?

A: BERT is encoder-only and bidirectional -- it sees all tokens simultaneously using masked language modeling (predict masked tokens from full context). GPT is decoder-only and autoregressive -- it generates tokens left-to-right using causal masking, each position attending only to previous positions. BERT excels at understanding tasks (classification, NER, QA), while GPT excels at generation tasks (text completion, dialogue, code).

Q: What is Layer Normalization and why is it used?

A: Layer Normalization normalizes across the feature dimension for each individual example: $\text{LN}(x) = \gamma \cdot (x - \mu) / (\sigma + \epsilon) + \beta$. Unlike Batch Normalization (which normalizes across the batch), LayerNorm is independent of batch size, making it suitable for variable-length sequences. It stabilizes training, enables higher learning rates, and is applied after each sub-layer in the Transformer.

Q: What is the complexity of self-attention?

A: Self-attention has $O(n^2 \cdot d)$ time and $O(n^2)$ memory complexity, where $n$ is sequence length and $d$ is model dimension. The $n^2$ comes from computing attention scores between every pair of tokens. For a 4096-token sequence, this means ~16.7 million pairwise interactions per layer. This quadratic scaling is the primary bottleneck for processing long sequences and has motivated efficient variants like Linformer, Performer, and Flash Attention.

Q: What are Vision Transformers (ViT)?

A: Vision Transformers (ViT) apply the Transformer architecture directly to images by splitting an image into fixed-size patches (e.g., 16x16), flattening each patch into a vector, projecting it linearly, and treating the resulting sequence of patch embeddings like tokens in NLP. A learnable [CLS] token is prepended for classification. ViT achieves competitive or superior accuracy to CNNs when pretrained on large datasets, demonstrating that the Transformer's attention mechanism can replace convolutions for vision.

Continue Your Journey