Transformer
Cheat Sheet
Your quick reference for Transformers -- from self-attention and multi-head attention to positional encoding, encoder-decoder architecture, and BERT vs GPT.
Your quick reference for Transformers -- from self-attention and multi-head attention to positional encoding, encoder-decoder architecture, and BERT vs GPT.
Self-attention allows every token to attend to every other token in the sequence, capturing long-range dependencies in a single layer -- unlike RNNs which must propagate information step-by-step through the sequence.
Multiple heads allow the model to attend to information from different representation subspaces simultaneously -- one head might capture syntactic relationships while another captures semantic similarity, and a third captures positional patterns.
Positional encodings are essential because self-attention is permutation-invariant -- without them, the model cannot distinguish "the cat sat on the mat" from "the mat sat on the cat." They are added (not concatenated) to the input embeddings.
The original Transformer uses 6 encoder blocks and 6 decoder blocks. Pre-norm (LayerNorm before sub-layer) is now preferred over post-norm (LayerNorm after) for training stability at scale. Each block maintains the same dimensionality throughout.
A: Self-attention computes a weighted sum of all positions in a sequence for each position, where the weights are determined by the compatibility (dot product) between the query of the current position and the keys of all positions. It allows every token to directly attend to every other token, capturing dependencies regardless of distance in a single computation step -- unlike RNNs which must propagate information sequentially.
A: The dot product $QK^T$ grows in magnitude with the dimension $d_k$. Specifically, if the components of $Q$ and $K$ are independent random variables with mean 0 and variance 1, their dot product has mean 0 and variance $d_k$. Without scaling, large values push softmax into regions with extremely small gradients, making learning difficult. Dividing by $\sqrt{d_k}$ normalizes the variance to 1, keeping softmax in a well-behaved regime.
A: Multi-head attention allows the model to jointly attend to information from different representation subspaces at different positions. A single attention head averages over all subspaces, which can wash out important patterns. Multiple heads let the model capture diverse relationships simultaneously -- for example, one head might focus on syntactic dependencies, another on coreference, and another on semantic similarity.
A: Self-attention is permutation-invariant -- it computes the same output regardless of the order of input tokens. Without positional encodings, the model cannot distinguish "dog bites man" from "man bites dog." Positional encodings inject sequence order information by adding position-dependent vectors to the input embeddings, allowing the model to reason about token positions and relative distances.
A: BERT is encoder-only and bidirectional -- it sees all tokens simultaneously using masked language modeling (predict masked tokens from full context). GPT is decoder-only and autoregressive -- it generates tokens left-to-right using causal masking, each position attending only to previous positions. BERT excels at understanding tasks (classification, NER, QA), while GPT excels at generation tasks (text completion, dialogue, code).
A: Layer Normalization normalizes across the feature dimension for each individual example: $\text{LN}(x) = \gamma \cdot (x - \mu) / (\sigma + \epsilon) + \beta$. Unlike Batch Normalization (which normalizes across the batch), LayerNorm is independent of batch size, making it suitable for variable-length sequences. It stabilizes training, enables higher learning rates, and is applied after each sub-layer in the Transformer.
A: Self-attention has $O(n^2 \cdot d)$ time and $O(n^2)$ memory complexity, where $n$ is sequence length and $d$ is model dimension. The $n^2$ comes from computing attention scores between every pair of tokens. For a 4096-token sequence, this means ~16.7 million pairwise interactions per layer. This quadratic scaling is the primary bottleneck for processing long sequences and has motivated efficient variants like Linformer, Performer, and Flash Attention.
A: Vision Transformers (ViT) apply the Transformer architecture directly to images by splitting an image into fixed-size patches (e.g., 16x16), flattening each patch into a vector, projecting it linearly, and treating the resulting sequence of patch embeddings like tokens in NLP. A learnable [CLS] token is prepended for classification. ViT achieves competitive or superior accuracy to CNNs when pretrained on large datasets, demonstrating that the Transformer's attention mechanism can replace convolutions for vision.