Transformer
15 interview questions from basic to advanced
15 interview questions from basic to advanced
The Transformer is a deep learning architecture introduced in the 2017 paper "Attention Is All You Need" by Vaswani et al. It was designed for sequence-to-sequence tasks like machine translation, replacing the recurrent (RNN/LSTM) and convolutional approaches that previously dominated NLP. The key innovation is the self-attention mechanism, which allows every position in the sequence to directly attend to every other position in a single computation step.
The Transformer was introduced to address two fundamental limitations of RNNs: (1) Sequential computation -- RNNs process tokens one at a time, making them slow to train because GPU parallelism cannot be exploited across time steps. (2) Long-range dependency degradation -- information must propagate through many hidden states to travel from one end of a sequence to the other, causing vanishing gradients and information loss. Self-attention solves both: it is fully parallelizable (all positions computed simultaneously) and captures dependencies between any two positions in O(1) path length. The Transformer has since become the foundation for virtually all modern NLP models (BERT, GPT, T5) and has been adapted to vision (ViT), audio (Whisper), and multimodal tasks.
Traditional attention (as used in Bahdanau or Luong attention for seq2seq models) computes attention between two different sequences: the decoder hidden states attend to the encoder hidden states. The queries come from the decoder and the keys/values come from the encoder. This is also called cross-attention -- it relates positions in one sequence to positions in another sequence.
Self-attention computes attention within a single sequence: the queries, keys, and values all come from the same input. Each token in the sequence attends to every other token in the same sequence, computing how much each position should contribute to the representation of every other position. This allows the model to build contextualized representations where the meaning of each token depends on the full sequence context. For example, the word "bank" in "river bank" vs. "bank account" receives different representations because it attends to different surrounding words. Self-attention is the core mechanism in Transformer encoders; Transformer decoders use both self-attention (within the output sequence) and cross-attention (to the encoder output).
Query (Q), Key (K), and Value (V) are three different linear projections of the input, computed as Q = XW^Q, K = XW^K, V = XW^V, where X is the input matrix and W^Q, W^K, W^V are learned weight matrices. The terminology is inspired by information retrieval: imagine a database where you search with a query, match against keys, and retrieve values.
The Query represents "what is this token looking for?" -- it encodes the current position's information need. The Key represents "what does this token contain?" -- it encodes what each position offers. The Value represents "what information does this token provide?" -- it carries the actual content to be aggregated. The attention mechanism computes compatibility scores by taking the dot product of queries and keys (QK^T), applies softmax to get attention weights, then uses these weights to compute a weighted sum of values. Having separate Q, K, V projections (rather than using the raw input) gives the model flexibility to learn different representations for matching (Q, K) versus for content retrieval (V).
Self-attention is fundamentally permutation-invariant -- it computes pairwise interactions between all tokens, but the result is the same regardless of the order of input tokens. If you shuffle the input sequence, the attention scores and output change only in that the rows and columns are permuted accordingly. Without positional information, the model literally cannot tell "the cat sat on the mat" from "the mat sat on the cat" -- both produce equivalent representations.
Positional encodings inject sequence order information by adding position-dependent vectors to the input embeddings before they enter the Transformer. The original paper uses sinusoidal functions: PE(pos, 2i) = sin(pos / 10000^(2i/d_model)) for even dimensions and cos for odd dimensions. This encoding has the property that relative positions can be represented as linear transformations, allowing the model to learn relative position relationships. Alternatives include learned positional embeddings (BERT, GPT), which train a separate embedding for each position; Rotary Position Embeddings (RoPE), used in LLaMA, which encode positions via rotation matrices; and ALiBi, which adds a linear bias to attention scores proportional to distance. The choice of positional encoding significantly affects the model's ability to generalize to sequence lengths not seen during training.
BERT (Bidirectional Encoder Representations from Transformers) uses only the Transformer encoder with bidirectional self-attention. During pretraining, it randomly masks 15% of tokens and predicts them using context from both left and right (Masked Language Model). It also uses Next Sentence Prediction to understand sentence relationships. BERT is designed for understanding tasks: text classification, named entity recognition, question answering, and sentence similarity. It is fine-tuned by adding a task-specific head on top of the pretrained encoder.
GPT (Generative Pre-trained Transformer) uses only the Transformer decoder with causal (left-to-right) masking. Each position can only attend to previous positions, making it autoregressive -- it generates one token at a time, each conditioned on all preceding tokens. GPT is pretrained with Causal Language Modeling (predict the next token). It excels at generation tasks: text completion, dialogue, summarization, translation, and code generation. GPT-3 and later models demonstrated remarkable few-shot and zero-shot capabilities through in-context learning. The key tradeoff: BERT builds richer representations by seeing full context (better for understanding), while GPT can generate text naturally (better for generation). T5 bridges both with a full encoder-decoder that reads bidirectionally and generates autoregressively.
Given an input sequence X of shape (n, d_model) where n is the sequence length: Step 1 -- Compute Q, K, V by multiplying X with learned weight matrices: Q = XW^Q (n, d_k), K = XW^K (n, d_k), V = XW^V (n, d_v). Step 2 -- Compute attention scores by taking the dot product of queries and keys: scores = QK^T, producing an (n, n) matrix where entry (i, j) measures how much position i should attend to position j. Step 3 -- Scale the scores by dividing by sqrt(d_k) to prevent softmax saturation: scaled_scores = scores / sqrt(d_k).
Step 4 -- Optionally apply a mask (in the decoder, set future positions to negative infinity to prevent attending to not-yet-generated tokens). Step 5 -- Apply softmax row-wise to convert scores into probability distributions: attention_weights = softmax(scaled_scores), where each row sums to 1. Step 6 -- Compute the output as a weighted sum of values: output = attention_weights * V, shape (n, d_v). Each output position is a weighted combination of all value vectors, where the weights reflect the relevance of each position. The total computation is O(n^2 * d_k) for the QK^T multiplication and O(n^2 * d_v) for the attention-value multiplication, giving overall O(n^2 * d) complexity.
A single attention head computes one set of attention weights across the full d_model dimension. This means it must compress all types of relationships (syntactic, semantic, positional, coreference) into a single attention distribution, which tends to average out distinct relationship types. Multi-head attention addresses this by splitting the model dimension into h parallel heads, each operating on a d_k = d_model/h dimensional subspace with its own learned projections.
Each head can specialize in capturing different types of relationships. Research has shown that different heads in trained models indeed learn distinct patterns: some heads track syntactic dependencies (subject-verb agreement), others handle coreference resolution (pronoun-antecedent), some capture positional patterns (attending to adjacent tokens), and others learn semantic similarity. The total computational cost is the same as single-head attention on the full dimension because each head operates on a proportionally smaller subspace: h heads of dimension d_k = d_model/h cost h * O(n^2 * d_k) = O(n^2 * d_model). The concatenated head outputs are projected through W^O to produce the final output, allowing information mixing across heads. This parallel specialization with equivalent cost makes multi-head attention strictly more expressive than single-head attention.
Masked self-attention applies a causal mask to the attention scores before softmax, setting all entries where the key position is later than the query position to negative infinity. This means that after softmax, these positions receive zero attention weight. For a sequence of length n, the attention matrix becomes lower-triangular: position 1 can only attend to position 1, position 2 can attend to positions 1-2, position t can attend to positions 1 through t.
This is essential for autoregressive generation in the decoder. During training, the decoder receives the full target sequence at once (for parallelism), but each position must predict the next token without seeing future tokens -- otherwise the model would simply copy the answer. The causal mask ensures that position t's prediction of token t+1 only has access to tokens 1 through t, preserving the autoregressive property during parallel training. During inference (generation), the mask is implicitly satisfied because the model generates one token at a time and only past tokens exist. Without masking, the decoder would "cheat" during training by looking ahead at future tokens, and the model would fail to generate coherent text at inference time because it was never trained to predict without future context.
Residual connections add the input of each sub-layer directly to its output: output = x + SubLayer(x). This creates identity shortcut paths through the network, ensuring that gradients can flow directly from the output back to earlier layers without being attenuated by the transformations in between. Without residual connections, training deep Transformers (12+ layers) becomes extremely difficult due to vanishing gradients. The sub-layer only needs to learn the residual -- the difference between the desired output and the input -- which is typically easier than learning the full transformation.
Layer Normalization normalizes each example independently across the feature dimension: LN(x) = gamma * (x - mean) / (std + epsilon) + beta, where gamma and beta are learned parameters. Unlike Batch Normalization (which normalizes across the batch dimension), LayerNorm is independent of batch size, making it suitable for variable-length sequences and small batches. It stabilizes training by keeping activations in a well-conditioned range, enables higher learning rates, and reduces sensitivity to initialization. The original Transformer uses post-norm (LayerNorm after the residual connection): LN(x + SubLayer(x)). Modern large models use pre-norm (LayerNorm before the sub-layer): x + SubLayer(LN(x)), which provides better training stability at scale because the residual path carries unnormalized values, maintaining gradient magnitude through deep networks.
BERT fine-tuning adapts the pretrained encoder to specific tasks by adding a lightweight task-specific head on top and training the entire model end-to-end with labeled data. For single-sentence classification (e.g., sentiment analysis), the [CLS] token's final hidden state is passed through a linear layer + softmax to predict the class. For token-level tasks (e.g., NER), each token's final hidden state is passed through a linear classifier. For sentence-pair tasks (e.g., NLI, QA), both sentences are concatenated with a [SEP] token, and the [CLS] representation classifies the relationship.
Fine-tuning typically uses a much lower learning rate than pretraining (2e-5 to 5e-5 vs. 1e-4) to avoid destroying pretrained features. Training runs for only 2-4 epochs on the downstream dataset. The entire model (all encoder layers + task head) is updated, not just the head -- this is what makes fine-tuning so effective, as the pretrained representations adapt to the specific task. Practical considerations: use warmup (linear increase over 6-10% of steps), weight decay for regularization, and evaluate on a dev set to prevent overfitting. For very small datasets (under 1,000 examples), freezing lower layers and only fine-tuning the top layers + head can prevent overfitting. The key insight is that BERT's pretrained representations are already rich enough that even simple linear classifiers on top achieve strong results, and fine-tuning further adapts these representations to the specific task distribution.
Self-attention computes QK^T, a matrix multiplication between Q (n × d_k) and K^T (d_k × n), producing an n × n attention matrix. This requires O(n² · d_k) operations and O(n²) memory to store the full attention matrix. For a 4,096-token sequence, this means ~16.7 million entries per head per layer. For a 100,000-token document, it becomes ~10 billion entries -- far exceeding GPU memory limits. This quadratic scaling is the primary bottleneck for processing long sequences with standard Transformers.
Approaches to reduce this complexity: (1) Sparse attention patterns -- Longformer and BigBird use a combination of local sliding window attention (each token attends to k nearest neighbors, O(n·k)) plus global attention on selected tokens. (2) Linear attention -- Performer and Linear Transformers approximate softmax attention using random feature maps or kernel methods, reducing to O(n·d) by computing KV first (d×d matrix) then multiplying with Q. (3) Low-rank projection -- Linformer projects keys and values to a lower dimension k, reducing the attention matrix to n×k instead of n×n. (4) Hardware-aware optimization -- Flash Attention does not change the mathematical computation but reorganizes memory access patterns to avoid materializing the full n×n attention matrix in GPU HBM, using tiling and recomputation in fast SRAM. This achieves 2-4x wall-clock speedup and reduces memory from O(n²) to O(n). (5) State-space models -- Mamba and similar architectures replace attention entirely with recurrence-like mechanisms that process sequences in O(n) time. The tradeoff: exact attention variants (Flash Attention) maintain full quality, approximate methods sacrifice some accuracy for speed, and SSMs change the modeling paradigm entirely.
Scaling laws (Kaplan et al., 2020) empirically demonstrate that Transformer performance follows smooth power-law relationships with three factors: model parameters (N), dataset size (D), and compute budget (C). Specifically, the cross-entropy loss L decreases predictably as: L(N) proportional to N^(-0.076), L(D) proportional to D^(-0.095), and L(C) proportional to C^(-0.050), when each factor is scaled independently with the others non-bottlenecked. These relationships hold across many orders of magnitude and are remarkably consistent across different model architectures and datasets.
Practical implications: (1) Predictability -- you can estimate the performance of a 10x larger model by training smaller models and extrapolating. This allows organizations to decide whether to invest in training before committing millions of dollars. (2) Compute allocation -- the original Kaplan scaling laws suggested training larger models on less data was optimal, but Chinchilla scaling (Hoffmann et al., 2022) revised this, showing that parameters and data should scale roughly equally for compute-optimal training. Chinchilla (70B parameters, 1.4T tokens) outperformed Gopher (280B parameters, 300B tokens) despite 4x fewer parameters, because it used 4.7x more training data. (3) Emergent abilities -- certain capabilities (chain-of-thought reasoning, few-shot learning, code generation) appear to emerge suddenly at specific scale thresholds rather than improving gradually. (4) Architecture is secondary -- within the Transformer family, scaling laws are largely independent of specific architectural choices (number of layers vs. width, attention heads). This suggests that scale of data and compute matters more than architectural innovations for pushing the performance frontier.
Transformers use self-attention with O(n²) complexity but allow every token to directly attend to every other token, providing maximum flexibility in modeling dependencies. State Space Models (SSMs) like Mamba process sequences through a recurrence with O(n) complexity -- each new token updates a fixed-size hidden state that summarizes all past context, similar to RNNs but with carefully designed parameterizations (structured state matrices, selective gating) that enable stable long-range memory and efficient parallel training via the parallel scan algorithm.
Key tradeoffs: (1) Complexity -- SSMs are O(n) in both time and memory vs. O(n²) for attention, making them dramatically faster and more memory-efficient for long sequences (100K+ tokens). (2) Expressiveness -- Transformers can represent arbitrary pairwise interactions; SSMs compress all past context into a fixed-size state, which may lose information. However, Mamba's selective state space mechanism (input-dependent gating of state transitions) significantly narrows this gap. (3) Inference efficiency -- SSMs are naturally autoregressive with O(1) per-step generation (update state, emit token), while Transformers require recomputing attention over all past tokens at each step (mitigated by KV caching). (4) Training parallelism -- both are parallelizable, but attention has a simpler parallelization story. (5) In-context learning -- Transformers excel at in-context learning (few-shot prompting), which relies on attention's ability to dynamically route information; SSMs' compressed state makes this harder. Current trend: hybrid architectures (Jamba, Zamba) combine attention layers with Mamba layers, using attention for tasks requiring precise retrieval and SSMs for long-range processing efficiency.
Vision Transformer (ViT) applies the standard Transformer encoder directly to images by converting 2D images into 1D sequences of patch embeddings. Given an image of size H × W × C (e.g., 224 × 224 × 3), it is divided into a grid of non-overlapping patches of size P × P (e.g., 16 × 16), producing N = (H/P) × (W/P) = 196 patches. Each patch is flattened into a vector of dimension P² · C = 768, then projected through a learned linear layer (the patch embedding) to produce a d_model-dimensional embedding. A learnable [CLS] token is prepended to the sequence, and learned positional embeddings (one per patch position + [CLS]) are added. The resulting sequence of N+1 = 197 embeddings is fed into a standard Transformer encoder.
Key design decisions and tradeoffs: (1) Patch size determines the resolution-computation tradeoff -- smaller patches (8 × 8) produce 4x more tokens (4x longer sequence) but capture finer details. (2) ViT requires large-scale pretraining (ImageNet-21K or JFT-300M) to outperform CNNs; on ImageNet alone, it underperforms due to lack of inductive biases (locality, translation equivariance) that CNNs provide through convolutions. (3) The [CLS] token aggregates global information through attention and its final representation is used for classification. (4) Positional embeddings capture spatial layout -- 2D-aware positional embeddings interpolation enables using ViT at different resolutions than training. (5) Variants: DeiT uses knowledge distillation from CNNs to train ViT efficiently on ImageNet alone. Swin Transformer introduces shifted window attention for hierarchical features and linear complexity. MAE pretrains by masking 75% of patches and reconstructing them, achieving strong results with less labeled data.
Flash Attention (Dao et al., 2022) is a hardware-aware exact attention algorithm that computes the same mathematical result as standard self-attention but reorganizes the computation to minimize GPU memory reads/writes. The key insight is that standard attention is bottlenecked by memory bandwidth, not compute. The standard approach materializes the full n × n attention matrix in GPU HBM (high bandwidth memory, slow), reads it back for softmax, stores the result, then reads again for the value multiplication. This requires O(n²) HBM reads/writes, which dominates wall-clock time.
Flash Attention uses a tiling strategy: it divides Q, K, V into blocks that fit in GPU SRAM (fast on-chip memory, ~20 MB on an A100 vs. 40-80 GB HBM). For each block of queries, it iterates over blocks of keys and values, computing partial attention scores in SRAM, updating a running softmax (using the online softmax trick to maintain numerical stability without materializing the full row), and accumulating the output. The full n × n attention matrix is never materialized in HBM. Results: (1) Memory: O(n) instead of O(n²) -- enables processing sequences that would otherwise OOM. (2) Speed: 2-4x faster wall-clock time by reducing HBM accesses from O(n²) to O(n² · d / SRAM_size), which is much smaller in practice. (3) Exact: no approximation whatsoever -- produces bit-identical results to standard attention. (4) Backward pass: uses recomputation instead of storing the attention matrix for the backward pass, trading compute for memory. Flash Attention 2 further optimized parallelism across sequence blocks and heads. Flash Attention has become the default attention implementation in most modern Transformer training frameworks.