Transformer Interview Questions | Techma Zone Study Lab

EASY What is the Transformer architecture and why was it introduced?

The Transformer is a deep learning architecture introduced in the 2017 paper "Attention Is All You Need" by Vaswani et al. It was designed for sequence-to-sequence tasks like machine translation, replacing the recurrent (RNN/LSTM) and convolutional approaches that previously dominated NLP. The key innovation is the self-attention mechanism, which allows every position in the sequence to directly attend to every other position in a single computation step.

The Transformer was introduced to address two fundamental limitations of RNNs: (1) Sequential computation -- RNNs process tokens one at a time, making them slow to train because GPU parallelism cannot be exploited across time steps. (2) Long-range dependency degradation -- information must propagate through many hidden states to travel from one end of a sequence to the other, causing vanishing gradients and information loss. Self-attention solves both: it is fully parallelizable (all positions computed simultaneously) and captures dependencies between any two positions in O(1) path length. The Transformer has since become the foundation for virtually all modern NLP models (BERT, GPT, T5) and has been adapted to vision (ViT), audio (Whisper), and multimodal tasks.

Key Points

Introduced in "Attention Is All You Need" (Vaswani et al., 2017) for sequence-to-sequence tasks
Replaces recurrence with self-attention -- every token attends to every other token directly
Fully parallelizable and captures long-range dependencies in O(1) path length
Foundation for BERT, GPT, T5, ViT, and virtually all modern deep learning models

EASY How does self-attention differ from traditional attention mechanisms?

Traditional attention (as used in Bahdanau or Luong attention for seq2seq models) computes attention between two different sequences: the decoder hidden states attend to the encoder hidden states. The queries come from the decoder and the keys/values come from the encoder. This is also called cross-attention -- it relates positions in one sequence to positions in another sequence.

Self-attention computes attention within a single sequence: the queries, keys, and values all come from the same input. Each token in the sequence attends to every other token in the same sequence, computing how much each position should contribute to the representation of every other position. This allows the model to build contextualized representations where the meaning of each token depends on the full sequence context. For example, the word "bank" in "river bank" vs. "bank account" receives different representations because it attends to different surrounding words. Self-attention is the core mechanism in Transformer encoders; Transformer decoders use both self-attention (within the output sequence) and cross-attention (to the encoder output).

Key Points

Traditional attention: queries from one sequence, keys/values from another (cross-attention)
Self-attention: queries, keys, and values all from the same sequence
Self-attention builds contextualized representations where each token depends on all others
Transformer decoders use both self-attention and cross-attention

EASY What are Query, Key, and Value in the attention mechanism?

Query (Q), Key (K), and Value (V) are three different linear projections of the input, computed as Q = XW^Q, K = XW^K, V = XW^V, where X is the input matrix and W^Q, W^K, W^V are learned weight matrices. The terminology is inspired by information retrieval: imagine a database where you search with a query, match against keys, and retrieve values.

The Query represents "what is this token looking for?" -- it encodes the current position's information need. The Key represents "what does this token contain?" -- it encodes what each position offers. The Value represents "what information does this token provide?" -- it carries the actual content to be aggregated. The attention mechanism computes compatibility scores by taking the dot product of queries and keys (QK^T), applies softmax to get attention weights, then uses these weights to compute a weighted sum of values. Having separate Q, K, V projections (rather than using the raw input) gives the model flexibility to learn different representations for matching (Q, K) versus for content retrieval (V).

Key Points

Q, K, V are learned linear projections of the input: Q = XW^Q, K = XW^K, V = XW^V
Query = "what am I looking for?", Key = "what do I contain?", Value = "what info do I provide?"
Attention scores = softmax(QK^T / sqrt(d_k)), output = attention weights times V
Separate projections allow the model to learn different representations for matching vs. content

EASY Why do Transformers need positional encodings?

Self-attention is fundamentally permutation-invariant -- it computes pairwise interactions between all tokens, but the result is the same regardless of the order of input tokens. If you shuffle the input sequence, the attention scores and output change only in that the rows and columns are permuted accordingly. Without positional information, the model literally cannot tell "the cat sat on the mat" from "the mat sat on the cat" -- both produce equivalent representations.

Positional encodings inject sequence order information by adding position-dependent vectors to the input embeddings before they enter the Transformer. The original paper uses sinusoidal functions: PE(pos, 2i) = sin(pos / 10000^(2i/d_model)) for even dimensions and cos for odd dimensions. This encoding has the property that relative positions can be represented as linear transformations, allowing the model to learn relative position relationships. Alternatives include learned positional embeddings (BERT, GPT), which train a separate embedding for each position; Rotary Position Embeddings (RoPE), used in LLaMA, which encode positions via rotation matrices; and ALiBi, which adds a linear bias to attention scores proportional to distance. The choice of positional encoding significantly affects the model's ability to generalize to sequence lengths not seen during training.

Key Points

Self-attention is permutation-invariant -- it cannot distinguish token order without help
Positional encodings are added to input embeddings to inject sequence position information
Sinusoidal (original), learned (BERT/GPT), RoPE (LLaMA), ALiBi are common approaches
Choice of PE affects length generalization and the model's ability to learn relative positions

EASY What is the difference between BERT and GPT?

BERT (Bidirectional Encoder Representations from Transformers) uses only the Transformer encoder with bidirectional self-attention. During pretraining, it randomly masks 15% of tokens and predicts them using context from both left and right (Masked Language Model). It also uses Next Sentence Prediction to understand sentence relationships. BERT is designed for understanding tasks: text classification, named entity recognition, question answering, and sentence similarity. It is fine-tuned by adding a task-specific head on top of the pretrained encoder.

GPT (Generative Pre-trained Transformer) uses only the Transformer decoder with causal (left-to-right) masking. Each position can only attend to previous positions, making it autoregressive -- it generates one token at a time, each conditioned on all preceding tokens. GPT is pretrained with Causal Language Modeling (predict the next token). It excels at generation tasks: text completion, dialogue, summarization, translation, and code generation. GPT-3 and later models demonstrated remarkable few-shot and zero-shot capabilities through in-context learning. The key tradeoff: BERT builds richer representations by seeing full context (better for understanding), while GPT can generate text naturally (better for generation). T5 bridges both with a full encoder-decoder that reads bidirectionally and generates autoregressively.

Key Points

BERT: encoder-only, bidirectional, MLM pretraining -- excels at understanding tasks
GPT: decoder-only, autoregressive, causal LM pretraining -- excels at generation tasks
BERT sees all context simultaneously; GPT sees only left context for each token
T5 combines both: encoder reads bidirectionally, decoder generates autoregressively

MEDIUM Walk through the full self-attention computation step by step.

Given an input sequence X of shape (n, d_model) where n is the sequence length: Step 1 -- Compute Q, K, V by multiplying X with learned weight matrices: Q = XW^Q (n, d_k), K = XW^K (n, d_k), V = XW^V (n, d_v). Step 2 -- Compute attention scores by taking the dot product of queries and keys: scores = QK^T, producing an (n, n) matrix where entry (i, j) measures how much position i should attend to position j. Step 3 -- Scale the scores by dividing by sqrt(d_k) to prevent softmax saturation: scaled_scores = scores / sqrt(d_k).

Step 4 -- Optionally apply a mask (in the decoder, set future positions to negative infinity to prevent attending to not-yet-generated tokens). Step 5 -- Apply softmax row-wise to convert scores into probability distributions: attention_weights = softmax(scaled_scores), where each row sums to 1. Step 6 -- Compute the output as a weighted sum of values: output = attention_weights * V, shape (n, d_v). Each output position is a weighted combination of all value vectors, where the weights reflect the relevance of each position. The total computation is O(n^2 * d_k) for the QK^T multiplication and O(n^2 * d_v) for the attention-value multiplication, giving overall O(n^2 * d) complexity.

Key Points

Step 1: Linear projections Q = XW^Q, K = XW^K, V = XW^V
Step 2-3: Compute scaled scores = QK^T / sqrt(d_k), producing an n x n attention matrix
Step 4-5: Apply optional mask, then softmax to get attention weights (each row sums to 1)
Step 6: Output = attention_weights * V -- weighted sum of value vectors per position

MEDIUM Why use multi-head attention instead of a single attention head?

A single attention head computes one set of attention weights across the full d_model dimension. This means it must compress all types of relationships (syntactic, semantic, positional, coreference) into a single attention distribution, which tends to average out distinct relationship types. Multi-head attention addresses this by splitting the model dimension into h parallel heads, each operating on a d_k = d_model/h dimensional subspace with its own learned projections.

Each head can specialize in capturing different types of relationships. Research has shown that different heads in trained models indeed learn distinct patterns: some heads track syntactic dependencies (subject-verb agreement), others handle coreference resolution (pronoun-antecedent), some capture positional patterns (attending to adjacent tokens), and others learn semantic similarity. The total computational cost is the same as single-head attention on the full dimension because each head operates on a proportionally smaller subspace: h heads of dimension d_k = d_model/h cost h * O(n^2 * d_k) = O(n^2 * d_model). The concatenated head outputs are projected through W^O to produce the final output, allowing information mixing across heads. This parallel specialization with equivalent cost makes multi-head attention strictly more expressive than single-head attention.

Key Points

Single head averages all relationship types into one attention distribution
Multiple heads can specialize: syntactic, semantic, positional, coreference patterns
Same total compute: h heads of d_k = d_model/h equals one head of d_model
W^O output projection mixes information from all heads for the final representation

MEDIUM What is masked self-attention and why is it needed in the decoder?

Masked self-attention applies a causal mask to the attention scores before softmax, setting all entries where the key position is later than the query position to negative infinity. This means that after softmax, these positions receive zero attention weight. For a sequence of length n, the attention matrix becomes lower-triangular: position 1 can only attend to position 1, position 2 can attend to positions 1-2, position t can attend to positions 1 through t.

This is essential for autoregressive generation in the decoder. During training, the decoder receives the full target sequence at once (for parallelism), but each position must predict the next token without seeing future tokens -- otherwise the model would simply copy the answer. The causal mask ensures that position t's prediction of token t+1 only has access to tokens 1 through t, preserving the autoregressive property during parallel training. During inference (generation), the mask is implicitly satisfied because the model generates one token at a time and only past tokens exist. Without masking, the decoder would "cheat" during training by looking ahead at future tokens, and the model would fail to generate coherent text at inference time because it was never trained to predict without future context.

Key Points

Sets attention scores for future positions to negative infinity before softmax
Creates a lower-triangular attention matrix: position t sees only positions 1 to t
Essential for autoregressive training: prevents the model from seeing future tokens
During inference, the mask is implicitly satisfied since tokens are generated sequentially

MEDIUM What role do residual connections and Layer Normalization play in the Transformer?

Residual connections add the input of each sub-layer directly to its output: output = x + SubLayer(x). This creates identity shortcut paths through the network, ensuring that gradients can flow directly from the output back to earlier layers without being attenuated by the transformations in between. Without residual connections, training deep Transformers (12+ layers) becomes extremely difficult due to vanishing gradients. The sub-layer only needs to learn the residual -- the difference between the desired output and the input -- which is typically easier than learning the full transformation.

Layer Normalization normalizes each example independently across the feature dimension: LN(x) = gamma * (x - mean) / (std + epsilon) + beta, where gamma and beta are learned parameters. Unlike Batch Normalization (which normalizes across the batch dimension), LayerNorm is independent of batch size, making it suitable for variable-length sequences and small batches. It stabilizes training by keeping activations in a well-conditioned range, enables higher learning rates, and reduces sensitivity to initialization. The original Transformer uses post-norm (LayerNorm after the residual connection): LN(x + SubLayer(x)). Modern large models use pre-norm (LayerNorm before the sub-layer): x + SubLayer(LN(x)), which provides better training stability at scale because the residual path carries unnormalized values, maintaining gradient magnitude through deep networks.

Key Points

Residual connections: output = x + SubLayer(x) -- enables gradient flow through deep networks
LayerNorm normalizes across features per example -- independent of batch size
Post-norm (original): LN(x + SubLayer(x)). Pre-norm (modern): x + SubLayer(LN(x))
Pre-norm is preferred for large models due to better training stability at scale

MEDIUM How does BERT fine-tuning work for downstream tasks?

BERT fine-tuning adapts the pretrained encoder to specific tasks by adding a lightweight task-specific head on top and training the entire model end-to-end with labeled data. For single-sentence classification (e.g., sentiment analysis), the [CLS] token's final hidden state is passed through a linear layer + softmax to predict the class. For token-level tasks (e.g., NER), each token's final hidden state is passed through a linear classifier. For sentence-pair tasks (e.g., NLI, QA), both sentences are concatenated with a [SEP] token, and the [CLS] representation classifies the relationship.

Fine-tuning typically uses a much lower learning rate than pretraining (2e-5 to 5e-5 vs. 1e-4) to avoid destroying pretrained features. Training runs for only 2-4 epochs on the downstream dataset. The entire model (all encoder layers + task head) is updated, not just the head -- this is what makes fine-tuning so effective, as the pretrained representations adapt to the specific task. Practical considerations: use warmup (linear increase over 6-10% of steps), weight decay for regularization, and evaluate on a dev set to prevent overfitting. For very small datasets (under 1,000 examples), freezing lower layers and only fine-tuning the top layers + head can prevent overfitting. The key insight is that BERT's pretrained representations are already rich enough that even simple linear classifiers on top achieve strong results, and fine-tuning further adapts these representations to the specific task distribution.

Key Points

Add a task-specific head (linear layer) on top of pretrained encoder, train end-to-end
Classification uses [CLS] token; NER uses per-token outputs; QA uses start/end predictions
Low learning rate (2e-5 to 5e-5), 2-4 epochs, warmup schedule, weight decay
Entire model is updated -- pretrained representations adapt to the specific task

HARD Explain the O(n²) complexity of self-attention and discuss approaches to reduce it.

Self-attention computes QK^T, a matrix multiplication between Q (n × d_k) and K^T (d_k × n), producing an n × n attention matrix. This requires O(n² · d_k) operations and O(n²) memory to store the full attention matrix. For a 4,096-token sequence, this means ~16.7 million entries per head per layer. For a 100,000-token document, it becomes ~10 billion entries -- far exceeding GPU memory limits. This quadratic scaling is the primary bottleneck for processing long sequences with standard Transformers.

Approaches to reduce this complexity: (1) Sparse attention patterns -- Longformer and BigBird use a combination of local sliding window attention (each token attends to k nearest neighbors, O(n·k)) plus global attention on selected tokens. (2) Linear attention -- Performer and Linear Transformers approximate softmax attention using random feature maps or kernel methods, reducing to O(n·d) by computing KV first (d×d matrix) then multiplying with Q. (3) Low-rank projection -- Linformer projects keys and values to a lower dimension k, reducing the attention matrix to n×k instead of n×n. (4) Hardware-aware optimization -- Flash Attention does not change the mathematical computation but reorganizes memory access patterns to avoid materializing the full n×n attention matrix in GPU HBM, using tiling and recomputation in fast SRAM. This achieves 2-4x wall-clock speedup and reduces memory from O(n²) to O(n). (5) State-space models -- Mamba and similar architectures replace attention entirely with recurrence-like mechanisms that process sequences in O(n) time. The tradeoff: exact attention variants (Flash Attention) maintain full quality, approximate methods sacrifice some accuracy for speed, and SSMs change the modeling paradigm entirely.

Key Points

QK^T produces an n × n matrix: O(n² · d) compute, O(n²) memory
Sparse attention (Longformer): local windows + global tokens, O(n · k)
Linear attention (Performer): kernel approximation, O(n · d) by computing KV first
Flash Attention: exact computation but hardware-optimized memory access, O(n) memory

HARD What are scaling laws for Transformers and how do they guide model development?

Scaling laws (Kaplan et al., 2020) empirically demonstrate that Transformer performance follows smooth power-law relationships with three factors: model parameters (N), dataset size (D), and compute budget (C). Specifically, the cross-entropy loss L decreases predictably as: L(N) proportional to N^(-0.076), L(D) proportional to D^(-0.095), and L(C) proportional to C^(-0.050), when each factor is scaled independently with the others non-bottlenecked. These relationships hold across many orders of magnitude and are remarkably consistent across different model architectures and datasets.

Practical implications: (1) Predictability -- you can estimate the performance of a 10x larger model by training smaller models and extrapolating. This allows organizations to decide whether to invest in training before committing millions of dollars. (2) Compute allocation -- the original Kaplan scaling laws suggested training larger models on less data was optimal, but Chinchilla scaling (Hoffmann et al., 2022) revised this, showing that parameters and data should scale roughly equally for compute-optimal training. Chinchilla (70B parameters, 1.4T tokens) outperformed Gopher (280B parameters, 300B tokens) despite 4x fewer parameters, because it used 4.7x more training data. (3) Emergent abilities -- certain capabilities (chain-of-thought reasoning, few-shot learning, code generation) appear to emerge suddenly at specific scale thresholds rather than improving gradually. (4) Architecture is secondary -- within the Transformer family, scaling laws are largely independent of specific architectural choices (number of layers vs. width, attention heads). This suggests that scale of data and compute matters more than architectural innovations for pushing the performance frontier.

Key Points

Loss follows power laws with parameters, data, and compute: L proportional to N^(-alpha)
Chinchilla scaling: parameters and data should scale equally for compute-optimal training
Enables predicting larger model performance from smaller experiments before committing resources
Emergent abilities appear at scale thresholds; architecture is secondary to scale

HARD Compare Transformers with State Space Models (SSMs) like Mamba. What are the tradeoffs?

Transformers use self-attention with O(n²) complexity but allow every token to directly attend to every other token, providing maximum flexibility in modeling dependencies. State Space Models (SSMs) like Mamba process sequences through a recurrence with O(n) complexity -- each new token updates a fixed-size hidden state that summarizes all past context, similar to RNNs but with carefully designed parameterizations (structured state matrices, selective gating) that enable stable long-range memory and efficient parallel training via the parallel scan algorithm.

Key tradeoffs: (1) Complexity -- SSMs are O(n) in both time and memory vs. O(n²) for attention, making them dramatically faster and more memory-efficient for long sequences (100K+ tokens). (2) Expressiveness -- Transformers can represent arbitrary pairwise interactions; SSMs compress all past context into a fixed-size state, which may lose information. However, Mamba's selective state space mechanism (input-dependent gating of state transitions) significantly narrows this gap. (3) Inference efficiency -- SSMs are naturally autoregressive with O(1) per-step generation (update state, emit token), while Transformers require recomputing attention over all past tokens at each step (mitigated by KV caching). (4) Training parallelism -- both are parallelizable, but attention has a simpler parallelization story. (5) In-context learning -- Transformers excel at in-context learning (few-shot prompting), which relies on attention's ability to dynamically route information; SSMs' compressed state makes this harder. Current trend: hybrid architectures (Jamba, Zamba) combine attention layers with Mamba layers, using attention for tasks requiring precise retrieval and SSMs for long-range processing efficiency.

Key Points

Transformers: O(n²) but arbitrary pairwise interactions; SSMs: O(n) but fixed-size state
SSMs dramatically faster for long sequences; Transformers better at in-context learning
Mamba's selective gating narrows the expressiveness gap with input-dependent transitions
Hybrid architectures (Jamba) combine attention + SSM layers for the best of both

HARD How does Vision Transformer (ViT) process images, and what is the patch embedding mechanism?

Vision Transformer (ViT) applies the standard Transformer encoder directly to images by converting 2D images into 1D sequences of patch embeddings. Given an image of size H × W × C (e.g., 224 × 224 × 3), it is divided into a grid of non-overlapping patches of size P × P (e.g., 16 × 16), producing N = (H/P) × (W/P) = 196 patches. Each patch is flattened into a vector of dimension P² · C = 768, then projected through a learned linear layer (the patch embedding) to produce a d_model-dimensional embedding. A learnable [CLS] token is prepended to the sequence, and learned positional embeddings (one per patch position + [CLS]) are added. The resulting sequence of N+1 = 197 embeddings is fed into a standard Transformer encoder.

Key design decisions and tradeoffs: (1) Patch size determines the resolution-computation tradeoff -- smaller patches (8 × 8) produce 4x more tokens (4x longer sequence) but capture finer details. (2) ViT requires large-scale pretraining (ImageNet-21K or JFT-300M) to outperform CNNs; on ImageNet alone, it underperforms due to lack of inductive biases (locality, translation equivariance) that CNNs provide through convolutions. (3) The [CLS] token aggregates global information through attention and its final representation is used for classification. (4) Positional embeddings capture spatial layout -- 2D-aware positional embeddings interpolation enables using ViT at different resolutions than training. (5) Variants: DeiT uses knowledge distillation from CNNs to train ViT efficiently on ImageNet alone. Swin Transformer introduces shifted window attention for hierarchical features and linear complexity. MAE pretrains by masking 75% of patches and reconstructing them, achieving strong results with less labeled data.

Key Points

Split image into P × P patches, flatten, project linearly to get patch embeddings
Prepend [CLS] token, add positional embeddings, process with standard Transformer encoder
Requires large-scale pretraining to overcome lack of CNN inductive biases
Variants: DeiT (distillation), Swin (shifted windows), MAE (masked patch reconstruction)

HARD What is Flash Attention and how does it achieve speedup without approximation?

Flash Attention (Dao et al., 2022) is a hardware-aware exact attention algorithm that computes the same mathematical result as standard self-attention but reorganizes the computation to minimize GPU memory reads/writes. The key insight is that standard attention is bottlenecked by memory bandwidth, not compute. The standard approach materializes the full n × n attention matrix in GPU HBM (high bandwidth memory, slow), reads it back for softmax, stores the result, then reads again for the value multiplication. This requires O(n²) HBM reads/writes, which dominates wall-clock time.

Flash Attention uses a tiling strategy: it divides Q, K, V into blocks that fit in GPU SRAM (fast on-chip memory, ~20 MB on an A100 vs. 40-80 GB HBM). For each block of queries, it iterates over blocks of keys and values, computing partial attention scores in SRAM, updating a running softmax (using the online softmax trick to maintain numerical stability without materializing the full row), and accumulating the output. The full n × n attention matrix is never materialized in HBM. Results: (1) Memory: O(n) instead of O(n²) -- enables processing sequences that would otherwise OOM. (2) Speed: 2-4x faster wall-clock time by reducing HBM accesses from O(n²) to O(n² · d / SRAM_size), which is much smaller in practice. (3) Exact: no approximation whatsoever -- produces bit-identical results to standard attention. (4) Backward pass: uses recomputation instead of storing the attention matrix for the backward pass, trading compute for memory. Flash Attention 2 further optimized parallelism across sequence blocks and heads. Flash Attention has become the default attention implementation in most modern Transformer training frameworks.

Key Points

Hardware-aware tiling: computes attention in SRAM blocks without materializing full n × n matrix
Uses online softmax trick for numerically stable partial computation across tiles
Reduces HBM memory from O(n²) to O(n) -- exact results, no approximation
2-4x wall-clock speedup; default attention implementation in modern training frameworks

Transformer

Key Points

Key Points

Key Points

Key Points

Key Points

Key Points

Key Points

Key Points

Key Points

Key Points

Key Points

Key Points

Key Points

Key Points

Key Points

Continue Your Journey

Guide

Cheat Sheet

Quiz