RNN / LSTM
15 interview questions from basic to advanced
15 interview questions from basic to advanced
A Recurrent Neural Network (RNN) is a neural network architecture designed to process sequential data by maintaining a hidden state that carries information across time steps. At each time step t, the hidden state h_t is computed from the current input x_t and the previous hidden state h_{t-1}: h_t = tanh(W_h * h_{t-1} + W_x * x_t + b). This recurrence creates a form of memory that allows the network to model temporal dependencies.
RNNs are used for sequential data because they respect the ordering of inputs and can handle variable-length sequences. Unlike feedforward networks that treat each input independently, RNNs process sequences one element at a time while maintaining context through the hidden state. This makes them suitable for tasks like language modeling, speech recognition, time series forecasting, and machine translation where the order of inputs matters and outputs depend on the full history of prior inputs.
The hidden state is a fixed-size vector that acts as the network's memory, encoding a compressed representation of all information the network has processed up to the current time step. At each step t, the hidden state h_t is updated to incorporate the new input x_t while retaining relevant information from the past. It serves as both the network's internal representation and the vehicle for passing information between time steps.
The hidden state performs multiple functions: (1) It summarizes the sequence history into a fixed-dimensional vector, regardless of how long the sequence is. (2) It is used to make predictions at each time step (or only at the final step, depending on the task). (3) It transfers context from one step to the next through the recurrent connection. The dimensionality of the hidden state (hidden_size) controls the network's capacity -- larger hidden states can store more information but require more computation and risk overfitting. In practice, vanilla RNN hidden states struggle to retain information from distant time steps, which is why LSTM and GRU were developed.
The vanishing gradient problem occurs when gradients become exponentially small during backpropagation through time (BPTT), preventing the network from learning long-range dependencies. In a vanilla RNN, the gradient of the loss with respect to the hidden state at an early time step involves multiplying the recurrent weight matrix W_h and the tanh derivative repeatedly for each intervening step. If the spectral radius of W_h is less than 1, these repeated multiplications shrink the gradient exponentially toward zero.
This problem is particularly severe in RNNs because sequences can be very long (hundreds or thousands of steps), and the same weight matrix is multiplied at every step. As a result, the network effectively cannot learn that an event at step 5 should influence the output at step 100 -- the gradient signal from step 100 has vanished to near-zero by the time it reaches step 5. The complementary problem is exploding gradients (when the spectral radius is greater than 1), which is addressed by gradient clipping. The vanishing gradient problem is the primary motivation for gated architectures like LSTM and GRU, which provide additive gradient paths that bypass the multiplicative chain.
Long Short-Term Memory (LSTM) is a gated RNN architecture designed to learn long-range dependencies by solving the vanishing gradient problem. While a vanilla RNN has a single hidden state updated by a tanh transformation, an LSTM maintains two states: a cell state C_t (long-term memory) and a hidden state h_t (short-term memory). The cell state is updated through three gating mechanisms that control what to forget, what to add, and what to output.
The key innovation is the cell state, which is updated additively: C_t = f_t * C_{t-1} + i_t * C_tilde_t. This additive update creates a linear gradient path -- when the forget gate f_t is close to 1, gradients flow through the cell state with minimal decay, similar to skip connections in ResNets. The three gates (forget, input, output) use sigmoid activations to make soft binary decisions about what information to erase, write, and read. This gives LSTMs the ability to selectively remember information for hundreds of time steps, which vanilla RNNs fundamentally cannot do.
1. Machine Translation (seq2seq): An encoder LSTM reads the source sentence and compresses it into a context vector, then a decoder LSTM generates the target sentence word by word. With attention mechanisms, the decoder can focus on relevant source words at each step. This was the dominant approach before Transformers (Google Translate used LSTM-based NMT from 2016 to ~2019).
2. Speech Recognition: Bidirectional LSTMs process audio spectrograms frame by frame, converting acoustic features into text. The recurrent structure naturally handles the variable-length, temporal nature of speech. CTC (Connectionist Temporal Classification) loss allows training without exact frame-level alignments. Deep Speech 2 and many production systems used stacked bidirectional LSTMs.
3. Time Series Forecasting: LSTMs model temporal patterns in financial data, weather, energy demand, and sensor readings. They process sequences of past observations to predict future values, capturing seasonality, trends, and complex nonlinear relationships. For multivariate time series, multiple features are fed as a vector at each step. LSTMs remain competitive with Transformers for many time series tasks, especially with limited data.
Forget Gate: f_t = sigma(W_f[h_{t-1}, x_t] + b_f). The forget gate decides what to erase from the cell state. It takes the previous hidden state and current input, passes them through a sigmoid that outputs values in [0,1] for each cell state dimension. A value near 0 erases that component; near 1 retains it. For example, when processing language and encountering a new subject, the forget gate might erase the old subject information from the cell state.
Input Gate + Candidate: i_t = sigma(W_i[h_{t-1}, x_t] + b_i) and C_tilde_t = tanh(W_C[h_{t-1}, x_t] + b_C). The input gate controls how much of the new candidate values to write to the cell state. The candidate is generated by tanh (values in [-1,1]), and the sigmoid gate selects which components to actually update. This two-step process separates "what could be written" from "what should be written."
Cell State Update: C_t = f_t * C_{t-1} + i_t * C_tilde_t. This is the core of the LSTM. The old cell state is selectively erased (forget gate), then new information is selectively added (input gate). The additive nature of this update is what prevents vanishing gradients -- gradients flow through the cell state via multiplication by f_t rather than through a nonlinearity.
Output Gate: o_t = sigma(W_o[h_{t-1}, x_t] + b_o) and h_t = o_t * tanh(C_t). The output gate controls what parts of the cell state are exposed as the hidden state (which is also the output). The cell state is squashed through tanh, and the sigmoid gate selects which components to output. Not all stored information needs to be relevant to the current output.
Architecture: LSTM has three gates (forget, input, output) and maintains a separate cell state alongside the hidden state. GRU has two gates (reset, update) and uses only a single hidden state -- there is no separate cell state. The GRU update gate combines the roles of the LSTM forget and input gates through linear interpolation: h_t = (1-z_t)*h_{t-1} + z_t*h_tilde_t. The reset gate controls how much of the previous state enters the candidate computation.
Parameters: For input size d_x and hidden size d_h, LSTM has 4*(d_h*d_h + d_h*d_x + d_h) parameters (four weight matrices + biases for the three gates + candidate). GRU has 3*(d_h*d_h + d_h*d_x + d_h) parameters (three weight matrices for the two gates + candidate). GRU has approximately 25% fewer parameters than LSTM.
When to use each: GRU is preferred when: (1) training speed matters -- fewer parameters mean faster iteration, (2) the dataset is smaller -- fewer parameters reduce overfitting risk, (3) the task does not require very fine-grained memory control. LSTM is preferred when: (1) the task requires precise control over what to remember and forget (the separate cell state and output gate provide this), (2) the dataset is large enough to support more parameters, (3) proven LSTM architectures exist for the specific task. In practice, both perform similarly on most benchmarks; GRU often wins on smaller datasets and LSTM on larger ones.
BPTT vs. standard backpropagation: Standard backpropagation computes gradients through a static computational graph where each layer has its own parameters. BPTT first unrolls the RNN across all T time steps, creating a deep graph where the same weight matrices are shared at every "layer" (time step). The gradient of the loss with respect to the shared weights is the sum of gradients from all time steps: dL/dW = sum_t(dL_t/dW). Because the same weights appear at every step, the gradient involves products of Jacobians dh_t/dh_{t-1} that chain multiplicatively.
Truncated BPTT: Full BPTT over very long sequences is impractical because: (1) memory grows linearly with sequence length (all intermediate activations must be stored), and (2) gradients from distant steps contribute negligibly due to vanishing. Truncated BPTT splits the sequence into fixed-length chunks (e.g., 35--100 steps) and only backpropagates within each chunk. The hidden state is detached from the computational graph at chunk boundaries but carried forward as input to the next chunk. This approximates the true gradient while keeping memory constant. The chunk length is a hyperparameter that trades off between gradient quality and resource usage.
Unidirectional RNN: Processes the sequence from left to right (or past to future). At each step t, the hidden state h_t depends only on inputs x_1 through x_t -- it has no knowledge of future inputs. This is appropriate for tasks where causality matters or future context is unavailable: language generation (you predict the next word from past words), real-time speech processing (you must output before hearing the full utterance), and online time series forecasting (future data does not exist yet).
Bidirectional RNN: Runs two separate RNNs -- one forward (left to right) and one backward (right to left) -- then concatenates their hidden states at each position: h_t = [h_forward_t ; h_backward_t]. This gives each position access to both past and future context, doubling the hidden representation size. This is essential for tasks where the full input is available and context from both directions improves prediction: named entity recognition (is "Washington" a person or place depends on surrounding words), sentiment analysis, part-of-speech tagging, and any classification or labeling task on complete sequences.
Key tradeoff: Bidirectional models have 2x the parameters (two separate RNNs) and require the full sequence to be available before processing. They cannot be used for autoregressive generation or any streaming/real-time application where you must produce output before seeing the entire input.
Padding and masking: The most common approach is to pad all sequences in a batch to the same length (the maximum length in that batch) using a special padding token. A binary mask tensor indicates which positions are real data vs. padding. During the forward pass, the RNN processes all positions, but the loss function and any aggregation operations (e.g., taking the last hidden state) use the mask to ignore padding positions. This allows efficient batched computation on GPUs.
Packing (PyTorch): Frameworks like PyTorch provide pack_padded_sequence and pad_packed_sequence utilities. Packing sorts sequences by length and groups non-padding elements together, so the RNN only computes on real tokens. This is more efficient than padding because it avoids unnecessary computation on pad tokens, and it ensures the hidden state at the "last" position corresponds to the actual last token, not a padding position.
Bucketing: Group sequences of similar lengths into the same batch to minimize padding waste. For example, sequences of length 10--15 go in one batch, 15--20 in another. This reduces wasted computation from excess padding while maintaining batch processing efficiency. Combined with dynamic batching (choosing batch size based on total tokens rather than number of sequences), this is the standard approach in production systems.
Vanilla RNN gradient: For the vanilla RNN h_t = tanh(W_h * h_{t-1} + W_x * x_t + b), the gradient of the loss at time T with respect to the hidden state at time k is: dL_T/dh_k = dL_T/dh_T * product_{t=k+1}^{T} (dh_t/dh_{t-1}). Each Jacobian dh_t/dh_{t-1} = diag(1 - h_t^2) * W_h^T, where diag(1 - h_t^2) comes from the tanh derivative and is element-wise in (0, 1]. The product of T-k such matrices has norm bounded by ||W_h||^{T-k} * max(1-h^2)^{T-k}. When ||W_h|| < 1/max(1-h^2), this product vanishes exponentially. When ||W_h|| > 1, it explodes.
LSTM gradient: For the LSTM cell state C_t = f_t * C_{t-1} + i_t * C_tilde_t, the gradient is: dC_t/dC_{t-1} = diag(f_t) + (terms depending on how gates depend on C_{t-1}). The dominant term is diag(f_t). When the forget gate f_t is close to 1 (which the network learns for information it wants to remember), dC_t/dC_{t-1} is approximately the identity matrix. Over T-k steps: dC_T/dC_k = product_{t=k+1}^{T} diag(f_t) approximately equals diag(product f_t), which stays close to 1 when all forget gates are near 1. This additive structure means gradients flow through the cell state path without the exponential decay caused by multiplicative weight matrix chains in vanilla RNNs.
LSTM complexity: Per-layer computation is O(n * d^2) where n is sequence length and d is hidden dimension. This comes from n time steps, each performing matrix multiplications of size d x d (for the four gate computations). Memory for activations is O(n * d). The critical limitation is that computation is inherently sequential -- step t depends on step t-1, so parallelism across time steps is impossible. Training time scales linearly with sequence length, and there is no way to accelerate this with more hardware.
Transformer complexity: Self-attention has O(n^2 * d) computation and O(n^2) memory for the attention matrix. The feedforward layers contribute O(n * d^2). For short-to-medium sequences (n < d), the feedforward term dominates and total complexity is similar to LSTM. For long sequences (n > d), the quadratic attention cost dominates. However, the critical advantage is that all positions are computed in parallel -- training is massively parallelizable on GPUs/TPUs, making wall-clock training time much faster despite similar FLOP counts.
Practical implications: For training, Transformers are 5-10x faster in wall-clock time due to parallelism, even though FLOP counts may be comparable. For inference, LSTMs have an advantage for streaming: generating each new token requires O(d^2) computation (just one step), while Transformers must attend to all previous tokens, costing O(n * d) per new token (or using KV-cache which requires O(n * d) memory). This is why some edge/mobile applications still use LSTMs. Linear attention variants (e.g., Mamba, RWKV) attempt to combine the best of both: parallelizable training with O(1) per-step inference.
The bottleneck problem: In a vanilla seq2seq model, the encoder processes the entire source sequence and compresses it into a single fixed-length vector (the final hidden state). The decoder then generates the output sequence conditioned only on this vector. For long sequences, this creates a severe information bottleneck -- a 100-word sentence must be fully represented in a single vector of, say, 512 dimensions. Performance degrades significantly as source length increases.
Attention mechanism (Bahdanau, 2014): Instead of using only the final encoder state, attention allows the decoder to look at all encoder hidden states [h_1, ..., h_n] at each decoding step. At decoder step t: (1) Compute alignment scores: e_{t,i} = score(s_t, h_i) where s_t is the decoder state and h_i is encoder state at position i. The score function can be additive (v^T * tanh(W_1*s_t + W_2*h_i)) or multiplicative (s_t^T * W * h_i). (2) Normalize to get attention weights: alpha_{t,i} = softmax(e_{t,i}). (3) Compute context vector: c_t = sum_i(alpha_{t,i} * h_i). (4) Use c_t alongside s_t to generate the output.
Why it was a breakthrough: Attention removes the fixed-length bottleneck, allowing the model to dynamically select relevant source information at each step. This dramatically improved translation quality, especially for long sentences. It also provided interpretability -- attention weights show which source words the model focuses on. Most importantly, attention was the conceptual precursor to self-attention in Transformers, where the same mechanism is applied within a single sequence to relate all positions to each other.
The fundamental challenge: Even LSTMs struggle with sequences beyond ~500--1000 steps because: (1) the forget gate product over many steps eventually decays (product of values slightly less than 1 still vanishes), (2) memory and compute scale linearly with sequence length, and (3) truncated BPTT limits the effective gradient horizon. At 10K+ steps, vanilla LSTMs are impractical.
Hierarchical approaches: Split the long sequence into segments and process at multiple time scales. A lower-level LSTM processes short segments (e.g., 100 steps), and a higher-level LSTM processes the sequence of segment summaries. This creates an O(sqrt(n)) effective path length. Hierarchical Multiscale LSTM and Clockwork RNN formalize this with different update frequencies for different hidden state groups.
Attention-augmented approaches: Add attention over a sliding window or sparse set of past hidden states. The LSTM processes locally while attention provides long-range skip connections. Transformer-XL extends this by caching previous segment hidden states and attending to them (segment-level recurrence + attention). This achieves practical context lengths of 3K--16K tokens.
Modern alternatives: For truly long sequences, consider: (1) Sparse Transformers with O(n*sqrt(n)) attention, (2) Longformer/BigBird with local + global attention patterns, (3) State-space models (S4, Mamba) that achieve O(n*log(n)) or O(n) complexity with recurrent-like structure but parallelizable training, (4) Compressive memory that summarizes and discards old states to maintain a fixed-size memory. The choice depends on whether you need exact long-range attention (sparse Transformer) or approximate modeling is sufficient (hierarchical RNN, state-space models).
RNNs (LSTM/GRU): Process sequences step by step with O(n*d^2) complexity. Strengths: natural fit for sequential/streaming data, constant memory per step at inference, handle variable lengths natively. Weaknesses: sequential training cannot be parallelized, limited effective memory (~500 steps for LSTM), slower training than alternatives. Best for: streaming applications, edge deployment, small datasets, and tasks where recurrent inductive bias helps.
Temporal Convolutional Networks (TCN): Apply 1D dilated causal convolutions with exponentially increasing dilation rates to achieve large receptive fields. Complexity: O(n*d^2*k) where k is kernel size. Strengths: fully parallelizable (like Transformers), stable gradients (no multiplicative chain), flexible receptive field via dilation, and backpropagation does not suffer from vanishing gradients. Weaknesses: fixed receptive field (must be designed to cover the maximum dependency length), less flexible than attention for variable-length dependencies, and requires careful architecture design (dilation schedule). Best for: audio generation (WaveNet), fixed-length time series, and tasks where the maximum dependency length is known.
Transformers: Use self-attention for O(n^2*d) computation with full parallelization. Strengths: direct connection between any two positions (O(1) path length), massive parallelism, excellent scaling with data and compute, and the foundation of modern NLP. Weaknesses: O(n^2) memory/compute for attention (problematic for very long sequences), no inherent sequential inductive bias (must learn position from positional encodings), and requires large datasets to outperform alternatives. Best for: NLP (dominant architecture), large-scale pretraining, tasks requiring global context, and any setting with abundant data and compute.
Decision framework: Small data or streaming needed? Use LSTM/GRU. Known fixed dependency length with parallel training? Consider TCN. Large data, need global context, have GPU resources? Use Transformers. Very long sequences? Consider state-space models (Mamba) that combine RNN-like inference with parallelizable training.