Deep Learning
15 Questions
Interview Ready
Sequences
RNN / LSTM
Home / Study Lab / Guides / RNN / LSTM Interview
INTERVIEW PREP

RNN / LSTM

15 interview questions from basic to advanced

EASY What is a Recurrent Neural Network and why is it used for sequential data?

A Recurrent Neural Network (RNN) is a neural network architecture designed to process sequential data by maintaining a hidden state that carries information across time steps. At each time step t, the hidden state h_t is computed from the current input x_t and the previous hidden state h_{t-1}: h_t = tanh(W_h * h_{t-1} + W_x * x_t + b). This recurrence creates a form of memory that allows the network to model temporal dependencies.

RNNs are used for sequential data because they respect the ordering of inputs and can handle variable-length sequences. Unlike feedforward networks that treat each input independently, RNNs process sequences one element at a time while maintaining context through the hidden state. This makes them suitable for tasks like language modeling, speech recognition, time series forecasting, and machine translation where the order of inputs matters and outputs depend on the full history of prior inputs.

Key Points
  • Maintains a hidden state updated at each time step, creating temporal memory
  • Processes sequences one step at a time, respecting input ordering
  • Handles variable-length inputs through parameter sharing across time steps
  • Suitable for text, speech, time series, and any data with temporal structure
EASY What role does the hidden state play in an RNN?

The hidden state is a fixed-size vector that acts as the network's memory, encoding a compressed representation of all information the network has processed up to the current time step. At each step t, the hidden state h_t is updated to incorporate the new input x_t while retaining relevant information from the past. It serves as both the network's internal representation and the vehicle for passing information between time steps.

The hidden state performs multiple functions: (1) It summarizes the sequence history into a fixed-dimensional vector, regardless of how long the sequence is. (2) It is used to make predictions at each time step (or only at the final step, depending on the task). (3) It transfers context from one step to the next through the recurrent connection. The dimensionality of the hidden state (hidden_size) controls the network's capacity -- larger hidden states can store more information but require more computation and risk overfitting. In practice, vanilla RNN hidden states struggle to retain information from distant time steps, which is why LSTM and GRU were developed.

Key Points
  • Fixed-size vector encoding a compressed summary of the entire sequence so far
  • Updated at each step to incorporate new input and retain past context
  • Used for predictions and as the bridge between consecutive time steps
  • Hidden size controls capacity; vanilla RNNs lose long-range information
EASY What is the vanishing gradient problem and why does it affect RNNs?

The vanishing gradient problem occurs when gradients become exponentially small during backpropagation through time (BPTT), preventing the network from learning long-range dependencies. In a vanilla RNN, the gradient of the loss with respect to the hidden state at an early time step involves multiplying the recurrent weight matrix W_h and the tanh derivative repeatedly for each intervening step. If the spectral radius of W_h is less than 1, these repeated multiplications shrink the gradient exponentially toward zero.

This problem is particularly severe in RNNs because sequences can be very long (hundreds or thousands of steps), and the same weight matrix is multiplied at every step. As a result, the network effectively cannot learn that an event at step 5 should influence the output at step 100 -- the gradient signal from step 100 has vanished to near-zero by the time it reaches step 5. The complementary problem is exploding gradients (when the spectral radius is greater than 1), which is addressed by gradient clipping. The vanishing gradient problem is the primary motivation for gated architectures like LSTM and GRU, which provide additive gradient paths that bypass the multiplicative chain.

Key Points
  • Gradients shrink exponentially when multiplied through many time steps
  • Caused by repeated multiplication by the recurrent weight matrix and tanh derivative
  • Prevents learning of long-range dependencies in vanilla RNNs
  • Motivates LSTM/GRU (additive paths) and gradient clipping (for explosions)
EASY What is an LSTM and how does it differ from a vanilla RNN?

Long Short-Term Memory (LSTM) is a gated RNN architecture designed to learn long-range dependencies by solving the vanishing gradient problem. While a vanilla RNN has a single hidden state updated by a tanh transformation, an LSTM maintains two states: a cell state C_t (long-term memory) and a hidden state h_t (short-term memory). The cell state is updated through three gating mechanisms that control what to forget, what to add, and what to output.

The key innovation is the cell state, which is updated additively: C_t = f_t * C_{t-1} + i_t * C_tilde_t. This additive update creates a linear gradient path -- when the forget gate f_t is close to 1, gradients flow through the cell state with minimal decay, similar to skip connections in ResNets. The three gates (forget, input, output) use sigmoid activations to make soft binary decisions about what information to erase, write, and read. This gives LSTMs the ability to selectively remember information for hundreds of time steps, which vanilla RNNs fundamentally cannot do.

Key Points
  • Maintains two states: cell state (long-term) and hidden state (short-term)
  • Three gates (forget, input, output) control information flow with sigmoid activations
  • Additive cell state update creates a gradient highway, solving vanishing gradients
  • Can learn dependencies over hundreds of time steps vs. ~10-20 for vanilla RNNs
EASY Name three real-world applications of RNNs/LSTMs.

1. Machine Translation (seq2seq): An encoder LSTM reads the source sentence and compresses it into a context vector, then a decoder LSTM generates the target sentence word by word. With attention mechanisms, the decoder can focus on relevant source words at each step. This was the dominant approach before Transformers (Google Translate used LSTM-based NMT from 2016 to ~2019).

2. Speech Recognition: Bidirectional LSTMs process audio spectrograms frame by frame, converting acoustic features into text. The recurrent structure naturally handles the variable-length, temporal nature of speech. CTC (Connectionist Temporal Classification) loss allows training without exact frame-level alignments. Deep Speech 2 and many production systems used stacked bidirectional LSTMs.

3. Time Series Forecasting: LSTMs model temporal patterns in financial data, weather, energy demand, and sensor readings. They process sequences of past observations to predict future values, capturing seasonality, trends, and complex nonlinear relationships. For multivariate time series, multiple features are fed as a vector at each step. LSTMs remain competitive with Transformers for many time series tasks, especially with limited data.

Key Points
  • Machine translation: encoder-decoder LSTM with attention for seq2seq tasks
  • Speech recognition: bidirectional LSTMs with CTC loss for audio-to-text
  • Time series forecasting: capturing temporal patterns in financial, weather, sensor data
  • Other applications: text generation, sentiment analysis, music composition, handwriting recognition
MEDIUM Explain all four components of an LSTM cell in detail.

Forget Gate: f_t = sigma(W_f[h_{t-1}, x_t] + b_f). The forget gate decides what to erase from the cell state. It takes the previous hidden state and current input, passes them through a sigmoid that outputs values in [0,1] for each cell state dimension. A value near 0 erases that component; near 1 retains it. For example, when processing language and encountering a new subject, the forget gate might erase the old subject information from the cell state.

Input Gate + Candidate: i_t = sigma(W_i[h_{t-1}, x_t] + b_i) and C_tilde_t = tanh(W_C[h_{t-1}, x_t] + b_C). The input gate controls how much of the new candidate values to write to the cell state. The candidate is generated by tanh (values in [-1,1]), and the sigmoid gate selects which components to actually update. This two-step process separates "what could be written" from "what should be written."

Cell State Update: C_t = f_t * C_{t-1} + i_t * C_tilde_t. This is the core of the LSTM. The old cell state is selectively erased (forget gate), then new information is selectively added (input gate). The additive nature of this update is what prevents vanishing gradients -- gradients flow through the cell state via multiplication by f_t rather than through a nonlinearity.

Output Gate: o_t = sigma(W_o[h_{t-1}, x_t] + b_o) and h_t = o_t * tanh(C_t). The output gate controls what parts of the cell state are exposed as the hidden state (which is also the output). The cell state is squashed through tanh, and the sigmoid gate selects which components to output. Not all stored information needs to be relevant to the current output.

Key Points
  • Forget gate (sigma): selectively erases old cell state information
  • Input gate (sigma) + candidate (tanh): selectively writes new information
  • Cell state: additive update provides gradient highway for long-range learning
  • Output gate (sigma): selectively reads from cell state to produce hidden state
MEDIUM Compare LSTM and GRU: architecture, parameters, and when to use each.

Architecture: LSTM has three gates (forget, input, output) and maintains a separate cell state alongside the hidden state. GRU has two gates (reset, update) and uses only a single hidden state -- there is no separate cell state. The GRU update gate combines the roles of the LSTM forget and input gates through linear interpolation: h_t = (1-z_t)*h_{t-1} + z_t*h_tilde_t. The reset gate controls how much of the previous state enters the candidate computation.

Parameters: For input size d_x and hidden size d_h, LSTM has 4*(d_h*d_h + d_h*d_x + d_h) parameters (four weight matrices + biases for the three gates + candidate). GRU has 3*(d_h*d_h + d_h*d_x + d_h) parameters (three weight matrices for the two gates + candidate). GRU has approximately 25% fewer parameters than LSTM.

When to use each: GRU is preferred when: (1) training speed matters -- fewer parameters mean faster iteration, (2) the dataset is smaller -- fewer parameters reduce overfitting risk, (3) the task does not require very fine-grained memory control. LSTM is preferred when: (1) the task requires precise control over what to remember and forget (the separate cell state and output gate provide this), (2) the dataset is large enough to support more parameters, (3) proven LSTM architectures exist for the specific task. In practice, both perform similarly on most benchmarks; GRU often wins on smaller datasets and LSTM on larger ones.

Key Points
  • LSTM: 3 gates + cell state (more capacity, more parameters)
  • GRU: 2 gates, no cell state, ~25% fewer parameters, faster training
  • Performance is similar on most tasks; GRU often wins on smaller datasets
  • LSTM when fine-grained memory control matters; GRU for faster iteration
MEDIUM How does BPTT differ from standard backpropagation, and what is truncated BPTT?

BPTT vs. standard backpropagation: Standard backpropagation computes gradients through a static computational graph where each layer has its own parameters. BPTT first unrolls the RNN across all T time steps, creating a deep graph where the same weight matrices are shared at every "layer" (time step). The gradient of the loss with respect to the shared weights is the sum of gradients from all time steps: dL/dW = sum_t(dL_t/dW). Because the same weights appear at every step, the gradient involves products of Jacobians dh_t/dh_{t-1} that chain multiplicatively.

Truncated BPTT: Full BPTT over very long sequences is impractical because: (1) memory grows linearly with sequence length (all intermediate activations must be stored), and (2) gradients from distant steps contribute negligibly due to vanishing. Truncated BPTT splits the sequence into fixed-length chunks (e.g., 35--100 steps) and only backpropagates within each chunk. The hidden state is detached from the computational graph at chunk boundaries but carried forward as input to the next chunk. This approximates the true gradient while keeping memory constant. The chunk length is a hyperparameter that trades off between gradient quality and resource usage.

Key Points
  • BPTT unrolls the RNN and sums gradients from all time steps (shared weights)
  • Memory and compute grow linearly with sequence length in full BPTT
  • Truncated BPTT limits backpropagation to fixed-length chunks (e.g., 35--100 steps)
  • Hidden state carries forward across chunks but gradients do not flow back across boundaries
MEDIUM What is the difference between bidirectional and unidirectional RNNs, and when should you use each?

Unidirectional RNN: Processes the sequence from left to right (or past to future). At each step t, the hidden state h_t depends only on inputs x_1 through x_t -- it has no knowledge of future inputs. This is appropriate for tasks where causality matters or future context is unavailable: language generation (you predict the next word from past words), real-time speech processing (you must output before hearing the full utterance), and online time series forecasting (future data does not exist yet).

Bidirectional RNN: Runs two separate RNNs -- one forward (left to right) and one backward (right to left) -- then concatenates their hidden states at each position: h_t = [h_forward_t ; h_backward_t]. This gives each position access to both past and future context, doubling the hidden representation size. This is essential for tasks where the full input is available and context from both directions improves prediction: named entity recognition (is "Washington" a person or place depends on surrounding words), sentiment analysis, part-of-speech tagging, and any classification or labeling task on complete sequences.

Key tradeoff: Bidirectional models have 2x the parameters (two separate RNNs) and require the full sequence to be available before processing. They cannot be used for autoregressive generation or any streaming/real-time application where you must produce output before seeing the entire input.

Key Points
  • Unidirectional: only past context, suitable for generation and streaming tasks
  • Bidirectional: past + future context, concatenated hidden states, 2x parameters
  • Use bidirectional for classification/labeling where full input is available (NER, sentiment)
  • Use unidirectional for autoregressive generation and real-time processing
MEDIUM How do RNNs handle variable-length sequences in practice?

Padding and masking: The most common approach is to pad all sequences in a batch to the same length (the maximum length in that batch) using a special padding token. A binary mask tensor indicates which positions are real data vs. padding. During the forward pass, the RNN processes all positions, but the loss function and any aggregation operations (e.g., taking the last hidden state) use the mask to ignore padding positions. This allows efficient batched computation on GPUs.

Packing (PyTorch): Frameworks like PyTorch provide pack_padded_sequence and pad_packed_sequence utilities. Packing sorts sequences by length and groups non-padding elements together, so the RNN only computes on real tokens. This is more efficient than padding because it avoids unnecessary computation on pad tokens, and it ensures the hidden state at the "last" position corresponds to the actual last token, not a padding position.

Bucketing: Group sequences of similar lengths into the same batch to minimize padding waste. For example, sequences of length 10--15 go in one batch, 15--20 in another. This reduces wasted computation from excess padding while maintaining batch processing efficiency. Combined with dynamic batching (choosing batch size based on total tokens rather than number of sequences), this is the standard approach in production systems.

Key Points
  • Padding + masking: pad to max length, use masks to ignore padding in loss/aggregation
  • Packing (PyTorch): sorts by length, avoids computation on padding tokens
  • Bucketing: group similar-length sequences to minimize padding waste
  • The hidden state at the true last position (not padding) is used for sequence-level tasks
HARD Derive mathematically why gradients vanish in vanilla RNNs but not in LSTMs.

Vanilla RNN gradient: For the vanilla RNN h_t = tanh(W_h * h_{t-1} + W_x * x_t + b), the gradient of the loss at time T with respect to the hidden state at time k is: dL_T/dh_k = dL_T/dh_T * product_{t=k+1}^{T} (dh_t/dh_{t-1}). Each Jacobian dh_t/dh_{t-1} = diag(1 - h_t^2) * W_h^T, where diag(1 - h_t^2) comes from the tanh derivative and is element-wise in (0, 1]. The product of T-k such matrices has norm bounded by ||W_h||^{T-k} * max(1-h^2)^{T-k}. When ||W_h|| < 1/max(1-h^2), this product vanishes exponentially. When ||W_h|| > 1, it explodes.

LSTM gradient: For the LSTM cell state C_t = f_t * C_{t-1} + i_t * C_tilde_t, the gradient is: dC_t/dC_{t-1} = diag(f_t) + (terms depending on how gates depend on C_{t-1}). The dominant term is diag(f_t). When the forget gate f_t is close to 1 (which the network learns for information it wants to remember), dC_t/dC_{t-1} is approximately the identity matrix. Over T-k steps: dC_T/dC_k = product_{t=k+1}^{T} diag(f_t) approximately equals diag(product f_t), which stays close to 1 when all forget gates are near 1. This additive structure means gradients flow through the cell state path without the exponential decay caused by multiplicative weight matrix chains in vanilla RNNs.

Key Points
  • Vanilla RNN: gradient involves product of (W_h^T * tanh-derivative) matrices -- exponential decay/growth
  • LSTM cell state gradient: dC_t/dC_{t-1} = diag(f_t) -- close to identity when f_t near 1
  • Additive cell state update creates a linear path; multiplicative hidden state causes exponential behavior
  • LSTM learns to set f_t near 1 for long-term information, maintaining gradient flow
HARD Compare the computational complexity of LSTM vs. Transformer for sequence modeling.

LSTM complexity: Per-layer computation is O(n * d^2) where n is sequence length and d is hidden dimension. This comes from n time steps, each performing matrix multiplications of size d x d (for the four gate computations). Memory for activations is O(n * d). The critical limitation is that computation is inherently sequential -- step t depends on step t-1, so parallelism across time steps is impossible. Training time scales linearly with sequence length, and there is no way to accelerate this with more hardware.

Transformer complexity: Self-attention has O(n^2 * d) computation and O(n^2) memory for the attention matrix. The feedforward layers contribute O(n * d^2). For short-to-medium sequences (n < d), the feedforward term dominates and total complexity is similar to LSTM. For long sequences (n > d), the quadratic attention cost dominates. However, the critical advantage is that all positions are computed in parallel -- training is massively parallelizable on GPUs/TPUs, making wall-clock training time much faster despite similar FLOP counts.

Practical implications: For training, Transformers are 5-10x faster in wall-clock time due to parallelism, even though FLOP counts may be comparable. For inference, LSTMs have an advantage for streaming: generating each new token requires O(d^2) computation (just one step), while Transformers must attend to all previous tokens, costing O(n * d) per new token (or using KV-cache which requires O(n * d) memory). This is why some edge/mobile applications still use LSTMs. Linear attention variants (e.g., Mamba, RWKV) attempt to combine the best of both: parallelizable training with O(1) per-step inference.

Key Points
  • LSTM: O(n*d^2) compute, sequential -- cannot parallelize across time steps
  • Transformer: O(n^2*d) attention + O(n*d^2) FFN, fully parallelizable
  • Transformers train 5-10x faster in wall-clock time due to GPU parallelism
  • LSTMs have O(d^2) per-step inference advantage for streaming applications
HARD How does attention work in a seq2seq model, and why was it a breakthrough?

The bottleneck problem: In a vanilla seq2seq model, the encoder processes the entire source sequence and compresses it into a single fixed-length vector (the final hidden state). The decoder then generates the output sequence conditioned only on this vector. For long sequences, this creates a severe information bottleneck -- a 100-word sentence must be fully represented in a single vector of, say, 512 dimensions. Performance degrades significantly as source length increases.

Attention mechanism (Bahdanau, 2014): Instead of using only the final encoder state, attention allows the decoder to look at all encoder hidden states [h_1, ..., h_n] at each decoding step. At decoder step t: (1) Compute alignment scores: e_{t,i} = score(s_t, h_i) where s_t is the decoder state and h_i is encoder state at position i. The score function can be additive (v^T * tanh(W_1*s_t + W_2*h_i)) or multiplicative (s_t^T * W * h_i). (2) Normalize to get attention weights: alpha_{t,i} = softmax(e_{t,i}). (3) Compute context vector: c_t = sum_i(alpha_{t,i} * h_i). (4) Use c_t alongside s_t to generate the output.

Why it was a breakthrough: Attention removes the fixed-length bottleneck, allowing the model to dynamically select relevant source information at each step. This dramatically improved translation quality, especially for long sentences. It also provided interpretability -- attention weights show which source words the model focuses on. Most importantly, attention was the conceptual precursor to self-attention in Transformers, where the same mechanism is applied within a single sequence to relate all positions to each other.

Key Points
  • Vanilla seq2seq bottleneck: entire source compressed into one vector
  • Attention computes weighted sum of all encoder states at each decoder step
  • Alignment scores + softmax + context vector: dynamic, differentiable information retrieval
  • Precursor to self-attention in Transformers; provided interpretability through attention weights
HARD How would you handle very long sequences (10,000+ time steps) that exceed LSTM capacity?

The fundamental challenge: Even LSTMs struggle with sequences beyond ~500--1000 steps because: (1) the forget gate product over many steps eventually decays (product of values slightly less than 1 still vanishes), (2) memory and compute scale linearly with sequence length, and (3) truncated BPTT limits the effective gradient horizon. At 10K+ steps, vanilla LSTMs are impractical.

Hierarchical approaches: Split the long sequence into segments and process at multiple time scales. A lower-level LSTM processes short segments (e.g., 100 steps), and a higher-level LSTM processes the sequence of segment summaries. This creates an O(sqrt(n)) effective path length. Hierarchical Multiscale LSTM and Clockwork RNN formalize this with different update frequencies for different hidden state groups.

Attention-augmented approaches: Add attention over a sliding window or sparse set of past hidden states. The LSTM processes locally while attention provides long-range skip connections. Transformer-XL extends this by caching previous segment hidden states and attending to them (segment-level recurrence + attention). This achieves practical context lengths of 3K--16K tokens.

Modern alternatives: For truly long sequences, consider: (1) Sparse Transformers with O(n*sqrt(n)) attention, (2) Longformer/BigBird with local + global attention patterns, (3) State-space models (S4, Mamba) that achieve O(n*log(n)) or O(n) complexity with recurrent-like structure but parallelizable training, (4) Compressive memory that summarizes and discards old states to maintain a fixed-size memory. The choice depends on whether you need exact long-range attention (sparse Transformer) or approximate modeling is sufficient (hierarchical RNN, state-space models).

Key Points
  • LSTMs struggle beyond ~500--1000 steps due to forget gate decay and memory constraints
  • Hierarchical RNNs process at multiple time scales, reducing effective path length
  • Transformer-XL: segment-level recurrence + attention for 3K--16K context
  • State-space models (S4, Mamba): O(n) complexity, parallelizable, strong on long sequences
HARD Compare RNNs, Temporal Convolutional Networks (TCN), and Transformers for sequence modeling.

RNNs (LSTM/GRU): Process sequences step by step with O(n*d^2) complexity. Strengths: natural fit for sequential/streaming data, constant memory per step at inference, handle variable lengths natively. Weaknesses: sequential training cannot be parallelized, limited effective memory (~500 steps for LSTM), slower training than alternatives. Best for: streaming applications, edge deployment, small datasets, and tasks where recurrent inductive bias helps.

Temporal Convolutional Networks (TCN): Apply 1D dilated causal convolutions with exponentially increasing dilation rates to achieve large receptive fields. Complexity: O(n*d^2*k) where k is kernel size. Strengths: fully parallelizable (like Transformers), stable gradients (no multiplicative chain), flexible receptive field via dilation, and backpropagation does not suffer from vanishing gradients. Weaknesses: fixed receptive field (must be designed to cover the maximum dependency length), less flexible than attention for variable-length dependencies, and requires careful architecture design (dilation schedule). Best for: audio generation (WaveNet), fixed-length time series, and tasks where the maximum dependency length is known.

Transformers: Use self-attention for O(n^2*d) computation with full parallelization. Strengths: direct connection between any two positions (O(1) path length), massive parallelism, excellent scaling with data and compute, and the foundation of modern NLP. Weaknesses: O(n^2) memory/compute for attention (problematic for very long sequences), no inherent sequential inductive bias (must learn position from positional encodings), and requires large datasets to outperform alternatives. Best for: NLP (dominant architecture), large-scale pretraining, tasks requiring global context, and any setting with abundant data and compute.

Decision framework: Small data or streaming needed? Use LSTM/GRU. Known fixed dependency length with parallel training? Consider TCN. Large data, need global context, have GPU resources? Use Transformers. Very long sequences? Consider state-space models (Mamba) that combine RNN-like inference with parallelizable training.

Key Points
  • RNN: sequential, constant inference memory, best for streaming and small data
  • TCN: parallel, fixed receptive field via dilated convolutions, stable gradients
  • Transformer: parallel, global attention, scales best but O(n^2) cost
  • State-space models (Mamba): emerging hybrid combining parallel training with O(1) inference

Continue Your Journey