PCA Interview Questions

EASY What is PCA and what problem does it solve?

Principal Component Analysis (PCA) is an unsupervised dimensionality reduction technique that transforms a dataset with many correlated features into a smaller set of uncorrelated variables called principal components. It works by finding the directions (eigenvectors of the covariance matrix) along which the data varies the most and projecting the data onto those directions.

PCA solves the curse of dimensionality -- the problem that arises when datasets have too many features. High-dimensional data leads to increased computational cost, overfitting risk, and the breakdown of distance-based methods. By reducing dimensions while retaining the most informative variation, PCA makes data more manageable, improves model performance, and enables visualization of complex datasets in 2D or 3D.

Key Points

Unsupervised dimensionality reduction -- no class labels needed
Finds orthogonal directions of maximum variance (principal components)
Reduces features while preserving the most important information
Addresses curse of dimensionality, computational cost, and overfitting

EASY What does PCA maximize and why?

PCA maximizes the variance of the projected data along each principal component direction. The first principal component is the direction in feature space along which the data has the greatest spread (variance). The second component is the direction of maximum remaining variance, constrained to be orthogonal to the first, and so on.

The reasoning behind maximizing variance is that variance serves as a proxy for information content. If a feature has high variance, it helps distinguish between different data points. If a feature is nearly constant (low variance), it provides almost no discriminative information. By projecting onto high-variance directions, PCA retains the most informative aspects of the data while discarding redundant or noisy dimensions.

Key Points

PCA maximizes variance of projected data along each component
High variance implies more information for distinguishing data points
Equivalently, PCA minimizes the reconstruction error
Each subsequent component captures the next highest variance, orthogonal to all previous

EASY Why is feature scaling important before applying PCA?

Feature scaling is critical before PCA because PCA is based on variance, and variance is scale-dependent. If features are measured in different units or have vastly different numerical ranges, the features with larger magnitudes will dominate the principal components simply because they have larger numerical variance, not because they are inherently more important.

For example, if one feature is income measured in dollars (range: 30,000-200,000) and another is age in years (range: 20-65), income will completely dominate the first principal component due to its much larger numerical range. Standardizing features to zero mean and unit variance ensures that each feature contributes equally to the analysis, allowing PCA to find the true underlying structure rather than being biased by scale differences.

Key Points

PCA is variance-based, and variance depends on the scale of measurement
Without scaling, features with larger ranges dominate the components
Standardize to zero mean and unit variance (StandardScaler) before PCA
Exception: if features are already on the same scale (e.g., pixel intensities), centering alone may suffice

EASY What is a scree plot and how do you use it?

A scree plot is a graph that displays the eigenvalues (or explained variance ratios) of the principal components in descending order. It is used to determine how many principal components to retain. The name comes from geology -- "scree" refers to the loose rocks at the base of a cliff, which the shape of the plot resembles.

To use a scree plot, you look for the "elbow" -- the point where eigenvalues transition from a steep decline to a gradual leveling off. Components before the elbow capture meaningful structure in the data, while components after the elbow primarily capture noise. You retain the components up to (and including) the elbow. In practice, this is often complemented by the cumulative variance threshold approach, where you keep enough components to explain 90-95% of the total variance.

Key Points

Plots eigenvalues in descending order to visualize variance distribution
Look for the "elbow" point where eigenvalues level off
Components before the elbow capture signal; after capture noise
Often combined with cumulative variance threshold (95%)

EASY How do you decide how many principal components to keep?

There are three main criteria for choosing the number of components. First, the cumulative variance threshold: retain the minimum number of components needed to explain a desired percentage of total variance, typically 95% or 99%. In sklearn, you can set PCA(n_components=0.95) to auto-select this number.

Second, the scree plot elbow method: plot eigenvalues in descending order and look for the point where the curve bends sharply, indicating a transition from meaningful variation to noise. Third, the Kaiser criterion: when using the correlation matrix, retain only components with eigenvalues greater than 1. The reasoning is that a component should explain at least as much variance as a single standardized variable. In practice, the cumulative variance method is the most commonly used and is easiest to automate.

Key Points

Cumulative variance threshold (95%): most common, easy to automate
Scree plot elbow method: visual, look for the bend point
Kaiser criterion: eigenvalue > 1 (with correlation matrix)
In sklearn: PCA(n_components=0.95) auto-selects for 95% variance

MEDIUM What is the relationship between PCA and SVD?

SVD (Singular Value Decomposition) is a more general matrix factorization that decomposes any matrix X into three matrices: X = UΣV^T. The connection to PCA is direct and elegant. When X is a centered data matrix, the right singular vectors (columns of V) are exactly the principal component directions -- the eigenvectors of the covariance matrix X^TX/n. The eigenvalues of the covariance matrix relate to singular values by λ_i = σ_i^2/n.

SVD is the preferred computational method for PCA because computing X^TX explicitly squares the condition number of the data matrix, amplifying numerical errors. SVD works directly on X, avoiding this issue. Additionally, truncated SVD algorithms can find only the top k principal components without computing the full decomposition, which is much more efficient for large datasets. Scikit-learn's PCA implementation uses SVD internally rather than explicit eigendecomposition.

Key Points

SVD decomposes X = UΣV^T; columns of V are principal directions
Eigenvalues: λ_i = σ_i^2/n; projected data: Z = XV = UΣ
SVD avoids forming X^TX, providing better numerical stability
Sklearn uses SVD internally for its PCA implementation

MEDIUM What are the key assumptions of PCA?

PCA makes several important assumptions. First, linearity: PCA assumes that the principal components are linear combinations of the original features. If the data has nonlinear structure (e.g., a Swiss roll or concentric circles), standard PCA will fail to capture it. Kernel PCA addresses this limitation. Second, variance equals importance: PCA assumes that directions of maximum variance are the most informative. This is not always true -- in classification tasks, the most discriminative direction may not coincide with the direction of maximum variance.

Third, orthogonality: PCA constrains all principal components to be mutually orthogonal. While this ensures uncorrelated components, the true underlying factors in the data may not be orthogonal. Fourth, PCA uses only first and second-order statistics (mean and covariance), so it cannot capture higher-order dependencies. For non-Gaussian data where higher-order statistics matter, ICA (Independent Component Analysis) may be more appropriate. Finally, PCA assumes the data is centered (mean-subtracted) before analysis.

Key Points

Linearity: components are linear combinations of original features
Variance = importance: may not hold for classification tasks
Orthogonality: components must be mutually perpendicular
Uses only first/second-order statistics (mean, covariance)

MEDIUM How does Kernel PCA differ from standard PCA?

Standard PCA finds linear principal components -- directions that are linear combinations of the original features. This works well when the data structure is approximately linear, but fails when the underlying relationships are nonlinear. For example, if data lies on concentric circles, linear PCA cannot separate the classes because no linear projection preserves the structure.

Kernel PCA addresses this by using the kernel trick to implicitly map data into a higher-dimensional feature space where linear PCA can capture the nonlinear structure. Instead of computing the covariance matrix, kernel PCA works with the kernel matrix (Gram matrix) K_ij = K(x_i, x_j). Common kernels include RBF (Gaussian) and polynomial. The key trade-off is that kernel PCA is computationally more expensive (O(n^3) for eigendecomposition of the n x n kernel matrix) and does not have a straightforward inverse transform for reconstruction.

Key Points

Standard PCA: linear projections only; Kernel PCA: nonlinear via kernel trick
Kernel PCA works with the n x n kernel matrix instead of d x d covariance
Common kernels: RBF (Gaussian), polynomial
Trade-off: higher computational cost (O(n^3)), no easy inverse transform

MEDIUM Compare PCA and LDA. When would you use each?

PCA is unsupervised and maximizes total variance in the data, ignoring class labels entirely. It finds the directions where the data spreads the most. LDA (Linear Discriminant Analysis) is supervised and uses class labels to find directions that maximize the ratio of between-class variance to within-class variance. In other words, LDA finds projections that best separate different classes.

Use PCA when you want general-purpose dimensionality reduction, visualization, or preprocessing without class information. Use LDA when you have labeled data and the goal is classification -- LDA will find a projection that maximizes class separation, which may not coincide with the direction of maximum total variance. A key limitation of LDA is that it can produce at most (k-1) components where k is the number of classes, so for binary classification it gives only one discriminant direction. PCA has no such limitation.

Key Points

PCA: unsupervised, maximizes total variance; LDA: supervised, maximizes class separation
PCA ignores labels; LDA requires them
LDA limited to at most (k-1) components; PCA has no such limit
Use PCA for general reduction; LDA when classification performance matters

MEDIUM How does PCA help with multicollinearity?

Multicollinearity occurs when two or more features are highly correlated, causing instability in regression models -- small changes in data lead to large changes in coefficient estimates. PCA directly addresses this because principal components are orthogonal and uncorrelated by construction. When you replace the original correlated features with principal components, the multicollinearity problem vanishes entirely.

This approach is called Principal Component Regression (PCR): apply PCA to the feature matrix, select the top k components, and use those as inputs to the regression model instead of the original features. The trade-off is loss of interpretability -- you can no longer say "a one-unit increase in feature X leads to a Y-unit change in the response," because the components are abstract linear combinations of all original features. Ridge regression is an alternative that handles multicollinearity without losing interpretability.

Key Points

Multicollinearity: correlated features cause unstable regression coefficients
PCA components are orthogonal and uncorrelated, eliminating multicollinearity
Principal Component Regression (PCR): PCA + regression on components
Trade-off: eliminates multicollinearity but loses feature interpretability

HARD Derive PCA from the variance maximization perspective.

Given centered data X, the variance of the data projected onto a unit vector w is Var(Xw) = w^T C w, where C = (1/n) X^T X is the covariance matrix. We want to maximize w^T C w subject to the constraint w^T w = 1. Using a Lagrange multiplier λ, the Lagrangian is L(w, λ) = w^T C w - λ(w^T w - 1). Taking the derivative with respect to w and setting it to zero gives 2Cw - 2λw = 0, which simplifies to Cw = λw.

This is the eigenvalue equation. The optimal w is an eigenvector of C, and the projected variance equals w^T C w = w^T λw = λ. Therefore, to maximize the projected variance, we choose the eigenvector corresponding to the largest eigenvalue. For the second component, we maximize w^T C w subject to both w^T w = 1 and orthogonality to the first eigenvector. By a similar Lagrange argument, the second component is the eigenvector with the second-largest eigenvalue. By induction, the top k principal components are the k eigenvectors of C with the k largest eigenvalues.

Key Points

Objective: maximize w^T C w subject to w^T w = 1
Lagrangian yields the eigenvalue equation: Cw = λw
Projected variance = eigenvalue λ
Top k eigenvectors give the optimal k-dimensional projection

HARD How can PCA be used for anomaly detection?

PCA-based anomaly detection works on the principle that normal data points lie within the subspace spanned by the top principal components, while anomalies deviate significantly from this subspace. The key metric is the reconstruction error: for each data point, project it onto the top k components and reconstruct it. The reconstruction error (the squared distance between the original and reconstructed point) measures how well the point fits the learned normal pattern.

Points with high reconstruction error are likely anomalies because they have significant variation along the low-variance (discarded) components, which is unusual. This can be decomposed into two complementary statistics: the T-squared statistic (measuring deviation within the principal subspace) and the Q-statistic or SPE (Squared Prediction Error, measuring deviation in the residual subspace). PCA-based anomaly detection is widely used in manufacturing process monitoring, network intrusion detection, and financial fraud detection because it does not require labeled anomaly examples.

Key Points

Normal data fits well in the principal subspace; anomalies do not
High reconstruction error indicates an anomaly
T-squared statistic: deviation within principal subspace
Q-statistic (SPE): deviation in the residual subspace

HARD What are PCA's limitations with nonlinear data, and what alternatives exist?

Standard PCA can only capture linear relationships because it finds linear projections (eigenvectors of the covariance matrix). If the data lies on a nonlinear manifold -- for example, a Swiss roll, an S-curve, or concentric circles -- PCA will project the data along linear directions that cut across the manifold, failing to preserve the intrinsic nonlinear structure. The resulting projection mixes together points that are far apart on the manifold but close in Euclidean space.

Several alternatives handle nonlinear data. Kernel PCA uses the kernel trick to perform PCA in a high-dimensional feature space, capturing nonlinear structure. t-SNE (t-distributed Stochastic Neighbor Embedding) preserves local neighborhood structure and is excellent for visualization but does not provide a general-purpose transformation. UMAP (Uniform Manifold Approximation and Projection) offers similar benefits to t-SNE with better preservation of global structure and faster computation. Autoencoders (neural network-based) learn a nonlinear encoder-decoder mapping and can capture complex nonlinear relationships. Isomap and Locally Linear Embedding (LLE) are manifold learning methods that preserve geodesic distances or local linear structure.

Key Points

PCA assumes linearity and fails on manifolds (Swiss roll, circles)
Kernel PCA: nonlinear via kernel trick, but computationally expensive
t-SNE and UMAP: excellent for visualization, preserve local structure
Autoencoders: neural network approach for complex nonlinear mappings

HARD Explain Incremental PCA and when you would use it for large datasets.

Standard PCA requires loading the entire dataset into memory to compute the covariance matrix or perform SVD, which is infeasible for very large datasets that exceed available RAM. Incremental PCA (IPCA) solves this by processing data in mini-batches. It updates the principal component estimates incrementally as each batch is processed, without ever needing the full dataset in memory simultaneously.

IPCA works by maintaining a running estimate of the covariance structure. When a new batch arrives, it combines the batch statistics with the current estimate using a rank-augmented SVD update. The final components converge to approximately the same result as full PCA. Sklearn provides IncrementalPCA, which accepts data in chunks via the partial_fit method. IPCA is essential for datasets too large for memory, streaming data scenarios, and online learning settings where data arrives continuously. The trade-off is a slight approximation error compared to full PCA, and the batch size affects the quality of approximation.

Key Points

Processes data in mini-batches, never requires full dataset in memory
Updates SVD estimates incrementally as batches arrive
Use for datasets too large for memory or streaming data
Sklearn: IncrementalPCA with partial_fit method

HARD What is the connection between PCA and autoencoders?

PCA and autoencoders are both dimensionality reduction techniques, and there is a deep mathematical connection between them. A linear autoencoder -- a neural network with one hidden layer, no activation function, trained with MSE loss -- learns a representation that spans the same subspace as PCA. The weight matrices of a trained linear autoencoder are related to the principal components by a rotation. Specifically, the encoder weights span the same subspace as the top k eigenvectors of the covariance matrix.

The key difference emerges when the autoencoder uses nonlinear activation functions. A nonlinear autoencoder can learn complex, nonlinear mappings between the original space and the compressed representation, capturing relationships that PCA cannot. This makes deep autoencoders strictly more powerful than PCA -- they subsume PCA as a special linear case. However, autoencoders have significant disadvantages: they require more data, have hyperparameters to tune (architecture, learning rate, activation), are prone to overfitting, and provide non-unique solutions (different initializations give different representations). PCA is deterministic, parameter-free, and always gives the globally optimal linear solution.

Key Points

Linear autoencoder (no activation, MSE loss) recovers PCA subspace
Nonlinear autoencoders generalize PCA to nonlinear mappings
Autoencoders: more powerful but need more data, hyperparameter tuning
PCA: deterministic, globally optimal linear solution, no hyperparameters

Key Points

Key Points

Key Points

Key Points

Key Points

Key Points

Key Points

Key Points

Key Points

Key Points

Key Points

Key Points

Key Points

Key Points

Key Points

Continue Your Journey

Guide

Cheat Sheet

PCA Quiz