PCA Interview Questions
15 commonly asked interview questions with detailed answers. Click any question to reveal the answer.
15 commonly asked interview questions with detailed answers. Click any question to reveal the answer.
Principal Component Analysis (PCA) is an unsupervised dimensionality reduction technique that transforms a dataset with many correlated features into a smaller set of uncorrelated variables called principal components. It works by finding the directions (eigenvectors of the covariance matrix) along which the data varies the most and projecting the data onto those directions.
PCA solves the curse of dimensionality -- the problem that arises when datasets have too many features. High-dimensional data leads to increased computational cost, overfitting risk, and the breakdown of distance-based methods. By reducing dimensions while retaining the most informative variation, PCA makes data more manageable, improves model performance, and enables visualization of complex datasets in 2D or 3D.
PCA maximizes the variance of the projected data along each principal component direction. The first principal component is the direction in feature space along which the data has the greatest spread (variance). The second component is the direction of maximum remaining variance, constrained to be orthogonal to the first, and so on.
The reasoning behind maximizing variance is that variance serves as a proxy for information content. If a feature has high variance, it helps distinguish between different data points. If a feature is nearly constant (low variance), it provides almost no discriminative information. By projecting onto high-variance directions, PCA retains the most informative aspects of the data while discarding redundant or noisy dimensions.
Feature scaling is critical before PCA because PCA is based on variance, and variance is scale-dependent. If features are measured in different units or have vastly different numerical ranges, the features with larger magnitudes will dominate the principal components simply because they have larger numerical variance, not because they are inherently more important.
For example, if one feature is income measured in dollars (range: 30,000-200,000) and another is age in years (range: 20-65), income will completely dominate the first principal component due to its much larger numerical range. Standardizing features to zero mean and unit variance ensures that each feature contributes equally to the analysis, allowing PCA to find the true underlying structure rather than being biased by scale differences.
A scree plot is a graph that displays the eigenvalues (or explained variance ratios) of the principal components in descending order. It is used to determine how many principal components to retain. The name comes from geology -- "scree" refers to the loose rocks at the base of a cliff, which the shape of the plot resembles.
To use a scree plot, you look for the "elbow" -- the point where eigenvalues transition from a steep decline to a gradual leveling off. Components before the elbow capture meaningful structure in the data, while components after the elbow primarily capture noise. You retain the components up to (and including) the elbow. In practice, this is often complemented by the cumulative variance threshold approach, where you keep enough components to explain 90-95% of the total variance.
There are three main criteria for choosing the number of components. First, the cumulative variance threshold: retain the minimum number of components needed to explain a desired percentage of total variance, typically 95% or 99%. In sklearn, you can set PCA(n_components=0.95) to auto-select this number.
Second, the scree plot elbow method: plot eigenvalues in descending order and look for the point where the curve bends sharply, indicating a transition from meaningful variation to noise. Third, the Kaiser criterion: when using the correlation matrix, retain only components with eigenvalues greater than 1. The reasoning is that a component should explain at least as much variance as a single standardized variable. In practice, the cumulative variance method is the most commonly used and is easiest to automate.
SVD (Singular Value Decomposition) is a more general matrix factorization that decomposes any matrix X into three matrices: X = UΣV^T. The connection to PCA is direct and elegant. When X is a centered data matrix, the right singular vectors (columns of V) are exactly the principal component directions -- the eigenvectors of the covariance matrix X^TX/n. The eigenvalues of the covariance matrix relate to singular values by λ_i = σ_i^2/n.
SVD is the preferred computational method for PCA because computing X^TX explicitly squares the condition number of the data matrix, amplifying numerical errors. SVD works directly on X, avoiding this issue. Additionally, truncated SVD algorithms can find only the top k principal components without computing the full decomposition, which is much more efficient for large datasets. Scikit-learn's PCA implementation uses SVD internally rather than explicit eigendecomposition.
PCA makes several important assumptions. First, linearity: PCA assumes that the principal components are linear combinations of the original features. If the data has nonlinear structure (e.g., a Swiss roll or concentric circles), standard PCA will fail to capture it. Kernel PCA addresses this limitation. Second, variance equals importance: PCA assumes that directions of maximum variance are the most informative. This is not always true -- in classification tasks, the most discriminative direction may not coincide with the direction of maximum variance.
Third, orthogonality: PCA constrains all principal components to be mutually orthogonal. While this ensures uncorrelated components, the true underlying factors in the data may not be orthogonal. Fourth, PCA uses only first and second-order statistics (mean and covariance), so it cannot capture higher-order dependencies. For non-Gaussian data where higher-order statistics matter, ICA (Independent Component Analysis) may be more appropriate. Finally, PCA assumes the data is centered (mean-subtracted) before analysis.
Standard PCA finds linear principal components -- directions that are linear combinations of the original features. This works well when the data structure is approximately linear, but fails when the underlying relationships are nonlinear. For example, if data lies on concentric circles, linear PCA cannot separate the classes because no linear projection preserves the structure.
Kernel PCA addresses this by using the kernel trick to implicitly map data into a higher-dimensional feature space where linear PCA can capture the nonlinear structure. Instead of computing the covariance matrix, kernel PCA works with the kernel matrix (Gram matrix) K_ij = K(x_i, x_j). Common kernels include RBF (Gaussian) and polynomial. The key trade-off is that kernel PCA is computationally more expensive (O(n^3) for eigendecomposition of the n x n kernel matrix) and does not have a straightforward inverse transform for reconstruction.
PCA is unsupervised and maximizes total variance in the data, ignoring class labels entirely. It finds the directions where the data spreads the most. LDA (Linear Discriminant Analysis) is supervised and uses class labels to find directions that maximize the ratio of between-class variance to within-class variance. In other words, LDA finds projections that best separate different classes.
Use PCA when you want general-purpose dimensionality reduction, visualization, or preprocessing without class information. Use LDA when you have labeled data and the goal is classification -- LDA will find a projection that maximizes class separation, which may not coincide with the direction of maximum total variance. A key limitation of LDA is that it can produce at most (k-1) components where k is the number of classes, so for binary classification it gives only one discriminant direction. PCA has no such limitation.
Multicollinearity occurs when two or more features are highly correlated, causing instability in regression models -- small changes in data lead to large changes in coefficient estimates. PCA directly addresses this because principal components are orthogonal and uncorrelated by construction. When you replace the original correlated features with principal components, the multicollinearity problem vanishes entirely.
This approach is called Principal Component Regression (PCR): apply PCA to the feature matrix, select the top k components, and use those as inputs to the regression model instead of the original features. The trade-off is loss of interpretability -- you can no longer say "a one-unit increase in feature X leads to a Y-unit change in the response," because the components are abstract linear combinations of all original features. Ridge regression is an alternative that handles multicollinearity without losing interpretability.
Given centered data X, the variance of the data projected onto a unit vector w is Var(Xw) = w^T C w, where C = (1/n) X^T X is the covariance matrix. We want to maximize w^T C w subject to the constraint w^T w = 1. Using a Lagrange multiplier λ, the Lagrangian is L(w, λ) = w^T C w - λ(w^T w - 1). Taking the derivative with respect to w and setting it to zero gives 2Cw - 2λw = 0, which simplifies to Cw = λw.
This is the eigenvalue equation. The optimal w is an eigenvector of C, and the projected variance equals w^T C w = w^T λw = λ. Therefore, to maximize the projected variance, we choose the eigenvector corresponding to the largest eigenvalue. For the second component, we maximize w^T C w subject to both w^T w = 1 and orthogonality to the first eigenvector. By a similar Lagrange argument, the second component is the eigenvector with the second-largest eigenvalue. By induction, the top k principal components are the k eigenvectors of C with the k largest eigenvalues.
PCA-based anomaly detection works on the principle that normal data points lie within the subspace spanned by the top principal components, while anomalies deviate significantly from this subspace. The key metric is the reconstruction error: for each data point, project it onto the top k components and reconstruct it. The reconstruction error (the squared distance between the original and reconstructed point) measures how well the point fits the learned normal pattern.
Points with high reconstruction error are likely anomalies because they have significant variation along the low-variance (discarded) components, which is unusual. This can be decomposed into two complementary statistics: the T-squared statistic (measuring deviation within the principal subspace) and the Q-statistic or SPE (Squared Prediction Error, measuring deviation in the residual subspace). PCA-based anomaly detection is widely used in manufacturing process monitoring, network intrusion detection, and financial fraud detection because it does not require labeled anomaly examples.
Standard PCA can only capture linear relationships because it finds linear projections (eigenvectors of the covariance matrix). If the data lies on a nonlinear manifold -- for example, a Swiss roll, an S-curve, or concentric circles -- PCA will project the data along linear directions that cut across the manifold, failing to preserve the intrinsic nonlinear structure. The resulting projection mixes together points that are far apart on the manifold but close in Euclidean space.
Several alternatives handle nonlinear data. Kernel PCA uses the kernel trick to perform PCA in a high-dimensional feature space, capturing nonlinear structure. t-SNE (t-distributed Stochastic Neighbor Embedding) preserves local neighborhood structure and is excellent for visualization but does not provide a general-purpose transformation. UMAP (Uniform Manifold Approximation and Projection) offers similar benefits to t-SNE with better preservation of global structure and faster computation. Autoencoders (neural network-based) learn a nonlinear encoder-decoder mapping and can capture complex nonlinear relationships. Isomap and Locally Linear Embedding (LLE) are manifold learning methods that preserve geodesic distances or local linear structure.
Standard PCA requires loading the entire dataset into memory to compute the covariance matrix or perform SVD, which is infeasible for very large datasets that exceed available RAM. Incremental PCA (IPCA) solves this by processing data in mini-batches. It updates the principal component estimates incrementally as each batch is processed, without ever needing the full dataset in memory simultaneously.
IPCA works by maintaining a running estimate of the covariance structure. When a new batch arrives, it combines the batch statistics with the current estimate using a rank-augmented SVD update. The final components converge to approximately the same result as full PCA. Sklearn provides IncrementalPCA, which accepts data in chunks via the partial_fit method. IPCA is essential for datasets too large for memory, streaming data scenarios, and online learning settings where data arrives continuously. The trade-off is a slight approximation error compared to full PCA, and the batch size affects the quality of approximation.
PCA and autoencoders are both dimensionality reduction techniques, and there is a deep mathematical connection between them. A linear autoencoder -- a neural network with one hidden layer, no activation function, trained with MSE loss -- learns a representation that spans the same subspace as PCA. The weight matrices of a trained linear autoencoder are related to the principal components by a rotation. Specifically, the encoder weights span the same subspace as the top k eigenvectors of the covariance matrix.
The key difference emerges when the autoencoder uses nonlinear activation functions. A nonlinear autoencoder can learn complex, nonlinear mappings between the original space and the compressed representation, capturing relationships that PCA cannot. This makes deep autoencoders strictly more powerful than PCA -- they subsume PCA as a special linear case. However, autoencoders have significant disadvantages: they require more data, have hyperparameters to tune (architecture, learning rate, activation), are prone to overfitting, and provide non-unique solutions (different initializations give different representations). PCA is deterministic, parameter-free, and always gives the globally optimal linear solution.