covariance matrix
eigenvectors
variance explained
Quick Reference
Dimensionality Reduction
Home / Study Lab / Cheat Sheets / PCA Cheat Sheet
QUICK REFERENCE

PCA
Cheat Sheet

Everything you need on one page. Perfect for revision, interviews, and quick reference.

Key Formulas

Covariance Matrix:
$$C = \frac{1}{n} X^T X$$
Eigenvalue Equation:
$$C\mathbf{v} = \lambda \mathbf{v}$$
Projection:
$$Z = X_{\text{centered}} \cdot W_k$$
Variance Maximization:
$$\max_{\mathbf{w}} \mathbf{w}^T C \mathbf{w} \quad \text{s.t.} \quad \mathbf{w}^T \mathbf{w} = 1$$
Explained Variance Ratio:
$$\frac{\lambda_i}{\sum_{j=1}^{d} \lambda_j}$$
Reconstruction:
$$\hat{X} = Z W_k^T + \bar{X}$$

Algorithm Steps

  1. Center the data: Subtract the mean of each feature. Standardize if features are on different scales.
  2. Compute covariance matrix: $C = \frac{1}{n} X_c^T X_c$ where $X_c$ is centered data.
  3. Eigendecomposition: Solve $C\mathbf{v} = \lambda \mathbf{v}$ for all eigenvalues and eigenvectors.
  4. Sort by eigenvalue: Order eigenvectors by decreasing eigenvalue: $\lambda_1 \geq \lambda_2 \geq \cdots \geq \lambda_d$.
  5. Select top $k$ components: Choose $k$ based on scree plot or cumulative variance threshold.
  6. Project: $Z = X_c \cdot W_k$ where $W_k$ contains the top $k$ eigenvectors.

In practice, sklearn uses SVD instead of explicit eigendecomposition for better numerical stability.

Choosing Components

Cumulative Variance (95%):
$$k = \min\left\{k : \frac{\sum_{i=1}^{k}\lambda_i}{\sum_{i=1}^{d}\lambda_i} \geq 0.95\right\}$$
Kaiser Criterion:
Keep components where $\lambda_i > 1$ (using correlation matrix)
Scree Plot:
Look for the "elbow" where eigenvalues transition from steep to flat

In sklearn: PCA(n_components=0.95) auto-selects $k$ for 95% variance.

SVD Connection

SVD Decomposition:
$$X = U \Sigma V^T$$
Principal Directions:
Columns of $V$ = eigenvectors of $C = \frac{1}{n}X^TX$
Eigenvalue Relation:
$$\lambda_i = \frac{\sigma_i^2}{n}$$
Projected Data:
$$Z = XV = U\Sigma$$

SVD avoids computing $X^TX$ explicitly, which is more numerically stable and efficient. Sklearn uses SVD internally.

Kernel PCA

Kernel Matrix:
$$K_{ij} = K(\mathbf{x}_i, \mathbf{x}_j) = \phi(\mathbf{x}_i) \cdot \phi(\mathbf{x}_j)$$
RBF Kernel:
$$K(\mathbf{x}, \mathbf{x}') = \exp(-\gamma\|\mathbf{x} - \mathbf{x}'\|^2)$$
Polynomial Kernel:
$$K(\mathbf{x}, \mathbf{x}') = (\mathbf{x} \cdot \mathbf{x}' + r)^d$$

Kernel PCA handles nonlinear relationships by implicitly mapping to higher-dimensional spaces. Use when linear PCA fails to capture structure (e.g., concentric circles).

Assumptions & Limitations

Assumptions:

  • Principal components are linear combinations of original features
  • Directions of maximum variance are the most important
  • Components are mutually orthogonal (uncorrelated)
  • Data should be centered (mean-subtracted) before applying PCA

Limitations:

  • Cannot capture nonlinear relationships (use kernel PCA, t-SNE, UMAP)
  • Highly sensitive to feature scaling -- always standardize first
  • Components lose interpretability (mixtures of all original features)
  • Variance may not equal importance for classification tasks (consider LDA)
  • Sensitive to outliers that can distort the covariance matrix

Common Pitfalls

  • Forgetting to scale/standardize features before PCA (most common mistake)
  • Treating principal components as if they have interpretable meaning
  • Fitting PCA on the entire dataset before train/test split (data leakage)
  • Using PCA when features are already independent and uncorrelated
  • Keeping too few components and losing critical information
  • Using PCA for classification without checking if it preserves class separation
  • Applying standard PCA to nonlinear data structures (use kernel PCA)
  • Forgetting to apply the same PCA transformation to test data

Interview Quick-Fire

Q: What is PCA?

A: PCA is an unsupervised dimensionality reduction technique that projects data onto orthogonal directions (principal components) that capture maximum variance, using the eigenvectors of the covariance matrix.

Q: What does PCA maximize?

A: PCA maximizes the variance of the projected data. Equivalently, it minimizes the reconstruction error (the information lost when projecting onto fewer dimensions).

Q: Why is scaling important for PCA?

A: PCA is based on variance. Without scaling, features with larger numerical ranges dominate the principal components, regardless of their actual importance. Standardization ensures equal contribution from all features.

Q: How do you choose the number of components?

A: Use the scree plot (look for the elbow), cumulative variance threshold (typically 95%), or Kaiser criterion (eigenvalue > 1 with correlation matrix). In sklearn, use PCA(n_components=0.95).

Q: PCA vs LDA -- what's the difference?

A: PCA is unsupervised and maximizes total variance. LDA is supervised and maximizes class separability. PCA ignores class labels; LDA uses them. Use PCA for general reduction; LDA when classification performance matters.

Q: What is the relationship between PCA and SVD?

A: SVD decomposes $X = U\Sigma V^T$. The columns of $V$ are the principal directions, and $\lambda_i = \sigma_i^2/n$. SVD computes PCA without forming $X^TX$, which is more stable and efficient.

Q: When should you use kernel PCA?

A: When the data has nonlinear structure that standard PCA cannot capture (e.g., concentric circles, Swiss roll). Kernel PCA maps data to a higher-dimensional space via a kernel function before extracting principal components.

Q: Can PCA help with multicollinearity?

A: Yes. Principal components are orthogonal and uncorrelated by construction. Using PCA-transformed features in regression eliminates multicollinearity problems. However, you lose interpretability of the original features.

Continue Your Journey