PCA (Principal Component Analysis) - Complete Master Guide

01

Historical Intuition

How the quest to understand variation in multivariate data gave rise to one of the most fundamental techniques in data science.

Karl Pearson & the Origins of PCA

The story of Principal Component Analysis begins in 1901 with Karl Pearson, the English mathematician and biostatistician. Pearson was interested in finding the "line of best fit" to a system of points in space -- not in the least-squares regression sense (minimizing vertical distances), but in the sense of minimizing the orthogonal distances from points to the line.

Pearson framed the problem as finding the direction through a cloud of data points that captures the most variation. This geometric insight -- that the most informative direction is the one with maximum spread -- became the foundation of PCA. He showed that the optimal lines and planes of fit could be determined from the principal axes of the data's ellipsoid of concentration.

Harold Hotelling & the Modern Formulation

In 1933, Harold Hotelling independently developed PCA as a formal statistical method, casting it in terms of eigenvalue decomposition of the covariance matrix. Hotelling was motivated by the practical problem of reducing a large number of correlated variables to a smaller set of uncorrelated components that preserve most of the information.

Hotelling introduced the term "principal components" and provided the mathematical framework that is still used today. His key insight was that the principal components are the eigenvectors of the covariance matrix, ordered by their corresponding eigenvalues. This elegant connection between linear algebra and statistics transformed PCA from a geometric curiosity into a powerful analytical tool.

C\mathbf{v} = \lambda \mathbf{v}

Where \( C \) is the covariance matrix, \( \mathbf{v} \) is an eigenvector (principal direction), and \( \lambda \) is the corresponding eigenvalue (variance explained along that direction).

The Evolution of PCA

From a theoretical tool in statistics to a cornerstone of modern machine learning, PCA has had a remarkable journey:

1901

Karl Pearson proposes fitting lines and planes of closest fit to data points using orthogonal distances

1933

Harold Hotelling formalizes PCA using eigendecomposition of the covariance matrix, naming them "principal components"

1965

Golub and Kahan develop efficient SVD algorithms, enabling PCA on larger datasets

1987

Sirovich and Kirby apply PCA to face images, introducing the concept of Eigenfaces

1998

Kernel PCA is introduced by Scholkopf, Smola, and Muller, extending PCA to nonlinear dimensionality reduction

Today

PCA remains the default first step for dimensionality reduction, visualization, and preprocessing in modern data science pipelines

02

Core Intuition

Why reducing dimensions matters, and how variance equals information.

The Curse of Dimensionality

As the number of features in a dataset increases, several problems emerge. Data becomes increasingly sparse in high-dimensional spaces -- the volume of the space increases exponentially, but the amount of data remains fixed. This means that algorithms need exponentially more data to maintain the same level of statistical significance.

High-dimensional data also suffers from increased computational cost, overfitting risk, and the breakdown of distance-based methods (all pairwise distances converge to similar values). PCA addresses these issues by projecting data onto a lower-dimensional subspace while preserving as much of the original information as possible.

Variance = Information

The central philosophy of PCA is that the directions with the most variance carry the most information. If a feature varies widely across observations, it is helping to distinguish between data points. If a feature is nearly constant, it carries almost no discriminative information and can be safely discarded.

PCA finds new axes (principal components) that are linear combinations of the original features, ordered by the amount of variance they capture. The first principal component captures the maximum possible variance, the second captures the maximum remaining variance while being orthogonal to the first, and so on.

Maximum Variance

Each principal component captures the maximum possible variance from the remaining variation, ensuring the most informative projection

Orthogonality

All principal components are mutually orthogonal (perpendicular), ensuring zero correlation between them -- no redundant information

Dimensionality Reduction

By keeping only the top k components that capture most of the variance, we can dramatically reduce the number of dimensions while losing minimal information

Interactive: 2D to 1D Projection

See how PCA projects 2D data onto the direction of maximum variance. Drag to move data points and observe how the principal component axis adjusts.

PCA Projection Visualization

The red line shows the first principal component -- the direction of maximum variance. Blue dots are original data; orange dots are the projected points on the principal axis.

03

Mathematical Foundation

The elegant optimization problem behind finding directions of maximum variance.

The Covariance Matrix

Given a centered data matrix \( X \in \mathbb{R}^{n \times d} \) (each column has zero mean), the covariance matrix captures how features vary together:

C = \frac{1}{n} X^T X

The covariance matrix \( C \in \mathbb{R}^{d \times d} \) is symmetric and positive semi-definite. The diagonal entries \( C_{ii} \) represent the variance of each feature, while off-diagonal entries \( C_{ij} \) represent the covariance (linear relationship) between features \( i \) and \( j \).

A large positive covariance between two features means they tend to increase together. A large negative covariance means they tend to move in opposite directions. Zero covariance means no linear relationship.

The Optimization Objective

PCA seeks the direction \( \mathbf{w} \) that maximizes the variance of the projected data. The variance of data projected onto a unit vector \( \mathbf{w} \) is:

\text{Var}(X\mathbf{w}) = \mathbf{w}^T C \mathbf{w}

We want to maximize this subject to the constraint that \( \mathbf{w} \) is a unit vector:

\max_{\mathbf{w}} \mathbf{w}^T C \mathbf{w} \quad \text{subject to} \quad \mathbf{w}^T \mathbf{w} = 1

Using a Lagrange multiplier \( \lambda \), the optimality condition is:

C\mathbf{w} = \lambda \mathbf{w}

This is precisely the eigenvalue equation! The optimal projection direction \( \mathbf{w} \) is an eigenvector of the covariance matrix \( C \), and the corresponding eigenvalue \( \lambda \) equals the variance of the data projected along that direction.

Why Eigenvectors are the Answer

The beauty of PCA lies in this connection: finding directions of maximum variance in data is equivalent to finding the eigenvectors of the covariance matrix. The eigenvector with the largest eigenvalue points in the direction of greatest variance. The second-largest eigenvalue gives the direction of second-greatest variance (orthogonal to the first), and so on. This is not an approximation -- it is the exact, mathematically optimal solution.

Multiple Components

The second principal component maximizes variance subject to being orthogonal to the first. By induction, the \( k \)-th component maximizes variance subject to being orthogonal to all previous components. The result is that the first \( k \) principal components are the \( k \) eigenvectors of \( C \) corresponding to the \( k \) largest eigenvalues.

C = V \Lambda V^T

Where \( V = [\mathbf{v}_1, \mathbf{v}_2, \ldots, \mathbf{v}_d] \) is the matrix of eigenvectors (principal directions) and \( \Lambda = \text{diag}(\lambda_1, \lambda_2, \ldots, \lambda_d) \) with \( \lambda_1 \geq \lambda_2 \geq \cdots \geq \lambda_d \geq 0 \) are the eigenvalues (variance explained by each direction).

The total variance in the data equals the sum of all eigenvalues: \( \text{Total Variance} = \sum_{i=1}^{d} \lambda_i = \text{tr}(C) \). The fraction of variance explained by the first \( k \) components is:

\text{Explained Variance Ratio} = \frac{\sum_{i=1}^{k} \lambda_i}{\sum_{i=1}^{d} \lambda_i}

Explained Variance by Component

The chart below illustrates how the variance is distributed across principal components for a typical dataset. The first few components capture most of the variance, enabling significant dimensionality reduction with minimal information loss.

04

Step-by-Step PCA Algorithm

A clear, systematic walkthrough of the PCA procedure from raw data to reduced representation.

The Five Steps of PCA

1

Standardize / Center the Data

Subtract the mean of each feature so that the data is centered at the origin: \( X_{\text{centered}} = X - \bar{X} \). If features are on different scales, standardize to zero mean and unit variance. This is critical because PCA is sensitive to scale -- a feature measured in kilometers will dominate one measured in centimeters simply due to its larger numerical range.

2

Compute the Covariance Matrix

Calculate the \( d \times d \) covariance matrix \( C = \frac{1}{n} X_{\text{centered}}^T X_{\text{centered}} \). This matrix encodes all pairwise linear relationships between features. For a dataset with 100 features, the covariance matrix will be \( 100 \times 100 \).

3

Compute Eigenvalues and Eigenvectors

Solve the eigenvalue equation \( C\mathbf{v} = \lambda \mathbf{v} \) to find all \( d \) eigenvectors and their corresponding eigenvalues. The eigenvectors are the principal directions; the eigenvalues quantify the variance along each direction.

4

Sort and Select Top k Components

Sort eigenvalues in descending order: \( \lambda_1 \geq \lambda_2 \geq \cdots \geq \lambda_d \). Select the top \( k \) eigenvectors corresponding to the \( k \) largest eigenvalues. The choice of \( k \) depends on the desired explained variance threshold (typically 95%).

5

Project Data onto New Subspace

Form the projection matrix \( W = [\mathbf{v}_1, \mathbf{v}_2, \ldots, \mathbf{v}_k] \in \mathbb{R}^{d \times k} \) and transform the data: \( Z = X_{\text{centered}} \cdot W \). The result \( Z \in \mathbb{R}^{n \times k} \) is the reduced-dimensional representation of the data.

The Projection Formula

The complete PCA transformation can be written compactly as:

Z = (X - \bar{X}) W_k

Where \( \bar{X} \) is the mean vector, \( W_k \) is the matrix of the top \( k \) eigenvectors, and \( Z \) is the projected data in the reduced space. To reconstruct the data back to the original space (with some information loss):

\hat{X} = Z W_k^T + \bar{X}

The reconstruction error (information lost) is determined by the eigenvalues of the discarded components. The smaller these eigenvalues, the less information is lost during dimensionality reduction.

Always Center (or Standardize) First

If you skip centering, the first principal component will point toward the mean of the data rather than the direction of maximum variance. If features are on vastly different scales (e.g., age in years vs. salary in dollars), the feature with the largest numerical range will dominate the principal components. Always standardize when features have different units or scales.

05

Eigenvectors & Eigenvalues

The geometric and algebraic meaning behind the building blocks of PCA.

Geometric Interpretation

An eigenvector of a matrix \( A \) is a non-zero vector \( \mathbf{v} \) that, when multiplied by \( A \), only gets scaled -- it does not change direction:

A\mathbf{v} = \lambda \mathbf{v}

The scalar \( \lambda \) is called the eigenvalue. Geometrically, eigenvectors are the directions along which a linear transformation acts by simply stretching (or compressing). In the context of PCA:

Eigenvectors of the covariance matrix point in the directions of the principal axes of the data ellipsoid -- the directions along which the data varies the most
Eigenvalues measure the amount of variance (spread) of the data along each eigenvector direction
The eigenvector with the largest eigenvalue is the first principal component -- the single most informative direction in the data
Eigenvectors of a symmetric matrix (like the covariance matrix) are always mutually orthogonal

Principal Directions of a Data Cloud

Think of your data as a cloud of points in high-dimensional space. The covariance matrix describes the shape of this cloud. The eigenvectors of the covariance matrix are the natural axes of this cloud -- the directions along which it stretches the most. The eigenvalues tell you how much it stretches in each direction.

If you project the data onto these eigenvector axes, you get uncorrelated components ordered by importance. The first few axes capture the dominant structure; the remaining axes capture noise or irrelevant variation.

The Spectral Theorem

The covariance matrix \( C \) is real and symmetric, so the spectral theorem guarantees that it has \( d \) real eigenvalues and \( d \) orthonormal eigenvectors. The eigendecomposition \( C = V\Lambda V^T \) always exists and is unique (up to sign and ordering). This is why PCA always produces a valid, well-defined result -- there is no convergence issue or local minimum problem.

Interactive: Eigenvector Directions

Rotate the data cloud and see how the principal component directions (eigenvectors) align with the spread of the data.

Eigenvector Visualization

The red and green arrows show the two principal component directions. Adjust the rotation to see how the eigenvectors track the data's spread.

Rotate Data: 0°

06

Choosing Number of Components

How to decide how many dimensions to keep -- balancing compression with information retention.

The Scree Plot

A scree plot displays eigenvalues (or explained variance ratios) in descending order. The name comes from the geological term for loose rocks at the base of a cliff. In a scree plot, you look for the "elbow" -- the point where the eigenvalues transition from steeply declining to leveling off.

Components before the elbow capture meaningful structure in the data. Components after the elbow capture mostly noise. The elbow indicates the optimal number of components to retain.

Common Selection Criteria

1

Cumulative Variance Threshold (95%)

The most popular approach: keep the minimum number of components needed to explain at least 95% (or 90%, or 99%) of the total variance. This ensures that only 5% of the information is lost. Compute: \( k = \min\left\{ k : \frac{\sum_{i=1}^{k} \lambda_i}{\sum_{i=1}^{d} \lambda_i} \geq 0.95 \right\} \).

2

Kaiser Criterion (Eigenvalue > 1)

Keep only components with eigenvalues greater than 1 (when using the correlation matrix). The reasoning is that a component should explain at least as much variance as a single standardized variable. This rule works well for moderately sized datasets but can over- or under-retain components in some cases.

3

Scree Plot Elbow Method

Visually inspect the scree plot for an "elbow" where eigenvalues transition from steep decline to gradual leveling. Keep all components before the elbow. This is subjective but often aligns well with the other criteria.

Interactive: Scree Plot & Cumulative Variance

Explore how eigenvalues distribute across components. The left axis shows individual eigenvalues (bar chart), while the right axis shows cumulative explained variance (line). The dashed line marks the 95% threshold.

Scree Plot & Variance Explained

Bar chart shows individual component variance. The orange line shows cumulative variance. The dashed line marks the 95% threshold for component selection.

07

PCA via SVD

The computationally superior approach to PCA using Singular Value Decomposition.

Singular Value Decomposition

Any matrix \( X \in \mathbb{R}^{n \times d} \) can be decomposed as:

X = U \Sigma V^T

Where:

\( U \in \mathbb{R}^{n \times n} \) -- orthogonal matrix of left singular vectors (directions in sample space)
\( \Sigma \in \mathbb{R}^{n \times d} \) -- diagonal matrix of singular values \( \sigma_1 \geq \sigma_2 \geq \cdots \geq 0 \)
\( V \in \mathbb{R}^{d \times d} \) -- orthogonal matrix of right singular vectors (directions in feature space)

Connection to Eigendecomposition

The connection between SVD of the data matrix and eigendecomposition of the covariance matrix is elegant. If \( X \) is centered, then:

C = \frac{1}{n} X^T X = \frac{1}{n} (U\Sigma V^T)^T (U\Sigma V^T) = V \frac{\Sigma^T\Sigma}{n} V^T

This means:

The right singular vectors \( V \) are the principal component directions (eigenvectors of \( C \))
The eigenvalues are related to singular values by \( \lambda_i = \frac{\sigma_i^2}{n} \)
The projected data (scores) are \( Z = XV = U\Sigma \)

Why SVD is Preferred

Numerical Stability

Computing \( X^T X \) squares the condition number of the data matrix, amplifying numerical errors. SVD works directly on \( X \), avoiding this issue

Efficiency for Tall Matrices

When \( n \gg d \), computing the \( d \times d \) covariance matrix is wasteful. Truncated SVD can find only the top \( k \) components without computing the full decomposition

Sparse Data Support

Randomized SVD algorithms (used by sklearn) can handle very large, sparse matrices that would be impossible to process with explicit covariance computation

Sklearn Uses SVD Internally

When you call sklearn.decomposition.PCA, it does not compute the covariance matrix explicitly. Instead, it uses a truncated SVD of the centered data matrix, which is both faster and more numerically stable. The TruncatedSVD class goes further by working with sparse matrices and not requiring centering (useful for text data in TF-IDF form).

08

Kernel PCA

Extending PCA to capture nonlinear structure using the kernel trick.

The Limitation of Linear PCA

Standard PCA finds linear projections of the data. If the underlying structure of the data is nonlinear -- for example, data distributed on a curved manifold like a Swiss roll or concentric circles -- linear PCA will fail to capture this structure. The principal components will cut across the manifold rather than following it.

Kernel PCA addresses this by first implicitly mapping the data into a high-dimensional feature space using a kernel function, and then performing PCA in that space. Just as kernel SVM finds nonlinear decision boundaries, kernel PCA finds nonlinear principal components.

How Kernel PCA Works

Instead of computing the covariance matrix in the original space, kernel PCA works with the kernel matrix (Gram matrix):

K_{ij} = K(\mathbf{x}_i, \mathbf{x}_j) = \phi(\mathbf{x}_i) \cdot \phi(\mathbf{x}_j)

Where \( \phi \) is the (implicit) mapping to the feature space and \( K \) is the kernel function. The eigenvalue problem becomes:

K \mathbf{a} = n \lambda \mathbf{a}

Where \( \mathbf{a} \) are the eigenvectors of the centered kernel matrix. The projection of a new point \( \mathbf{x} \) onto the \( k \)-th principal component in feature space is:

z_k = \sum_{i=1}^{n} a_k^{(i)} K(\mathbf{x}_i, \mathbf{x})

Common Kernels for Kernel PCA

RBF

Radial Basis Function (Gaussian) Kernel

\( K(\mathbf{x}, \mathbf{x}') = \exp(-\gamma \|\mathbf{x} - \mathbf{x}'\|^2) \). The most popular kernel for nonlinear PCA. It maps data to an infinite-dimensional space and can capture complex curved structures. The parameter \( \gamma \) controls the kernel width -- larger values create more localized, detailed mappings.

Poly

Polynomial Kernel

\( K(\mathbf{x}, \mathbf{x}') = (\mathbf{x} \cdot \mathbf{x}' + r)^d \). Creates polynomial feature interactions. Degree \( d \) controls the complexity of the mapping. Useful when the nonlinearity has a polynomial nature.

Kernel PCA Limitations

Unlike standard PCA, kernel PCA does not have a straightforward inverse transform (pre-image problem). The kernel matrix \( K \) is \( n \times n \), making kernel PCA computationally expensive for large datasets -- \( O(n^3) \) for eigendecomposition. The kernel and its parameters (e.g., \( \gamma \) for RBF) must be chosen carefully, often via cross-validation on a downstream task.

09

Limitations & Assumptions

Understanding when PCA works well and when it may lead you astray.

Core Assumptions of PCA

1

Linearity

PCA assumes that the principal components are linear combinations of the original features. If the underlying structure is nonlinear (e.g., data lies on a curved manifold), standard PCA will fail to capture it. Kernel PCA or nonlinear methods like t-SNE and UMAP should be used instead.

2

Variance = Importance

PCA equates high variance with high importance. This is not always true -- in some problems, the most informative features may have low variance, while high-variance features may be noise. For example, in classification tasks, the direction that best separates classes may not align with the direction of maximum variance. LDA (Linear Discriminant Analysis) is a supervised alternative that accounts for class labels.

3

Scale Sensitivity

PCA is highly sensitive to the scale of features. Features with larger numerical ranges will dominate the principal components, regardless of their actual importance. This is why standardization (zero mean, unit variance) is almost always required before applying PCA.

4

Orthogonality Constraint

PCA forces all principal components to be mutually orthogonal. While this ensures uncorrelated components, the true underlying factors in the data may not be orthogonal. Methods like Independent Component Analysis (ICA) or Factor Analysis relax this constraint.

When PCA Can Mislead

Non-Gaussian data: PCA only uses first and second-order statistics (mean and covariance). If the data distribution has important higher-order structure, PCA will miss it. ICA is better for non-Gaussian signals.
Interpretability loss: Principal components are linear combinations of all original features, making them difficult to interpret. A component might mix height, weight, and age, which has no clear physical meaning.
Outlier sensitivity: Outliers can dramatically distort the covariance matrix and therefore the principal components. Robust PCA methods exist to handle this, but standard PCA is vulnerable.
Classification tasks: PCA maximizes variance, not class separability. The directions of maximum variance may not be the directions that best separate classes. Supervised methods like LDA may be more appropriate.

Dimensionality Reduction Comparison

The chart below compares the reconstruction quality of different dimensionality reduction approaches across varying numbers of retained components.

10

Applications

Real-world domains where PCA delivers powerful dimensionality reduction and feature extraction.

Where PCA Excels

Image Compression

Images can be represented by a small number of principal components that capture the dominant patterns. Eigenfaces for face recognition use PCA to reduce face images from thousands of pixels to tens of components.

Noise Reduction

By keeping only the top principal components and discarding the rest, PCA filters out noise (which typically lives in the low-variance components). Reconstruction from the top components gives a denoised version of the data.

Data Visualization

PCA reduces high-dimensional data to 2 or 3 dimensions for plotting, revealing clusters, trends, and outliers that are invisible in the original space. It is the standard first step in exploratory data analysis.

Feature Extraction

PCA creates new features (principal components) that are uncorrelated and ordered by importance. These can serve as inputs to machine learning models, often improving performance by removing multicollinearity and reducing overfitting.

Face Recognition (Eigenfaces)

One of the most famous applications of PCA. Each face image is projected onto a low-dimensional "face space" defined by principal components. Recognition compares the projections of test faces to stored projections of known faces.

Anomaly Detection

Points with high reconstruction error (large distance between original and PCA-reconstructed versions) are likely anomalies. PCA-based anomaly detection is used in manufacturing quality control, network intrusion detection, and financial fraud detection.

Interactive: Image Reconstruction with PCA

Adjust the number of principal components to see how image reconstruction quality changes. More components preserve more detail, but fewer components achieve greater compression.

PCA Reconstruction Quality

Slide to change the number of retained principal components. Watch how the reconstruction quality improves as more components are added.

Components: 5

Advantages

Effective Dimensionality Reduction

PCA can reduce hundreds or thousands of features to a handful of components while retaining 95%+ of the variance, dramatically speeding up downstream algorithms.

Removes Multicollinearity

Principal components are orthogonal and uncorrelated by construction, eliminating multicollinearity problems that plague many regression and classification methods.

Noise Filtering

By discarding low-variance components, PCA naturally filters out noise, often improving the signal-to-noise ratio and the performance of downstream models.

No Hyperparameters (standard PCA)

Standard PCA has no hyperparameters to tune (aside from the number of components). The solution is deterministic and guaranteed to be optimal in the variance-maximizing sense.

Computationally Efficient

Using SVD, PCA can be computed in O(min(n,d)^2 * max(n,d)) time. Randomized SVD makes it feasible even for very large, sparse datasets.

Disadvantages

Assumes Linearity

PCA can only capture linear relationships between features. Nonlinear structure requires kernel PCA, t-SNE, UMAP, or autoencoders.

Loss of Interpretability

Principal components are linear combinations of all original features, making them difficult to interpret. You lose the ability to say "this feature is important."

Scale Sensitivity

Without standardization, features with larger numerical ranges dominate the principal components regardless of their actual importance.

Variance is Not Always Importance

In classification tasks, the most informative directions for separating classes may not align with the directions of maximum variance. LDA may be more appropriate.

11

Python Implementation

From basic PCA to kernel PCA with scikit-learn and visualization.

Basic PCA with Scikit-Learn

The simplest way to apply PCA using scikit-learn. We use the Iris dataset and reduce from 4 features to 2 components.

from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

# Load data
X, y = load_iris(return_X_y=True)

# IMPORTANT: Always standardize before PCA
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Apply PCA - reduce from 4 to 2 dimensions
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

# Check explained variance
print(f"Explained variance ratio: {pca.explained_variance_ratio_}")
print(f"Total variance explained: {sum(pca.explained_variance_ratio_):.4f}")
print(f"Components shape: {pca.components_.shape}")

# Visualize
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y, cmap='viridis')
plt.xlabel('First Principal Component')
plt.ylabel('Second Principal Component')
plt.title('PCA of Iris Dataset')
plt.colorbar(label='Species')
plt.show()

Choosing the Number of Components

Use the explained variance ratio and scree plot to determine the optimal number of components.

from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_digits
import numpy as np

# Load handwritten digits (64 features)
X, y = load_digits(return_X_y=True)
X_scaled = StandardScaler().fit_transform(X)

# Fit PCA with all components to analyze variance
pca_full = PCA()
pca_full.fit(X_scaled)

# Cumulative explained variance
cumulative_var = np.cumsum(pca_full.explained_variance_ratio_)

# Find number of components for 95% variance
n_95 = np.argmax(cumulative_var >= 0.95) + 1
print(f"Components for 95% variance: {n_95} out of {X.shape[1]}")

# Apply PCA with optimal components
pca = PCA(n_components=0.95)  # auto-select for 95% variance
X_reduced = pca.fit_transform(X_scaled)
print(f"Reduced shape: {X_reduced.shape}")

PCA for Image Reconstruction

Reconstruct images with different numbers of components to see the effect of compression.

from sklearn.decomposition import PCA
from sklearn.datasets import fetch_olivetti_faces
import matplotlib.pyplot as plt
import numpy as np

# Load face images (400 images, 4096 pixels each)
faces = fetch_olivetti_faces()
X = faces.data

# Reconstruct with different numbers of components
n_components_list = [10, 50, 100, 200]
fig, axes = plt.subplots(1, len(n_components_list) + 1,
                          figsize=(15, 3))

# Original face
axes[0].imshow(X[0].reshape(64, 64), cmap='gray')
axes[0].set_title('Original')

for i, n in enumerate(n_components_list):
    pca = PCA(n_components=n)
    X_pca = pca.fit_transform(X)
    X_reconstructed = pca.inverse_transform(X_pca)

    axes[i+1].imshow(X_reconstructed[0].reshape(64, 64),
                      cmap='gray')
    var_explained = sum(pca.explained_variance_ratio_) * 100
    axes[i+1].set_title(f'{n} comps\n({var_explained:.1f}%)')

plt.suptitle('Face Reconstruction with PCA')
plt.tight_layout()
plt.show()

Kernel PCA for Nonlinear Data

Using kernel PCA to handle nonlinear data structures like concentric circles.

from sklearn.decomposition import KernelPCA, PCA
from sklearn.datasets import make_circles
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt

# Generate nonlinear data (concentric circles)
X, y = make_circles(n_samples=500, factor=0.3,
                     noise=0.05, random_state=42)

# Standard PCA (fails for nonlinear data)
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)

# Kernel PCA with RBF kernel
kpca = KernelPCA(n_components=2, kernel='rbf', gamma=15)
X_kpca = kpca.fit_transform(X)

# Compare results
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

axes[0].scatter(X[:, 0], X[:, 1], c=y, cmap='coolwarm')
axes[0].set_title('Original Data')

axes[1].scatter(X_pca[:, 0], X_pca[:, 1], c=y, cmap='coolwarm')
axes[1].set_title('Standard PCA')

axes[2].scatter(X_kpca[:, 0], X_kpca[:, 1], c=y, cmap='coolwarm')
axes[2].set_title('Kernel PCA (RBF)')

plt.tight_layout()
plt.show()

PCA in a Machine Learning Pipeline

Combining PCA with a classifier in a scikit-learn pipeline for clean, reproducible workflows.

from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.datasets import load_digits

# Load data
X, y = load_digits(return_X_y=True)

# Build pipeline: scale -> PCA -> classify
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('pca', PCA()),
    ('svm', SVC(kernel='rbf'))
])

# Tune PCA components and SVM parameters together
param_grid = {
    'pca__n_components': [10, 20, 30, 40],
    'svm__C': [1, 10, 100],
    'svm__gamma': ['scale', 0.01, 0.001],
}

grid_search = GridSearchCV(
    pipeline, param_grid, cv=5,
    scoring='accuracy', n_jobs=-1
)
grid_search.fit(X, y)

print(f"Best params: {grid_search.best_params_}")
print(f"Best accuracy: {grid_search.best_score_:.4f}")

PCA + Pipeline Best Practices

Always put PCA inside a pipeline with the scaler. Fitting the scaler and PCA on the full dataset before cross-validation causes data leakage. Inside a pipeline, the scaler and PCA are fit only on the training fold, ensuring valid evaluation. Use n_components as a float (e.g., 0.95) to automatically select enough components for a given variance threshold.

Principal Component Analysis (PCA)

Historical Intuition

Karl Pearson & the Origins of PCA

Harold Hotelling & the Modern Formulation

The Evolution of PCA

Core Intuition

The Curse of Dimensionality

Variance = Information

Maximum Variance

Orthogonality

Dimensionality Reduction

Interactive: 2D to 1D Projection

PCA Projection Visualization

Mathematical Foundation

The Covariance Matrix

The Optimization Objective

Why Eigenvectors are the Answer

Multiple Components

Explained Variance by Component

Step-by-Step PCA Algorithm

The Five Steps of PCA

Standardize / Center the Data

Compute the Covariance Matrix

Compute Eigenvalues and Eigenvectors

Sort and Select Top k Components

Project Data onto New Subspace

The Projection Formula

Always Center (or Standardize) First

Eigenvectors & Eigenvalues

Geometric Interpretation

Principal Directions of a Data Cloud

The Spectral Theorem

Interactive: Eigenvector Directions

Eigenvector Visualization

Choosing Number of Components

The Scree Plot

Common Selection Criteria

Cumulative Variance Threshold (95%)

Kaiser Criterion (Eigenvalue > 1)

Scree Plot Elbow Method

Interactive: Scree Plot & Cumulative Variance

Scree Plot & Variance Explained

PCA via SVD

Singular Value Decomposition

Connection to Eigendecomposition

Why SVD is Preferred

Numerical Stability

Efficiency for Tall Matrices

Sparse Data Support

Sklearn Uses SVD Internally

Kernel PCA

The Limitation of Linear PCA

How Kernel PCA Works

Common Kernels for Kernel PCA

Radial Basis Function (Gaussian) Kernel

Polynomial Kernel

Kernel PCA Limitations

Limitations & Assumptions

Core Assumptions of PCA

Linearity

Variance = Importance

Scale Sensitivity

Orthogonality Constraint

When PCA Can Mislead

Dimensionality Reduction Comparison

Applications

Where PCA Excels

Image Compression

Noise Reduction

Data Visualization

Feature Extraction

Face Recognition (Eigenfaces)

Anomaly Detection

Interactive: Image Reconstruction with PCA

PCA Reconstruction Quality

Advantages

Effective Dimensionality Reduction

Removes Multicollinearity

Noise Filtering

No Hyperparameters (standard PCA)