Principal Component Analysis (PCA)
From covariance matrices to eigendecomposition. A comprehensive, visually interactive deep dive into the most widely used dimensionality reduction technique in machine learning.
Begin Learning ↓From covariance matrices to eigendecomposition. A comprehensive, visually interactive deep dive into the most widely used dimensionality reduction technique in machine learning.
Begin Learning ↓How the quest to understand variation in multivariate data gave rise to one of the most fundamental techniques in data science.
The story of Principal Component Analysis begins in 1901 with Karl Pearson, the English mathematician and biostatistician. Pearson was interested in finding the "line of best fit" to a system of points in space -- not in the least-squares regression sense (minimizing vertical distances), but in the sense of minimizing the orthogonal distances from points to the line.
Pearson framed the problem as finding the direction through a cloud of data points that captures the most variation. This geometric insight -- that the most informative direction is the one with maximum spread -- became the foundation of PCA. He showed that the optimal lines and planes of fit could be determined from the principal axes of the data's ellipsoid of concentration.
In 1933, Harold Hotelling independently developed PCA as a formal statistical method, casting it in terms of eigenvalue decomposition of the covariance matrix. Hotelling was motivated by the practical problem of reducing a large number of correlated variables to a smaller set of uncorrelated components that preserve most of the information.
Hotelling introduced the term "principal components" and provided the mathematical framework that is still used today. His key insight was that the principal components are the eigenvectors of the covariance matrix, ordered by their corresponding eigenvalues. This elegant connection between linear algebra and statistics transformed PCA from a geometric curiosity into a powerful analytical tool.
Where \( C \) is the covariance matrix, \( \mathbf{v} \) is an eigenvector (principal direction), and \( \lambda \) is the corresponding eigenvalue (variance explained along that direction).
From a theoretical tool in statistics to a cornerstone of modern machine learning, PCA has had a remarkable journey:
Why reducing dimensions matters, and how variance equals information.
As the number of features in a dataset increases, several problems emerge. Data becomes increasingly sparse in high-dimensional spaces -- the volume of the space increases exponentially, but the amount of data remains fixed. This means that algorithms need exponentially more data to maintain the same level of statistical significance.
High-dimensional data also suffers from increased computational cost, overfitting risk, and the breakdown of distance-based methods (all pairwise distances converge to similar values). PCA addresses these issues by projecting data onto a lower-dimensional subspace while preserving as much of the original information as possible.
The central philosophy of PCA is that the directions with the most variance carry the most information. If a feature varies widely across observations, it is helping to distinguish between data points. If a feature is nearly constant, it carries almost no discriminative information and can be safely discarded.
PCA finds new axes (principal components) that are linear combinations of the original features, ordered by the amount of variance they capture. The first principal component captures the maximum possible variance, the second captures the maximum remaining variance while being orthogonal to the first, and so on.
Each principal component captures the maximum possible variance from the remaining variation, ensuring the most informative projection
All principal components are mutually orthogonal (perpendicular), ensuring zero correlation between them -- no redundant information
By keeping only the top k components that capture most of the variance, we can dramatically reduce the number of dimensions while losing minimal information
See how PCA projects 2D data onto the direction of maximum variance. Drag to move data points and observe how the principal component axis adjusts.
The red line shows the first principal component -- the direction of maximum variance. Blue dots are original data; orange dots are the projected points on the principal axis.
The elegant optimization problem behind finding directions of maximum variance.
Given a centered data matrix \( X \in \mathbb{R}^{n \times d} \) (each column has zero mean), the covariance matrix captures how features vary together:
The covariance matrix \( C \in \mathbb{R}^{d \times d} \) is symmetric and positive semi-definite. The diagonal entries \( C_{ii} \) represent the variance of each feature, while off-diagonal entries \( C_{ij} \) represent the covariance (linear relationship) between features \( i \) and \( j \).
A large positive covariance between two features means they tend to increase together. A large negative covariance means they tend to move in opposite directions. Zero covariance means no linear relationship.
PCA seeks the direction \( \mathbf{w} \) that maximizes the variance of the projected data. The variance of data projected onto a unit vector \( \mathbf{w} \) is:
We want to maximize this subject to the constraint that \( \mathbf{w} \) is a unit vector:
Using a Lagrange multiplier \( \lambda \), the optimality condition is:
This is precisely the eigenvalue equation! The optimal projection direction \( \mathbf{w} \) is an eigenvector of the covariance matrix \( C \), and the corresponding eigenvalue \( \lambda \) equals the variance of the data projected along that direction.
The beauty of PCA lies in this connection: finding directions of maximum variance in data is equivalent to finding the eigenvectors of the covariance matrix. The eigenvector with the largest eigenvalue points in the direction of greatest variance. The second-largest eigenvalue gives the direction of second-greatest variance (orthogonal to the first), and so on. This is not an approximation -- it is the exact, mathematically optimal solution.
The second principal component maximizes variance subject to being orthogonal to the first. By induction, the \( k \)-th component maximizes variance subject to being orthogonal to all previous components. The result is that the first \( k \) principal components are the \( k \) eigenvectors of \( C \) corresponding to the \( k \) largest eigenvalues.
Where \( V = [\mathbf{v}_1, \mathbf{v}_2, \ldots, \mathbf{v}_d] \) is the matrix of eigenvectors (principal directions) and \( \Lambda = \text{diag}(\lambda_1, \lambda_2, \ldots, \lambda_d) \) with \( \lambda_1 \geq \lambda_2 \geq \cdots \geq \lambda_d \geq 0 \) are the eigenvalues (variance explained by each direction).
The total variance in the data equals the sum of all eigenvalues: \( \text{Total Variance} = \sum_{i=1}^{d} \lambda_i = \text{tr}(C) \). The fraction of variance explained by the first \( k \) components is:
The chart below illustrates how the variance is distributed across principal components for a typical dataset. The first few components capture most of the variance, enabling significant dimensionality reduction with minimal information loss.
A clear, systematic walkthrough of the PCA procedure from raw data to reduced representation.
Subtract the mean of each feature so that the data is centered at the origin: \( X_{\text{centered}} = X - \bar{X} \). If features are on different scales, standardize to zero mean and unit variance. This is critical because PCA is sensitive to scale -- a feature measured in kilometers will dominate one measured in centimeters simply due to its larger numerical range.
Calculate the \( d \times d \) covariance matrix \( C = \frac{1}{n} X_{\text{centered}}^T X_{\text{centered}} \). This matrix encodes all pairwise linear relationships between features. For a dataset with 100 features, the covariance matrix will be \( 100 \times 100 \).
Solve the eigenvalue equation \( C\mathbf{v} = \lambda \mathbf{v} \) to find all \( d \) eigenvectors and their corresponding eigenvalues. The eigenvectors are the principal directions; the eigenvalues quantify the variance along each direction.
Sort eigenvalues in descending order: \( \lambda_1 \geq \lambda_2 \geq \cdots \geq \lambda_d \). Select the top \( k \) eigenvectors corresponding to the \( k \) largest eigenvalues. The choice of \( k \) depends on the desired explained variance threshold (typically 95%).
Form the projection matrix \( W = [\mathbf{v}_1, \mathbf{v}_2, \ldots, \mathbf{v}_k] \in \mathbb{R}^{d \times k} \) and transform the data: \( Z = X_{\text{centered}} \cdot W \). The result \( Z \in \mathbb{R}^{n \times k} \) is the reduced-dimensional representation of the data.
The complete PCA transformation can be written compactly as:
Where \( \bar{X} \) is the mean vector, \( W_k \) is the matrix of the top \( k \) eigenvectors, and \( Z \) is the projected data in the reduced space. To reconstruct the data back to the original space (with some information loss):
The reconstruction error (information lost) is determined by the eigenvalues of the discarded components. The smaller these eigenvalues, the less information is lost during dimensionality reduction.
If you skip centering, the first principal component will point toward the mean of the data rather than the direction of maximum variance. If features are on vastly different scales (e.g., age in years vs. salary in dollars), the feature with the largest numerical range will dominate the principal components. Always standardize when features have different units or scales.
The geometric and algebraic meaning behind the building blocks of PCA.
An eigenvector of a matrix \( A \) is a non-zero vector \( \mathbf{v} \) that, when multiplied by \( A \), only gets scaled -- it does not change direction:
The scalar \( \lambda \) is called the eigenvalue. Geometrically, eigenvectors are the directions along which a linear transformation acts by simply stretching (or compressing). In the context of PCA:
Think of your data as a cloud of points in high-dimensional space. The covariance matrix describes the shape of this cloud. The eigenvectors of the covariance matrix are the natural axes of this cloud -- the directions along which it stretches the most. The eigenvalues tell you how much it stretches in each direction.
If you project the data onto these eigenvector axes, you get uncorrelated components ordered by importance. The first few axes capture the dominant structure; the remaining axes capture noise or irrelevant variation.
The covariance matrix \( C \) is real and symmetric, so the spectral theorem guarantees that it has \( d \) real eigenvalues and \( d \) orthonormal eigenvectors. The eigendecomposition \( C = V\Lambda V^T \) always exists and is unique (up to sign and ordering). This is why PCA always produces a valid, well-defined result -- there is no convergence issue or local minimum problem.
Rotate the data cloud and see how the principal component directions (eigenvectors) align with the spread of the data.
The red and green arrows show the two principal component directions. Adjust the rotation to see how the eigenvectors track the data's spread.
How to decide how many dimensions to keep -- balancing compression with information retention.
A scree plot displays eigenvalues (or explained variance ratios) in descending order. The name comes from the geological term for loose rocks at the base of a cliff. In a scree plot, you look for the "elbow" -- the point where the eigenvalues transition from steeply declining to leveling off.
Components before the elbow capture meaningful structure in the data. Components after the elbow capture mostly noise. The elbow indicates the optimal number of components to retain.
The most popular approach: keep the minimum number of components needed to explain at least 95% (or 90%, or 99%) of the total variance. This ensures that only 5% of the information is lost. Compute: \( k = \min\left\{ k : \frac{\sum_{i=1}^{k} \lambda_i}{\sum_{i=1}^{d} \lambda_i} \geq 0.95 \right\} \).
Keep only components with eigenvalues greater than 1 (when using the correlation matrix). The reasoning is that a component should explain at least as much variance as a single standardized variable. This rule works well for moderately sized datasets but can over- or under-retain components in some cases.
Visually inspect the scree plot for an "elbow" where eigenvalues transition from steep decline to gradual leveling. Keep all components before the elbow. This is subjective but often aligns well with the other criteria.
Explore how eigenvalues distribute across components. The left axis shows individual eigenvalues (bar chart), while the right axis shows cumulative explained variance (line). The dashed line marks the 95% threshold.
Bar chart shows individual component variance. The orange line shows cumulative variance. The dashed line marks the 95% threshold for component selection.
The computationally superior approach to PCA using Singular Value Decomposition.
Any matrix \( X \in \mathbb{R}^{n \times d} \) can be decomposed as:
Where:
The connection between SVD of the data matrix and eigendecomposition of the covariance matrix is elegant. If \( X \) is centered, then:
This means:
Computing \( X^T X \) squares the condition number of the data matrix, amplifying numerical errors. SVD works directly on \( X \), avoiding this issue
When \( n \gg d \), computing the \( d \times d \) covariance matrix is wasteful. Truncated SVD can find only the top \( k \) components without computing the full decomposition
Randomized SVD algorithms (used by sklearn) can handle very large, sparse matrices that would be impossible to process with explicit covariance computation
When you call sklearn.decomposition.PCA, it does not compute the covariance matrix explicitly. Instead, it uses a truncated SVD of the centered data matrix, which is both faster and more numerically stable. The TruncatedSVD class goes further by working with sparse matrices and not requiring centering (useful for text data in TF-IDF form).
Extending PCA to capture nonlinear structure using the kernel trick.
Standard PCA finds linear projections of the data. If the underlying structure of the data is nonlinear -- for example, data distributed on a curved manifold like a Swiss roll or concentric circles -- linear PCA will fail to capture this structure. The principal components will cut across the manifold rather than following it.
Kernel PCA addresses this by first implicitly mapping the data into a high-dimensional feature space using a kernel function, and then performing PCA in that space. Just as kernel SVM finds nonlinear decision boundaries, kernel PCA finds nonlinear principal components.
Instead of computing the covariance matrix in the original space, kernel PCA works with the kernel matrix (Gram matrix):
Where \( \phi \) is the (implicit) mapping to the feature space and \( K \) is the kernel function. The eigenvalue problem becomes:
Where \( \mathbf{a} \) are the eigenvectors of the centered kernel matrix. The projection of a new point \( \mathbf{x} \) onto the \( k \)-th principal component in feature space is:
\( K(\mathbf{x}, \mathbf{x}') = \exp(-\gamma \|\mathbf{x} - \mathbf{x}'\|^2) \). The most popular kernel for nonlinear PCA. It maps data to an infinite-dimensional space and can capture complex curved structures. The parameter \( \gamma \) controls the kernel width -- larger values create more localized, detailed mappings.
\( K(\mathbf{x}, \mathbf{x}') = (\mathbf{x} \cdot \mathbf{x}' + r)^d \). Creates polynomial feature interactions. Degree \( d \) controls the complexity of the mapping. Useful when the nonlinearity has a polynomial nature.
Unlike standard PCA, kernel PCA does not have a straightforward inverse transform (pre-image problem). The kernel matrix \( K \) is \( n \times n \), making kernel PCA computationally expensive for large datasets -- \( O(n^3) \) for eigendecomposition. The kernel and its parameters (e.g., \( \gamma \) for RBF) must be chosen carefully, often via cross-validation on a downstream task.
Understanding when PCA works well and when it may lead you astray.
PCA assumes that the principal components are linear combinations of the original features. If the underlying structure is nonlinear (e.g., data lies on a curved manifold), standard PCA will fail to capture it. Kernel PCA or nonlinear methods like t-SNE and UMAP should be used instead.
PCA equates high variance with high importance. This is not always true -- in some problems, the most informative features may have low variance, while high-variance features may be noise. For example, in classification tasks, the direction that best separates classes may not align with the direction of maximum variance. LDA (Linear Discriminant Analysis) is a supervised alternative that accounts for class labels.
PCA is highly sensitive to the scale of features. Features with larger numerical ranges will dominate the principal components, regardless of their actual importance. This is why standardization (zero mean, unit variance) is almost always required before applying PCA.
PCA forces all principal components to be mutually orthogonal. While this ensures uncorrelated components, the true underlying factors in the data may not be orthogonal. Methods like Independent Component Analysis (ICA) or Factor Analysis relax this constraint.
The chart below compares the reconstruction quality of different dimensionality reduction approaches across varying numbers of retained components.
Real-world domains where PCA delivers powerful dimensionality reduction and feature extraction.
Images can be represented by a small number of principal components that capture the dominant patterns. Eigenfaces for face recognition use PCA to reduce face images from thousands of pixels to tens of components.
By keeping only the top principal components and discarding the rest, PCA filters out noise (which typically lives in the low-variance components). Reconstruction from the top components gives a denoised version of the data.
PCA reduces high-dimensional data to 2 or 3 dimensions for plotting, revealing clusters, trends, and outliers that are invisible in the original space. It is the standard first step in exploratory data analysis.
PCA creates new features (principal components) that are uncorrelated and ordered by importance. These can serve as inputs to machine learning models, often improving performance by removing multicollinearity and reducing overfitting.
One of the most famous applications of PCA. Each face image is projected onto a low-dimensional "face space" defined by principal components. Recognition compares the projections of test faces to stored projections of known faces.
Points with high reconstruction error (large distance between original and PCA-reconstructed versions) are likely anomalies. PCA-based anomaly detection is used in manufacturing quality control, network intrusion detection, and financial fraud detection.
Adjust the number of principal components to see how image reconstruction quality changes. More components preserve more detail, but fewer components achieve greater compression.
Slide to change the number of retained principal components. Watch how the reconstruction quality improves as more components are added.
PCA can reduce hundreds or thousands of features to a handful of components while retaining 95%+ of the variance, dramatically speeding up downstream algorithms.
Principal components are orthogonal and uncorrelated by construction, eliminating multicollinearity problems that plague many regression and classification methods.
By discarding low-variance components, PCA naturally filters out noise, often improving the signal-to-noise ratio and the performance of downstream models.
Standard PCA has no hyperparameters to tune (aside from the number of components). The solution is deterministic and guaranteed to be optimal in the variance-maximizing sense.
Using SVD, PCA can be computed in O(min(n,d)^2 * max(n,d)) time. Randomized SVD makes it feasible even for very large, sparse datasets.
PCA can only capture linear relationships between features. Nonlinear structure requires kernel PCA, t-SNE, UMAP, or autoencoders.
Principal components are linear combinations of all original features, making them difficult to interpret. You lose the ability to say "this feature is important."
Without standardization, features with larger numerical ranges dominate the principal components regardless of their actual importance.
In classification tasks, the most informative directions for separating classes may not align with the directions of maximum variance. LDA may be more appropriate.
From basic PCA to kernel PCA with scikit-learn and visualization.
The simplest way to apply PCA using scikit-learn. We use the Iris dataset and reduce from 4 features to 2 components.
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
# Load data
X, y = load_iris(return_X_y=True)
# IMPORTANT: Always standardize before PCA
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Apply PCA - reduce from 4 to 2 dimensions
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)
# Check explained variance
print(f"Explained variance ratio: {pca.explained_variance_ratio_}")
print(f"Total variance explained: {sum(pca.explained_variance_ratio_):.4f}")
print(f"Components shape: {pca.components_.shape}")
# Visualize
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y, cmap='viridis')
plt.xlabel('First Principal Component')
plt.ylabel('Second Principal Component')
plt.title('PCA of Iris Dataset')
plt.colorbar(label='Species')
plt.show()
Use the explained variance ratio and scree plot to determine the optimal number of components.
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_digits
import numpy as np
# Load handwritten digits (64 features)
X, y = load_digits(return_X_y=True)
X_scaled = StandardScaler().fit_transform(X)
# Fit PCA with all components to analyze variance
pca_full = PCA()
pca_full.fit(X_scaled)
# Cumulative explained variance
cumulative_var = np.cumsum(pca_full.explained_variance_ratio_)
# Find number of components for 95% variance
n_95 = np.argmax(cumulative_var >= 0.95) + 1
print(f"Components for 95% variance: {n_95} out of {X.shape[1]}")
# Apply PCA with optimal components
pca = PCA(n_components=0.95) # auto-select for 95% variance
X_reduced = pca.fit_transform(X_scaled)
print(f"Reduced shape: {X_reduced.shape}")
Reconstruct images with different numbers of components to see the effect of compression.
from sklearn.decomposition import PCA
from sklearn.datasets import fetch_olivetti_faces
import matplotlib.pyplot as plt
import numpy as np
# Load face images (400 images, 4096 pixels each)
faces = fetch_olivetti_faces()
X = faces.data
# Reconstruct with different numbers of components
n_components_list = [10, 50, 100, 200]
fig, axes = plt.subplots(1, len(n_components_list) + 1,
figsize=(15, 3))
# Original face
axes[0].imshow(X[0].reshape(64, 64), cmap='gray')
axes[0].set_title('Original')
for i, n in enumerate(n_components_list):
pca = PCA(n_components=n)
X_pca = pca.fit_transform(X)
X_reconstructed = pca.inverse_transform(X_pca)
axes[i+1].imshow(X_reconstructed[0].reshape(64, 64),
cmap='gray')
var_explained = sum(pca.explained_variance_ratio_) * 100
axes[i+1].set_title(f'{n} comps\n({var_explained:.1f}%)')
plt.suptitle('Face Reconstruction with PCA')
plt.tight_layout()
plt.show()
Using kernel PCA to handle nonlinear data structures like concentric circles.
from sklearn.decomposition import KernelPCA, PCA
from sklearn.datasets import make_circles
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
# Generate nonlinear data (concentric circles)
X, y = make_circles(n_samples=500, factor=0.3,
noise=0.05, random_state=42)
# Standard PCA (fails for nonlinear data)
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)
# Kernel PCA with RBF kernel
kpca = KernelPCA(n_components=2, kernel='rbf', gamma=15)
X_kpca = kpca.fit_transform(X)
# Compare results
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
axes[0].scatter(X[:, 0], X[:, 1], c=y, cmap='coolwarm')
axes[0].set_title('Original Data')
axes[1].scatter(X_pca[:, 0], X_pca[:, 1], c=y, cmap='coolwarm')
axes[1].set_title('Standard PCA')
axes[2].scatter(X_kpca[:, 0], X_kpca[:, 1], c=y, cmap='coolwarm')
axes[2].set_title('Kernel PCA (RBF)')
plt.tight_layout()
plt.show()
Combining PCA with a classifier in a scikit-learn pipeline for clean, reproducible workflows.
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.datasets import load_digits
# Load data
X, y = load_digits(return_X_y=True)
# Build pipeline: scale -> PCA -> classify
pipeline = Pipeline([
('scaler', StandardScaler()),
('pca', PCA()),
('svm', SVC(kernel='rbf'))
])
# Tune PCA components and SVM parameters together
param_grid = {
'pca__n_components': [10, 20, 30, 40],
'svm__C': [1, 10, 100],
'svm__gamma': ['scale', 0.01, 0.001],
}
grid_search = GridSearchCV(
pipeline, param_grid, cv=5,
scoring='accuracy', n_jobs=-1
)
grid_search.fit(X, y)
print(f"Best params: {grid_search.best_params_}")
print(f"Best accuracy: {grid_search.best_score_:.4f}")
Always put PCA inside a pipeline with the scaler. Fitting the scaler and PCA on the full dataset before cross-validation causes data leakage. Inside a pipeline, the scaler and PCA are fit only on the training fold, ensuring valid evaluation. Use n_components as a float (e.g., 0.95) to automatically select enough components for a given variance threshold.