K-Means Clustering
From centroid initialization to convergence. A beginner-friendly, visually interactive deep dive into the most widely used unsupervised learning algorithm.
Begin Learning ↓From centroid initialization to convergence. A beginner-friendly, visually interactive deep dive into the most widely used unsupervised learning algorithm.
Begin Learning ↓How a signal processing technique from the 1950s became the world's most popular clustering algorithm.
The algorithm we now call K-Means was first conceived by Stuart Lloyd in 1957 at Bell Labs. Lloyd was not working on machine learning -- he was working on signal processing, specifically the problem of quantizing continuous signals into discrete levels for efficient transmission. His goal was to find the optimal set of quantization levels that would minimize the mean squared error between the original signal and its quantized version.
Lloyd's work was circulated internally at Bell Labs as a technical report but was not formally published until 1982, a full 25 years later. By then, the algorithm had already been rediscovered multiple times by different researchers working on entirely different problems.
The name "K-Means" was coined by James MacQueen in a 1967 paper titled "Some methods for classification and analysis of multivariate observations." MacQueen formalized the algorithm in a statistical context, proving convergence properties and establishing it as a method for partitioning observations into $K$ clusters where each observation belongs to the cluster with the nearest mean.
MacQueen's contribution was crucial: he transformed an engineering trick into a rigorous statistical method with formal guarantees, making it accessible to a much wider audience of scientists and analysts.
The journey of K-Means mirrors the evolution of data science itself -- from engineering signal processing to general-purpose pattern recognition:
Group similar data points together through iterative refinement of cluster centers.
K-Means is an unsupervised learning algorithm that partitions $n$ data points into $K$ clusters. Unlike supervised algorithms that learn from labeled examples, K-Means discovers structure in unlabeled data by grouping points that are "similar" to each other.
The notion of similarity is based on distance. Points that are close together in feature space are more likely to belong to the same cluster. Each cluster is represented by its centroid -- the mean of all points assigned to that cluster.
Imagine you have a table covered with mixed fruit -- apples, oranges, and bananas -- and you need to sort them into three groups without knowing their names. You might:
Start by randomly selecting three fruits as initial representatives for each group
Place each fruit next to the representative it looks most like (color, shape, size)
Find the most "average" fruit in each group and make it the new representative. Repeat until stable.
Despite being over six decades old, K-Means remains the most widely used clustering algorithm. Its enduring popularity stems from several key properties:
In supervised learning (classification, regression), the algorithm learns from labeled data -- it knows the "right answer" during training. In unsupervised learning, there are no labels. K-Means must discover the structure of the data on its own. This makes evaluation harder, but also makes clustering applicable to exploration tasks where labels do not exist.
Initialize, assign, update, repeat -- four steps that solve clustering.
K-Means follows a simple iterative procedure that alternates between two operations: assigning points to clusters and updating cluster centers. The algorithm is sometimes called Lloyd's algorithm after its inventor.
Select $K$ initial centroid positions. This can be done randomly, using K-Means++, or with other strategies. The choice of initialization significantly affects the final result.
For every data point $x_i$, compute the distance to each centroid $\mu_k$ and assign the point to the cluster with the closest centroid:
For each cluster, recompute the centroid as the mean of all points currently assigned to that cluster:
Repeat steps 2 and 3 until the centroids no longer change (or change by less than a threshold $\epsilon$), or a maximum number of iterations is reached.
K-Means is guaranteed to converge in a finite number of iterations. Here is the intuition:
Formally, the objective function $J$ is monotonically non-increasing at every step:
Convergence is guaranteed, but only to a local optimum. K-Means is an NP-hard problem in general, meaning finding the globally optimal clustering is computationally intractable. Different initializations can lead to different local optima. This is why running K-Means multiple times with different seeds and keeping the best result is standard practice.
Watch how the objective function (inertia) decreases with each iteration. The algorithm rapidly reduces the cost in the first few iterations and then converges to a stable value.
Watch the K-Means algorithm iterate. Click "Step" to advance one iteration — see centroids (★) move toward cluster centers and points reassign colors.
Each step: (1) assign points to nearest centroid, (2) move centroids to cluster means.
The objective function, centroid derivation, and why the mean is optimal.
K-Means seeks to minimize the Within-Cluster Sum of Squares (WCSS), also known as inertia. This measures the total squared distance between every data point and its assigned centroid:
Expanding the squared Euclidean distance for $d$-dimensional data:
The objective can equivalently be written in terms of the cluster assignment indicators $r_{ik} \in \{0, 1\}$:
Where $r_{ik} = 1$ if point $x_i$ is assigned to cluster $k$, and $r_{ik} = 0$ otherwise.
Why is the centroid the mean of its cluster members? We can derive this by taking the derivative of $J$ with respect to $\mu_k$ and setting it to zero:
This proves that the centroid that minimizes the sum of squared distances to all points in a cluster is the arithmetic mean of those points. This is why the algorithm is called K-Means.
The objective can also be expressed in terms of within-cluster variance. For cluster $C_k$, the variance is:
So the total inertia is the sum of cluster sizes times their variances:
An equivalent formulation uses pairwise distances within each cluster:
Adding more clusters will always reduce inertia (at worst, you can split one cluster into two without increasing the objective). When $K = n$, inertia is zero -- each point is its own cluster. This is why simply minimizing inertia is not enough to choose $K$; we need methods like the elbow method or silhouette analysis to balance fit against complexity.
How you start determines where you end -- choosing initial centroids wisely.
The simplest approach: randomly select $K$ data points from the dataset as the initial centroids. While easy to implement, this method has serious drawbacks:
A common mitigation is to run K-Means multiple times (e.g., 10 or 100 runs) with different random seeds and keep the result with the lowest inertia. This is the n_init parameter in scikit-learn.
Introduced by Arthur and Vassilvitskii in 2007, K-Means++ is a smarter initialization strategy that spreads the initial centroids out. The key insight: initial centroids should be far apart from each other.
Select one data point $\mu_1$ uniformly at random from the dataset $X$.
For each remaining data point $x_i$, compute its distance $D(x_i)$ to the nearest already-chosen centroid:
Choose the next centroid $\mu_{c+1}$ from the remaining points with probability proportional to $D(x_i)^2$:
Continue steps 2-3 until all $K$ centroids have been selected, then proceed with standard K-Means iterations.
K-Means++ provides an approximation guarantee: the expected value of the inertia is at most $O(\log K)$ times the optimal inertia. In practice, it typically produces results within a factor of 2-3 of optimal and converges in fewer iterations than random initialization. It is the default initialization method in scikit-learn.
The Forgy method is the simplest initialization: randomly choose $K$ observations from the dataset and use them as the initial centroids. This is essentially the "random" initialization but specifically selects actual data points rather than random positions in feature space.
The difference from random partition initialization is subtle but important:
Forgy tends to spread initial centroids across the data space, while Random Partition tends to place all initial centroids near the center of the data.
The most important hyperparameter and the methods to find its optimal value.
The elbow method is the most popular heuristic for choosing $K$. Run K-Means for a range of $K$ values (e.g., 1 to 10), plot the inertia for each, and look for the "elbow" -- the point where adding more clusters yields diminishing returns.
The elbow is where the rate of decrease sharply changes. Before the elbow, adding clusters significantly reduces inertia. After the elbow, additional clusters provide marginal improvement.
In practice, the inertia curve often decreases smoothly without a sharp bend, making it difficult to identify a clear elbow. When the data does not have well-separated clusters, or when clusters have very different sizes, the elbow method can be misleading. Always use it in combination with other methods.
Click on the chart to select your optimal K. The elbow point is where adding more clusters stops significantly reducing inertia.
The orange marker shows your selection. The green circle highlights the true elbow.
The silhouette score measures how similar a point is to its own cluster compared to the nearest neighboring cluster. For each data point $i$:
The silhouette score ranges from $-1$ to $+1$:
The average silhouette score across all points gives an overall quality measure. Choose the $K$ that maximizes this average.
The gap statistic, proposed by Tibshirani, Walther, and Hastie (2001), compares the inertia of the clustering to what would be expected under a null reference distribution (uniform random data):
Where $W_K$ is the within-cluster dispersion for $K$ clusters and $E^*$ denotes the expectation under the null reference. The optimal $K$ is the smallest value where the gap is within one standard error of the maximum:
The gap statistic is more principled than the elbow method because it provides a formal null hypothesis test, but it is computationally more expensive due to the reference distribution simulations.
Adaptations of K-Means for different data types, scales, and robustness requirements.
Standard K-Means computes distances between all points and all centroids at every iteration. For very large datasets (millions of points), this becomes prohibitively slow. Mini-Batch K-Means solves this by using a random subset (mini-batch) of the data at each iteration.
At each iteration, a mini-batch of size $b$ is sampled, points are assigned to their nearest centroid, and centroids are updated using a streaming average:
Where $\eta_t = \frac{1}{n_k^{(t)}}$ is a learning rate based on the count of points assigned to cluster $k$. This produces results within 1-2% of standard K-Means but can be orders of magnitude faster.
K-Medoids (Partitioning Around Medoids) uses actual data points as cluster centers instead of means. The medoid of a cluster is the point that minimizes the sum of distances to all other points in the cluster:
Key advantages over K-Means:
The main drawback is computational cost: $O(n^2 \cdot K \cdot t)$, which is much slower than K-Means for large datasets.
K-Medians replaces the mean with the component-wise median when updating cluster centers. Instead of minimizing squared Euclidean distance, it minimizes the sum of absolute distances (L1 norm / Manhattan distance):
The median is the optimal center under L1 distance, just as the mean is optimal under L2. K-Medians is more robust to outliers than K-Means because the median is less sensitive to extreme values.
Bisecting K-Means is a divisive (top-down) hierarchical approach:
Begin with $K=1$ -- the entire dataset is one cluster.
Select the cluster with the highest inertia and split it into 2 using standard K-Means (with $K=2$).
Continue splitting until the desired number of clusters is reached.
This approach is less sensitive to initialization because each split only involves 2 clusters, and it naturally produces a hierarchical structure that can be useful for exploration.
Fastest variant. Near-identical results. Use for large-scale datasets with millions of points.
Robust to outliers. Works with any distance metric. Centers are real data points.
Minimizes L1 distance. More robust to outliers than K-Means. Uses component-wise median.
Top-down hierarchical. Less sensitive to initialization. Produces interpretable cluster hierarchy.
Understanding where K-Means breaks down and why it happens.
K-Means uses Euclidean distance to assign points, which implicitly assumes that clusters are spherical (isotropic) and of similar size. The Voronoi partition created by K-Means always produces convex regions.
This means K-Means will fail on:
Mathematically, K-Means is equivalent to fitting $K$ isotropic Gaussians with equal covariance. When the true data distribution deviates from this, the clustering will be suboptimal.
Because the centroid is the mean of all points in a cluster, a single extreme outlier can significantly shift the centroid away from the true center of the cluster. The squared distance in the objective function amplifies the effect of outliers:
For example, in a cluster of 100 points centered at the origin, a single outlier at $(1000, 1000)$ will shift the centroid by approximately $(10, 10)$ -- a significant displacement that distorts the entire cluster.
Before running K-Means, remove or clip outliers, and standardize your features (zero mean, unit variance). Without standardization, features with larger scales will dominate the distance computation. A feature measured in thousands (salary) will overwhelm one measured in single digits (age), regardless of their actual importance.
K-Means requires the number of clusters $K$ to be specified before the algorithm runs. In many real-world scenarios, the true number of clusters is unknown. While methods like the elbow method and silhouette analysis provide guidance, they are heuristics and can be ambiguous.
This contrasts with algorithms like DBSCAN and HDBSCAN, which automatically determine the number of clusters from the data structure.
Linear time complexity per iteration. Scales to millions of data points with minimal memory overhead.
Always converges to a local optimum. Typically converges within 10-50 iterations for most datasets.
Available in every major ML library. Extensive documentation, tutorials, and community support.
Cluster centers are easy to understand. Each point belongs to exactly one cluster with a clear reason (nearest centroid).
Cannot capture non-convex shapes, elongated clusters, or clusters of varying density and size.
Different starting centroids can yield vastly different results. Must run multiple times for reliability.
Squared distances amplify the influence of outliers. A single extreme point can distort an entire cluster.
Requires the user to choose the number of clusters before running. No automatic detection of natural groupings.
How K-Means stacks up against DBSCAN, hierarchical clustering, and Gaussian mixtures.
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) finds clusters based on density rather than distance to centroids.
Use DBSCAN when clusters are non-spherical, data contains noise, or $K$ is unknown. Use K-Means when clusters are roughly spherical, data is clean, and scalability matters.
Hierarchical clustering builds a tree (dendrogram) of nested clusters. It comes in two flavors: agglomerative (bottom-up) and divisive (top-down).
Use hierarchical when you need the full cluster hierarchy, dataset is small to medium sized, or you want to explore multiple granularity levels. Use K-Means when you need a flat partition and scalability is important.
GMM is the probabilistic generalization of K-Means. It models the data as a mixture of $K$ Gaussian distributions, each with its own mean, covariance, and mixing weight:
Where $\pi_k$ is the mixing weight, $\mu_k$ is the mean, and $\Sigma_k$ is the covariance matrix of the $k$-th Gaussian.
Use GMM when you need soft cluster assignments, clusters are elliptical, or you need a probabilistic model. Use K-Means when you need hard assignments, speed, and simplicity.
Real-world domains where K-Means delivers production-grade results.
Group customers by purchasing behavior, demographics, or engagement patterns. E-commerce companies use this to personalize marketing and recommendations.
Reduce image file size by clustering pixel colors into K representative colors. A 16-million-color image can be reduced to 16 or 256 colors with minimal visual loss.
Group similar documents by content using TF-IDF vectors. Powers topic discovery, search result grouping, and content recommendation systems.
Points far from all centroids are flagged as anomalies. Useful in fraud detection, network intrusion detection, and quality control.
Cluster assignments can be used as new features for downstream supervised learning tasks. Distance-to-centroid features often improve model performance.
The original use case from Lloyd's work. Compress continuous signals into discrete codebook entries for efficient storage and transmission.
Image compression is one of the most intuitive applications of K-Means. Here is how it works:
An image of size $H \times W$ with 3 color channels (RGB) is reshaped into a matrix of $n = H \times W$ rows and $d = 3$ columns. Each row is a pixel with its RGB values.
Cluster the pixels into $K$ clusters. Each centroid represents a "representative" color.
Every pixel is replaced by the color of its nearest centroid. The image now uses only $K$ colors instead of up to $256^3 \approx 16.7$ million.
With $K = 16$, each pixel needs only 4 bits (instead of 24), achieving a compression ratio of $6:1$ while retaining most visual quality.
Click to add data points, choose K, and watch the algorithm cluster them in real-time!
From scikit-learn basics to evaluation metrics and real-world pipelines.
The simplest way to run K-Means in Python. scikit-learn uses K-Means++ initialization and multiple restarts by default.
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
import numpy as np
# Generate synthetic data with 3 clusters
X, y_true = make_blobs(
n_samples=300, centers=3,
cluster_std=0.60, random_state=42
)
# Fit K-Means
kmeans = KMeans(
n_clusters=3, # K = 3
init='k-means++', # Smart initialization
n_init=10, # Run 10 times, keep best
max_iter=300, # Max iterations per run
random_state=42
)
kmeans.fit(X)
# Results
print("Cluster centers:\n", kmeans.cluster_centers_)
print("Labels:", kmeans.labels_[:10])
print("Inertia:", kmeans.inertia_)
print("Iterations:", kmeans.n_iter_)
A complete pipeline for choosing the optimal number of clusters using both the elbow method and silhouette scores.
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.datasets import make_blobs
import numpy as np
# Generate data
X, _ = make_blobs(n_samples=500, centers=4,
cluster_std=1.0, random_state=42)
# Test K from 2 to 10
K_range = range(2, 11)
inertias = []
silhouette_scores = []
for k in K_range:
km = KMeans(n_clusters=k, n_init=10, random_state=42)
km.fit(X)
inertias.append(km.inertia_)
silhouette_scores.append(silhouette_score(X, km.labels_))
# Print results
for k, inertia, sil in zip(K_range, inertias, silhouette_scores):
print(f"K={k}: Inertia={inertia:.1f}, "
f"Silhouette={sil:.4f}")
A practical example of using K-Means to compress an image by reducing the number of colors.
from sklearn.cluster import KMeans
from sklearn.utils import shuffle
import numpy as np
from PIL import Image
# Load image and convert to numpy array
img = Image.open('photo.jpg')
img_array = np.array(img, dtype=np.float64) / 255
# Reshape: (H, W, 3) -> (H*W, 3)
w, h, d = img_array.shape
pixels = img_array.reshape((w * h, d))
# Sample 1000 pixels for faster fitting
sample = shuffle(pixels, random_state=42)[:1000]
# Fit K-Means with K=16 colors
kmeans = KMeans(n_clusters=16, n_init=4,
random_state=42)
kmeans.fit(sample)
# Assign each pixel to nearest centroid
labels = kmeans.predict(pixels)
# Reconstruct compressed image
compressed = kmeans.cluster_centers_[labels]
compressed_img = compressed.reshape((w, h, d))
# Save result
result = Image.fromarray(
(compressed_img * 255).astype(np.uint8)
)
result.save('compressed_16colors.jpg')
A production-ready example with standardization, clustering, and comprehensive evaluation metrics.
from sklearn.cluster import KMeans, MiniBatchKMeans
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import (
silhouette_score,
calinski_harabasz_score,
davies_bouldin_score
)
from sklearn.datasets import make_blobs
import numpy as np
# Generate data with known ground truth
X, y_true = make_blobs(
n_samples=1000, centers=5,
cluster_std=1.2, random_state=42
)
# Step 1: Standardize features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Step 2: Fit K-Means
kmeans = KMeans(n_clusters=5, n_init=10,
random_state=42)
labels = kmeans.fit_predict(X_scaled)
# Step 3: Evaluate clustering quality
print("=== Clustering Evaluation ===")
print(f"Inertia (WCSS): {kmeans.inertia_:.2f}")
print(f"Silhouette Score: "
f"{silhouette_score(X_scaled, labels):.4f}")
print(f"Calinski-Harabasz Index: "
f"{calinski_harabasz_score(X_scaled, labels):.2f}")
print(f"Davies-Bouldin Index: "
f"{davies_bouldin_score(X_scaled, labels):.4f}")
print(f"Iterations to converge: {kmeans.n_iter_}")
# Step 4: Compare with Mini-Batch K-Means
mb_kmeans = MiniBatchKMeans(
n_clusters=5, batch_size=100,
n_init=10, random_state=42
)
mb_labels = mb_kmeans.fit_predict(X_scaled)
print(f"\nMini-Batch Silhouette: "
f"{silhouette_score(X_scaled, mb_labels):.4f}")
Silhouette Score (higher is better, range [-1, 1]): measures cluster cohesion and separation. Calinski-Harabasz Index (higher is better): ratio of between-cluster to within-cluster dispersion. Davies-Bouldin Index (lower is better): average similarity between each cluster and its most similar neighbor. Use all three for a comprehensive evaluation.