K-Means Clustering - Complete Master Guide

01

Historical Intuition

How a signal processing technique from the 1950s became the world's most popular clustering algorithm.

Stuart Lloyd and Pulse-Code Modulation (1957)

The algorithm we now call K-Means was first conceived by Stuart Lloyd in 1957 at Bell Labs. Lloyd was not working on machine learning -- he was working on signal processing, specifically the problem of quantizing continuous signals into discrete levels for efficient transmission. His goal was to find the optimal set of quantization levels that would minimize the mean squared error between the original signal and its quantized version.

Lloyd's work was circulated internally at Bell Labs as a technical report but was not formally published until 1982, a full 25 years later. By then, the algorithm had already been rediscovered multiple times by different researchers working on entirely different problems.

James MacQueen and the Name "K-Means" (1967)

The name "K-Means" was coined by James MacQueen in a 1967 paper titled "Some methods for classification and analysis of multivariate observations." MacQueen formalized the algorithm in a statistical context, proving convergence properties and establishing it as a method for partitioning observations into $$K$$ clusters where each observation belongs to the cluster with the nearest mean.

MacQueen's contribution was crucial: he transformed an engineering trick into a rigorous statistical method with formal guarantees, making it accessible to a much wider audience of scientists and analysts.

From Telephone Lines to Machine Learning

The journey of K-Means mirrors the evolution of data science itself -- from engineering signal processing to general-purpose pattern recognition:

1957

Stuart Lloyd develops the algorithm at Bell Labs for pulse-code modulation in signal processing

1965

Edward Forgy publishes a similar iterative algorithm for cluster analysis

1967

James MacQueen coins the term "K-Means" and provides convergence analysis

1982

Lloyd's original 1957 paper is formally published in IEEE Transactions

2007

Arthur and Vassilvitskii introduce K-Means++ initialization for better convergence

Today

One of the top 10 most influential algorithms in data mining, used in virtually every industry

02

Core Intuition

Group similar data points together through iterative refinement of cluster centers.

The Central Idea

K-Means is an unsupervised learning algorithm that partitions $$n$$ data points into $$K$$ clusters. Unlike supervised algorithms that learn from labeled examples, K-Means discovers structure in unlabeled data by grouping points that are "similar" to each other.

The notion of similarity is based on distance. Points that are close together in feature space are more likely to belong to the same cluster. Each cluster is represented by its centroid -- the mean of all points assigned to that cluster.

The Sorting Fruit Analogy

Imagine you have a table covered with mixed fruit -- apples, oranges, and bananas -- and you need to sort them into three groups without knowing their names. You might:

Pick 3 Representative Fruits

Start by randomly selecting three fruits as initial representatives for each group

Assign Each Fruit

Place each fruit next to the representative it looks most like (color, shape, size)

Update Representatives

Find the most "average" fruit in each group and make it the new representative. Repeat until stable.

Why K-Means is So Popular

Despite being over six decades old, K-Means remains the most widely used clustering algorithm. Its enduring popularity stems from several key properties:

Simplicity -- the algorithm can be explained in four sentences and implemented in under 30 lines of code
Scalability -- it runs efficiently on datasets with millions of points and hundreds of dimensions, with time complexity $O(n \cdot K \cdot d \cdot t)$ where $$t$$ is the number of iterations
Guaranteed convergence -- the algorithm always converges to a local optimum in a finite number of steps
Versatility -- it works as a preprocessing step, a standalone analysis tool, or a component in more complex pipelines

Unsupervised vs Supervised

In supervised learning (classification, regression), the algorithm learns from labeled data -- it knows the "right answer" during training. In unsupervised learning, there are no labels. K-Means must discover the structure of the data on its own. This makes evaluation harder, but also makes clustering applicable to exploration tasks where labels do not exist.

03

The Algorithm

Initialize, assign, update, repeat -- four steps that solve clustering.

The Four Steps of K-Means

K-Means follows a simple iterative procedure that alternates between two operations: assigning points to clusters and updating cluster centers. The algorithm is sometimes called Lloyd's algorithm after its inventor.

1

Initialize -- Choose K Starting Centroids

Select $$K$$ initial centroid positions. This can be done randomly, using K-Means++, or with other strategies. The choice of initialization significantly affects the final result.

2

Assign -- Assign Each Point to the Nearest Centroid

For every data point $$x_i$$ , compute the distance to each centroid $\mu_k$ and assign the point to the cluster with the closest centroid:

C_k = \{x_i : \|x_i - \mu_k\|^2 \leq \|x_i - \mu_j\|^2 \;\; \forall \; j, \; 1 \leq j \leq K\}

3

Update -- Recompute the Centroids

For each cluster, recompute the centroid as the mean of all points currently assigned to that cluster:

\mu_k = \frac{1}{|C_k|} \sum_{x_i \in C_k} x_i

4

Repeat -- Iterate Until Convergence

Repeat steps 2 and 3 until the centroids no longer change (or change by less than a threshold $\epsilon$ ), or a maximum number of iterations is reached.

Convergence Guarantee

K-Means is guaranteed to converge in a finite number of iterations. Here is the intuition:

Assignment step -- assigning each point to its nearest centroid can only decrease or maintain the total within-cluster sum of squares (WCSS)
Update step -- moving each centroid to the mean of its cluster can only decrease or maintain WCSS (the mean minimizes the sum of squared distances)
Finite partitions -- there are a finite number of ways to partition $$n$$ points into $$K$$ clusters, so the algorithm cannot cycle and must terminate

Formally, the objective function $$J$$ is monotonically non-increasing at every step:

J^{(t+1)} \leq J^{(t)} \quad \text{for all iterations } t

Local Optima, Not Global

Convergence is guaranteed, but only to a local optimum. K-Means is an NP-hard problem in general, meaning finding the globally optimal clustering is computationally intractable. Different initializations can lead to different local optima. This is why running K-Means multiple times with different seeds and keeping the best result is standard practice.

Convergence Visualization

Watch how the objective function (inertia) decreases with each iteration. The algorithm rapidly reduces the cost in the first few iterations and then converges to a stable value.

Interactive: K-Means Step-by-Step

Watch the K-Means algorithm iterate. Click "Step" to advance one iteration — see centroids (★) move toward cluster centers and points reassign colors.

K-Means Algorithm Animation

Each step: (1) assign points to nearest centroid, (2) move centroids to cluster means.

Number of Clusters (K): 3

Iteration: 0 Status: Ready

04

Mathematical Foundation

The objective function, centroid derivation, and why the mean is optimal.

The Objective Function (Inertia / WCSS)

K-Means seeks to minimize the Within-Cluster Sum of Squares (WCSS), also known as inertia. This measures the total squared distance between every data point and its assigned centroid:

J = \sum_{k=1}^{K} \sum_{x_i \in C_k} \|x_i - \mu_k\|^2

Expanding the squared Euclidean distance for $$d$$ -dimensional data:

\|x_i - \mu_k\|^2 = \sum_{j=1}^{d} (x_{ij} - \mu_{kj})^2

The objective can equivalently be written in terms of the cluster assignment indicators $r_{ik} \in \{0, 1\}$ :

J = \sum_{i=1}^{n} \sum_{k=1}^{K} r_{ik} \|x_i - \mu_k\|^2

Where $r_{ik} = 1$ if point $$x_i$$ is assigned to cluster $$k$$ , and $r_{ik} = 0$ otherwise.

Deriving the Centroid Update Rule

Why is the centroid the mean of its cluster members? We can derive this by taking the derivative of $$J$$ with respect to $\mu_k$ and setting it to zero:

Objective for cluster k:

J_k = \sum_{x_i \in C_k} \|x_i - \mu_k\|^2

Take the gradient:

\frac{\partial J_k}{\partial \mu_k} = -2 \sum_{x_i \in C_k} (x_i - \mu_k)

Set to zero and solve:

\sum_{x_i \in C_k} (x_i - \mu_k) = 0

\sum_{x_i \in C_k} x_i = |C_k| \cdot \mu_k

Centroid update:

\mu_k = \frac{1}{|C_k|} \sum_{x_i \in C_k} x_i

This proves that the centroid that minimizes the sum of squared distances to all points in a cluster is the arithmetic mean of those points. This is why the algorithm is called K-Means.

Within-Cluster Variance

The objective can also be expressed in terms of within-cluster variance. For cluster $$C_k$$ , the variance is:

\text{Var}(C_k) = \frac{1}{|C_k|} \sum_{x_i \in C_k} \|x_i - \mu_k\|^2

So the total inertia is the sum of cluster sizes times their variances:

J = \sum_{k=1}^{K} |C_k| \cdot \text{Var}(C_k)

An equivalent formulation uses pairwise distances within each cluster:

J = \sum_{k=1}^{K} \frac{1}{2|C_k|} \sum_{x_i \in C_k} \sum_{x_j \in C_k} \|x_i - x_j\|^2

Inertia Always Decreases with K

Adding more clusters will always reduce inertia (at worst, you can split one cluster into two without increasing the objective). When $$K = n$$ , inertia is zero -- each point is its own cluster. This is why simply minimizing inertia is not enough to choose $$K$$ ; we need methods like the elbow method or silhouette analysis to balance fit against complexity.

05

Initialization Methods

How you start determines where you end -- choosing initial centroids wisely.

Random Initialization

The simplest approach: randomly select $$K$$ data points from the dataset as the initial centroids. While easy to implement, this method has serious drawbacks:

Non-deterministic -- different random seeds produce different results
Risk of poor convergence -- if initial centroids are close together, the algorithm may converge to a suboptimal solution
Sensitivity to outliers -- if an outlier is selected as an initial centroid, it can distort the entire clustering

A common mitigation is to run K-Means multiple times (e.g., 10 or 100 runs) with different random seeds and keep the result with the lowest inertia. This is the n_init parameter in scikit-learn.

K-Means++ Initialization

Introduced by Arthur and Vassilvitskii in 2007, K-Means++ is a smarter initialization strategy that spreads the initial centroids out. The key insight: initial centroids should be far apart from each other.

1

Choose the first centroid uniformly at random

Select one data point $\mu_1$ uniformly at random from the dataset $$X$$ .

2

Compute distances from nearest centroid

For each remaining data point $$x_i$$ , compute its distance $$D(x_i)$$ to the nearest already-chosen centroid:

D(x_i) = \min_{j \in \{1, \ldots, c\}} \|x_i - \mu_j\|^2

3

Select next centroid with weighted probability

Choose the next centroid $\mu_{c+1}$ from the remaining points with probability proportional to $$D(x_i)^2$$ :

P(x_i \text{ chosen}) = \frac{D(x_i)^2}{\sum_{x_j} D(x_j)^2}

4

Repeat until K centroids are chosen

Continue steps 2-3 until all $$K$$ centroids have been selected, then proceed with standard K-Means iterations.

Theoretical Guarantee

K-Means++ provides an approximation guarantee: the expected value of the inertia is at most $O(\log K)$ times the optimal inertia. In practice, it typically produces results within a factor of 2-3 of optimal and converges in fewer iterations than random initialization. It is the default initialization method in scikit-learn.

Forgy Method

The Forgy method is the simplest initialization: randomly choose $$K$$ observations from the dataset and use them as the initial centroids. This is essentially the "random" initialization but specifically selects actual data points rather than random positions in feature space.

The difference from random partition initialization is subtle but important:

Forgy: picks $$K$$ random points as centroids, then assigns all other points to nearest centroid (starts with the assignment step)
Random Partition: randomly assigns each point to one of $$K$$ clusters, then computes centroids (starts with the update step)

Forgy tends to spread initial centroids across the data space, while Random Partition tends to place all initial centroids near the center of the data.

06

Choosing K

The most important hyperparameter and the methods to find its optimal value.

The Elbow Method

The elbow method is the most popular heuristic for choosing $$K$$ . Run K-Means for a range of $$K$$ values (e.g., 1 to 10), plot the inertia for each, and look for the "elbow" -- the point where adding more clusters yields diminishing returns.

\text{Inertia}(K) = \sum_{k=1}^{K} \sum_{x_i \in C_k} \|x_i - \mu_k\|^2

The elbow is where the rate of decrease sharply changes. Before the elbow, adding clusters significantly reduces inertia. After the elbow, additional clusters provide marginal improvement.

The Elbow is Not Always Clear

In practice, the inertia curve often decreases smoothly without a sharp bend, making it difficult to identify a clear elbow. When the data does not have well-separated clusters, or when clusters have very different sizes, the elbow method can be misleading. Always use it in combination with other methods.

Interactive: Find the Elbow

Click on the chart to select your optimal K. The elbow point is where adding more clusters stops significantly reducing inertia.

Elbow Method — Click to Choose K

The orange marker shows your selection. The green circle highlights the true elbow.

Your choice: K = 3 Optimal K ≈ 3

Silhouette Score

The silhouette score measures how similar a point is to its own cluster compared to the nearest neighboring cluster. For each data point $$i$$ :

Intra-cluster distance:

a(i) = \frac{1}{|C_{k_i}| - 1} \sum_{x_j \in C_{k_i}, j \neq i} \|x_i - x_j\|

Nearest-cluster distance:

b(i) = \min_{k \neq k_i} \frac{1}{|C_k|} \sum_{x_j \in C_k} \|x_i - x_j\|

Silhouette coefficient:

s(i) = \frac{b(i) - a(i)}{\max(a(i), b(i))}

The silhouette score ranges from $$-1$$ to $$+1$$ :

$s(i) \approx +1$ : the point is well-matched to its cluster and far from neighboring clusters
$s(i) \approx 0$ : the point is on the boundary between two clusters
$s(i) \approx -1$ : the point is likely assigned to the wrong cluster

The average silhouette score across all points gives an overall quality measure. Choose the $$K$$ that maximizes this average.

Gap Statistic

The gap statistic, proposed by Tibshirani, Walther, and Hastie (2001), compares the inertia of the clustering to what would be expected under a null reference distribution (uniform random data):

\text{Gap}(K) = E^*[\log W_K] - \log W_K

Where $$W_K$$ is the within-cluster dispersion for $$K$$ clusters and $$E^*$$ denotes the expectation under the null reference. The optimal $$K$$ is the smallest value where the gap is within one standard error of the maximum:

\hat{K} = \text{smallest } K \text{ such that } \text{Gap}(K) \geq \text{Gap}(K+1) - s_{K+1}

The gap statistic is more principled than the elbow method because it provides a formal null hypothesis test, but it is computationally more expensive due to the reference distribution simulations.

07

Variants

Adaptations of K-Means for different data types, scales, and robustness requirements.

Mini-Batch K-Means

Standard K-Means computes distances between all points and all centroids at every iteration. For very large datasets (millions of points), this becomes prohibitively slow. Mini-Batch K-Means solves this by using a random subset (mini-batch) of the data at each iteration.

At each iteration, a mini-batch of size $$b$$ is sampled, points are assigned to their nearest centroid, and centroids are updated using a streaming average:

\mu_k^{(t+1)} = (1 - \eta_t) \cdot \mu_k^{(t)} + \eta_t \cdot x_i

Where $\eta_t = \frac{1}{n_k^{(t)}}$ is a learning rate based on the count of points assigned to cluster $$k$$ . This produces results within 1-2% of standard K-Means but can be orders of magnitude faster.

K-Medoids (PAM)

K-Medoids (Partitioning Around Medoids) uses actual data points as cluster centers instead of means. The medoid of a cluster is the point that minimizes the sum of distances to all other points in the cluster:

m_k = \arg\min_{x_i \in C_k} \sum_{x_j \in C_k} \|x_i - x_j\|

Key advantages over K-Means:

Robust to outliers -- medoids are actual data points, so a single outlier cannot pull the center far from the cluster
Works with any distance metric -- does not require Euclidean distance; can use Manhattan, cosine, or any custom dissimilarity
Interpretable centers -- each cluster center is a real observation, making results easier to explain

The main drawback is computational cost: $O(n^2 \cdot K \cdot t)$ , which is much slower than K-Means for large datasets.

K-Medians

K-Medians replaces the mean with the component-wise median when updating cluster centers. Instead of minimizing squared Euclidean distance, it minimizes the sum of absolute distances (L1 norm / Manhattan distance):

J_{L_1} = \sum_{k=1}^{K} \sum_{x_i \in C_k} \|x_i - \mu_k\|_1 = \sum_{k=1}^{K} \sum_{x_i \in C_k} \sum_{j=1}^{d} |x_{ij} - \mu_{kj}|

The median is the optimal center under L1 distance, just as the mean is optimal under L2. K-Medians is more robust to outliers than K-Means because the median is less sensitive to extreme values.

Bisecting K-Means

Bisecting K-Means is a divisive (top-down) hierarchical approach:

1

Start with all data in one cluster

Begin with $$K=1$$ -- the entire dataset is one cluster.

2

Split the largest cluster

Select the cluster with the highest inertia and split it into 2 using standard K-Means (with $$K=2$$ ).

3

Repeat until K clusters

Continue splitting until the desired number of clusters is reached.

This approach is less sensitive to initialization because each split only involves 2 clusters, and it naturally produces a hierarchical structure that can be useful for exploration.

Comparison of Variants

Mini-Batch K-Means

Fastest variant. Near-identical results. Use for large-scale datasets with millions of points.

K-Medoids (PAM)

Robust to outliers. Works with any distance metric. Centers are real data points.

K-Medians

Minimizes L1 distance. More robust to outliers than K-Means. Uses component-wise median.

Bisecting K-Means

Top-down hierarchical. Less sensitive to initialization. Produces interpretable cluster hierarchy.

08

Limitations

Understanding where K-Means breaks down and why it happens.

Assumes Spherical Clusters

K-Means uses Euclidean distance to assign points, which implicitly assumes that clusters are spherical (isotropic) and of similar size. The Voronoi partition created by K-Means always produces convex regions.

This means K-Means will fail on:

Elongated clusters -- elliptical or banana-shaped clusters get split into multiple fragments
Clusters of varying density -- a dense small cluster near a sparse large cluster may be absorbed
Non-convex clusters -- crescent, ring, or interleaving shapes cannot be captured

Mathematically, K-Means is equivalent to fitting $$K$$ isotropic Gaussians with equal covariance. When the true data distribution deviates from this, the clustering will be suboptimal.

Sensitive to Outliers

Because the centroid is the mean of all points in a cluster, a single extreme outlier can significantly shift the centroid away from the true center of the cluster. The squared distance in the objective function amplifies the effect of outliers:

\text{Impact of outlier} \propto \|x_{\text{outlier}} - \mu_k\|^2

For example, in a cluster of 100 points centered at the origin, a single outlier at $$(1000, 1000)$$ will shift the centroid by approximately $$(10, 10)$$ -- a significant displacement that distorts the entire cluster.

Always Preprocess Your Data

Before running K-Means, remove or clip outliers, and standardize your features (zero mean, unit variance). Without standardization, features with larger scales will dominate the distance computation. A feature measured in thousands (salary) will overwhelm one measured in single digits (age), regardless of their actual importance.

Fixed K Requirement

K-Means requires the number of clusters $$K$$ to be specified before the algorithm runs. In many real-world scenarios, the true number of clusters is unknown. While methods like the elbow method and silhouette analysis provide guidance, they are heuristics and can be ambiguous.

This contrasts with algorithms like DBSCAN and HDBSCAN, which automatically determine the number of clusters from the data structure.

Strengths

Simple and Fast

Linear time complexity per iteration. Scales to millions of data points with minimal memory overhead.

Guaranteed Convergence

Always converges to a local optimum. Typically converges within 10-50 iterations for most datasets.

Widely Supported

Available in every major ML library. Extensive documentation, tutorials, and community support.

Interpretable Results

Cluster centers are easy to understand. Each point belongs to exactly one cluster with a clear reason (nearest centroid).

Weaknesses

Spherical Cluster Assumption

Cannot capture non-convex shapes, elongated clusters, or clusters of varying density and size.

Sensitive to Initialization

Different starting centroids can yield vastly different results. Must run multiple times for reliability.

Outlier Sensitivity

Squared distances amplify the influence of outliers. A single extreme point can distort an entire cluster.

Must Specify K in Advance

Requires the user to choose the number of clusters before running. No automatic detection of natural groupings.

09

Comparison with Other Clustering

How K-Means stacks up against DBSCAN, hierarchical clustering, and Gaussian mixtures.

K-Means vs DBSCAN

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) finds clusters based on density rather than distance to centroids.

Cluster shape -- DBSCAN finds arbitrarily shaped clusters; K-Means is limited to convex/spherical shapes
Number of clusters -- DBSCAN discovers $$K$$ automatically; K-Means requires it as input
Noise handling -- DBSCAN labels outliers as noise; K-Means forces every point into a cluster
Scalability -- K-Means is $O(n \cdot K \cdot d \cdot t)$ ; DBSCAN is $O(n \log n)$ with spatial indexing, but degrades in high dimensions
Parameters -- K-Means needs $$K$$ ; DBSCAN needs $\epsilon$ (neighborhood radius) and $\text{minPts}$ (minimum points per cluster)

Use DBSCAN when clusters are non-spherical, data contains noise, or $$K$$ is unknown. Use K-Means when clusters are roughly spherical, data is clean, and scalability matters.

K-Means vs Hierarchical Clustering

Hierarchical clustering builds a tree (dendrogram) of nested clusters. It comes in two flavors: agglomerative (bottom-up) and divisive (top-down).

Output -- hierarchical produces a full dendrogram showing relationships at all levels; K-Means gives a flat partition
Flexibility -- hierarchical can use any linkage criterion (single, complete, average, Ward); K-Means uses only Euclidean distance
Scalability -- agglomerative clustering is $O(n^2 \log n)$ with memory $$O(n^2)$$ ; K-Means is $O(n \cdot K \cdot d \cdot t)$ with memory $O(n \cdot d)$
Determinism -- hierarchical clustering is deterministic; K-Means depends on initialization

Use hierarchical when you need the full cluster hierarchy, dataset is small to medium sized, or you want to explore multiple granularity levels. Use K-Means when you need a flat partition and scalability is important.

K-Means vs Gaussian Mixture Models (GMM)

GMM is the probabilistic generalization of K-Means. It models the data as a mixture of $$K$$ Gaussian distributions, each with its own mean, covariance, and mixing weight:

p(x) = \sum_{k=1}^{K} \pi_k \cdot \mathcal{N}(x \mid \mu_k, \Sigma_k)

Where $\pi_k$ is the mixing weight, $\mu_k$ is the mean, and $\Sigma_k$ is the covariance matrix of the $$k$$ -th Gaussian.

Cluster shape -- GMM captures elliptical clusters of varying shape and orientation; K-Means only captures spherical clusters
Soft assignments -- GMM assigns probabilities of membership; K-Means makes hard assignments
Relationship -- K-Means is equivalent to GMM with isotropic covariance $\Sigma_k = \sigma^2 I$ and hard assignments
Complexity -- GMM has more parameters ( $O(K \cdot d^2)$ ) and uses EM algorithm, which is slower than K-Means

Use GMM when you need soft cluster assignments, clusters are elliptical, or you need a probabilistic model. Use K-Means when you need hard assignments, speed, and simplicity.

10

Practical Applications

Real-world domains where K-Means delivers production-grade results.

Where K-Means Excels

Customer Segmentation

Group customers by purchasing behavior, demographics, or engagement patterns. E-commerce companies use this to personalize marketing and recommendations.

Image Compression

Reduce image file size by clustering pixel colors into K representative colors. A 16-million-color image can be reduced to 16 or 256 colors with minimal visual loss.

Document Clustering

Group similar documents by content using TF-IDF vectors. Powers topic discovery, search result grouping, and content recommendation systems.

Anomaly Detection

Points far from all centroids are flagged as anomalies. Useful in fraud detection, network intrusion detection, and quality control.

Feature Engineering

Cluster assignments can be used as new features for downstream supervised learning tasks. Distance-to-centroid features often improve model performance.

Vector Quantization

The original use case from Lloyd's work. Compress continuous signals into discrete codebook entries for efficient storage and transmission.

Image Compression -- Worked Example

Image compression is one of the most intuitive applications of K-Means. Here is how it works:

1

Reshape the image

An image of size $H \times W$ with 3 color channels (RGB) is reshaped into a matrix of $n = H \times W$ rows and $$d = 3$$ columns. Each row is a pixel with its RGB values.

2

Run K-Means on pixel colors

Cluster the pixels into $$K$$ clusters. Each centroid represents a "representative" color.

3

Replace each pixel with its centroid

Every pixel is replaced by the color of its nearest centroid. The image now uses only $$K$$ colors instead of up to $256^3 \approx 16.7$ million.

With $$K = 16$$ , each pixel needs only 4 bits (instead of 24), achieving a compression ratio of $$6:1$$ while retaining most visual quality.

K-Means Clustering Playground

Click to add data points, choose K, and watch the algorithm cluster them in real-time!

K Clusters: 3

Points: 0 Clusters: 3 Inertia: --

11

Python Implementation

From scikit-learn basics to evaluation metrics and real-world pipelines.

Basic K-Means with scikit-learn

The simplest way to run K-Means in Python. scikit-learn uses K-Means++ initialization and multiple restarts by default.

from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
import numpy as np

# Generate synthetic data with 3 clusters
X, y_true = make_blobs(
    n_samples=300, centers=3,
    cluster_std=0.60, random_state=42
)

# Fit K-Means
kmeans = KMeans(
    n_clusters=3,        # K = 3
    init='k-means++',    # Smart initialization
    n_init=10,           # Run 10 times, keep best
    max_iter=300,        # Max iterations per run
    random_state=42
)
kmeans.fit(X)

# Results
print("Cluster centers:\n", kmeans.cluster_centers_)
print("Labels:", kmeans.labels_[:10])
print("Inertia:", kmeans.inertia_)
print("Iterations:", kmeans.n_iter_)

Elbow Method and Silhouette Analysis

A complete pipeline for choosing the optimal number of clusters using both the elbow method and silhouette scores.

from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.datasets import make_blobs
import numpy as np

# Generate data
X, _ = make_blobs(n_samples=500, centers=4,
                  cluster_std=1.0, random_state=42)

# Test K from 2 to 10
K_range = range(2, 11)
inertias = []
silhouette_scores = []

for k in K_range:
    km = KMeans(n_clusters=k, n_init=10, random_state=42)
    km.fit(X)
    inertias.append(km.inertia_)
    silhouette_scores.append(silhouette_score(X, km.labels_))

# Print results
for k, inertia, sil in zip(K_range, inertias, silhouette_scores):
    print(f"K={k}: Inertia={inertia:.1f}, "
          f"Silhouette={sil:.4f}")

Image Compression with K-Means

A practical example of using K-Means to compress an image by reducing the number of colors.

from sklearn.cluster import KMeans
from sklearn.utils import shuffle
import numpy as np
from PIL import Image

# Load image and convert to numpy array
img = Image.open('photo.jpg')
img_array = np.array(img, dtype=np.float64) / 255

# Reshape: (H, W, 3) -> (H*W, 3)
w, h, d = img_array.shape
pixels = img_array.reshape((w * h, d))

# Sample 1000 pixels for faster fitting
sample = shuffle(pixels, random_state=42)[:1000]

# Fit K-Means with K=16 colors
kmeans = KMeans(n_clusters=16, n_init=4,
                random_state=42)
kmeans.fit(sample)

# Assign each pixel to nearest centroid
labels = kmeans.predict(pixels)

# Reconstruct compressed image
compressed = kmeans.cluster_centers_[labels]
compressed_img = compressed.reshape((w, h, d))

# Save result
result = Image.fromarray(
    (compressed_img * 255).astype(np.uint8)
)
result.save('compressed_16colors.jpg')

Complete Evaluation Pipeline

A production-ready example with standardization, clustering, and comprehensive evaluation metrics.

from sklearn.cluster import KMeans, MiniBatchKMeans
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import (
    silhouette_score,
    calinski_harabasz_score,
    davies_bouldin_score
)
from sklearn.datasets import make_blobs
import numpy as np

# Generate data with known ground truth
X, y_true = make_blobs(
    n_samples=1000, centers=5,
    cluster_std=1.2, random_state=42
)

# Step 1: Standardize features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Step 2: Fit K-Means
kmeans = KMeans(n_clusters=5, n_init=10,
                random_state=42)
labels = kmeans.fit_predict(X_scaled)

# Step 3: Evaluate clustering quality
print("=== Clustering Evaluation ===")
print(f"Inertia (WCSS): {kmeans.inertia_:.2f}")
print(f"Silhouette Score: "
      f"{silhouette_score(X_scaled, labels):.4f}")
print(f"Calinski-Harabasz Index: "
      f"{calinski_harabasz_score(X_scaled, labels):.2f}")
print(f"Davies-Bouldin Index: "
      f"{davies_bouldin_score(X_scaled, labels):.4f}")
print(f"Iterations to converge: {kmeans.n_iter_}")

# Step 4: Compare with Mini-Batch K-Means
mb_kmeans = MiniBatchKMeans(
    n_clusters=5, batch_size=100,
    n_init=10, random_state=42
)
mb_labels = mb_kmeans.fit_predict(X_scaled)
print(f"\nMini-Batch Silhouette: "
      f"{silhouette_score(X_scaled, mb_labels):.4f}")

Key Evaluation Metrics

Silhouette Score (higher is better, range [-1, 1]): measures cluster cohesion and separation. Calinski-Harabasz Index (higher is better): ratio of between-cluster to within-cluster dispersion. Davies-Bouldin Index (lower is better): average similarity between each cluster and its most similar neighbor. Use all three for a comprehensive evaluation.