K-Means Clustering Cheat Sheet | Techma Zone Study Lab

Algorithm Steps

Initialize: Choose $K$ centroids $\mu_1, \mu_2, \ldots, \mu_K$ (randomly or via K-Means++)
Assign: Assign each point $x_i$ to the nearest centroid: $$c_i = \arg\min_k \| x_i - \mu_k \|^2$$
Update: Recompute each centroid as the mean of its assigned points: $$\mu_k = \frac{1}{|C_k|} \sum_{x_i \in C_k} x_i$$
Repeat: Iterate steps 2-3 until centroids converge (no change) or maximum iterations reached

Time complexity: $O(n \cdot K \cdot d \cdot t)$ where $n$ = samples, $d$ = dimensions, $t$ = iterations.

Key Formulas

Objective (WCSS):

J = \sum_{k=1}^{K} \sum_{x_i \in C_k} \| x_i - \mu_k \|^2

Centroid Update:

\mu_k = \frac{1}{|C_k|} \sum_{x_i \in C_k} x_i

Euclidean Distance:

d(x, y) = \sqrt{\sum_{j=1}^{d} (x_j - y_j)^2}

Manhattan Distance:

d(x, y) = \sum_{j=1}^{d} |x_j - y_j|

Cosine Distance:

d(x, y) = 1 - \frac{x \cdot y}{\|x\| \|y\|}

K-Means++ Initialization

Choose the first centroid $\mu_1$ uniformly at random from the data
For each point $x_i$, compute the distance to the nearest chosen centroid: $$D(x_i) = \min_{j \in \{1,...,c\}} \| x_i - \mu_j \|^2$$
Select next centroid with probability proportional to $D(x_i)^2$: $$P(x_i) = \frac{D(x_i)^2}{\sum_{x \in X} D(x)^2}$$
Repeat steps 2-3 until $K$ centroids are chosen

K-Means++ guarantees $O(\log K)$ competitive ratio for the objective function compared to optimal.

Choosing K

Elbow Method:

Plot $J(K) = \sum_{k=1}^{K} \sum_{x_i \in C_k} \|x_i - \mu_k\|^2$ vs. $K$, pick the "elbow" point

Silhouette Score:

s(i) = \frac{b(i) - a(i)}{\max(a(i), b(i))}

Where:

$a(i)$ = mean intra-cluster distance, $b(i)$ = mean nearest-cluster distance

Gap Statistic:

\text{Gap}(K) = E[\log W_K^*] - \log W_K

Rule:

Choose smallest $K$ such that $\text{Gap}(K) \geq \text{Gap}(K+1) - s_{K+1}$

Silhouette score ranges from $-1$ (wrong cluster) to $+1$ (well-matched). Values near $0$ indicate overlapping clusters.

Evaluation Metrics

Inertia (WCSS):

\text{Inertia} = \sum_{k=1}^{K} \sum_{x_i \in C_k} \|x_i - \mu_k\|^2

Silhouette Avg:

\bar{s} = \frac{1}{n} \sum_{i=1}^{n} s(i), \quad s(i) \in [-1, 1]

Calinski-Harabasz:

\text{CH} = \frac{\text{tr}(B_K)}{\text{tr}(W_K)} \cdot \frac{n - K}{K - 1}

Davies-Bouldin:

\text{DB} = \frac{1}{K} \sum_{k=1}^{K} \max_{k \neq l} \frac{S_k + S_l}{d(\mu_k, \mu_l)}

Higher CH = better. Lower DB = better. $B_K$ = between-cluster dispersion, $W_K$ = within-cluster dispersion, $S_k$ = avg distance in cluster $k$.

Variants

Mini-Batch K-Means:

Uses random subsets (mini-batches) per iteration. Faster for large datasets: $O(b \cdot K \cdot d \cdot t)$

K-Medoids (PAM):

\mu_k = \arg\min_{x_i \in C_k} \sum_{x_j \in C_k} d(x_i, x_j)

K-Medians:

Uses median instead of mean; robust to outliers. Uses $L_1$ (Manhattan) distance.

Bisecting K-Means:

Hierarchical: start with 1 cluster, repeatedly split the cluster with highest SSE using 2-means.

K-Medoids uses actual data points as centers, making it more robust to outliers than standard K-Means.

Assumptions & Limitations

Assumptions:

Clusters are spherical (isotropic) and roughly equal in size
Each cluster has similar variance (equal spread)
Number of clusters $K$ is known a priori
Data points belong to exactly one cluster (hard assignment)

Limitations:

Sensitive to initialization - can converge to local minima
Cannot detect non-convex or irregularly shaped clusters
Sensitive to outliers (they pull centroids away)
Assumes clusters of similar density and size
Does not work well with high-dimensional sparse data

vs Other Clustering

Feature	K-Means	DBSCAN	Hierarchical	GMM
Cluster Shape	Spherical	Arbitrary	Arbitrary	Elliptical
Requires $K$	Yes	No	Optional	Yes
Handles Outliers	Poor	Good (noise label)	Poor	Moderate
Assignment	Hard	Hard + noise	Hard	Soft (probabilistic)
Scalability	$O(nKd)$	$O(n \log n)$	$O(n^2 \log n)$	$O(nK d^2)$

Common Mistakes

Not scaling features: K-Means uses distance, so features on different scales will dominate. Always standardize first: $z = \frac{x - \mu}{\sigma}$
Wrong K: Blindly picking $K$ without elbow/silhouette analysis leads to poor clusters
Single run: K-Means is non-deterministic; always run multiple times (e.g., n_init=10) and pick the best $J$
Ignoring convergence: Not checking if the algorithm actually converged before using results
Using on categorical data: K-Means requires numerical features; use K-Modes for categorical data
Assuming equal importance: Not weighting features appropriately before clustering
Ignoring outliers: Outliers heavily distort centroids; consider removing or using K-Medoids

Interview Quick-Fire

Q: What is the objective function of K-Means?

A: Minimize the Within-Cluster Sum of Squares (WCSS): $J = \sum_{k=1}^{K} \sum_{x_i \in C_k} \|x_i - \mu_k\|^2$. It measures total squared distance from each point to its assigned centroid.

Q: Is K-Means guaranteed to converge?

A: Yes, K-Means always converges because $J$ decreases monotonically with each step. However, it converges to a local minimum, not necessarily the global optimum.

Q: Why is feature scaling important for K-Means?

A: K-Means uses Euclidean distance. Features on larger scales will dominate the distance calculation, biasing cluster assignments. Standardization ensures all features contribute equally.

Q: How does K-Means++ improve over random initialization?

A: K-Means++ spreads initial centroids apart by choosing each subsequent centroid with probability proportional to $D(x)^2$, reducing the chance of poor local minima and improving convergence speed.

Q: What is the difference between K-Means and GMM?

A: K-Means performs hard assignment (each point belongs to one cluster), while GMM provides soft (probabilistic) assignment. GMM can model elliptical clusters; K-Means assumes spherical clusters.

Q: When would you choose DBSCAN over K-Means?

A: When clusters have arbitrary shapes, you don't know $K$ in advance, or your data contains noise/outliers. DBSCAN automatically detects the number of clusters and labels outliers as noise.

Q: How do you handle categorical features with K-Means?

A: K-Means cannot directly handle categorical data. Use K-Modes (uses mode instead of mean) for categorical, or K-Prototypes for mixed data. One-hot encoding is an alternative but can be problematic.

Q: What happens if you set K equal to the number of data points?

A: Each data point becomes its own cluster with inertia = 0. This is trivially optimal for $J$ but provides no useful grouping. The elbow method helps find the balance between low $J$ and meaningful $K$.

K-Means Clustering
Cheat Sheet

Algorithm Steps

Key Formulas

K-Means++ Initialization

Choosing K

Evaluation Metrics

Variants

Assumptions & Limitations

Assumptions:

Limitations:

vs Other Clustering

Common Mistakes

Interview Quick-Fire

Q: What is the objective function of K-Means?

Q: Is K-Means guaranteed to converge?

Q: Why is feature scaling important for K-Means?

Q: How does K-Means++ improve over random initialization?

Q: What is the difference between K-Means and GMM?

Q: When would you choose DBSCAN over K-Means?

Q: How do you handle categorical features with K-Means?

Q: What happens if you set K equal to the number of data points?

Continue Your Journey

Study Guide

Quiz

Interview Prep

K-Means ClusteringCheat Sheet

Algorithm Steps

Key Formulas

K-Means++ Initialization

Choosing K

Evaluation Metrics

Variants

Assumptions & Limitations

Assumptions:

Limitations:

vs Other Clustering

Common Mistakes

Interview Quick-Fire

Q: What is the objective function of K-Means?

Q: Is K-Means guaranteed to converge?

Q: Why is feature scaling important for K-Means?

Q: How does K-Means++ improve over random initialization?

Q: What is the difference between K-Means and GMM?

Q: When would you choose DBSCAN over K-Means?

Q: How do you handle categorical features with K-Means?

Q: What happens if you set K equal to the number of data points?

Continue Your Journey

Study Guide

Quiz

Interview Prep

K-Means Clustering
Cheat Sheet