K clusters
centroids
unsupervised
Quick Reference
Clustering
Home / Study Lab / Cheat Sheets / K-Means Cheat Sheet
QUICK REFERENCE

K-Means Clustering
Cheat Sheet

Everything you need on one page. Perfect for revision, interviews, and quick reference.

Algorithm Steps

  1. Initialize: Choose $K$ centroids $\mu_1, \mu_2, \ldots, \mu_K$ (randomly or via K-Means++)
  2. Assign: Assign each point $x_i$ to the nearest centroid: $$c_i = \arg\min_k \| x_i - \mu_k \|^2$$
  3. Update: Recompute each centroid as the mean of its assigned points: $$\mu_k = \frac{1}{|C_k|} \sum_{x_i \in C_k} x_i$$
  4. Repeat: Iterate steps 2-3 until centroids converge (no change) or maximum iterations reached

Time complexity: $O(n \cdot K \cdot d \cdot t)$ where $n$ = samples, $d$ = dimensions, $t$ = iterations.

Key Formulas

Objective (WCSS):
$$J = \sum_{k=1}^{K} \sum_{x_i \in C_k} \| x_i - \mu_k \|^2$$
Centroid Update:
$$\mu_k = \frac{1}{|C_k|} \sum_{x_i \in C_k} x_i$$
Euclidean Distance:
$$d(x, y) = \sqrt{\sum_{j=1}^{d} (x_j - y_j)^2}$$
Manhattan Distance:
$$d(x, y) = \sum_{j=1}^{d} |x_j - y_j|$$
Cosine Distance:
$$d(x, y) = 1 - \frac{x \cdot y}{\|x\| \|y\|}$$

K-Means++ Initialization

  1. Choose the first centroid $\mu_1$ uniformly at random from the data
  2. For each point $x_i$, compute the distance to the nearest chosen centroid: $$D(x_i) = \min_{j \in \{1,...,c\}} \| x_i - \mu_j \|^2$$
  3. Select next centroid with probability proportional to $D(x_i)^2$: $$P(x_i) = \frac{D(x_i)^2}{\sum_{x \in X} D(x)^2}$$
  4. Repeat steps 2-3 until $K$ centroids are chosen

K-Means++ guarantees $O(\log K)$ competitive ratio for the objective function compared to optimal.

Choosing K

Elbow Method:
Plot $J(K) = \sum_{k=1}^{K} \sum_{x_i \in C_k} \|x_i - \mu_k\|^2$ vs. $K$, pick the "elbow" point
Silhouette Score:
$$s(i) = \frac{b(i) - a(i)}{\max(a(i), b(i))}$$
Where:
$a(i)$ = mean intra-cluster distance, $b(i)$ = mean nearest-cluster distance
Gap Statistic:
$$\text{Gap}(K) = E[\log W_K^*] - \log W_K$$
Rule:
Choose smallest $K$ such that $\text{Gap}(K) \geq \text{Gap}(K+1) - s_{K+1}$

Silhouette score ranges from $-1$ (wrong cluster) to $+1$ (well-matched). Values near $0$ indicate overlapping clusters.

Evaluation Metrics

Inertia (WCSS):
$$\text{Inertia} = \sum_{k=1}^{K} \sum_{x_i \in C_k} \|x_i - \mu_k\|^2$$
Silhouette Avg:
$$\bar{s} = \frac{1}{n} \sum_{i=1}^{n} s(i), \quad s(i) \in [-1, 1]$$
Calinski-Harabasz:
$$\text{CH} = \frac{\text{tr}(B_K)}{\text{tr}(W_K)} \cdot \frac{n - K}{K - 1}$$
Davies-Bouldin:
$$\text{DB} = \frac{1}{K} \sum_{k=1}^{K} \max_{k \neq l} \frac{S_k + S_l}{d(\mu_k, \mu_l)}$$

Higher CH = better. Lower DB = better. $B_K$ = between-cluster dispersion, $W_K$ = within-cluster dispersion, $S_k$ = avg distance in cluster $k$.

Variants

Mini-Batch K-Means:
Uses random subsets (mini-batches) per iteration. Faster for large datasets: $O(b \cdot K \cdot d \cdot t)$
K-Medoids (PAM):
$$\mu_k = \arg\min_{x_i \in C_k} \sum_{x_j \in C_k} d(x_i, x_j)$$
K-Medians:
Uses median instead of mean; robust to outliers. Uses $L_1$ (Manhattan) distance.
Bisecting K-Means:
Hierarchical: start with 1 cluster, repeatedly split the cluster with highest SSE using 2-means.

K-Medoids uses actual data points as centers, making it more robust to outliers than standard K-Means.

Assumptions & Limitations

Assumptions:

  • Clusters are spherical (isotropic) and roughly equal in size
  • Each cluster has similar variance (equal spread)
  • Number of clusters $K$ is known a priori
  • Data points belong to exactly one cluster (hard assignment)

Limitations:

  • Sensitive to initialization - can converge to local minima
  • Cannot detect non-convex or irregularly shaped clusters
  • Sensitive to outliers (they pull centroids away)
  • Assumes clusters of similar density and size
  • Does not work well with high-dimensional sparse data

vs Other Clustering

Feature K-Means DBSCAN Hierarchical GMM
Cluster Shape Spherical Arbitrary Arbitrary Elliptical
Requires $K$ Yes No Optional Yes
Handles Outliers Poor Good (noise label) Poor Moderate
Assignment Hard Hard + noise Hard Soft (probabilistic)
Scalability $O(nKd)$ $O(n \log n)$ $O(n^2 \log n)$ $O(nK d^2)$

Common Mistakes

  • Not scaling features: K-Means uses distance, so features on different scales will dominate. Always standardize first: $z = \frac{x - \mu}{\sigma}$
  • Wrong K: Blindly picking $K$ without elbow/silhouette analysis leads to poor clusters
  • Single run: K-Means is non-deterministic; always run multiple times (e.g., n_init=10) and pick the best $J$
  • Ignoring convergence: Not checking if the algorithm actually converged before using results
  • Using on categorical data: K-Means requires numerical features; use K-Modes for categorical data
  • Assuming equal importance: Not weighting features appropriately before clustering
  • Ignoring outliers: Outliers heavily distort centroids; consider removing or using K-Medoids

Interview Quick-Fire

Q: What is the objective function of K-Means?

A: Minimize the Within-Cluster Sum of Squares (WCSS): $J = \sum_{k=1}^{K} \sum_{x_i \in C_k} \|x_i - \mu_k\|^2$. It measures total squared distance from each point to its assigned centroid.

Q: Is K-Means guaranteed to converge?

A: Yes, K-Means always converges because $J$ decreases monotonically with each step. However, it converges to a local minimum, not necessarily the global optimum.

Q: Why is feature scaling important for K-Means?

A: K-Means uses Euclidean distance. Features on larger scales will dominate the distance calculation, biasing cluster assignments. Standardization ensures all features contribute equally.

Q: How does K-Means++ improve over random initialization?

A: K-Means++ spreads initial centroids apart by choosing each subsequent centroid with probability proportional to $D(x)^2$, reducing the chance of poor local minima and improving convergence speed.

Q: What is the difference between K-Means and GMM?

A: K-Means performs hard assignment (each point belongs to one cluster), while GMM provides soft (probabilistic) assignment. GMM can model elliptical clusters; K-Means assumes spherical clusters.

Q: When would you choose DBSCAN over K-Means?

A: When clusters have arbitrary shapes, you don't know $K$ in advance, or your data contains noise/outliers. DBSCAN automatically detects the number of clusters and labels outliers as noise.

Q: How do you handle categorical features with K-Means?

A: K-Means cannot directly handle categorical data. Use K-Modes (uses mode instead of mean) for categorical, or K-Prototypes for mixed data. One-hot encoding is an alternative but can be problematic.

Q: What happens if you set K equal to the number of data points?

A: Each data point becomes its own cluster with inertia = 0. This is trivially optimal for $J$ but provides no useful grouping. The elbow method helps find the balance between low $J$ and meaningful $K$.

Continue Your Journey