K-Means Clustering
Interview Questions
15 commonly asked interview questions with detailed answers. Click any question to reveal the answer.
15 commonly asked interview questions with detailed answers. Click any question to reveal the answer.
K-Means is an unsupervised learning algorithm that partitions a dataset into K distinct, non-overlapping clusters. Each data point belongs to the cluster whose centroid (mean) is closest to it. The algorithm aims to minimize the within-cluster sum of squares (WCSS), also known as inertia: J = sum over all clusters of sum over all points in that cluster of the squared Euclidean distance between each point and its cluster centroid.
The algorithm works in an iterative two-step process. First, in the assignment step, each data point is assigned to the nearest centroid based on Euclidean distance. Second, in the update step, each centroid is recalculated as the mean of all data points currently assigned to that cluster. These two steps repeat until the centroids no longer change significantly or a maximum number of iterations is reached. The result is K compact, spherical clusters.
Supervised learning trains models on labeled data where each input has a corresponding target output. The goal is to learn a mapping from inputs to outputs so the model can predict labels for new, unseen data. Examples include classification (predicting categories) and regression (predicting continuous values). Algorithms like logistic regression, decision trees, and neural networks fall into this category.
Unsupervised learning, on the other hand, works with unlabeled data and seeks to discover hidden structure or patterns without any guidance from target labels. K-Means clustering is a classic unsupervised algorithm -- it groups similar data points together based solely on feature similarity without ever being told what the groups should be. Other unsupervised tasks include dimensionality reduction (PCA), anomaly detection, and association rule mining.
Choosing K is one of the most important and challenging decisions in K-Means clustering because the algorithm requires K to be specified in advance. There is no single definitive method, so practitioners typically use a combination of quantitative techniques and domain knowledge. The most common approaches are the elbow method, the silhouette score analysis, and the gap statistic.
The elbow method plots the within-cluster sum of squares (WCSS) against different values of K and looks for a bend or elbow where adding more clusters yields diminishing returns. The silhouette score measures how similar each point is to its own cluster versus neighboring clusters, with values ranging from -1 to 1 where higher is better. Domain knowledge also plays a crucial role -- for example, if you are segmenting customers into marketing tiers, the business context might dictate three to five meaningful groups.
The elbow method is a heuristic technique for selecting the optimal number of clusters K. You run K-Means for a range of K values (for example, K = 1 to 10) and compute the within-cluster sum of squares (WCSS or inertia) for each. Then you plot K on the x-axis and WCSS on the y-axis. As K increases, WCSS always decreases because more clusters means each point is closer to some centroid.
The key insight is to look for the point where the rate of decrease sharply changes -- this forms an elbow shape in the curve. Before the elbow, adding clusters significantly reduces WCSS. After the elbow, additional clusters provide only marginal improvement, suggesting the data does not naturally support more groups. The K value at the elbow is typically chosen as the optimal number of clusters. However, in practice the elbow is not always clearly defined, which is why this method is often supplemented with other techniques like silhouette analysis.
Customer segmentation is one of the most widespread applications of K-Means. Businesses cluster customers by purchasing behavior, demographics, or engagement patterns to create targeted marketing campaigns. For example, an e-commerce company might identify high-value loyal customers, bargain hunters, and occasional browsers as distinct segments, allowing them to tailor promotions for each group.
Image compression uses K-Means to reduce the number of distinct colors in an image by clustering pixel colors and replacing each pixel with its nearest centroid color. Document clustering groups similar articles, emails, or web pages for topic discovery and information retrieval. Anomaly detection identifies outliers as points far from any centroid. Other applications include geographic analysis for placing facilities optimally, gene expression analysis in bioinformatics, and network traffic analysis for detecting intrusion patterns.
Standard K-Means initializes centroids randomly from the dataset, which can lead to poor convergence if centroids start close together or in suboptimal regions. K-Means++, introduced by Arthur and Vassilvitskii in 2007, provides a smarter initialization that spreads the initial centroids apart. The first centroid is chosen uniformly at random from the data points. Each subsequent centroid is chosen from the remaining points with probability proportional to D(x)^2, the squared distance from each point to its nearest already-chosen centroid.
This distance-weighted probability ensures that points far from existing centroids are more likely to be selected, producing a well-spread initial configuration. K-Means++ guarantees an expected approximation ratio of O(log K) to the optimal clustering, a significant theoretical improvement over random initialization which has no such guarantee. In practice, K-Means++ leads to faster convergence, fewer iterations, and more consistent results across multiple runs. It is the default initialization in scikit-learn and most modern implementations.
K-Means assumes clusters are spherical and roughly equal in size because it uses Euclidean distance and assigns each point to the single nearest centroid. It cannot naturally discover elongated, irregular, or non-convex cluster shapes like crescents or rings. The algorithm also requires specifying K in advance, which is a non-trivial decision when the true number of clusters is unknown. Additionally, K-Means is sensitive to initialization -- random starting centroids can lead to different local optima across runs.
K-Means is sensitive to outliers because extreme values pull centroids away from the true cluster centers, distorting the entire partition. It also treats all features equally, so features with larger scales dominate the distance calculation unless proper normalization is applied beforehand. The algorithm does not handle categorical data directly since it requires computing means, and it scales poorly to very high-dimensional data due to the curse of dimensionality where Euclidean distances become less meaningful.
K-Means is a partitional algorithm that creates exactly K flat clusters in a single pass (iteratively refined), while hierarchical clustering builds a nested tree of clusters called a dendrogram. In agglomerative hierarchical clustering, each point starts as its own cluster, and the two closest clusters are repeatedly merged until only one remains. You then cut the dendrogram at a chosen level to obtain any number of clusters without rerunning the algorithm.
K-Means has O(n * K * d * i) time complexity where i is the number of iterations, making it efficient for large datasets. Hierarchical clustering has O(n^2 * log n) or O(n^3) complexity depending on the linkage method, limiting it to smaller datasets. However, hierarchical clustering does not require specifying K in advance, reveals the nested cluster structure through the dendrogram, and can capture non-spherical shapes depending on the linkage criterion (single linkage can detect elongated clusters). K-Means is preferred for large-scale applications where speed matters, while hierarchical clustering is preferred for exploratory analysis where understanding the full cluster hierarchy is valuable.
Clustering evaluation metrics fall into two categories: internal metrics that use only the data and cluster assignments, and external metrics that compare against known ground-truth labels. The silhouette score is the most popular internal metric -- for each point, it computes (b - a) / max(a, b), where a is the mean distance to other points in the same cluster and b is the mean distance to the nearest neighboring cluster. Values range from -1 to 1, with higher values indicating better-defined clusters.
Other important internal metrics include the Davies-Bouldin Index (lower is better), which measures the average similarity ratio between each cluster and its most similar cluster, and the Calinski-Harabasz Index (higher is better), which is the ratio of between-cluster dispersion to within-cluster dispersion. When ground-truth labels are available, external metrics like the Adjusted Rand Index (ARI), Normalized Mutual Information (NMI), and V-measure quantify how well the clustering agrees with the true labels. Inertia (WCSS) itself is also commonly reported, though it always improves with more clusters.
K-Means relies on Euclidean distance to assign points to centroids, so features with larger numerical ranges dominate the distance calculation and effectively receive more weight in determining cluster assignments. For example, if one feature ranges from 0 to 100,000 (like income) and another ranges from 0 to 5 (like a rating), the income feature will almost entirely determine the clustering while the rating feature is effectively ignored.
Feature scaling ensures all features contribute equally to the distance computation. StandardScaler (z-score normalization) transforms each feature to have zero mean and unit variance, which is the most common choice for K-Means. MinMaxScaler scales features to a fixed range like [0, 1], which preserves the shape of the original distribution. In some cases, domain knowledge might suggest that certain features should naturally carry more weight, in which case deliberate feature weighting rather than uniform scaling is appropriate. Always scale before clustering, not after.
K-Means minimizes the objective function J = sum over k of sum over all x_i in cluster k of ||x_i - mu_k||^2. Convergence is proven by showing that both the assignment step and the update step can only decrease (or leave unchanged) this objective, and since J is bounded below by zero, the algorithm must converge. In the assignment step, each point is reassigned to its nearest centroid. If a point moves from cluster a to cluster b, it means ||x_i - mu_b||^2 < ||x_i - mu_a||^2, so J decreases. If no point changes assignment, J stays the same.
In the update step, the centroid mu_k is recalculated as the mean of all points assigned to cluster k. The mean minimizes the sum of squared distances to a set of points -- this is a well-known property from calculus (setting the derivative to zero yields the mean). Therefore, updating the centroid to the mean guarantees that J does not increase. Since each step monotonically decreases a non-negative objective over a finite number of possible partitions (there are at most K^n distinct assignments for n data points), the algorithm must terminate in a finite number of steps. However, it is only guaranteed to converge to a local minimum, not the global optimum.
K-Means performs hard assignment -- each point belongs to exactly one cluster. Gaussian Mixture Models (GMMs) perform soft assignment, computing the probability that each point belongs to each cluster. GMMs model the data as a mixture of K Gaussian distributions, each with its own mean, covariance matrix, and mixing weight. The Expectation-Maximization (EM) algorithm alternates between computing posterior responsibilities (E-step) and updating the Gaussian parameters (M-step), analogous to K-Means' assignment and update steps.
In fact, K-Means is a special case of GMMs where all covariance matrices are the identity matrix scaled by a variance that approaches zero. When the Gaussian covariances are identical and isotropic, the soft responsibilities collapse to hard assignments and EM reduces to the K-Means algorithm. GMMs are more flexible because they can model elliptical clusters of varying sizes and orientations through their full covariance matrices, while K-Means is restricted to spherical clusters of similar size. However, GMMs are computationally more expensive, have more parameters to estimate (means, covariances, and mixing weights), and are more prone to overfitting with limited data.
When data has non-spherical cluster shapes such as crescents, rings, or elongated structures, standard K-Means fails because it relies on Euclidean distance which produces Voronoi partitions -- always convex regions. The most direct alternative is DBSCAN (Density-Based Spatial Clustering of Applications with Noise), which defines clusters as dense regions separated by areas of low density. DBSCAN can discover clusters of arbitrary shape and automatically identifies noise points, requiring no specification of K.
Spectral clustering is another powerful approach: it constructs a similarity graph from the data, computes the eigenvectors of the graph Laplacian matrix, and then applies K-Means in the reduced eigenspace where non-convex clusters become linearly separable. The kernel K-Means approach implicitly maps data to a higher-dimensional feature space using a kernel function (like RBF) where the clusters become spherical, then runs standard K-Means. GMMs with full covariance matrices can handle elliptical shapes. HDBSCAN extends DBSCAN to handle varying density clusters. The choice depends on the specific data geometry, noise levels, and computational constraints.
The gap statistic, introduced by Tibshirani, Walther, and Hastie in 2001, formalizes the intuition behind the elbow method by comparing the observed within-cluster dispersion to what would be expected under a null reference distribution with no clustering structure. For each candidate K, it computes Gap(K) = E*[log(W_k)] - log(W_k), where W_k is the within-cluster sum of squares for the observed data and E*[log(W_k)] is the expected log(W_k) under the reference distribution, typically generated by sampling uniformly within the bounding box of the data.
The reference distribution is generated by creating B bootstrap samples (commonly B = 50 to 500) of uniformly distributed data with the same feature ranges. Each sample is clustered for each K, and the average log(W_k) provides the expected baseline. The optimal K is the smallest K such that Gap(K) >= Gap(K+1) - s_{K+1}, where s_{K+1} is the standard deviation of the bootstrap samples. This criterion selects the most parsimonious model that explains the clustering structure significantly better than random. The gap statistic is more principled than the elbow method and can correctly identify when K = 1 (no meaningful clusters exist).
Start with feature engineering: collect behavioral features (purchase frequency, average order value, recency of last purchase, session duration), demographic features (age, location, account tenure), and engagement features (email open rates, app usage, support tickets). Apply RFM analysis (Recency, Frequency, Monetary value) as a proven feature framework for customer segmentation. Handle missing values through imputation, remove or cap outliers using IQR-based methods, and apply StandardScaler to normalize all features to zero mean and unit variance.
Use PCA or UMAP for dimensionality reduction if the feature space is large, retaining enough components to explain 90-95% of variance. Run K-Means++ with multiple K values, evaluate using silhouette scores and the gap statistic, and validate with business stakeholders to ensure segments are actionable. Implement the pipeline: batch scoring for the full customer base nightly, real-time scoring for new customers via a low-latency API serving precomputed centroids. Monitor cluster drift over time by tracking centroid movements and segment size distributions. Build a dashboard showing segment profiles with key metrics, and feed segments into downstream systems for personalized marketing, dynamic pricing, and churn prediction models.