KNN
Interview Questions
15 commonly asked interview questions with detailed answers. Click any question to reveal the answer.
15 commonly asked interview questions with detailed answers. Click any question to reveal the answer.
K-Nearest Neighbors (KNN) is a non-parametric, instance-based (lazy) learning algorithm used for both classification and regression. It works by storing all training examples and making predictions based on the K closest data points to a new query point. The algorithm does not learn an explicit model during training; instead, all computation is deferred to prediction time.
For classification, the algorithm finds the K nearest neighbors of the query point using a distance metric (typically Euclidean distance), then assigns the class that appears most frequently among those K neighbors (majority vote). For regression, it returns the average (or weighted average) of the target values of the K nearest neighbors. The simplicity of this approach makes KNN one of the most intuitive algorithms in machine learning.
In KNN classification, the algorithm identifies the K nearest neighbors of a query point and assigns the class label that has the highest frequency among those neighbors. This is known as majority voting. For instance, if K=5 and three of the five nearest neighbors belong to class A while two belong to class B, the query point is classified as class A. Ties can be broken randomly or by considering the distances to neighbors.
In KNN regression, instead of voting on a class label, the algorithm computes the mean (or sometimes the median) of the target values of the K nearest neighbors. For example, if the five nearest neighbors have target values of 10, 12, 11, 13, and 14, the predicted value would be 12.0 (the average). Weighted variants assign higher influence to closer neighbors in both classification and regression settings.
Feature scaling is critical for KNN because the algorithm relies entirely on distance calculations to determine which training points are closest to a query. When features have vastly different scales -- for example, one feature ranges from 0 to 1 and another ranges from 0 to 100,000 -- the feature with the larger range will dominate the distance computation. The smaller-scale feature would have virtually no influence on which neighbors are selected, regardless of its actual predictive importance.
Common scaling techniques include Min-Max normalization (scaling all features to the range [0, 1]) and Z-score standardization (subtracting the mean and dividing by the standard deviation so each feature has zero mean and unit variance). After scaling, all features contribute proportionally to the distance calculation. Failing to scale features is one of the most common mistakes when applying KNN and can lead to severely degraded performance.
When K=1, the algorithm assigns each query point the label of its single nearest neighbor. This produces highly flexible and complex decision boundaries that can perfectly fit the training data (zero training error). However, the model becomes extremely sensitive to noise and outliers -- a single mislabeled training point will directly cause misclassification of any query point closest to it. This leads to high variance and overfitting.
As K increases, the decision boundary becomes smoother and more stable because the prediction is based on a consensus of many neighbors. Very large values of K reduce variance but increase bias, because the prediction becomes more like a global average and loses sensitivity to local structure. In the extreme case where K equals the total number of training points, every query point receives the same prediction (the majority class for classification, or the global mean for regression). The optimal K balances the bias-variance tradeoff and is typically found through cross-validation.
Euclidean distance is the most commonly used metric and computes the straight-line distance between two points in the feature space. It is defined as the square root of the sum of squared differences across all dimensions. Manhattan distance (also called L1 or city-block distance) sums the absolute differences across dimensions and is more robust to outliers than Euclidean distance since it does not square the differences.
Minkowski distance is a generalization that includes both Euclidean (p=2) and Manhattan (p=1) as special cases by using a parameter p. For high-dimensional or sparse data such as text, cosine similarity (or cosine distance) is often preferred because it measures the angle between vectors rather than their magnitude. Hamming distance is used for categorical or binary features and counts the number of positions where the values differ. The choice of distance metric should reflect the nature of the data and the problem at hand.
The curse of dimensionality refers to the phenomenon where the performance of distance-based algorithms like KNN degrades as the number of features increases. In high-dimensional spaces, data points become increasingly equidistant from one another. The ratio between the nearest and farthest neighbor distances approaches 1, which means the concept of "nearest neighbor" loses its meaning because all points appear almost equally far from the query.
This happens because the volume of the feature space grows exponentially with the number of dimensions, and the fixed number of training points becomes increasingly sparse. To maintain the same density of data points, the required training set size grows exponentially with dimensionality. In practice, this means KNN requires dimensionality reduction techniques (PCA, feature selection, or autoencoders) to work effectively in high-dimensional settings. Without reducing dimensions, the distance calculations become unreliable and the algorithm's accuracy drops significantly.
The most reliable method for choosing K is cross-validation. You evaluate the model's performance across a range of K values (e.g., K=1 to K=30 or more) using k-fold cross-validation, then select the K that yields the best validation metric (accuracy, F1-score, or RMSE depending on the task). Plotting validation error versus K typically shows a U-shaped curve: error is high for very small K (overfitting), decreases as K increases, reaches a minimum, and then increases again for very large K (underfitting).
Some practical guidelines also apply. For binary classification, choosing an odd K avoids ties in majority voting. A common heuristic starting point is K = sqrt(N), where N is the number of training samples, though this should always be validated empirically. The optimal K depends on the dataset's noise level, class distribution, and feature characteristics, so there is no one-size-fits-all answer. Grid search or randomized search combined with cross-validation is the standard approach in practice.
KNN is a non-parametric model, meaning it does not assume any fixed functional form for the decision boundary. It can capture arbitrarily complex, non-linear relationships in the data because its decision boundary is shaped entirely by the local distribution of training points. Logistic regression, by contrast, is a parametric model that assumes a linear decision boundary (or log-linear relationship between features and log-odds). It learns a fixed set of weights during training and discards the training data afterward.
KNN's flexibility comes at a cost: it requires storing the entire training set, has O(N*d) prediction time (compared to O(d) for logistic regression), and is sensitive to irrelevant features and the curse of dimensionality. Logistic regression is faster at prediction, more interpretable (each weight indicates feature importance), and handles high-dimensional data more gracefully. Choose KNN when decision boundaries are highly non-linear and the dataset is moderate-sized with relevant features. Choose logistic regression when you need speed, interpretability, or when the relationship is approximately linear.
In standard KNN, all K neighbors contribute equally to the prediction regardless of their distance to the query point. Weighted KNN assigns a weight to each neighbor that is inversely proportional to its distance, so closer neighbors have a stronger influence on the prediction. The most common weighting scheme is w = 1/d or w = 1/d^2, where d is the distance from the query point to the neighbor. This way, a neighbor that is very close to the query has a much larger say than one that is far away.
Weighted KNN is particularly useful when the data has varying densities or when the decision boundary is irregular. It reduces the sensitivity to the choice of K because even with a larger K value, distant neighbors receive negligible weight and do not distort the prediction. It also makes the algorithm more robust to noise since noisy points that happen to be among the K nearest neighbors but are relatively far away will have minimal impact. In scikit-learn, this is activated by setting the weights parameter to "distance" instead of the default "uniform".
The brute-force approach to KNN computes the distance from the query point to every training example, giving O(N*d) time per query. KD-Trees (K-Dimensional Trees) improve this by recursively partitioning the feature space along alternating axes, allowing the algorithm to prune large portions of the search space. For low-dimensional data (d < 20), KD-Trees achieve O(d * log N) average query time, which is a dramatic speedup over brute force.
Ball Trees organize data points into nested hyperspheres rather than axis-aligned partitions. They perform better than KD-Trees in moderate-to-high dimensions because they adapt to the data's intrinsic geometry rather than being constrained to axis-aligned splits. However, both KD-Trees and Ball Trees degrade toward brute-force performance in very high dimensions due to the curse of dimensionality. For extremely large or high-dimensional datasets, approximate methods using hashing-based structures (like Locality-Sensitive Hashing) are preferred, trading exact results for significantly faster queries.
KNN provides one of the clearest illustrations of the bias-variance tradeoff in machine learning. When K=1, the model has zero bias on the training set (it perfectly memorizes every training point) but very high variance -- small changes in the training data or the presence of noise can drastically alter predictions. The decision boundary is jagged and overfits to the idiosyncrasies of the training set. As K increases, the model averages over more neighbors, which smooths the decision boundary, reduces variance, but introduces bias because the prediction becomes less sensitive to local structure.
Mathematically, for KNN regression, the expected prediction error decomposes as: Error = Bias^2 + Variance + Irreducible Noise. As K increases from 1, variance decreases monotonically (more averaging), while bias increases monotonically (less local sensitivity). The total error follows a U-shaped curve with the minimum at some intermediate K. This optimal K depends on the signal-to-noise ratio and the complexity of the true decision boundary. In high-noise settings, larger K values are better because smoothing reduces the impact of noise. In low-noise settings with complex boundaries, smaller K preserves the fine-grained structure of the data.
Scaling KNN to very large datasets requires addressing both computational and memory bottlenecks. First, use spatial indexing structures like KD-Trees or Ball Trees to reduce per-query time from O(N*d) to approximately O(d * log N). For datasets too large to fit in memory, use disk-based or distributed KD-Tree implementations. Second, apply dimensionality reduction (PCA, random projections, or feature selection) to lower d, which reduces both distance computation cost and mitigates the curse of dimensionality.
Third, consider data reduction techniques such as prototype selection (condensed nearest neighbors) or prototype generation (k-means clustering to replace N points with a smaller set of representative centroids). This dramatically reduces N without losing much accuracy. Fourth, switch to approximate nearest neighbor (ANN) methods like Locality-Sensitive Hashing, ANNOY, or FAISS, which provide sub-linear query time with controlled accuracy loss. Finally, distributed computing frameworks (MapReduce, Spark) can partition the dataset across nodes, with each node computing local neighbors, followed by a merge step to find global K nearest neighbors.
Locality-Sensitive Hashing (LSH) uses hash functions that are designed so that similar points are more likely to be hashed to the same bucket than dissimilar points. By hashing the query point and only comparing it to other points in the same (or nearby) buckets, LSH reduces the search space from the entire dataset to a small subset. Multiple hash tables with different hash functions are used to increase recall (the probability of finding the true nearest neighbors). LSH is particularly effective for high-dimensional data where exact methods like KD-Trees degrade to brute-force performance.
ANNOY (Approximate Nearest Neighbors Oh Yeah), developed at Spotify, builds a forest of random binary trees. Each tree recursively splits the space using random hyperplanes, creating a hierarchical partition. At query time, ANNOY traverses all trees to collect candidate neighbors from multiple leaf nodes, then performs exact distance computation only on these candidates. More trees yield higher accuracy but use more memory. FAISS (from Meta) is another popular library that combines inverted file indices with product quantization for billion-scale datasets. All ANN methods trade a small, controllable amount of accuracy for orders-of-magnitude speedup in query time.
Standard distance metrics like Euclidean distance are designed for numerical features and do not naturally handle categorical variables. The simplest approach is to one-hot encode categorical features, converting each category into a binary vector, and then apply standard distance metrics on the combined numerical and binary features. However, one-hot encoding can inflate dimensionality and distort distances, especially when categorical features have many unique values (high cardinality).
A more principled approach is to use the Gower distance, which computes a normalized distance for each feature based on its type -- absolute difference divided by range for numerical features, and a simple match/mismatch indicator (0 or 1) for categorical features -- then takes the weighted average across all features. This ensures that each feature contributes proportionally regardless of its type. For ordinal categorical features, you can map categories to integers reflecting their natural order and treat them as numerical. Alternatively, you can use heterogeneous distance functions like HEOM (Heterogeneous Euclidean-Overlap Metric) or HVDM (Heterogeneous Value Difference Metric), which apply different sub-metrics to different feature types within a single unified distance calculation.
A KNN-based recommendation system can follow either user-based or item-based collaborative filtering. In user-based KNN, each user is represented as a vector of their ratings across all items. To recommend items to a target user, the algorithm finds the K users most similar to the target (using cosine similarity or Pearson correlation on rating vectors), then recommends items that these similar users rated highly but the target user has not yet consumed. The predicted rating for an unrated item is computed as a weighted average of the K neighbors' ratings for that item.
Item-based KNN reverses the perspective: each item is represented by the vector of ratings it received from all users. To predict whether a target user will like a specific item, the algorithm finds the K items most similar to the target item that the user has already rated, then predicts the rating as a weighted average of the user's ratings on those similar items. Item-based approaches tend to be more stable because item similarity changes less frequently than user similarity. For scalability, precompute and cache the K nearest neighbors for all items, use sparse matrix representations to handle the large user-item matrix efficiently, and apply approximate nearest neighbor methods when the item or user count reaches millions.