K-Nearest Neighbors (KNN)
From distance metrics to decision boundaries. A beginner-friendly, visually interactive deep dive into the most intuitive machine learning algorithm -- you are the average of your neighbors.
Begin Learning ↓From distance metrics to decision boundaries. A beginner-friendly, visually interactive deep dive into the most intuitive machine learning algorithm -- you are the average of your neighbors.
Begin Learning ↓How a simple idea about proximity became one of the most enduring algorithms in machine learning.
The K-Nearest Neighbors algorithm traces its formal origins to a landmark 1967 paper by Thomas Cover and Peter Hart, titled "Nearest Neighbor Pattern Classification". In this paper, they proved a remarkable result: the error rate of the 1-nearest neighbor rule is bounded above by twice the Bayes error rate (the theoretical minimum error achievable by any classifier).
This means a classifier that simply copies the label of the closest training example is, in the worst case, only twice as bad as the theoretically optimal classifier. For many practical problems, the gap is much smaller.
While Cover and Hart gave KNN its theoretical foundation, the idea of classifying by proximity is far older. The concept draws from a simple human intuition: similar things are likely to belong to the same category. If an unknown fruit looks, smells, and tastes like an apple, it is probably an apple.
Before the term "machine learning" existed, statisticians were already exploring non-parametric methods -- techniques that make minimal assumptions about the underlying data distribution. KNN is the purest embodiment of this philosophy: it makes zero assumptions about data distribution, relying entirely on the structure of the training data itself.
"You are the average of your neighbors" -- the simplest yet most powerful idea in pattern recognition.
K-Nearest Neighbors is an instance-based learning algorithm, also called lazy learning. Unlike eager learners (logistic regression, decision trees, neural networks) that build an explicit model during training, KNN does nothing during training. It simply stores the entire training dataset in memory.
All the computation happens at prediction time. When a new data point arrives, KNN:
Measure the distance from the new point to every point in the training set using a distance metric (Euclidean, Manhattan, etc.).
Sort all training points by distance and select the $K$ closest ones.
For classification, take a majority vote among the K neighbors. For regression, take the average (or weighted average) of their values.
The distinction between lazy and eager learning is fundamental to understanding KNN:
No training phase. Stores all data. All work happens at prediction time. Adapts instantly to new data without retraining.
Expensive training phase to build a model. Fast prediction. Must retrain to incorporate new data points.
KNN trades fast training for slow prediction. Eager learners trade slow training for fast prediction. Choose based on your use case.
KNN is a non-parametric algorithm. It does not assume any specific form for the decision boundary -- no straight lines, no polynomial curves, no probability distributions. The decision boundary emerges organically from the spatial arrangement of training points.
This gives KNN extraordinary flexibility. It can learn arbitrarily complex decision boundaries, including non-convex regions, disconnected regions, and boundaries of any shape. The flip side is that it needs enough training data to define these boundaries accurately.
KNN is sometimes called a memory-based learner because it literally memorizes the entire training set. There are no learned parameters, no weights, no coefficients. The "model" is the data. This is both its greatest strength (maximum flexibility) and its greatest weakness (storage and computational cost scale with dataset size).
Click anywhere on the canvas to place a new query point (green star). See which K neighbors are closest and how the majority vote decides the class.
Adjust K and click to classify. Dashed lines connect to the K nearest neighbors.
The choice of distance function fundamentally shapes how KNN perceives similarity.
The most commonly used distance metric. It measures the straight-line distance between two points in Euclidean space -- the "as the crow flies" distance:
For two dimensions, this is simply the Pythagorean theorem: $d = \sqrt{(x_1 - y_1)^2 + (x_2 - y_2)^2}$.
Properties: Rotation-invariant, sensitive to feature scale, heavily influenced by large differences in any single dimension. Works best when all features are continuous and on a similar scale.
Also known as the taxicab distance or city block distance. It measures the sum of absolute differences along each dimension -- like navigating a grid of city blocks:
Properties: Less sensitive to outliers than Euclidean distance because it does not square the differences. Often preferred when features represent fundamentally different quantities, or when the data lies on a grid-like structure.
Both Euclidean and Manhattan distances are special cases of the Minkowski distance, parameterized by $p$:
The parameter $p$ controls how much emphasis is placed on large vs small differences. Higher $p$ values give more weight to the dimension with the largest discrepancy.
Unlike the previous metrics which measure magnitude, cosine similarity measures the angle between two vectors. It is especially useful for text data (TF-IDF vectors) where the direction matters more than the magnitude:
The cosine distance is defined as $d = 1 - \cos(\theta)$. Two vectors pointing in the same direction have cosine similarity of 1 (distance 0), while orthogonal vectors have similarity 0 (distance 1).
Use Euclidean distance when the magnitude of the features matters (e.g., height and weight). Use cosine similarity when only the direction matters (e.g., document vectors, where a longer document is not necessarily more similar). In NLP and recommendation systems, cosine similarity is almost always the better choice.
The chart below compares how different metrics measure distance between two points in a 2D space. Notice how Manhattan distance is always greater than or equal to Euclidean distance.
See how different distance metrics create different "neighborhoods". Euclidean creates circular boundaries while Manhattan creates diamond-shaped ones.
Toggle between metrics to see how the neighborhood shape changes around the center point.
The single most important hyperparameter in KNN -- how many neighbors should vote?
The value of $K$ controls the bias-variance trade-off in KNN:
When classifying into two classes, always use an odd value of $K$ to avoid ties. With $K = 4$, you could get a 2-2 split with no clear winner. With $K = 5$, there is always a majority.
A common rule of thumb is to set $K = \sqrt{N}$ where $N$ is the number of training samples. For 10,000 samples, start with $K \approx 100$. Then tune up or down.
The most reliable approach: evaluate KNN with different K values using cross-validation and plot the error rate vs K. Choose the K at the "elbow" where the error stabilizes.
$K = 1$ memorizes the training data perfectly (including noise) and is extremely sensitive to outliers. A single mislabeled example can corrupt the prediction for its entire neighborhood.
The chart below illustrates the typical relationship between K and model accuracy. Low K values overfit (noisy, high variance), while very high K values underfit (too smooth, high bias). The sweet spot is usually somewhere in the middle.
Majority voting, weighted voting, and how KNN draws decision boundaries.
The simplest classification rule: each of the $K$ nearest neighbors casts an equal vote for its class. The class with the most votes wins:
Where $\mathbb{1}(y_i = c)$ is the indicator function that equals 1 when neighbor $i$ belongs to class $c$, and 0 otherwise.
Problem: All neighbors have equal influence, regardless of how close or far they are. A neighbor at distance 0.01 counts the same as one at distance 10.0.
A smarter approach gives closer neighbors more influence than distant ones. The most common weighting scheme uses the inverse of the distance:
The weighted classification rule becomes:
Now a neighbor at distance 1 has weight $w = 1$, but a neighbor at distance 5 has weight $w = 0.04$. Closer neighbors dominate the vote.
Weighted voting is especially beneficial when K is large. With a large K, you inevitably include some distant points that may belong to the wrong class. Weighting ensures that these distant "intruders" have minimal impact on the final prediction. In scikit-learn, set weights='distance' to enable weighted voting.
Suppose we have $K = 5$ and the five nearest neighbors with their classes and distances:
Notice how weighting completely reverses the prediction. The two close A-neighbors outweigh the three distant B-neighbors.
KNN can also output class probabilities, not just hard predictions. The probability of class $c$ is simply the proportion of neighbors belonging to that class:
For weighted KNN, the probability becomes:
In scikit-learn, call model.predict_proba(X) to get these probability estimates. They are useful for ranking predictions by confidence or for cost-sensitive classification.
Predicting continuous values by averaging the outputs of nearest neighbors.
For regression tasks, instead of voting on a class, we average the target values of the K nearest neighbors:
This produces a piecewise-constant prediction surface. Each region of the feature space is assigned the average value of the K nearest training points.
Just as in classification, we can weight each neighbor's contribution by the inverse of its distance:
Weighted averaging produces smoother predictions because nearby points contribute more, creating a more nuanced interpolation between training values.
Predicting the price of a new house using $K = 3$ nearest neighbors:
Discrete output. Uses majority vote (or weighted vote). Outputs class label and optionally class probabilities.
Continuous output. Uses average (or weighted average). Outputs a real-valued prediction for the target variable.
Why KNN struggles when the number of features grows -- and what to do about it.
As the number of dimensions increases, the volume of the feature space grows exponentially. Data points become increasingly sparse, and the concept of "nearest neighbor" becomes less meaningful. This phenomenon, coined by Richard Bellman, is the curse of dimensionality.
Consider a unit hypercube in $d$ dimensions. To capture just 10% of the data volume, you need a subcube with side length:
In 100 dimensions, you need to span 98% of each feature's range just to capture 10% of the data. "Nearest" neighbors are not very near at all.
The curse of dimensionality affects KNN more severely than most other algorithms because KNN relies entirely on distances:
KNN works well when the number of features $d$ is small (typically $d < 20$). Beyond that, performance degrades rapidly unless you apply dimensionality reduction (PCA, feature selection) first. For very high-dimensional data (text, images), use KNN only after projecting to a lower-dimensional space.
Remove irrelevant features that add noise to the distance computation. Use correlation analysis, mutual information, or recursive feature elimination.
Apply PCA, t-SNE, or UMAP to project data into a lower-dimensional space that preserves the most important structure. KNN on PCA-transformed data often dramatically outperforms KNN on raw high-dimensional data.
If dimensionality cannot be reduced, consider algorithms that handle high dimensions better: random forests, gradient boosting, or SVMs with appropriate kernels.
Why normalization and standardization are not optional for KNN -- they are essential.
KNN computes distances between data points. If features have different scales, features with larger ranges will dominate the distance calculation, making features with smaller ranges effectively invisible.
Consider predicting whether a person buys a product based on age (20-70) and income ($30,000-$150,000). Without scaling:
A difference of $10,000 in income will overwhelm a difference of 10 years in age, even though both might be equally important. The income feature effectively becomes the only feature KNN considers.
Scales all features to a fixed range, typically [0, 1]:
Pros: Preserves the shape of the original distribution. All features contribute equally to the distance. Cons: Sensitive to outliers. A single extreme value stretches the range, compressing all other values.
Centers each feature at zero with unit variance:
Where $\mu$ is the feature mean and $\sigma$ is the standard deviation. After standardization, each feature has mean 0 and standard deviation 1.
Pros: More robust to outliers than min-max scaling. Works well when features have different units or distributions. Cons: Does not bound features to a specific range.
Forgetting to scale features is the most common mistake when using KNN. It can cause accuracy to drop by 20-40% or more. Always scale your features before applying KNN. In scikit-learn, use StandardScaler or MinMaxScaler in a pipeline to avoid data leakage.
The chart below demonstrates how feature scaling dramatically improves KNN accuracy. Without scaling, features with larger ranges dominate; with scaling, all features contribute fairly.
From brute force to KD-Trees and Ball Trees -- making KNN fast enough for production.
The naive approach: compute the distance from the query point to every training point, then sort and pick the top K. The time complexity per query is:
Where $N$ is the number of training samples and $d$ is the number of features. For $N = 1{,}000{,}000$ and $d = 100$, this means 100 million distance calculations per prediction. Clearly impractical for real-time applications.
When to use: Small datasets ($N < 10{,}000$) or when dimensionality is very high (where tree-based methods degrade to brute force anyway).
A KD-Tree is a binary tree that recursively partitions the feature space along one dimension at a time. At each node, the space is split along the median of one feature, cycling through all features.
KD-Trees work beautifully for low-dimensional data ($d < 20$) but their performance degrades rapidly as dimensionality increases, eventually becoming worse than brute force due to the overhead of tree traversal.
Ball Trees partition the data into nested hyperspheres (balls) instead of axis-aligned rectangles. This makes them more effective than KD-Trees when the data does not align well with coordinate axes.
In scikit-learn, set algorithm='ball_tree' to use Ball Trees, or algorithm='auto' to let the library choose based on your data characteristics.
Small N or very high d. Simple, exact, no build overhead. O(Nd) per query.
Large N, low d (under 20). O(log N) query in best case. Default choice for tabular data.
Large N, moderate d (under 40). Handles non-axis-aligned clusters well. Robust fallback.
For truly large-scale applications (millions of points), exact nearest neighbor search is too slow regardless of the data structure. Libraries like FAISS (Facebook), Annoy (Spotify), and ScaNN (Google) provide approximate nearest neighbor search that trades a small amount of accuracy for orders-of-magnitude speedup. These are the backbone of production recommendation systems and semantic search engines.
Real-world domains where KNN delivers production-grade results.
Finding users or items similar to a target. "Users who bought X also bought Y" is fundamentally a nearest neighbor problem in user-item space.
Points whose K nearest neighbors are unusually far away are likely anomalies. The average distance to K neighbors serves as an anomaly score.
Fill in missing values by averaging the values of K nearest complete-case neighbors. Scikit-learn provides KNNImputer for this purpose.
Classifying handwritten digits (MNIST) by comparing pixel patterns. KNN achieves over 96% accuracy on MNIST without any feature engineering.
Diagnosing diseases based on patient features (lab results, symptoms). KNN's interpretability makes it suitable for clinical decision support systems.
Comparing image feature vectors extracted by CNNs. Use KNN on learned embeddings for few-shot learning or content-based image retrieval.
KNN stores the entire training set. For 1 million samples with 100 features (float64), that is 800 MB of RAM. Consider data compression or approximate methods for large datasets.
Even with KD-Trees, prediction is slower than model-based approaches. For sub-millisecond latency requirements, precompute neighbor indices or use ANN libraries.
One advantage of KNN: adding new training data requires no retraining. Simply append the new points to the dataset. However, tree structures may need rebuilding for optimal performance.
Click to add training points, then click "Show Decision Regions" to see the KNN classification regions across the space.
From scikit-learn basics to complete classification and regression pipelines.
The classic introduction to KNN. We load the Iris dataset, scale features, fit a KNeighborsClassifier, and evaluate performance.
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report
# Load data
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=42
)
# Feature scaling (critical for KNN!)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Train KNN classifier
knn = KNeighborsClassifier(n_neighbors=5, weights='uniform')
knn.fit(X_train_scaled, y_train)
# Evaluate
y_pred = knn.predict(X_test_scaled)
print(classification_report(y_test, y_pred))
Using KNeighborsRegressor with distance-weighted averaging for a regression task.
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np
# Load data
X, y = fetch_california_housing(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Train KNN regressor with distance weighting
knn_reg = KNeighborsRegressor(
n_neighbors=7,
weights='distance',
algorithm='auto'
)
knn_reg.fit(X_train_scaled, y_train)
# Evaluate
y_pred = knn_reg.predict(X_test_scaled)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
r2 = r2_score(y_test, y_pred)
print(f"RMSE: {rmse:.4f}")
print(f"R2 Score: {r2:.4f}")
A systematic approach to choosing the best K value using the elbow method.
from sklearn.datasets import load_iris
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline
import numpy as np
X, y = load_iris(return_X_y=True)
# Test K values from 1 to 30
k_range = range(1, 31)
cv_scores = []
for k in k_range:
pipeline = Pipeline([
('scaler', StandardScaler()),
('knn', KNeighborsClassifier(n_neighbors=k))
])
scores = cross_val_score(pipeline, X, y, cv=10, scoring='accuracy')
cv_scores.append(scores.mean())
# Find the best K
best_k = k_range[np.argmax(cv_scores)]
print(f"Best K: {best_k}")
print(f"Best CV Accuracy: {max(cv_scores):.4f}")
# Plot would show the elbow curve
# (accuracy peaks around K=5-13 for Iris)
A production-ready example with hyperparameter tuning over K, weights, distance metric, and the search algorithm.
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report
# Load data
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Build pipeline
pipeline = Pipeline([
('scaler', StandardScaler()),
('knn', KNeighborsClassifier())
])
# Define hyperparameter grid
param_grid = {
'knn__n_neighbors': [3, 5, 7, 9, 11, 15, 21],
'knn__weights': ['uniform', 'distance'],
'knn__metric': ['euclidean', 'manhattan', 'minkowski'],
'knn__algorithm': ['auto', 'ball_tree', 'kd_tree']
}
# Grid search with cross-validation
grid_search = GridSearchCV(
pipeline, param_grid,
cv=5, scoring='accuracy', n_jobs=-1
)
grid_search.fit(X_train, y_train)
# Best parameters and score
print(f"Best params: {grid_search.best_params_}")
print(f"Best CV score: {grid_search.best_score_:.4f}")
# Evaluate on test set
y_pred = grid_search.predict(X_test)
print(classification_report(y_test, y_pred))
Never scale the training and test data together -- this causes data leakage. The scaler should learn statistics (mean, std) only from the training set. Using a Pipeline ensures that fit_transform is called only on the training fold during cross-validation, and transform is applied to the validation fold. This gives honest performance estimates.