K-Nearest Neighbors (KNN) - Complete Master Guide

01

Historical Intuition

How a simple idea about proximity became one of the most enduring algorithms in machine learning.

Cover & Hart (1967) -- The Birth of KNN

The K-Nearest Neighbors algorithm traces its formal origins to a landmark 1967 paper by Thomas Cover and Peter Hart, titled "Nearest Neighbor Pattern Classification". In this paper, they proved a remarkable result: the error rate of the 1-nearest neighbor rule is bounded above by twice the Bayes error rate (the theoretical minimum error achievable by any classifier).

This means a classifier that simply copies the label of the closest training example is, in the worst case, only twice as bad as the theoretically optimal classifier. For many practical problems, the gap is much smaller.

The Roots of Instance-Based Learning

While Cover and Hart gave KNN its theoretical foundation, the idea of classifying by proximity is far older. The concept draws from a simple human intuition: similar things are likely to belong to the same category. If an unknown fruit looks, smells, and tastes like an apple, it is probably an apple.

Before the term "machine learning" existed, statisticians were already exploring non-parametric methods -- techniques that make minimal assumptions about the underlying data distribution. KNN is the purest embodiment of this philosophy: it makes zero assumptions about data distribution, relying entirely on the structure of the training data itself.

Timeline of KNN and Instance-Based Learning

1951

Fix and Hodges propose the nearest neighbor rule for pattern classification at USAF School of Aviation Medicine

1967

Cover and Hart publish their foundational paper proving error bounds for 1-NN classification

1975

Fukunaga and Narendra develop the branch-and-bound algorithm for fast nearest neighbor search

1977

Bentley introduces the KD-Tree data structure, dramatically speeding up neighbor queries

1990s

KNN becomes widely used in recommender systems, OCR, and medical diagnosis

Today

Still used as a baseline, for imputation, anomaly detection, and in ensemble methods across industry

02

Core Intuition

"You are the average of your neighbors" -- the simplest yet most powerful idea in pattern recognition.

The Central Idea

K-Nearest Neighbors is an instance-based learning algorithm, also called lazy learning. Unlike eager learners (logistic regression, decision trees, neural networks) that build an explicit model during training, KNN does nothing during training. It simply stores the entire training dataset in memory.

All the computation happens at prediction time. When a new data point arrives, KNN:

1

Compute Distances

Measure the distance from the new point to every point in the training set using a distance metric (Euclidean, Manhattan, etc.).

2

Find K Nearest Neighbors

Sort all training points by distance and select the $$K$$ closest ones.

3

Aggregate & Predict

For classification, take a majority vote among the K neighbors. For regression, take the average (or weighted average) of their values.

Lazy Learning vs Eager Learning

The distinction between lazy and eager learning is fundamental to understanding KNN:

Lazy Learning (KNN)

No training phase. Stores all data. All work happens at prediction time. Adapts instantly to new data without retraining.

Eager Learning (SVM, NN)

Expensive training phase to build a model. Fast prediction. Must retrain to incorporate new data points.

Trade-Off

KNN trades fast training for slow prediction. Eager learners trade slow training for fast prediction. Choose based on your use case.

Why KNN is Non-Parametric

KNN is a non-parametric algorithm. It does not assume any specific form for the decision boundary -- no straight lines, no polynomial curves, no probability distributions. The decision boundary emerges organically from the spatial arrangement of training points.

This gives KNN extraordinary flexibility. It can learn arbitrarily complex decision boundaries, including non-convex regions, disconnected regions, and boundaries of any shape. The flip side is that it needs enough training data to define these boundaries accurately.

The Memory-Based Perspective

KNN is sometimes called a memory-based learner because it literally memorizes the entire training set. There are no learned parameters, no weights, no coefficients. The "model" is the data. This is both its greatest strength (maximum flexibility) and its greatest weakness (storage and computational cost scale with dataset size).

Interactive: How KNN Classifies

Click anywhere on the canvas to place a new query point (green star). See which K neighbors are closest and how the majority vote decides the class.

KNN Voting Visualization

Adjust K and click to classify. Dashed lines connect to the K nearest neighbors.

K Value: 3

03

Distance Metrics

The choice of distance function fundamentally shapes how KNN perceives similarity.

Euclidean Distance (L2 Norm)

The most commonly used distance metric. It measures the straight-line distance between two points in Euclidean space -- the "as the crow flies" distance:

d(\mathbf{x}, \mathbf{y}) = \sqrt{\sum_{i=1}^{n} (x_i - y_i)^2}

For two dimensions, this is simply the Pythagorean theorem: $d = \sqrt{(x_1 - y_1)^2 + (x_2 - y_2)^2}$ .

Properties: Rotation-invariant, sensitive to feature scale, heavily influenced by large differences in any single dimension. Works best when all features are continuous and on a similar scale.

Manhattan Distance (L1 Norm)

Also known as the taxicab distance or city block distance. It measures the sum of absolute differences along each dimension -- like navigating a grid of city blocks:

d(\mathbf{x}, \mathbf{y}) = \sum_{i=1}^{n} |x_i - y_i|

Properties: Less sensitive to outliers than Euclidean distance because it does not square the differences. Often preferred when features represent fundamentally different quantities, or when the data lies on a grid-like structure.

Minkowski Distance (Generalized Form)

Both Euclidean and Manhattan distances are special cases of the Minkowski distance, parameterized by $$p$$ :

d(\mathbf{x}, \mathbf{y}) = \left(\sum_{i=1}^{n} |x_i - y_i|^p\right)^{1/p}

$$p = 1$$ : Manhattan distance -- sum of absolute differences
$$p = 2$$ : Euclidean distance -- straight-line distance
$p \to \infty$ : Chebyshev distance -- maximum difference along any single dimension: $d = \max_i |x_i - y_i|$

The parameter $$p$$ controls how much emphasis is placed on large vs small differences. Higher $$p$$ values give more weight to the dimension with the largest discrepancy.

Cosine Similarity

Unlike the previous metrics which measure magnitude, cosine similarity measures the angle between two vectors. It is especially useful for text data (TF-IDF vectors) where the direction matters more than the magnitude:

\cos(\theta) = \frac{\mathbf{x} \cdot \mathbf{y}}{\|\mathbf{x}\| \; \|\mathbf{y}\|} = \frac{\sum_{i=1}^{n} x_i y_i}{\sqrt{\sum_{i=1}^{n} x_i^2} \cdot \sqrt{\sum_{i=1}^{n} y_i^2}}

The cosine distance is defined as $d = 1 - \cos(\theta)$ . Two vectors pointing in the same direction have cosine similarity of 1 (distance 0), while orthogonal vectors have similarity 0 (distance 1).

When to Use Cosine vs Euclidean

Use Euclidean distance when the magnitude of the features matters (e.g., height and weight). Use cosine similarity when only the direction matters (e.g., document vectors, where a longer document is not necessarily more similar). In NLP and recommendation systems, cosine similarity is almost always the better choice.

Distance Metric Comparison

The chart below compares how different metrics measure distance between two points in a 2D space. Notice how Manhattan distance is always greater than or equal to Euclidean distance.

Interactive: Distance Metrics

See how different distance metrics create different "neighborhoods". Euclidean creates circular boundaries while Manhattan creates diamond-shaped ones.

Distance Metric Neighborhoods

Toggle between metrics to see how the neighborhood shape changes around the center point.

Radius: 2.0

04

Choosing K

The single most important hyperparameter in KNN -- how many neighbors should vote?

The Effect of K on Decision Boundaries

The value of $$K$$ controls the bias-variance trade-off in KNN:

Small K (e.g., K=1): The decision boundary is highly irregular, closely following the training data. Low bias, high variance. The model is sensitive to noise and outliers. This leads to overfitting.
Large K (e.g., K=50): The decision boundary is smooth, averaging over many neighbors. High bias, low variance. The model may miss fine-grained patterns. This leads to underfitting.
K = N (all samples): Every point is classified as the majority class in the entire dataset. The model is completely useless.

\text{As } K \uparrow \text{: bias} \uparrow \text{, variance} \downarrow \quad | \quad \text{As } K \downarrow \text{: bias} \downarrow \text{, variance} \uparrow

Practical Guidelines for Choosing K

1

Use Odd K for Binary Classification

When classifying into two classes, always use an odd value of $$K$$ to avoid ties. With $$K = 4$$ , you could get a 2-2 split with no clear winner. With $$K = 5$$ , there is always a majority.

2

Start with K = sqrt(N)

A common rule of thumb is to set $K = \sqrt{N}$ where $$N$$ is the number of training samples. For 10,000 samples, start with $K \approx 100$ . Then tune up or down.

K_{\text{initial}} = \lfloor\sqrt{N}\rfloor

3

Use Cross-Validation (The Elbow Method)

The most reliable approach: evaluate KNN with different K values using cross-validation and plot the error rate vs K. Choose the K at the "elbow" where the error stabilizes.

4

Never Use K = 1 in Production

$$K = 1$$ memorizes the training data perfectly (including noise) and is extremely sensitive to outliers. A single mislabeled example can corrupt the prediction for its entire neighborhood.

Accuracy vs K Value

The chart below illustrates the typical relationship between K and model accuracy. Low K values overfit (noisy, high variance), while very high K values underfit (too smooth, high bias). The sweet spot is usually somewhere in the middle.

05

KNN for Classification

Majority voting, weighted voting, and how KNN draws decision boundaries.

Majority Voting

The simplest classification rule: each of the $$K$$ nearest neighbors casts an equal vote for its class. The class with the most votes wins:

\hat{y} = \arg\max_c \sum_{i=1}^{K} \mathbb{1}(y_i = c)

Where $\mathbb{1}(y_i = c)$ is the indicator function that equals 1 when neighbor $$i$$ belongs to class $$c$$ , and 0 otherwise.

Problem: All neighbors have equal influence, regardless of how close or far they are. A neighbor at distance 0.01 counts the same as one at distance 10.0.

Distance-Weighted Voting

A smarter approach gives closer neighbors more influence than distant ones. The most common weighting scheme uses the inverse of the distance:

w_i = \frac{1}{d(\mathbf{x}, \mathbf{x}_i)^2}

The weighted classification rule becomes:

\hat{y} = \arg\max_c \sum_{i=1}^{K} w_i \cdot \mathbb{1}(y_i = c)

Now a neighbor at distance 1 has weight $$w = 1$$ , but a neighbor at distance 5 has weight $$w = 0.04$$ . Closer neighbors dominate the vote.

When to Use Weighted Voting

Weighted voting is especially beneficial when K is large. With a large K, you inevitably include some distant points that may belong to the wrong class. Weighting ensures that these distant "intruders" have minimal impact on the final prediction. In scikit-learn, set weights='distance' to enable weighted voting.

Worked Example: Classifying a New Point

Suppose we have $$K = 5$$ and the five nearest neighbors with their classes and distances:

Neighbors:

\text{N}_1: \text{class A}, d = 1.0 \quad | \quad \text{N}_2: \text{class A}, d = 1.5 \quad | \quad \text{N}_3: \text{class B}, d = 2.0

\text{N}_4: \text{class B}, d = 2.5 \quad | \quad \text{N}_5: \text{class B}, d = 8.0

Majority vote: A gets 2 votes, B gets 3 votes. Prediction: Class B.

Weighted vote (w = 1/d^2):

\text{A}: \frac{1}{1^2} + \frac{1}{1.5^2} = 1.0 + 0.444 = 1.444

\text{B}: \frac{1}{2^2} + \frac{1}{2.5^2} + \frac{1}{8^2} = 0.25 + 0.16 + 0.016 = 0.426

Decision: With weighted voting, A scores 1.444 vs B's 0.426. Prediction: Class A.

Notice how weighting completely reverses the prediction. The two close A-neighbors outweigh the three distant B-neighbors.

Probability Estimation

KNN can also output class probabilities, not just hard predictions. The probability of class $$c$$ is simply the proportion of neighbors belonging to that class:

P(y = c \mid \mathbf{x}) = \frac{1}{K} \sum_{i=1}^{K} \mathbb{1}(y_i = c)

For weighted KNN, the probability becomes:

P(y = c \mid \mathbf{x}) = \frac{\sum_{i=1}^{K} w_i \cdot \mathbb{1}(y_i = c)}{\sum_{i=1}^{K} w_i}

In scikit-learn, call model.predict_proba(X) to get these probability estimates. They are useful for ranking predictions by confidence or for cost-sensitive classification.

06

KNN for Regression

Predicting continuous values by averaging the outputs of nearest neighbors.

Simple Average

For regression tasks, instead of voting on a class, we average the target values of the K nearest neighbors:

\hat{y} = \frac{1}{K} \sum_{i=1}^{K} y_i

This produces a piecewise-constant prediction surface. Each region of the feature space is assigned the average value of the K nearest training points.

Weighted Average

Just as in classification, we can weight each neighbor's contribution by the inverse of its distance:

\hat{y} = \frac{\sum_{i=1}^{K} w_i \cdot y_i}{\sum_{i=1}^{K} w_i} \quad \text{where} \quad w_i = \frac{1}{d(\mathbf{x}, \mathbf{x}_i)^2}

Weighted averaging produces smoother predictions because nearby points contribute more, creating a more nuanced interpolation between training values.

Worked Example: House Price Prediction

Predicting the price of a new house using $$K = 3$$ nearest neighbors:

Neighbors (distance, price):

\text{N}_1: d = 0.5, \; \$250{,}000 \quad | \quad \text{N}_2: d = 1.2, \; \$280{,}000 \quad | \quad \text{N}_3: d = 3.0, \; \$350{,}000

Simple average:

\hat{y} = \frac{250{,}000 + 280{,}000 + 350{,}000}{3} = \$293{,}333

Weighted average (w = 1/d^2):

w_1 = \frac{1}{0.25} = 4.0, \quad w_2 = \frac{1}{1.44} = 0.694, \quad w_3 = \frac{1}{9.0} = 0.111

\hat{y} = \frac{4.0 \times 250{,}000 + 0.694 \times 280{,}000 + 0.111 \times 350{,}000}{4.0 + 0.694 + 0.111} = \frac{1{,}233{,}070}{4.805} \approx \$256{,}622

Interpretation: The weighted average places much more emphasis on the closest neighbor ($250K), giving a more localized estimate.

Classification vs Regression Summary

Classification

Discrete output. Uses majority vote (or weighted vote). Outputs class label and optionally class probabilities.

Regression

Continuous output. Uses average (or weighted average). Outputs a real-valued prediction for the target variable.

07

The Curse of Dimensionality

Why KNN struggles when the number of features grows -- and what to do about it.

The Problem

As the number of dimensions increases, the volume of the feature space grows exponentially. Data points become increasingly sparse, and the concept of "nearest neighbor" becomes less meaningful. This phenomenon, coined by Richard Bellman, is the curse of dimensionality.

Consider a unit hypercube in $$d$$ dimensions. To capture just 10% of the data volume, you need a subcube with side length:

\ell = 0.1^{1/d}

$$d = 1$$ : $\ell = 0.1$ (10% of the range)
$$d = 10$$ : $\ell = 0.794$ (79% of the range)
$$d = 100$$ : $\ell = 0.977$ (98% of the range)

In 100 dimensions, you need to span 98% of each feature's range just to capture 10% of the data. "Nearest" neighbors are not very near at all.

Why It Destroys KNN

The curse of dimensionality affects KNN more severely than most other algorithms because KNN relies entirely on distances:

Distance concentration: In high dimensions, the distance between the nearest and farthest neighbors converges. The ratio $d_{\max}/d_{\min} \to 1$ as $d \to \infty$ . All points appear equidistant.
Empty neighborhoods: With fixed $$N$$ training samples, the density drops as $N/V \propto N \cdot e^{-d}$ . You need exponentially more data to maintain the same density.
Irrelevant features dominate: If only a few features are relevant, irrelevant ones add noise to the distance calculation, drowning out the signal.

\frac{d_{\max} - d_{\min}}{d_{\min}} \to 0 \quad \text{as} \quad d \to \infty

Rule of Thumb for KNN

KNN works well when the number of features $$d$$ is small (typically $$d < 20$$ ). Beyond that, performance degrades rapidly unless you apply dimensionality reduction (PCA, feature selection) first. For very high-dimensional data (text, images), use KNN only after projecting to a lower-dimensional space.

Remedies for High Dimensionality

1

Feature Selection

Remove irrelevant features that add noise to the distance computation. Use correlation analysis, mutual information, or recursive feature elimination.

2

Dimensionality Reduction

Apply PCA, t-SNE, or UMAP to project data into a lower-dimensional space that preserves the most important structure. KNN on PCA-transformed data often dramatically outperforms KNN on raw high-dimensional data.

3

Use a Different Algorithm

If dimensionality cannot be reduced, consider algorithms that handle high dimensions better: random forests, gradient boosting, or SVMs with appropriate kernels.

08

Feature Scaling

Why normalization and standardization are not optional for KNN -- they are essential.

Why Feature Scaling Is Critical

KNN computes distances between data points. If features have different scales, features with larger ranges will dominate the distance calculation, making features with smaller ranges effectively invisible.

Consider predicting whether a person buys a product based on age (20-70) and income ($30,000-$150,000). Without scaling:

d = \sqrt{(\text{age}_1 - \text{age}_2)^2 + (\text{income}_1 - \text{income}_2)^2}

A difference of $10,000 in income will overwhelm a difference of 10 years in age, even though both might be equally important. The income feature effectively becomes the only feature KNN considers.

Min-Max Normalization

Scales all features to a fixed range, typically [0, 1]:

x_{\text{norm}} = \frac{x - x_{\min}}{x_{\max} - x_{\min}}

Pros: Preserves the shape of the original distribution. All features contribute equally to the distance. Cons: Sensitive to outliers. A single extreme value stretches the range, compressing all other values.

Z-Score Standardization

Centers each feature at zero with unit variance:

x_{\text{std}} = \frac{x - \mu}{\sigma}

Where $\mu$ is the feature mean and $\sigma$ is the standard deviation. After standardization, each feature has mean 0 and standard deviation 1.

Pros: More robust to outliers than min-max scaling. Works well when features have different units or distributions. Cons: Does not bound features to a specific range.

The #1 KNN Mistake

Forgetting to scale features is the most common mistake when using KNN. It can cause accuracy to drop by 20-40% or more. Always scale your features before applying KNN. In scikit-learn, use StandardScaler or MinMaxScaler in a pipeline to avoid data leakage.

Impact of Feature Scaling on Accuracy

The chart below demonstrates how feature scaling dramatically improves KNN accuracy. Without scaling, features with larger ranges dominate; with scaling, all features contribute fairly.

09

Efficient Neighbor Search

From brute force to KD-Trees and Ball Trees -- making KNN fast enough for production.

Brute Force Search

The naive approach: compute the distance from the query point to every training point, then sort and pick the top K. The time complexity per query is:

O(N \cdot d)

Where $$N$$ is the number of training samples and $$d$$ is the number of features. For $$N = 1{,}000{,}000$$ and $$d = 100$$ , this means 100 million distance calculations per prediction. Clearly impractical for real-time applications.

When to use: Small datasets ( $$N < 10{,}000$$ ) or when dimensionality is very high (where tree-based methods degrade to brute force anyway).

KD-Tree (K-Dimensional Tree)

A KD-Tree is a binary tree that recursively partitions the feature space along one dimension at a time. At each node, the space is split along the median of one feature, cycling through all features.

Build time: $O(N \log N)$
Query time (average): $O(\log N)$ for low dimensions
Query time (worst case): $O(N^{1-1/d})$ -- degrades with high dimensionality
Space: $$O(N)$$

KD-Trees work beautifully for low-dimensional data ( $$d < 20$$ ) but their performance degrades rapidly as dimensionality increases, eventually becoming worse than brute force due to the overhead of tree traversal.

Ball Tree

Ball Trees partition the data into nested hyperspheres (balls) instead of axis-aligned rectangles. This makes them more effective than KD-Trees when the data does not align well with coordinate axes.

Build time: $O(N \log N)$
Query time: $O(d \cdot \log N)$ on average
Better than KD-Tree for: Moderate dimensionality ( $$d < 40$$ ) and non-axis-aligned data clusters
Space: $$O(N)$$

In scikit-learn, set algorithm='ball_tree' to use Ball Trees, or algorithm='auto' to let the library choose based on your data characteristics.

Algorithm Selection Summary

Brute Force

Small N or very high d. Simple, exact, no build overhead. O(Nd) per query.

KD-Tree

Large N, low d (under 20). O(log N) query in best case. Default choice for tabular data.

Ball Tree

Large N, moderate d (under 40). Handles non-axis-aligned clusters well. Robust fallback.

Approximate Nearest Neighbors (ANN)

For truly large-scale applications (millions of points), exact nearest neighbor search is too slow regardless of the data structure. Libraries like FAISS (Facebook), Annoy (Spotify), and ScaNN (Google) provide approximate nearest neighbor search that trades a small amount of accuracy for orders-of-magnitude speedup. These are the backbone of production recommendation systems and semantic search engines.

10

Practical Applications

Real-world domains where KNN delivers production-grade results.

Where KNN Excels

Recommendation Systems

Finding users or items similar to a target. "Users who bought X also bought Y" is fundamentally a nearest neighbor problem in user-item space.

Anomaly Detection

Points whose K nearest neighbors are unusually far away are likely anomalies. The average distance to K neighbors serves as an anomaly score.

Missing Data Imputation

Fill in missing values by averaging the values of K nearest complete-case neighbors. Scikit-learn provides KNNImputer for this purpose.

Handwriting Recognition

Classifying handwritten digits (MNIST) by comparing pixel patterns. KNN achieves over 96% accuracy on MNIST without any feature engineering.

Medical Diagnosis

Diagnosing diseases based on patient features (lab results, symptoms). KNN's interpretability makes it suitable for clinical decision support systems.

Image Classification

Comparing image feature vectors extracted by CNNs. Use KNN on learned embeddings for few-shot learning or content-based image retrieval.

KNN in Production -- Key Considerations

1

Memory Management

KNN stores the entire training set. For 1 million samples with 100 features (float64), that is 800 MB of RAM. Consider data compression or approximate methods for large datasets.

2

Prediction Latency

Even with KD-Trees, prediction is slower than model-based approaches. For sub-millisecond latency requirements, precompute neighbor indices or use ANN libraries.

3

Data Updates

One advantage of KNN: adding new training data requires no retraining. Simply append the new points to the dataset. However, tree structures may need rebuilding for optimal performance.

KNN Classification Playground

Click to add training points, then click "Show Decision Regions" to see the KNN classification regions across the space.

K Value: 3

K: 3 Blue Points: 0 Orange Points: 0

11

Python Implementation

From scikit-learn basics to complete classification and regression pipelines.

KNN Classification -- Iris Dataset

The classic introduction to KNN. We load the Iris dataset, scale features, fit a KNeighborsClassifier, and evaluate performance.

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report

# Load data
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Feature scaling (critical for KNN!)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train KNN classifier
knn = KNeighborsClassifier(n_neighbors=5, weights='uniform')
knn.fit(X_train_scaled, y_train)

# Evaluate
y_pred = knn.predict(X_test_scaled)
print(classification_report(y_test, y_pred))

KNN Regression -- House Prices

Using KNeighborsRegressor with distance-weighted averaging for a regression task.

from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np

# Load data
X, y = fetch_california_housing(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train KNN regressor with distance weighting
knn_reg = KNeighborsRegressor(
    n_neighbors=7,
    weights='distance',
    algorithm='auto'
)
knn_reg.fit(X_train_scaled, y_train)

# Evaluate
y_pred = knn_reg.predict(X_test_scaled)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
r2 = r2_score(y_test, y_pred)
print(f"RMSE: {rmse:.4f}")
print(f"R2 Score: {r2:.4f}")

Finding the Optimal K with Cross-Validation

A systematic approach to choosing the best K value using the elbow method.

from sklearn.datasets import load_iris
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline
import numpy as np

X, y = load_iris(return_X_y=True)

# Test K values from 1 to 30
k_range = range(1, 31)
cv_scores = []

for k in k_range:
    pipeline = Pipeline([
        ('scaler', StandardScaler()),
        ('knn', KNeighborsClassifier(n_neighbors=k))
    ])
    scores = cross_val_score(pipeline, X, y, cv=10, scoring='accuracy')
    cv_scores.append(scores.mean())

# Find the best K
best_k = k_range[np.argmax(cv_scores)]
print(f"Best K: {best_k}")
print(f"Best CV Accuracy: {max(cv_scores):.4f}")

# Plot would show the elbow curve
# (accuracy peaks around K=5-13 for Iris)

Complete Pipeline with Grid Search

A production-ready example with hyperparameter tuning over K, weights, distance metric, and the search algorithm.

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report

# Load data
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Build pipeline
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('knn', KNeighborsClassifier())
])

# Define hyperparameter grid
param_grid = {
    'knn__n_neighbors': [3, 5, 7, 9, 11, 15, 21],
    'knn__weights': ['uniform', 'distance'],
    'knn__metric': ['euclidean', 'manhattan', 'minkowski'],
    'knn__algorithm': ['auto', 'ball_tree', 'kd_tree']
}

# Grid search with cross-validation
grid_search = GridSearchCV(
    pipeline, param_grid,
    cv=5, scoring='accuracy', n_jobs=-1
)
grid_search.fit(X_train, y_train)

# Best parameters and score
print(f"Best params: {grid_search.best_params_}")
print(f"Best CV score: {grid_search.best_score_:.4f}")

# Evaluate on test set
y_pred = grid_search.predict(X_test)
print(classification_report(y_test, y_pred))

Always Use a Pipeline

Never scale the training and test data together -- this causes data leakage. The scaler should learn statistics (mean, std) only from the training set. Using a Pipeline ensures that fit_transform is called only on the training fold during cross-validation, and transform is applied to the validation fold. This gives honest performance estimates.