K neighbors
distance metrics
lazy learning
Quick Reference
Instance-Based
Home / Study Lab / Cheat Sheets / KNN Cheat Sheet
QUICK REFERENCE

KNN
Cheat Sheet

Everything you need on one page. Perfect for revision, interviews, and quick reference.

Distance Formulas

Euclidean (L2):
$$d(\mathbf{x}, \mathbf{y}) = \sqrt{\sum_{i=1}^{n}(x_i - y_i)^2}$$
Manhattan (L1):
$$d(\mathbf{x}, \mathbf{y}) = \sum_{i=1}^{n} |x_i - y_i|$$
Minkowski (Lp):
$$d(\mathbf{x}, \mathbf{y}) = \left(\sum_{i=1}^{n} |x_i - y_i|^p\right)^{1/p}$$
Cosine Distance:
$$d(\mathbf{x}, \mathbf{y}) = 1 - \frac{\mathbf{x} \cdot \mathbf{y}}{\|\mathbf{x}\| \cdot \|\mathbf{y}\|}$$

Euclidean is the default. Manhattan is robust to outliers. Minkowski generalizes both ($p=1$ Manhattan, $p=2$ Euclidean). Cosine ignores magnitude.

Choosing K

Rule of Thumb:
$$K \approx \sqrt{n}$$
  • Use odd K for binary classification to avoid ties
  • Small $K$: low bias, high variance (overfitting)
  • Large $K$: high bias, low variance (underfitting)
  • $K = 1$: Voronoi tessellation decision boundary
  1. Start with $K = \sqrt{n}$ where $n$ is the training set size
  2. Use cross-validation to test a range of $K$ values
  3. Plot accuracy vs. $K$ and look for the elbow point
  4. Prefer the simplest model (larger $K$) at similar performance

Classification

Majority Voting:
$$\hat{y} = \arg\max_c \sum_{i=1}^{K} \mathbb{1}(y_i = c)$$
Weighted Voting:
$$\hat{y} = \arg\max_c \sum_{i=1}^{K} w_i \cdot \mathbb{1}(y_i = c)$$
Weight (inverse dist.):
$$w_i = \frac{1}{d(\mathbf{x}_q, \mathbf{x}_i) + \epsilon}$$

Majority voting treats all $K$ neighbors equally. Weighted voting gives closer neighbors more influence. Add $\epsilon$ to avoid division by zero.

Regression

Mean of Neighbors:
$$\hat{y} = \frac{1}{K} \sum_{i=1}^{K} y_i$$
Weighted Mean:
$$\hat{y} = \frac{\sum_{i=1}^{K} w_i \cdot y_i}{\sum_{i=1}^{K} w_i}$$
Inverse Distance Weight:
$$w_i = \frac{1}{d(\mathbf{x}_q, \mathbf{x}_i)^2}$$

For regression, KNN averages the target values of the $K$ closest neighbors. Inverse-distance weighting often improves accuracy by reducing the effect of distant neighbors.

Feature Scaling

Min-Max Normalization:
$$x' = \frac{x - x_{\min}}{x_{\max} - x_{\min}}$$
Scales to:
$x' \in [0, 1]$
Z-Score Standardization:
$$z = \frac{x - \mu}{\sigma}$$
Result:
$\mu = 0, \quad \sigma = 1$

Feature scaling is critical for KNN. Without it, features with larger ranges dominate the distance calculation. Always fit on training data, then transform test data using the same parameters.

Curse of Dimensionality

Volume of Hypersphere:
$$V_d(r) = \frac{\pi^{d/2}}{\Gamma(d/2 + 1)} r^d$$
Key Ratio:
$$\frac{d_{\max} - d_{\min}}{d_{\min}} \to 0 \quad \text{as } d \to \infty$$
  • In high dimensions, all points become nearly equidistant
  • Distance metrics lose discriminative power
  • Need exponentially more data: $n \propto e^d$
  • Most volume of a hypersphere concentrates near the surface

Solutions:

  • PCA or t-SNE for dimensionality reduction
  • Feature selection to remove irrelevant features
  • Use Manhattan distance (more robust in high-$d$)

Efficient Search

Brute Force:
$O(n \cdot d)$ per query
KD-Tree:
$O(d \cdot \log n)$ average per query
KD-Tree Build:
$O(n \cdot d \cdot \log n)$
Ball Tree:
$O(d \cdot \log n)$ average per query
  • Brute Force: best for small $n$ or high $d$ (above ~20)
  • KD-Tree: partitions space by axis-aligned splits; degrades in high dimensions
  • Ball Tree: uses hypersphere partitions; better than KD-Tree for $d > 20$
  • LSH (Locality-Sensitive Hashing): approximate NN, $O(1)$ amortized

Advantages & Limitations

Advantages:

  • No training phase (lazy learner) - $O(1)$ training
  • Naturally handles multi-class classification
  • Simple, intuitive, and easy to interpret
  • Non-parametric: makes no assumptions on data distribution
  • Adapts to any decision boundary shape

Limitations:

  • Slow prediction: $O(n \cdot d)$ per query (brute force)
  • High memory usage - stores entire training set
  • Sensitive to irrelevant and redundant features
  • Struggles with imbalanced datasets
  • Performance degrades in high dimensions

Common Mistakes

  • Not scaling features before computing distances
  • Choosing $K$ without cross-validation
  • Using even $K$ for binary classification (causes ties)
  • Ignoring the curse of dimensionality with many features
  • Not removing irrelevant or noisy features
  • Using Euclidean distance on categorical variables
  • Applying KNN to very large datasets without approximate NN
  • Forgetting to use the same scaler on train and test data

Interview Quick-Fire

Q: Why is KNN called a "lazy learner"?

A: Because it performs no training. It simply stores the dataset and defers all computation to prediction time, unlike eager learners (e.g., SVM, logistic regression) that build a model during training.

Q: How does KNN handle ties in classification?

A: Common strategies include using an odd $K$, applying distance-weighted voting so closer neighbors break the tie, or randomly selecting among the tied classes.

Q: What is the time complexity of KNN prediction?

A: Brute force is $O(n \cdot d)$ per query. KD-Trees and Ball Trees reduce this to $O(d \cdot \log n)$ on average. LSH provides approximate $O(1)$ amortized lookup.

Q: Why is feature scaling essential for KNN?

A: KNN relies on distance. If features have different scales, those with larger ranges dominate the distance. Scaling ensures all features contribute equally.

Q: KNN vs. Naive Bayes - when to choose which?

A: Choose Naive Bayes for high-dimensional sparse data (text), fast training, and when features are roughly independent. Choose KNN for low-dimensional data with complex, non-linear boundaries.

Q: How does the curse of dimensionality affect KNN?

A: In high dimensions, distances between all points converge, making nearest neighbors no more meaningful than random points. This degrades KNN's performance significantly.

Q: Is KNN parametric or non-parametric?

A: Non-parametric. It makes no assumptions about the underlying data distribution and the model complexity grows with the training set size.

Q: How do you handle imbalanced classes in KNN?

A: Use distance-weighted voting so closer neighbors have more influence, oversample the minority class, undersample the majority class, or use algorithms like SMOTE before applying KNN.

Continue Your Journey