N_eps(p)
density cluster
core · border · noise
Quick Reference
Clustering
Home / Study Lab / Cheat Sheets / DBSCAN Cheat Sheet
QUICK REFERENCE

DBSCAN
Cheat Sheet

Everything you need on one page. Perfect for revision, interviews, and quick reference.

Key Formulas

Eps-Neighborhood:
$$N_\varepsilon(p) = \{ q \in D \mid d(p,q) \leq \varepsilon \}$$
Core Point Condition:
$$|N_\varepsilon(p)| \geq \text{MinPts}$$
Directly Density-Reachable:
$$q \in N_\varepsilon(p) \;\text{and}\; |N_\varepsilon(p)| \geq \text{MinPts}$$
Density-Reachable:
$$p_1 \rightarrow p_2 \rightarrow \cdots \rightarrow p_n \;\text{(chain of core points)}$$
Density-Connected:
$$\exists \, o : p \xleftarrow{\text{reachable}} o \xrightarrow{\text{reachable}} q$$
Distance Concentration:
$$\lim_{d \to \infty} \frac{d_{\max} - d_{\min}}{d_{\min}} \to 0$$

Algorithm Steps

  1. Pick an unvisited point $p$ from the dataset
  2. Compute its $\varepsilon$-neighborhood $N_\varepsilon(p)$
  3. If $|N_\varepsilon(p)| \geq \text{MinPts}$: create a new cluster and expand via DFS
  4. For each neighbor $q$ in $N_\varepsilon(p)$: if $q$ is unvisited, visit it and check if it is a core point too
  5. If $q$ is also core, add its neighbors to the expansion queue
  6. If $|N_\varepsilon(p)| < \text{MinPts}$: label $p$ as noise (may become border later)
  7. Repeat until all points are visited

Time complexity: $O(n^2)$ without spatial index, $O(n \log n)$ with KD-tree or ball tree.

Point Types

Core Point:
$|N_\varepsilon(p)| \geq \text{MinPts}$ -- interior of dense region, backbone of cluster
Border Point:
$|N_\varepsilon(p)| < \text{MinPts}$ but $p \in N_\varepsilon(q)$ where $q$ is core -- periphery of cluster
Noise Point:
Neither core nor border -- isolated in sparse region, labeled $-1$

A cluster = maximal set of density-connected points. Every cluster has at least one core point. Border points can only belong to one cluster (first discovered).

Parameter Selection

Choosing $\varepsilon$:
K-distance graph: plot sorted $k$-th NN distances, pick the elbow value as $\varepsilon$.
Choosing MinPts:
Rule of thumb: $\text{MinPts} \geq d + 1$ where $d$ = number of dimensions. Default for 2D: $\text{MinPts} = 4$.
Large $\varepsilon$:
Merges clusters, fewer noise points. Risk: distinct clusters join into one.
Small $\varepsilon$:
Splits clusters, more noise points. Risk: valid clusters fragmented.

Always scale features before running DBSCAN. $\varepsilon$ is an absolute distance threshold, so unscaled features with different ranges will cause biased results.

DBSCAN vs K-Means

Number of clusters:
DBSCAN: automatic. K-Means: must specify $K$ upfront.
Cluster shapes:
DBSCAN: arbitrary shapes. K-Means: spherical/convex only.
Noise handling:
DBSCAN: labels outliers as noise ($-1$). K-Means: forces all points into clusters.
Determinism:
DBSCAN: deterministic (except border assignment). K-Means: depends on initialization.
Scalability:
DBSCAN: $O(n \log n)$ with index. K-Means: $O(nKt)$ per iteration, faster for large $n$.

Use DBSCAN when clusters have arbitrary shapes or noise is present. Use K-Means when clusters are spherical, $K$ is known, and you need maximum speed.

Variants

HDBSCAN:
Hierarchical extension. Handles varying densities. Only needs min_cluster_size. Best default choice.
OPTICS:
Produces reachability ordering. Visualize clusters at multiple density levels. No fixed $\varepsilon$ needed.
ST-DBSCAN:
Spatio-temporal. Uses separate $\varepsilon$ for space and time. For GPS, trajectory, event data.

When in doubt, use HDBSCAN. It eliminates the need to choose $\varepsilon$ and adapts to clusters of different densities automatically.

Advantages & Limitations

Advantages:

  • No need to specify number of clusters ($K$)
  • Discovers clusters of arbitrary shape
  • Built-in outlier/noise detection
  • Robust to outliers -- they do not distort clusters
  • Deterministic results (core + noise assignment)

Limitations:

  • Struggles with clusters of varying density (single global $\varepsilon$)
  • Sensitive to $\varepsilon$ and MinPts parameter choices
  • Poor performance in high-dimensional spaces (distance concentration)
  • $O(n^2)$ worst case without spatial index
  • Border points may be assigned non-deterministically

Interview Quick-Fire

Q: What is DBSCAN?

A: A density-based clustering algorithm that groups together points in dense regions and marks isolated points as noise. It does not require specifying the number of clusters and can discover clusters of arbitrary shape.

Q: What are core, border, and noise points?

A: Core points have at least MinPts neighbors within radius $\varepsilon$. Border points are within $\varepsilon$ of a core point but have fewer than MinPts neighbors. Noise points are neither -- they are isolated outliers.

Q: How do you choose $\varepsilon$?

A: Use the k-distance graph method: compute each point's distance to its k-th nearest neighbor, sort them, and plot. The elbow in the curve suggests a good $\varepsilon$ value.

Q: DBSCAN vs K-Means -- when to pick DBSCAN?

A: Choose DBSCAN when clusters have arbitrary shapes, when the number of clusters is unknown, when noise/outliers are present, or when you need built-in anomaly detection.

Q: What is HDBSCAN and when to use it?

A: HDBSCAN is a hierarchical extension that handles clusters of varying density. Use it when different clusters have different densities, or when you want to avoid choosing $\varepsilon$. Only min_cluster_size is needed.

Q: Why does DBSCAN struggle in high dimensions?

A: Due to the curse of dimensionality, distances between all pairs of points converge. The $\varepsilon$-neighborhood concept breaks down because there is no clear density distinction between cluster regions and sparse regions.

Q: What is the time complexity of DBSCAN?

A: $O(n^2)$ in the worst case (without spatial indexing). With a KD-tree or ball tree, it improves to $O(n \log n)$ on average, though this degrades in high dimensions.

Q: Why does DBSCAN not require specifying $K$?

A: Because clusters emerge from the density structure of the data. DBSCAN defines clusters as maximal sets of density-connected points. The number and shape of clusters are determined by the data's density distribution, not by a user-specified parameter.

Continue Your Journey