DBSCAN
Cheat Sheet
Everything you need on one page. Perfect for revision, interviews, and quick reference.
Everything you need on one page. Perfect for revision, interviews, and quick reference.
Time complexity: $O(n^2)$ without spatial index, $O(n \log n)$ with KD-tree or ball tree.
A cluster = maximal set of density-connected points. Every cluster has at least one core point. Border points can only belong to one cluster (first discovered).
Always scale features before running DBSCAN. $\varepsilon$ is an absolute distance threshold, so unscaled features with different ranges will cause biased results.
Use DBSCAN when clusters have arbitrary shapes or noise is present. Use K-Means when clusters are spherical, $K$ is known, and you need maximum speed.
When in doubt, use HDBSCAN. It eliminates the need to choose $\varepsilon$ and adapts to clusters of different densities automatically.
A: A density-based clustering algorithm that groups together points in dense regions and marks isolated points as noise. It does not require specifying the number of clusters and can discover clusters of arbitrary shape.
A: Core points have at least MinPts neighbors within radius $\varepsilon$. Border points are within $\varepsilon$ of a core point but have fewer than MinPts neighbors. Noise points are neither -- they are isolated outliers.
A: Use the k-distance graph method: compute each point's distance to its k-th nearest neighbor, sort them, and plot. The elbow in the curve suggests a good $\varepsilon$ value.
A: Choose DBSCAN when clusters have arbitrary shapes, when the number of clusters is unknown, when noise/outliers are present, or when you need built-in anomaly detection.
A: HDBSCAN is a hierarchical extension that handles clusters of varying density. Use it when different clusters have different densities, or when you want to avoid choosing $\varepsilon$. Only min_cluster_size is needed.
A: Due to the curse of dimensionality, distances between all pairs of points converge. The $\varepsilon$-neighborhood concept breaks down because there is no clear density distinction between cluster regions and sparse regions.
A: $O(n^2)$ in the worst case (without spatial indexing). With a KD-tree or ball tree, it improves to $O(n \log n)$ on average, though this degrades in high dimensions.
A: Because clusters emerge from the density structure of the data. DBSCAN defines clusters as maximal sets of density-connected points. The number and shape of clusters are determined by the data's density distribution, not by a user-specified parameter.