N_eps(p) = {q | d(p,q) ≤ ε}
density · minPts
core · border · noise
Density Clustering
DBSCAN
Home / Study Lab / Guides / DBSCAN
MASTER GUIDE

DBSCAN (Density-Based Clustering)

From epsilon-neighborhoods to density-reachable clusters. A comprehensive, visually interactive deep dive into the most powerful density-based clustering algorithm in machine learning.

11 Sections
40 min read
Beginner to Advanced
Interactive Visuals
Begin Learning
Contents
Historical Intuition Core Intuition Mathematical Foundation Core, Border & Noise The DBSCAN Algorithm Choosing Eps & MinPts High-Dimensional Data DBSCAN Variants Advantages & Limitations Applications Python Code
01

Historical Intuition

How a groundbreaking KDD paper redefined clustering by thinking in terms of density rather than distance to centroids.

Ester, Kriegel, Sander & Xu (KDD 1996)

The story of DBSCAN begins in 1996 at the second ACM SIGKDD Conference on Knowledge Discovery and Data Mining. Martin Ester, Hans-Peter Kriegel, Joerg Sander, and Xiaowei Xu published a landmark paper titled "A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise".

At the time, clustering was dominated by partitioning methods like K-Means and hierarchical methods like agglomerative clustering. These algorithms assumed that clusters were spherical (or at least convex) and required the user to specify the number of clusters in advance. The authors recognized that real-world data -- geographic locations, customer behavior, molecular structures -- often contains clusters of arbitrary shape embedded in noise.

The Density Insight

The key insight behind DBSCAN was deceptively simple: clusters are dense regions of points separated by sparse regions. Instead of defining a cluster by its center (as K-Means does) or by a hierarchical merge tree, DBSCAN defines clusters by the local density of data points. If a region of the feature space has enough points packed closely together, it forms a cluster. Points in sparse regions are classified as noise.

This shift in perspective -- from centroid-based to density-based thinking -- opened the door to discovering clusters of any shape, from crescents and rings to complex intertwined structures that K-Means could never capture.

The Evolution of Density-Based Clustering

From the original DBSCAN paper to modern extensions, density-based methods have had a transformative impact on clustering:

1996
Ester et al. publish DBSCAN at KDD -- the first practical density-based clustering algorithm
1999
Ankerst et al. introduce OPTICS -- an ordering-based extension that handles varying densities
2004
DENCLUE proposes density estimation using kernel functions for smoother cluster boundaries
2013
Campello et al. publish HDBSCAN -- a hierarchical extension that automatically selects cluster densities
2014
DBSCAN receives the KDD Test of Time Award, recognizing its lasting influence on the field
Today
DBSCAN and HDBSCAN remain go-to algorithms for spatial data, anomaly detection, and exploratory analysis
02

Core Intuition

Why defining clusters by local density rather than centroids is a game-changer for real-world data.

Density Reachability -- The Key Concept

Imagine scattering a handful of rice grains on a table. Where the grains pile up densely, you can identify distinct clusters. Between those clusters, the table is mostly bare -- that is the noise. DBSCAN formalizes this intuition: a cluster is a maximal set of points that are density-connected to each other.

Two points belong to the same cluster if you can "walk" from one to the other through a chain of dense neighborhoods, never stepping into a sparse void. This is the concept of density-reachability, and it allows DBSCAN to discover clusters of any shape -- crescents, spirals, rings, or any arbitrary geometry.

Why K-Means Fails on Non-Spherical Data

Centroid Bias

K-Means assigns each point to the nearest centroid, inherently assuming convex, roughly spherical clusters. It cannot discover crescents, rings, or interlocked shapes.

Fixed K Problem

K-Means requires you to specify the number of clusters upfront. If you guess wrong, you get poor results. DBSCAN discovers the number of clusters automatically from the data.

No Noise Handling

K-Means assigns every point to a cluster, even outliers and noise. DBSCAN explicitly identifies noise points and excludes them from all clusters.

The Two Parameters of DBSCAN

DBSCAN requires only two parameters, both with clear physical interpretations:

ε

Epsilon (ε) -- The Neighborhood Radius

Epsilon defines the radius of the neighborhood around each point. Two points are considered neighbors if the distance between them is at most ε. Think of it as drawing a circle (or hypersphere in higher dimensions) of radius ε around each data point.

m

MinPts -- The Minimum Density Threshold

MinPts specifies the minimum number of points required within the ε-neighborhood for a point to be considered a core point. This parameter controls the minimum density that constitutes a cluster. Higher MinPts values require denser regions to form clusters.

Fewer Parameters Than K-Means

While K-Means requires choosing K (which directly determines the output structure), DBSCAN's two parameters (ε and MinPts) describe the physical density of clusters. This makes DBSCAN more robust -- you are describing what a cluster "looks like" rather than dictating how many clusters exist.

03

Mathematical Foundation

The formal definitions of epsilon-neighborhoods, density-reachability, and density-connectivity.

The Epsilon-Neighborhood

The fundamental building block of DBSCAN is the epsilon-neighborhood of a point. For a point \( p \) in a dataset \( D \), the epsilon-neighborhood is defined as:

\[ N_\varepsilon(p) = \{ q \in D \mid d(p, q) \leq \varepsilon \} \]

Where \( d(p, q) \) is the distance function (typically Euclidean distance). The epsilon-neighborhood contains all points in the dataset that lie within a distance of \( \varepsilon \) from point \( p \), including \( p \) itself. The size of this neighborhood, \( |N_\varepsilon(p)| \), is the local density measure used by DBSCAN.

Interactive: Epsilon-Neighborhood Visualization

Drag the epsilon slider to see how the neighborhood radius changes around each point. Points within the circle are neighbors.

Epsilon-Neighborhood

Adjust ε to grow or shrink the neighborhood radius. Watch how many points fall within each neighborhood.

Directly Density-Reachable

A point \( q \) is directly density-reachable from a point \( p \) with respect to \( \varepsilon \) and MinPts if:

\[ q \in N_\varepsilon(p) \quad \text{and} \quad |N_\varepsilon(p)| \geq \text{MinPts} \]

In other words, \( q \) is in the ε-neighborhood of \( p \), and \( p \) is a core point (it has at least MinPts neighbors). Note that this relationship is not symmetric in general: if \( p \) is a core point and \( q \) is in its neighborhood, \( q \) is directly density-reachable from \( p \). But \( p \) may not be directly density-reachable from \( q \) (if \( q \) is not a core point).

Density-Reachable

A point \( q \) is density-reachable from a point \( p \) if there exists a chain of points \( p_1, p_2, \ldots, p_n \) where \( p_1 = p \) and \( p_n = q \), such that each \( p_{i+1} \) is directly density-reachable from \( p_i \):

\[ p = p_1 \rightarrow p_2 \rightarrow \cdots \rightarrow p_n = q \]

Where each \( p_i \) (except possibly \( p_n \)) is a core point. This allows DBSCAN to "chain" through dense regions, connecting points that may be far apart in absolute distance but are linked through a sequence of dense neighborhoods.

Density-Connected

Two points \( p \) and \( q \) are density-connected if there exists a point \( o \) such that both \( p \) and \( q \) are density-reachable from \( o \):

\[ \exists \, o \in D : p \xleftarrow{\text{density-reachable}} o \xrightarrow{\text{density-reachable}} q \]

Unlike density-reachability, density-connectivity is symmetric. A DBSCAN cluster is formally defined as a maximal set of density-connected points. This symmetry ensures that if \( p \) is in a cluster, and \( q \) is density-connected to \( p \), then \( q \) is also in the same cluster.

Why Density-Connectivity Matters

Density-connectivity is the formal property that defines cluster membership. Two border points at opposite ends of a cluster may not be density-reachable from each other (since neither is a core point), but they are density-connected through the chain of core points between them. This is what allows DBSCAN to discover elongated and irregularly shaped clusters.

K-Distance Graph for Parameter Selection

The K-distance graph is a key tool for choosing the epsilon parameter. For each point, compute the distance to its k-th nearest neighbor, sort these distances in ascending order, and plot them. The "elbow" in this curve suggests a good value for epsilon.

04

Core, Border & Noise Points

The three categories of points that DBSCAN classifies -- the foundation of density-based cluster structure.

Core Points

A point \( p \) is a core point if its ε-neighborhood contains at least MinPts points (including itself):

\[ |N_\varepsilon(p)| \geq \text{MinPts} \implies p \text{ is a core point} \]

Core points are the backbone of DBSCAN clusters. They sit in the interior of dense regions and are responsible for "growing" the cluster outward. Every cluster must contain at least one core point, and the set of core points within a cluster forms a connected subgraph through mutual direct density-reachability.

Border Points

A point \( p \) is a border point if it is not a core point itself, but falls within the ε-neighborhood of at least one core point:

\[ |N_\varepsilon(p)| < \text{MinPts} \quad \text{and} \quad \exists \, q : p \in N_\varepsilon(q), \; q \text{ is core} \]

Border points are on the periphery of a cluster. They are dense enough to be within reach of a core point but not dense enough to be core points themselves. A border point is assigned to the cluster of the first core point that discovers it during the algorithm's traversal.

Noise Points (Outliers)

A point \( p \) is a noise point (or outlier) if it is neither a core point nor a border point:

\[ |N_\varepsilon(p)| < \text{MinPts} \quad \text{and} \quad \nexists \, q : p \in N_\varepsilon(q), \; q \text{ is core} \]

Noise points live in sparse regions of the feature space, isolated from any dense cluster. They do not belong to any cluster and are labeled as -1 in most implementations. This built-in outlier detection is one of DBSCAN's greatest strengths -- unlike K-Means, which forces every point into a cluster, DBSCAN acknowledges that some points simply do not belong.

Noise Points Can Become Border Points

During the algorithm's execution, a point initially classified as noise may later be discovered to lie within the ε-neighborhood of a core point that is processed subsequently. In this case, the point's classification is updated from noise to border, and it is added to the corresponding cluster. This is why DBSCAN processes all points and re-evaluates classifications as clusters grow.

Interactive: Point Classification

Adjust the MinPts slider to see how points are classified as core (filled), border (half-filled), or noise (hollow). Larger MinPts requires higher density to form cores.

Core / Border / Noise Classification

Change MinPts to reclassify all points. Core = solid fill, Border = ring with fill, Noise = hollow ring.

05

The DBSCAN Algorithm

A step-by-step walkthrough of how DBSCAN discovers clusters through depth-first density exploration.

Algorithm Overview

DBSCAN works by iterating over every unvisited point in the dataset and attempting to grow clusters outward from core points. The algorithm is elegantly simple:

1

Pick an Unvisited Point

Select any unvisited point \( p \) from the dataset. Mark it as visited. Compute its ε-neighborhood \( N_\varepsilon(p) \).

2

Check Core Point Status

If \( |N_\varepsilon(p)| \geq \text{MinPts} \), then \( p \) is a core point. Create a new cluster and add \( p \) to it. Proceed to expand the cluster.

3

Expand the Cluster (DFS)

For each point \( q \) in \( N_\varepsilon(p) \): if \( q \) is unvisited, mark it visited and compute \( N_\varepsilon(q) \). If \( q \) is also a core point (\( |N_\varepsilon(q)| \geq \text{MinPts} \)), add all of \( q \)'s neighbors to the expansion queue. Add \( q \) to the current cluster (if not already assigned to one).

4

Handle Non-Core Points

If \( |N_\varepsilon(p)| < \text{MinPts} \), label \( p \) as noise (for now). It may later be reclassified as a border point if it falls within the ε-neighborhood of a core point discovered later.

5

Repeat Until All Points Are Visited

Continue picking unvisited points and expanding clusters until every point in the dataset has been visited. The result: a set of clusters and a set of noise points.

Pseudocode

DBSCAN(D, eps, MinPts):
    C = 0                          # Cluster counter
    for each point p in D:
        if p is visited:
            continue
        mark p as visited
        N = regionQuery(p, eps)    # Find eps-neighborhood
        if |N| < MinPts:
            mark p as NOISE
        else:
            C = C + 1              # Start new cluster
            expandCluster(p, N, C, eps, MinPts)

expandCluster(p, N, C, eps, MinPts):
    add p to cluster C
    for each point q in N:
        if q is not visited:
            mark q as visited
            N' = regionQuery(q, eps)
            if |N'| >= MinPts:
                N = N union N'     # Add new neighbors
        if q is not yet a member of any cluster:
            add q to cluster C

Time Complexity

The regionQuery function is called once for each point. Without indexing, each call takes \( O(n) \), giving \( O(n^2) \) total. With a spatial index (KD-tree, R-tree, or ball tree), each regionQuery runs in \( O(\log n) \), reducing the overall complexity to \( O(n \log n) \). For high-dimensional data, the spatial index degrades and the algorithm approaches \( O(n^2) \).

Interactive: DBSCAN Step-by-Step

Watch DBSCAN discover clusters one point at a time. Step through the algorithm or auto-run to see the full process.

DBSCAN Algorithm Execution

Click Step to process one point at a time. Auto-Run animates the full algorithm. Colors represent different clusters; hollow points are noise.

06

Choosing Epsilon & MinPts

Practical strategies for selecting the two most critical parameters in DBSCAN.

The K-Distance Graph Method

The most widely used method for choosing \( \varepsilon \) is the k-distance graph (also called the elbow method for DBSCAN). The procedure is:

1

Choose k = MinPts

Set \( k \) equal to the MinPts value you plan to use (e.g., \( k = 4 \)).

2

Compute k-th Nearest Neighbor Distances

For each point in the dataset, compute the distance to its \( k \)-th nearest neighbor. This gives you an array of \( n \) distance values.

3

Sort and Plot

Sort these distances in ascending order and plot them. The x-axis is the point index (sorted) and the y-axis is the k-distance.

4

Find the Elbow

Look for a sharp bend (elbow) in the curve. The y-value at this elbow is a good choice for \( \varepsilon \). Points below the elbow are in dense regions (cluster members); points above are in sparse regions (noise).

Choosing MinPts

The MinPts parameter controls the minimum density required to form a cluster. Here are practical guidelines:

  • Rule of thumb: \( \text{MinPts} \geq d + 1 \), where \( d \) is the dimensionality of the data. This ensures that the density criterion is meaningful in the feature space.
  • For 2D data: \( \text{MinPts} = 4 \) is a common default and works well for most datasets.
  • For higher dimensions: Increase MinPts proportionally. For \( d = 10 \), try \( \text{MinPts} = 2 \times d = 20 \).
  • Noisy data: Increase MinPts to require higher density for cluster formation, which makes the algorithm more robust to noise.
  • Small datasets: Use smaller MinPts (3-5) to avoid marking too many points as noise.

The Epsilon-MinPts Interaction

Epsilon and MinPts interact strongly. A large ε with a small MinPts can merge distinct clusters into one giant cluster. A small ε with a large MinPts can classify most points as noise. Always tune them together and validate the results visually (for 2D/3D data) or with clustering metrics like the silhouette score.

07

Handling High-Dimensional Data

Why DBSCAN struggles in high dimensions and what alternatives exist.

The Curse of Dimensionality

As the number of dimensions increases, the concept of "density" becomes increasingly meaningless. In high-dimensional spaces, distances between all pairs of points tend to converge -- the difference between the nearest neighbor and the farthest neighbor shrinks relative to the mean distance. This phenomenon, known as distance concentration, undermines the core assumption of DBSCAN.

\[ \lim_{d \to \infty} \frac{\text{dist}_{\max} - \text{dist}_{\min}}{\text{dist}_{\min}} \to 0 \]

When all pairwise distances are approximately equal, the ε-neighborhood either contains almost all points (too large) or almost none (too small). There is no "sweet spot" that distinguishes dense regions from sparse ones.

Strategies for High-Dimensional Data

Dimensionality Reduction

Apply PCA, t-SNE, or UMAP to reduce the data to 2-10 dimensions before running DBSCAN. This restores meaningful distance relationships and makes density-based clustering feasible.

Subspace Clustering

Methods like SubClu discover clusters in subspaces of the full feature space, where density differences remain meaningful even when the full space is too high-dimensional.

OPTICS as an Alternative

OPTICS (Ordering Points To Identify Clustering Structure) is more robust to parameter sensitivity than DBSCAN. It produces a reachability plot that can reveal hierarchical cluster structure even when a single ε value fails.

Feature Scaling and Distance Metrics

Unlike K-Means, DBSCAN is not invariant to feature scaling. Since the ε parameter defines an absolute distance threshold, features with larger scales will dominate the distance calculation. Always standardize or normalize features before applying DBSCAN:

\[ x_{\text{scaled}} = \frac{x - \mu}{\sigma} \]

For non-Euclidean data (e.g., categorical features, time series, text), consider using appropriate distance metrics:

  • Manhattan distance: For grid-like or city-block data
  • Cosine distance: For text and sparse high-dimensional data
  • Haversine distance: For geographic (latitude/longitude) data
  • Precomputed distance matrix: For custom similarity measures
08

DBSCAN Variants

Modern extensions that address DBSCAN's limitations -- varying density, hierarchical structure, and spatio-temporal data.

HDBSCAN -- Hierarchical DBSCAN

HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise) is the most significant extension of DBSCAN. Published by Campello, Moulavi, and Sander in 2013, it addresses DBSCAN's biggest weakness: the inability to handle clusters of varying density.

HDBSCAN works by running DBSCAN at all possible ε values simultaneously, building a cluster hierarchy (dendrogram), and then extracting the most stable clusters using a persistence-based method. This means:

  • No need to choose ε -- only min_cluster_size is required
  • Discovers clusters of different densities in the same dataset
  • Provides a cluster persistence score indicating how "real" each cluster is
  • Still identifies noise points that do not belong to any cluster

OPTICS -- Ordering-Based Clustering

OPTICS (Ordering Points To Identify the Clustering Structure) was proposed by Ankerst et al. in 1999. Instead of producing a flat clustering like DBSCAN, OPTICS produces an ordering of points along with their reachability distances, which can be visualized as a reachability plot.

The reachability plot is a powerful visualization: valleys in the plot correspond to clusters, and the depth of a valley indicates the cluster's density. By choosing different horizontal thresholds on the reachability plot, you can extract clusterings at different density levels -- effectively getting multiple DBSCAN runs from a single OPTICS execution.

ST-DBSCAN -- Spatio-Temporal Extension

ST-DBSCAN extends DBSCAN for data with both spatial and temporal dimensions. It uses two epsilon values -- one for spatial distance and one for temporal distance -- allowing it to discover clusters that are close in both space and time. This is particularly useful for:

  • Trajectory analysis (e.g., GPS tracking, vehicle movement)
  • Event detection in sensor networks
  • Crime pattern analysis (where and when crimes cluster)
  • Social media event detection (geospatial + temporal posts)

When to Use Each Variant

D

DBSCAN

Use when clusters have roughly uniform density, you have a good estimate of ε, and you need a simple, fast algorithm. Best for spatial data with clear density separation.

H

HDBSCAN

Use when clusters have varying densities, you want to avoid tuning ε, or you need cluster stability scores. Generally the best default choice for exploratory clustering.

O

OPTICS

Use when you want to visualize the full cluster hierarchy through reachability plots, or when you need to extract clusters at multiple density levels from a single run.

09

Advantages & Limitations

A balanced assessment of where DBSCAN shines and where it struggles.

Advantages

No K Required

Unlike K-Means, DBSCAN does not require you to specify the number of clusters. The number of clusters emerges naturally from the data's density structure.

Arbitrary Cluster Shapes

DBSCAN can discover clusters of any shape -- crescents, rings, spirals, elongated blobs -- because it defines clusters by density connectivity, not by distance to a centroid.

Built-in Noise Detection

Points in sparse regions are automatically classified as noise (outliers). This is invaluable for real-world data that always contains anomalies and irrelevant observations.

Robust to Outliers

Since outliers are explicitly labeled as noise, they do not distort the cluster boundaries or pull centroids away from the true cluster center (as happens with K-Means).

Deterministic Results

Given fixed parameters, DBSCAN produces the same clusters every time (with the exception of border points that could be assigned to multiple clusters). K-Means results vary with random initialization.

Limitations

Varying Density Problem

DBSCAN uses a single global ε, so it struggles when clusters have different densities. A loose cluster may be merged with noise, while a tight cluster may be split. Use HDBSCAN for varying densities.

Parameter Sensitivity

The choice of ε and MinPts significantly affects results. A small change in ε can dramatically alter the number and size of discovered clusters.

High-Dimensional Struggle

In high-dimensional spaces, distance concentration makes the ε-neighborhood concept unreliable. Dimensionality reduction is typically needed before applying DBSCAN.

Memory for Large Datasets

Without a spatial index, DBSCAN requires O(n^2) time. Even with indexing, very large datasets (millions of points) can be challenging. Consider HDBSCAN or mini-batch approaches.

DBSCAN vs K-Means Comparison

The chart below compares key characteristics of DBSCAN and K-Means across multiple evaluation criteria. Higher scores indicate better performance on each criterion.

10

Practical Applications

Real-world domains where DBSCAN's unique strengths make it the algorithm of choice.

Where DBSCAN Excels

Geographic Clustering

Clustering GPS coordinates, store locations, crime hotspots, or earthquake epicenters. DBSCAN naturally handles the irregular spatial distribution of geographic data and ignores isolated outlier locations.

Anomaly Detection

Since DBSCAN labels low-density points as noise, it naturally performs anomaly detection. Used in fraud detection, network intrusion detection, and manufacturing defect identification.

Image Segmentation

Grouping pixels by color and spatial proximity to segment images into meaningful regions. DBSCAN can identify objects of arbitrary shape without pre-specifying the number of segments.

Customer Behavior

Segmenting customers based on purchasing patterns, browsing behavior, or engagement metrics. DBSCAN discovers natural customer segments without forcing every customer into a group.

Astronomy

Identifying galaxy clusters, star-forming regions, and cosmic structures from astronomical surveys. The irregular shapes and noise in astronomical data make DBSCAN ideal.

Sensor Data Analysis

Clustering sensor readings to detect events, group similar operational states, or identify fault patterns in IoT and industrial monitoring systems.

DBSCAN vs K-Means: Visual Comparison

See how DBSCAN and K-Means behave differently on non-spherical data. Adjust epsilon and K to compare their clustering results side-by-side.

11

Python Implementation

From basic clustering to complete pipelines with parameter selection and evaluation.

Basic DBSCAN Clustering

The simplest way to get started with DBSCAN in Python using scikit-learn. We use the make_moons dataset to demonstrate non-spherical cluster discovery.

from sklearn.cluster import DBSCAN
from sklearn.datasets import make_moons
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score
import numpy as np

# Generate non-spherical data
X, y_true = make_moons(n_samples=500, noise=0.1, random_state=42)

# IMPORTANT: Always scale features for DBSCAN
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Run DBSCAN
db = DBSCAN(eps=0.3, min_samples=5)
labels = db.fit_predict(X_scaled)

# Analyze results
n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
n_noise = list(labels).count(-1)

print(f"Clusters found: {n_clusters}")
print(f"Noise points: {n_noise}")
print(f"Core samples: {len(db.core_sample_indices_)}")

# Evaluate (exclude noise for silhouette)
mask = labels != -1
if len(set(labels[mask])) > 1:
    score = silhouette_score(X_scaled[mask], labels[mask])
    print(f"Silhouette score: {score:.4f}")

Choosing Epsilon with the K-Distance Graph

Use the nearest neighbor distances to find the optimal epsilon value by identifying the elbow in the k-distance plot.

from sklearn.neighbors import NearestNeighbors
import matplotlib.pyplot as plt
import numpy as np

# Compute k-nearest neighbor distances
k = 5  # Typically set to MinPts
nn = NearestNeighbors(n_neighbors=k)
nn.fit(X_scaled)
distances, indices = nn.kneighbors(X_scaled)

# Sort the k-th nearest neighbor distances
k_distances = np.sort(distances[:, k-1])

# Plot the k-distance graph
plt.figure(figsize=(10, 6))
plt.plot(k_distances)
plt.xlabel('Points (sorted by distance)')
plt.ylabel(f'{k}-th Nearest Neighbor Distance')
plt.title('K-Distance Graph for Epsilon Selection')
plt.axhline(y=0.3, color='r', linestyle='--',
            label='Suggested eps = 0.3')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

Comparing DBSCAN with K-Means

A side-by-side comparison showing how DBSCAN handles non-spherical clusters that K-Means cannot.

from sklearn.cluster import DBSCAN, KMeans
from sklearn.datasets import make_moons, make_circles
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import adjusted_rand_score
import numpy as np

# Generate challenging datasets
datasets = {
    'Moons': make_moons(n_samples=500, noise=0.05),
    'Circles': make_circles(n_samples=500, noise=0.05,
                             factor=0.5),
}

for name, (X, y_true) in datasets.items():
    X_scaled = StandardScaler().fit_transform(X)

    # DBSCAN
    db_labels = DBSCAN(eps=0.3, min_samples=5).fit_predict(
        X_scaled
    )
    db_ari = adjusted_rand_score(y_true, db_labels)

    # K-Means
    km_labels = KMeans(n_clusters=2, random_state=42,
                        n_init=10).fit_predict(X_scaled)
    km_ari = adjusted_rand_score(y_true, km_labels)

    print(f"{name:10s} | DBSCAN ARI: {db_ari:.4f} "
          f"| K-Means ARI: {km_ari:.4f}")

HDBSCAN for Variable Density

When clusters have different densities, HDBSCAN automatically adapts without requiring an epsilon value.

import hdbscan
from sklearn.datasets import make_blobs
from sklearn.preprocessing import StandardScaler
import numpy as np

# Create clusters with varying densities
X1, _ = make_blobs(n_samples=300, centers=[[0, 0]],
                    cluster_std=0.3)
X2, _ = make_blobs(n_samples=100, centers=[[3, 3]],
                    cluster_std=1.0)
X3, _ = make_blobs(n_samples=50, centers=[[-3, 3]],
                    cluster_std=0.1)
noise = np.random.uniform(-5, 5, (30, 2))
X = np.vstack([X1, X2, X3, noise])
X_scaled = StandardScaler().fit_transform(X)

# HDBSCAN -- no epsilon needed!
clusterer = hdbscan.HDBSCAN(
    min_cluster_size=15,
    min_samples=5,
    cluster_selection_method='eom'
)
labels = clusterer.fit_predict(X_scaled)

n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
print(f"Clusters found: {n_clusters}")
print(f"Noise points: {list(labels).count(-1)}")
print(f"Cluster persistence: "
      f"{clusterer.cluster_persistence_}")

DBSCAN vs HDBSCAN

For most practical applications, HDBSCAN is the recommended choice. It requires fewer parameters (only min_cluster_size), handles varying densities, and provides cluster stability scores. Use standard DBSCAN when you have a good physical intuition for the epsilon value (e.g., geographic data where distances have clear meaning) or when you need maximum speed on large datasets.

Continue Your Journey