Note
Go to the end to download the full example code or to run this example in your browser via JupyterLite or Binder.
Demo of HDBSCAN clustering algorithm#
In this demo we will take a look at cluster.HDBSCAN from the
perspective of generalizing the cluster.DBSCAN algorithm.
We’ll compare both algorithms on specific datasets. Finally we’ll evaluate
HDBSCAN’s sensitivity to certain hyperparameters.
We first define a couple utility functions for convenience.
# Authors: The scikit-learn developers
# SPDX-License-Identifier: BSD-3-Clause
import matplotlib.pyplot as plt
import numpy as np
from sklearn.cluster import DBSCAN, HDBSCAN
from sklearn.datasets import make_blobs
def plot(X, labels, probabilities=None, parameters=None, ground_truth=False, ax=None):
if ax is None:
_, ax = plt.subplots(figsize=(10, 4))
labels = labels if labels is not None else np.ones(X.shape[0])
probabilities = probabilities if probabilities is not None else np.ones(X.shape[0])
# Black removed and is used for noise instead.
unique_labels = set(labels)
colors = [plt.cm.Spectral(each) for each in np.linspace(0, 1, len(unique_labels))]
# The probability of a point belonging to its labeled cluster determines
# the size of its marker
proba_map = {idx: probabilities[idx] for idx in range(len(labels))}
for k, col in zip(unique_labels, colors):
if k == -1:
# Black used for noise.
col = [0, 0, 0, 1]
class_index = (labels == k).nonzero()[0]
for ci in class_index:
ax.plot(
X[ci, 0],
X[ci, 1],
"x" if k == -1 else "o",
markerfacecolor=tuple(col),
markeredgecolor="k",
markersize=4 if k == -1 else 1 + 5 * proba_map[ci],
)
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
preamble = "True" if ground_truth else "Estimated"
title = f"{preamble} number of clusters: {n_clusters_}"
if parameters is not None:
parameters_str = ", ".join(f"{k}={v}" for k, v in parameters.items())
title += f" | {parameters_str}"
ax.set_title(title)
plt.tight_layout()
Generate sample data#
One of the greatest advantages of HDBSCAN over DBSCAN is its out-of-the-box
robustness. It’s especially remarkable on heterogeneous mixtures of data.
Like DBSCAN, it can model arbitrary shapes and distributions, however unlike
DBSCAN it does not require specification of an arbitrary and sensitive
eps hyperparameter.
For example, below we generate a dataset from a mixture of three bi-dimensional and isotropic Gaussian distributions.
centers = [[1, 1], [-1, -1], [1.5, -1.5]]
X, labels_true = make_blobs(
n_samples=750, centers=centers, cluster_std=[0.4, 0.1, 0.75], random_state=0
)
plot(X, labels=labels_true, ground_truth=True)