Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Calling all Data Engineers! Fabric Data Engineer (Exam DP-700) live sessions are back! Starting October 16th. Sign up.

Sahir_Maharaj

Clustering with KMeans, DBSCAN, and UMAP for Data Science in Microsoft Fabric

Every data professional eventually reaches that pivotal point... the one where descriptive analytics no longer satisfy your curiosity. You’ve explored averages, correlations, and distributions, but something inside you knows there’s more to the story. That’s when clustering becomes your key technique - a way to turn data into something that speaks. Clustering is fascinating because it’s both mathematical and intuitive. It groups similar data points together, revealing relationships that aren’t explicitly labeled. Whether it’s customer segmentation, image grouping, or anomaly detection, clustering shows how data naturally organizes itself (without needing you to tell it what to look for).

 

When I first learned clustering, it felt almost magical - adding in raw, messy data and watching distinct patterns emerge. But as I used it more in real projects, I realized that its true power isn’t in the algorithm - it’s in the interpretation. Understanding why clusters form and what they represent is where insight begins.

 

What you will learnIn this edition, we will explore the art of uncovering hidden patterns in your data using KMeans and DBSCAN. By the time you finish reading, you’ll have a clear sense of how each algorithm thinks, how to decide which one fits your data, and how to interpret the clusters they create. You’ll observe how KMeans brings structure and precision, while DBSCAN adds flexibility and adaptability for messier, real-world data. We’ll also bring in UMAP, a powerful tool that turns complex, high-dimensional data into something you can actually understand.

 

Read Time: 10 minutes

 

Source: Sahir Maharaj (https://sahirmaharaj.com)Source: Sahir Maharaj (https://sahirmaharaj.com)

 

If you’ve ever tried to make sense of customer data (spending patterns, purchasing behavior, or demographics)  you’ve probably wanted a way to group similar people together. That’s where KMeans shines. It’s a simple yet powerful algorithm that partitions data into k groups, or clusters, where each point belongs to the nearest cluster center. Think of it like organizing a party. At first, guests mingle randomly. Then, people gradually form groups based on proximity or shared interests. Each group’s “center” shifts as new people join, until everyone settles into the circle where they belong. That’s essentially what KMeans does - repeatedly adjusting until each cluster is stable. As a data scientist, I’ve found KMeans to be the perfect starting point for exploratory analysis. It’s fast, interpretable, and gives a tangible structure to your data.

 

You can easily extract centroids, assign labels, and calculate distances between points - all of which can become powerful features in downstream analysis. For instance, cluster centroids often represent “typical” behaviors or profiles that help teams design better strategies. However, it’s important to understand KMeans’ assumptions. It works best when clusters are spherical, evenly sized, and similar in density. If one cluster is much larger or denser than another, KMeans tends to split or merge them incorrectly. It also requires you to define k, the number of clusters, beforehand... something that often requires experimentation using the elbow method or silhouette scores. Despite these limitations, KMeans remains a go-to for many professionals (myself included) when starting exploratory segmentation. It gives you a a clear first pass that helps you understand the terrain before moving to more adaptive methods like DBSCAN.

 

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

X, y_true = make_blobs(
    n_samples=600, 
    centers=4, 
    cluster_std=0.70, 
    random_state=42
)

plt.figure(figsize=(6, 5))
plt.scatter(X[:, 0], X[:, 1], s=40, alpha=0.7)
plt.title("Raw Data — Unlabeled Points")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.show()

kmeans = KMeans(n_clusters=4, random_state=42, n_init='auto')
kmeans_labels = kmeans.fit_predict(X)

score = silhouette_score(X, kmeans_labels)
print(f"Silhouette Score for KMeans: {score:.3f}")

plt.figure(figsize=(6, 5))
plt.scatter(X[:, 0], X[:, 1], c=kmeans_labels, cmap='viridis', s=40)
plt.scatter(
    kmeans.cluster_centers_[:, 0], 
    kmeans.cluster_centers_[:, 1], 
    c='red', 
    s=200, 
    marker='X', 
    label='Centroids'
)
plt.title("KMeans Clustering — Groups Formed Around Centroids")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.legend()
plt.show()

inertias = []
k_values = range(1, 10)

for k in k_values:
    km = KMeans(n_clusters=k, random_state=42, n_init='auto')
    km.fit(X)
    inertias.append(km.inertia_)

plt.figure(figsize=(6, 4))
plt.plot(k_values, inertias, marker='o', linewidth=2)
plt.title("Elbow Method — Choosing Optimal Number of Clusters (k)")
plt.xlabel("Number of Clusters (k)")
plt.ylabel("Inertia (Within-Cluster Sum of Squares)")
plt.show()

 

While KMeans builds order through geometry, DBSCAN builds it through density. It doesn’t ask how many clusters you want - it lets the data decide. That’s why I often describe DBSCAN as the “artist” of clustering: it observes the data’s natural contours and creates groupings where the density of points suggests a relationship. DBSCAN, which stands for Density-Based Spatial Clustering of Applications with Noise, operates differently. It starts with a point and looks around it within a defined radius (eps). If there are enough nearby points (defined by min_samples), it considers that point part of a dense region and expands outward, connecting neighboring points recursively.

 

Points that don’t meet the density threshold are labeled as noise (-1). This makes DBSCAN perfect for discovering irregular or non-spherical clusters. When I work on datasets like geospatial coordinates, social interactions, or IoT sensor logs, DBSCAN consistently outperforms KMeans. It handles natural shapes (winding patterns, curved boundaries, uneven densities) all without needing you to guess how many clusters exist. But DBSCAN also requires intuition. The parameters eps and min_samples can drastically change the outcome.

 

Source: Sahir Maharaj (https://sahirmaharaj.com)Source: Sahir Maharaj (https://sahirmaharaj.com)

 

A small eps leads to tightly packed clusters but may label too many points as noise. A large eps merges everything together. Finding the sweet spot takes experimentation... but once you do, DBSCAN reveals clusters that KMeans would completely miss. It adapts, listens, and adjusts. It acknowledges that not all data fits neat mathematical boxes. And when combined with visualization tools like UMAP, DBSCAN becomes one of the most revealing exploratory methods in any data scientist’s toolkit.

 

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_moons, make_circles
from sklearn.cluster import DBSCAN
from sklearn.metrics import silhouette_score

X1, _ = make_moons(n_samples=400, noise=0.08, random_state=42)
X2, _ = make_circles(n_samples=400, factor=0.5, noise=0.05, random_state=24)
X = np.vstack((X1, X2))

plt.figure(figsize=(6, 5))
plt.scatter(X[:, 0], X[:, 1], s=40, alpha=0.7)
plt.title("Irregular Data — No Natural Circular Boundaries")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.show()

dbscan = DBSCAN(eps=0.15, min_samples=5)
labels = dbscan.fit_predict(X)

n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
n_noise = np.sum(labels == -1)

print(f"DBSCAN found {n_clusters} clusters and {n_noise} noise points.")

mask = labels != -1
if len(set(labels[mask])) > 1:
    dbscan_score = silhouette_score(X[mask], labels[mask])
    print(f"Silhouette Score (excluding noise): {dbscan_score:.3f}")

plt.figure(figsize=(6, 5))
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='Spectral', s=40)
plt.title("DBSCAN Clustering — Natural Shape Detection")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.show()

from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=4, random_state=42, n_init='auto')
km_labels = kmeans.fit_predict(X)

plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
plt.scatter(X[:, 0], X[:, 1], c=km_labels, cmap='plasma', s=40)
plt.title("KMeans — Forced Circular Boundaries")
plt.subplot(1, 2, 2)
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='Spectral', s=40)
plt.title("DBSCAN — Flexible Natural Boundaries")
plt.show()

 

Clustering only becomes powerful when you can see it. Visualizing clusters allows you to confirm what your algorithms suggest, whether the separation is real, overlapping, or even meaningful. This is where UMAP,Uniform Manifold Approximation and Projection, becomes invaluable. UMAP helps reduce high-dimensional data to 2D or 3D while preserving its structure as faithfully as possible. It’s faster than t-SNE, handles large datasets well, and provides a clearer sense of how points relate to each other. For clustering, UMAP acts like a microscope: it doesn’t change your clusters but makes them visible in a space your eyes can interpret.

 

Source: Sahir Maharaj (https://sahirmaharaj.com)Source: Sahir Maharaj (https://sahirmaharaj.com)

 

From my experience, pairing UMAP with DBSCAN or KMeans is like adding color to black-and-white data. Suddenly, you can see how clusters form, merge, or blur. Tight clusters appear as compact clouds, while ambiguous ones stretch out or overlap. This visual feedback helps refine your parameters and deepens your understanding of what the algorithm is doing. UMAP also introduces its own parameters - n_neighbors controls how local or global the structure should be, while min_dist adjusts how tightly points are packed in low-dimensional space. Modifying these can completely change your visualization. Though, one thing to keep in mind... UMAP isn’t just for pretty visuals. It’s a bridge between analysis and intuition that helps you interpret clustering results in a way that feels human, visual, and immediately actionable.

 

%pip install umap-learn

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans, DBSCAN
import umap.umap_ as umap

X, y_true = make_blobs(
    n_samples=1000,
    centers=5,
    n_features=10,
    cluster_std=1.2,
    random_state=42
)

kmeans = KMeans(n_clusters=5, random_state=42, n_init='auto')
kmeans_labels = kmeans.fit_predict(X)

dbscan = DBSCAN(eps=2.2, min_samples=10)
dbscan_labels = dbscan.fit_predict(X)

reducer = umap.UMAP(n_neighbors=20, min_dist=0.2, random_state=42)
X_umap = reducer.fit_transform(X)

fig, axes = plt.subplots(1, 3, figsize=(18, 6))

axes[0].scatter(X_umap[:, 0], X_umap[:, 1], c=y_true, cmap='viridis', s=25)
axes[0].set_title("True Clusters (Ground Truth)")
axes[0].set_xlabel("UMAP Dim 1")
axes[0].set_ylabel("UMAP Dim 2")

axes[1].scatter(X_umap[:, 0], X_umap[:, 1], c=kmeans_labels, cmap='plasma', s=25)
axes[1].set_title("KMeans Clustering in UMAP Space")
axes[1].set_xlabel("UMAP Dim 1")
axes[1].set_ylabel("UMAP Dim 2")

axes[2].scatter(X_umap[:, 0], X_umap[:, 1], c=dbscan_labels, cmap='Spectral', s=25)
axes[2].set_title("DBSCAN Clustering in UMAP Space")
axes[2].set_xlabel("UMAP Dim 1")
axes[2].set_ylabel("UMAP Dim 2")

plt.tight_layout()
plt.show()

for n in [5, 15, 50]:
    reducer = umap.UMAP(n_neighbors=n, min_dist=0.2, random_state=42)
    X_proj = reducer.fit_transform(X)

    plt.figure(figsize=(6, 5))
    plt.scatter(X_proj[:, 0], X_proj[:, 1], c=y_true, cmap='viridis', s=25)
    plt.title(f"UMAP Projection with n_neighbors={n}")
    plt.xlabel("UMAP Dim 1")
    plt.ylabel("UMAP Dim 2")
    plt.show()

 

As a data scientist, I’ve found that these three techniques often serve different purposes in my workflow. KMeans is my go-to for customer profiling, inventory grouping, or understanding behavioral clusters in web analytics. DBSCAN becomes invaluable when I need to detect outliers, identify fraud patterns, or explore geographic or sensor data that naturally forms uneven shapes. And UMAP links everything together, helping me visualize embeddings from deep learning models or find hidden structure in large-scale datasets before deciding which approach to pursue.

 

Source: Sahir Maharaj (https://sahirmaharaj.com)Source: Sahir Maharaj (https://sahirmaharaj.com)

 

And the best part is you can start experimenting right now in Microsoft Fabric! Open a notebook, import scikit-learn, scikit-learn-extra, and umap-learn, and paste in the examples we explored above. You don’t need a massive dataset or advanced hardware - just curiosity and a few lines of code. Because once you start using clustering in your everyday analysis, you’ll never look at data the same way again. Every dataset has a pattern, a relationship, or a story waiting to be told - and the tools are now in your hands.

 

Thanks for taking the time to read my post! I’d love to hear what you think and connect with you :slightly_smiling_face: