Clustering Algorithms: K-Means, DBSCAN, and Hierarchical Clustering — Unsupervised Learning, Segmentation, and When to Use Each
Every ML model we have covered so far — linear regression, decision trees, XGBoost — learned from labeled data. You gave the model inputs AND the correct answers (supervised learning). But what if you do not have labels? What if you have 10 million customers and the business asks: “find natural groups in our customer base”? Nobody has pre-labeled customers as “high value” or “at risk.” You need the algorithm to discover the groups on its own.
That is unsupervised learning, and clustering is its most widely used technique. Clustering finds natural groupings in data without being told what the groups are.
Think of it like sorting a bag of mixed coins from different countries. Nobody tells you which country each coin belongs to. But you naturally group them by size, color, weight, and markings — and you end up with neat piles of US quarters, Canadian loonies, British pounds, and Indian rupees. The algorithm does the same thing with data points: groups them by similarity without being told the categories in advance.
This post covers the three most important clustering algorithms — K-Means (fast, simple, spherical clusters), DBSCAN (finds irregular shapes, handles noise), and Hierarchical Clustering (no need to pre-specify K, produces a visual dendrogram) — with Python code, comparison tables, real-world use cases, and when to use each.
Table of Contents
- Supervised vs Unsupervised Learning
- What Is Clustering?
- K-Means Clustering
- How K-Means Works (Step by Step)
- K-Means in Python
- Choosing K: The Elbow Method
- K-Means Limitations
- DBSCAN (Density-Based Clustering)
- How DBSCAN Works
- DBSCAN in Python
- Choosing eps and min_samples
- DBSCAN Strengths and Limitations
- Hierarchical Clustering
- How Hierarchical Clustering Works
- Hierarchical Clustering in Python
- Reading a Dendrogram
- Comparison: K-Means vs DBSCAN vs Hierarchical
- Evaluation Metrics for Clustering
- Silhouette Score
- Inertia and Davies-Bouldin Index
- Data Preprocessing for Clustering
- Real-World Use Cases
- Common Mistakes
- Interview Questions
- Wrapping Up
Supervised vs Unsupervised Learning
| Aspect | Supervised Learning | Unsupervised Learning |
|---|---|---|
| Data has labels? | Yes — you provide correct answers | No — algorithm discovers patterns |
| Goal | Predict an outcome (y = f(X)) | Find structure in data (groups, patterns) |
| Examples | Predict house price, classify spam | Customer segmentation, anomaly detection |
| Algorithms | Linear Regression, Decision Trees, XGBoost | K-Means, DBSCAN, Hierarchical, PCA |
| Evaluation | Accuracy, RMSE, F1 (against known labels) | Silhouette score, inertia (no “correct” answer) |
Real-life analogy: Supervised learning is like a teacher grading exams with an answer key. Unsupervised learning is like an archaeologist sorting artifacts from a new dig site — no labels, no answer key, just patterns in the objects themselves.
What Is Clustering?
Clustering divides data points into groups (clusters) where points in the same cluster are more similar to each other than to points in other clusters.
Before clustering: After clustering:
• • Cluster A: • • •
• • • Cluster B: • • • •
• • • Cluster C: • •
• • •
• •
• • •
No structure visible. Three natural groups discovered.
The algorithm does not know what the groups mean — it just finds them. YOU interpret them: “Cluster A is high-spending young customers,” “Cluster B is budget-conscious families,” etc.
K-Means Clustering
K-Means is the most popular clustering algorithm. It is fast, simple, and works well for spherical (round) clusters of similar size.
Real-life analogy: Imagine you are a pizza delivery manager with 3 delivery drivers. You need to assign each delivery address to the closest driver. K-Means does exactly this: it places K “centroids” (drivers) and assigns each data point (delivery address) to the nearest centroid. Then it moves the centroids to the center of their assigned points and repeats until nobody switches groups.
How K-Means Works (Step by Step)
Step 1: Choose K (number of clusters) — you decide this upfront
Step 2: Randomly place K centroids in the data space
Step 3: ASSIGN — each data point goes to its nearest centroid
Step 4: UPDATE — move each centroid to the mean of its assigned points
Step 5: REPEAT Steps 3-4 until centroids stop moving (convergence)
Example with K=3 and 9 data points:
Iteration 1: Iteration 2 (centroids moved):
C1* Points reassigned based on new
• • centroid positions.
• C2*
• • C1 and C2 shift to cluster centers.
• C3*
• • After 5-10 iterations: stable.
* = centroid position Final: 3 tight clusters.
K-Means in Python
import numpy as np
import pandas as pd
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
# Sample customer data
data = pd.DataFrame({
'annual_income': [15, 16, 17, 18, 19, 39, 40, 41, 42, 43, 70, 71, 72, 73, 74],
'spending_score': [39, 81, 6, 77, 40, 76, 94, 3, 72, 14, 99, 15, 73, 10, 88]
})
# Step 1: Scale the data (CRITICAL for clustering!)
scaler = StandardScaler()
scaled_data = scaler.fit_transform(data)
# Step 2: Fit K-Means with K=3
kmeans = KMeans(n_clusters=3, random_state=42, n_init=10)
data['cluster'] = kmeans.fit_predict(scaled_data)
# Step 3: Examine results
print(data.groupby('cluster').mean())
# cluster annual_income spending_score
# 0 17.0 48.6 ← Low income group
# 1 41.0 51.8 ← Mid income group
# 2 72.0 57.0 ← High income group
# Step 4: Visualize
plt.scatter(data['annual_income'], data['spending_score'],
c=data['cluster'], cmap='viridis', s=100)
plt.xlabel('Annual Income (K$)')
plt.ylabel('Spending Score')
plt.title('Customer Segments (K-Means, K=3)')
plt.colorbar(label='Cluster')
plt.savefig('kmeans_clusters.png')
plt.show()
# Access centroids (cluster centers)
print("Centroids (scaled):", kmeans.cluster_centers_)
print("Inertia (lower = tighter clusters):", kmeans.inertia_)
Choosing K: The Elbow Method
The hardest part of K-Means is choosing K — how many clusters? The Elbow Method runs K-Means for K=1 through K=10 and plots the inertia (sum of squared distances to centroids). The “elbow” in the plot is where adding more clusters stops significantly improving the fit.
inertias = []
K_range = range(1, 11)
for k in K_range:
km = KMeans(n_clusters=k, random_state=42, n_init=10)
km.fit(scaled_data)
inertias.append(km.inertia_)
# Plot the elbow
plt.figure(figsize=(8, 4))
plt.plot(K_range, inertias, 'bo-')
plt.xlabel('Number of Clusters (K)')
plt.ylabel('Inertia')
plt.title('Elbow Method — Choosing K')
plt.xticks(K_range)
plt.grid(True)
plt.savefig('elbow_method.png')
plt.show()
# The "elbow" is typically at K=3 or K=4
# After the elbow, each additional cluster adds minimal improvement
Real-life analogy: The Elbow Method is like adding staff to a restaurant. Going from 1 to 2 chefs doubles output. 2 to 3 adds significant capacity. 3 to 4 helps a bit. 4 to 5 barely matters — the kitchen is already efficient. The “elbow” is where adding more stops making a meaningful difference.
K-Means Limitations
- Must specify K in advance — if you guess wrong, clusters are meaningless
- Assumes spherical clusters — struggles with elongated, crescent, or irregular shapes
- Sensitive to outliers — one extreme point pulls the centroid away from the true center
- Sensitive to initialization — random starting centroids can produce different results (use
n_init=10to mitigate) - Equal-size assumption — tends to create clusters of similar size even if natural groups differ
DBSCAN (Density-Based Clustering)
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) finds clusters based on density — regions where points are packed closely together. Unlike K-Means, it does not require you to specify K, it can find clusters of any shape, and it identifies outliers as noise.
Real-life analogy: Imagine you are at a party in a large hall. DBSCAN is like looking at the room from above. You see groups of people standing close together (clusters) and a few people standing alone near the walls (noise/outliers). The groups can be any shape — a circle around the buffet, a line at the bar, an irregular blob on the dance floor. DBSCAN finds them all without you telling it how many groups to look for.
How DBSCAN Works
DBSCAN uses two parameters:
- eps (epsilon) — the radius of a neighborhood around each point
- min_samples — minimum number of points required within eps to form a dense region
DBSCAN classifies every point into one of three types:
1. CORE POINT: Has >= min_samples points within eps radius
(In the thick of the crowd)
2. BORDER POINT: Within eps of a core point but has < min_samples neighbors
(Standing at the edge of a group)
3. NOISE POINT: Not within eps of any core point
(Standing alone — outlier)
Algorithm:
Step 1: Pick any unvisited point
Step 2: Find all points within eps radius
Step 3: If count >= min_samples → start a cluster
Expand: check each neighbor's neighbors (recursive)
Step 4: If count < min_samples → mark as noise (for now)
Step 5: Repeat until all points visited
Noise points near a later-discovered cluster become border points.
DBSCAN in Python
from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler
import numpy as np
# Generate sample data with two dense clusters and noise
from sklearn.datasets import make_moons
X, _ = make_moons(n_samples=300, noise=0.1, random_state=42)
# Scale the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Fit DBSCAN
dbscan = DBSCAN(eps=0.3, min_samples=5)
labels = dbscan.fit_predict(X_scaled)
# Results
n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
n_noise = list(labels).count(-1)
print(f"Clusters found: {n_clusters}") # 2
print(f"Noise points: {n_noise}") # ~5 outliers
print(f"Core samples: {len(dbscan.core_sample_indices_)}")
# Visualize
import matplotlib.pyplot as plt
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis', s=30)
plt.title(f'DBSCAN: {n_clusters} clusters, {n_noise} noise points')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.colorbar(label='Cluster (-1 = noise)')
plt.savefig('dbscan_clusters.png')
plt.show()
Notice: DBSCAN found the two crescent-shaped clusters that K-Means would split incorrectly. And it automatically identified the scattered noise points — no manual outlier removal needed.
Choosing eps and min_samples
The k-distance plot helps you choose eps. For each point, compute the distance to its k-th nearest neighbor (where k = min_samples). Sort these distances and plot them. The “elbow” in the curve is a good eps value.
from sklearn.neighbors import NearestNeighbors
# k-distance plot to find optimal eps
k = 5 # same as min_samples
nn = NearestNeighbors(n_neighbors=k)
nn.fit(X_scaled)
distances, _ = nn.kneighbors(X_scaled)
# Sort the k-th neighbor distances
k_distances = np.sort(distances[:, k-1])
plt.figure(figsize=(8, 4))
plt.plot(k_distances)
plt.xlabel('Points (sorted)')
plt.ylabel(f'{k}-th Nearest Neighbor Distance')
plt.title('k-Distance Plot — Choose eps at the Elbow')
plt.grid(True)
plt.savefig('k_distance_plot.png')
plt.show()
# Look for the "knee" — the sharp increase indicates the boundary
# between dense regions (low distance) and sparse regions (high distance)
Rules of thumb:
- min_samples: Start with
2 × number_of_features. For 2D data, min_samples=5 is common. Higher values = stricter (fewer clusters, more noise) - eps: Use the k-distance plot elbow. Too small = everything is noise. Too large = everything is one cluster
DBSCAN Strengths and Limitations
| Strengths | Limitations |
|---|---|
| Does not require K in advance | Sensitive to eps and min_samples choice |
| Finds clusters of any shape | Struggles with clusters of varying density |
| Automatically identifies outliers as noise | Does not work well in high-dimensional data (curse of dimensionality) |
| Robust to outliers | Non-deterministic for border points |
| No equal-size assumption | Slower than K-Means on very large datasets |
Hierarchical Clustering
Hierarchical clustering builds a tree of clusters (dendrogram) by either merging small clusters into larger ones (agglomerative — bottom-up) or splitting large clusters into smaller ones (divisive — top-down). You choose the number of clusters AFTER seeing the tree, not before.
Real-life analogy: Think of a company organizational chart built from the bottom up. Start with individual employees (each is their own cluster). The two most similar employees merge into a team. Then the two most similar teams merge into a department. Departments merge into divisions. Eventually everyone is in one company. The dendrogram shows this entire merging history, and you can “cut” the tree at any level to get the number of groups you want.
How Hierarchical Clustering Works
Agglomerative (bottom-up) — the most common approach:
Step 1: Start with N clusters (each point is its own cluster)
Step 2: Find the two closest clusters
Step 3: Merge them into one cluster
Step 4: Repeat Steps 2-3 until only one cluster remains
Step 5: Cut the dendrogram at the desired level
Example with 5 points: A, B, C, D, E
Start: {A} {B} {C} {D} {E} — 5 clusters
Merge: {A,B} {C} {D} {E} — 4 clusters (A and B were closest)
Merge: {A,B} {C} {D,E} — 3 clusters (D and E merged)
Merge: {A,B} {C,D,E} — 2 clusters
Merge: {A,B,C,D,E} — 1 cluster
Dendrogram captures every merge with the distance at which it happened.
Linkage methods determine how “distance between clusters” is measured:
| Linkage | Distance Between Clusters | Best For |
|---|---|---|
| Single | Minimum distance between any two points | Elongated/chain-like clusters |
| Complete | Maximum distance between any two points | Compact, spherical clusters |
| Average | Average distance between all point pairs | Balanced approach |
| Ward | Minimizes total within-cluster variance | Equal-sized, compact clusters (most common) |
Hierarchical Clustering in Python
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage
from sklearn.preprocessing import StandardScaler
# Sample data
data = pd.DataFrame({
'income': [15, 16, 45, 46, 75, 76, 77],
'spending': [39, 41, 55, 58, 90, 88, 92]
})
scaled = StandardScaler().fit_transform(data)
# Method 1: Fit with a specific number of clusters
agg = AgglomerativeClustering(n_clusters=3, linkage='ward')
data['cluster'] = agg.fit_predict(scaled)
print(data.groupby('cluster').mean())
# Method 2: Build dendrogram FIRST, then decide number of clusters
linked = linkage(scaled, method='ward')
plt.figure(figsize=(10, 5))
dendrogram(linked, labels=data.index.tolist(), leaf_rotation=0)
plt.title('Dendrogram — Ward Linkage')
plt.xlabel('Data Point Index')
plt.ylabel('Distance')
plt.axhline(y=4, color='r', linestyle='--', label='Cut at 3 clusters')
plt.legend()
plt.savefig('dendrogram.png')
plt.show()
# The red line cuts the dendrogram into 3 groups
Reading a Dendrogram
┌───────┐
┌────┤ ├────┐ ← Cut here = 2 clusters
┌────┤ └───────┘ ├────┐
│ │ │ │ ← Cut here = 3 clusters
┌─┤ ┌─┤ ┌─┤ ┌─┤
A B C D E F G H ← Cut here = 8 clusters (every point)
Reading the dendrogram:
- Height of each merge = distance at which clusters joined
- Tall vertical lines = clusters are far apart (natural separation)
- Short vertical lines = clusters are close (maybe should not be separated)
- Cut horizontally at a height = choose number of clusters
- The longest vertical line without a horizontal crossing = best number of clusters
Comparison: K-Means vs DBSCAN vs Hierarchical
| Feature | K-Means | DBSCAN | Hierarchical |
|---|---|---|---|
| Specify K? | Yes (required) | No (finds automatically) | Optional (cut dendrogram) |
| Cluster shapes | Spherical only | Any shape | Depends on linkage |
| Handles noise? | No (assigns all points) | Yes (labels noise as -1) | No (assigns all points) |
| Speed | Fast (O(n·K·i)) | Medium (O(n²) or O(n log n)) | Slow (O(n³) or O(n² log n)) |
| Large datasets? | Excellent (millions of rows) | Good (up to ~500K rows) | Poor (struggles above ~10K rows) |
| Outlier handling | Poor (outliers pull centroids) | Excellent (outliers = noise) | Moderate |
| Interpretability | Easy (centroids = cluster profiles) | Moderate (density-based) | Visual (dendrogram) |
| Parameters | K | eps, min_samples | linkage, n_clusters or distance |
| Best for | Customer segmentation, quantization | Anomaly detection, geographic data | Gene expression, small datasets |
Decision guide:
Do you know how many clusters?
├── YES → K-Means (fast, simple)
└── NO
├── Do you have noise/outliers?
│ ├── YES → DBSCAN (finds clusters + marks noise)
│ └── NO → Hierarchical (build dendrogram, decide K visually)
└── Is the dataset large (>50K rows)?
├── YES → K-Means or DBSCAN (Hierarchical is too slow)
└── NO → Hierarchical (dendrogram gives the best visual insight)
Evaluation Metrics for Clustering
Unlike supervised learning (where you compare predictions to known labels), clustering evaluation is harder because there are no “correct” answers. These metrics measure cluster quality internally.
Silhouette Score
The most widely used metric. For each point, it measures how similar the point is to its own cluster versus the nearest other cluster.
from sklearn.metrics import silhouette_score, silhouette_samples
# Compute silhouette score for different K values
for k in range(2, 8):
km = KMeans(n_clusters=k, random_state=42, n_init=10)
labels = km.fit_predict(scaled_data)
score = silhouette_score(scaled_data, labels)
print(f"K={k}: Silhouette Score = {score:.3f}")
# Output:
# K=2: Silhouette Score = 0.581
# K=3: Silhouette Score = 0.672 ← Best (highest)
# K=4: Silhouette Score = 0.534
# K=5: Silhouette Score = 0.489
# Interpretation:
# 1.0 = perfect separation (each point far from other clusters)
# 0.0 = on the boundary between clusters
# -1.0 = wrong cluster (closer to a different cluster)
# > 0.5 = good clustering
# > 0.7 = strong clustering
Inertia and Davies-Bouldin Index
# Inertia (K-Means only): sum of squared distances to centroids
# Lower = tighter clusters, but always decreases with more K
print(f"Inertia: {kmeans.inertia_:.2f}")
# Davies-Bouldin Index: lower = better cluster separation
from sklearn.metrics import davies_bouldin_score
db_score = davies_bouldin_score(scaled_data, labels)
print(f"Davies-Bouldin: {db_score:.3f}") # < 1.0 is good
Data Preprocessing for Clustering
Clustering is distance-based. If one feature is measured in thousands (income: 50,000) and another in single digits (age: 35), income dominates the distance calculation and age is ignored. Scaling is mandatory.
from sklearn.preprocessing import StandardScaler, MinMaxScaler
# StandardScaler: zero mean, unit variance (most common)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# MinMaxScaler: scales to [0, 1] range
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)
# Additional preprocessing:
# 1. Handle missing values BEFORE clustering (impute or drop)
# 2. Remove or cap extreme outliers (especially for K-Means)
# 3. Consider dimensionality reduction (PCA) if >10 features
# 4. One-hot encode categorical features
# 5. Use log transform for heavily skewed features
Real-life analogy: Scaling is like converting all measurements to the same unit before comparing. You would not compare a building’s height in meters with its width in millimeters — the height would seem enormously more important. Scaling puts every feature on equal footing.
Real-World Use Cases
| Use Case | Algorithm | What It Does |
|---|---|---|
| Customer segmentation | K-Means | Group customers by spending behavior for targeted marketing |
| Fraud detection | DBSCAN | Normal transactions form dense clusters; fraud = noise points |
| Geographic analysis | DBSCAN | Find hotspots (crime clusters, earthquake zones, delivery areas) |
| Gene expression analysis | Hierarchical | Group similar genes; dendrogram shows evolutionary relationships |
| Image compression | K-Means | Reduce millions of colors to K representative colors |
| Document grouping | K-Means/DBSCAN | Cluster similar news articles or support tickets by topic |
| Anomaly detection in logs | DBSCAN | Normal log patterns = clusters; unusual patterns = noise |
| Market basket analysis | Hierarchical | Group products frequently purchased together |
Common Mistakes
-
Not scaling features before clustering — the most common mistake. Features with larger ranges dominate distance calculations. Always use
StandardScalerorMinMaxScalerbefore fitting any clustering algorithm. -
Choosing K randomly for K-Means — use the Elbow Method or Silhouette Score to find the optimal K. “The business wants 5 segments” is fine, but validate that 5 produces meaningful clusters.
-
Using K-Means on non-spherical data — K-Means assumes round clusters. For crescent shapes, rings, or irregular blobs, use DBSCAN instead.
-
Ignoring DBSCAN noise points — noise points (label = -1) are not errors. They are outliers that do not belong to any cluster. Investigate them — they might be fraud, data quality issues, or genuinely unique records.
-
Using Hierarchical Clustering on large datasets — it has O(n³) time complexity. Above 10K-20K rows, it becomes impractically slow. Use K-Means or DBSCAN instead and reserve Hierarchical for small datasets where the dendrogram adds value.
-
Forgetting to interpret clusters — clustering finds groups, but YOU must interpret what they mean. After clustering, profile each group: “Cluster 0 has high income + low spending = potential upsell target.” Without interpretation, clusters are just numbers.
Interview Questions
Q: What is the difference between K-Means and DBSCAN? A: K-Means requires K (number of clusters) upfront, assumes spherical clusters, and assigns every point to a cluster. DBSCAN finds clusters automatically based on density, handles any cluster shape, and identifies outliers as noise (label = -1). Use K-Means for well-separated round clusters and DBSCAN for irregular shapes with noise.
Q: How do you choose the number of clusters in K-Means? A: Use the Elbow Method (plot inertia vs K, look for the elbow) and the Silhouette Score (higher is better, try K=2 to K=10). Combine both: the elbow suggests a range, the silhouette score confirms the best K within that range. Also consider domain knowledge — does the number of clusters make business sense?
Q: Why is feature scaling important for clustering? A: Clustering algorithms use distance (Euclidean) to measure similarity. If one feature ranges 0-100,000 (income) and another 0-100 (age), income dominates the distance calculation and age is effectively ignored. Scaling (StandardScaler or MinMaxScaler) puts all features on equal footing so each contributes equally to the distance.
Q: What is the Silhouette Score and what values indicate good clustering? A: The Silhouette Score measures how similar each point is to its own cluster versus the nearest other cluster. Ranges from -1 to 1. Above 0.5 = good clustering. Above 0.7 = strong clustering. Near 0 = points are on cluster boundaries. Negative = points are likely in the wrong cluster.
Q: When would you use Hierarchical Clustering over K-Means? A: When you do not know K and want to visualize the cluster hierarchy via a dendrogram. When the dataset is small (under 10K rows). When you need to explore different numbers of clusters without re-running the algorithm — just cut the dendrogram at different heights. K-Means is better for large datasets where speed matters.
Q: How does DBSCAN handle outliers? A: DBSCAN labels points that are not within eps distance of any core point as noise (cluster label = -1). These are points that do not belong to any dense region. This makes DBSCAN inherently robust to outliers — they are identified automatically rather than distorting cluster boundaries like in K-Means.
Wrapping Up
Clustering is your go-to tool when the data has no labels. K-Means is the workhorse — fast, simple, and effective for round clusters. DBSCAN excels at finding irregular shapes and identifying noise. Hierarchical Clustering gives you the best visual tool (dendrogram) for understanding cluster structure.
The key takeaway: there is no single best clustering algorithm. The right choice depends on your data shape, dataset size, whether you know K, and whether outliers matter. Start with K-Means (fastest to try), validate with silhouette scores, and switch to DBSCAN or Hierarchical when K-Means assumptions do not hold.
And never forget: clustering finds the groups, but YOU give them meaning. The algorithm says “these 500 customers are similar.” You say “they are our high-value, low-engagement segment — let us send them a re-engagement campaign.”
Related posts: – AI/ML Introduction – Feature Engineering – Model Evaluation Deep Dive – Decision Trees & Random Forests
Naveen Vuppula is a Senior Data Engineering Consultant and app developer based in Ontario, Canada. He writes about Python, SQL, AWS, Azure, and everything data engineering at DriveDataScience.com.