Clustering Algorithms and Evaluation Methods: A Practical Guide
Published:
TL;DR: Density-based clustering algorithms like DBSCAN and HDBSCAN can discover arbitrarily shaped clusters and gracefully handle noise—two properties that traditional methods like K-Means struggle with. This guide walks through how these algorithms work, when to choose one over the other, and how to rigorously evaluate clustering quality using internal metrics, external metrics, stability analysis, and practical validation techniques.
Introduction: Why Clustering Matters
Clustering is one of the most fundamental tasks in unsupervised machine learning. At its core, clustering asks a deceptively simple question: given a collection of data points, which ones naturally belong together?
Unlike classification, where we have labeled examples to learn from, clustering must discover structure purely from the data itself. This makes it both powerful and challenging—powerful because it can reveal hidden patterns without human annotation, and challenging because there is often no single “correct” answer.
Clustering appears everywhere in practice: customer segmentation, document organization, anomaly detection, image compression, gene expression analysis, and increasingly in modern AI systems that need to semantically organize large volumes of information. As AI agents rely more heavily on long-context summarization and retrieval-augmented generation, the ability to group related content coherently becomes critical infrastructure.
But discovering clusters is only half the battle. Evaluating whether your clusters are meaningful is equally important—and considerably harder. Without ground-truth labels, how do you know if your clustering is capturing real structure or just partitioning noise? This article addresses both sides of the problem.
The Challenge of Unsupervised Evaluation
In supervised learning, evaluation is straightforward: compare predictions to known labels and compute accuracy, precision, or recall. Clustering lacks this luxury. We face several fundamental challenges:
- No ground truth: In most real-world scenarios, we don’t have labeled data to compare against. If we did, we might just use a classifier.
- Subjectivity of “correct” clusters: The same dataset can be validly clustered in multiple ways depending on the resolution, the features emphasized, or the downstream task.
- Shape and density assumptions: Many evaluation metrics implicitly assume clusters are convex and similarly sized—assumptions that real data frequently violates.
- The curse of dimensionality: As feature dimensions grow, distance metrics become less meaningful, making both clustering and evaluation harder.
These challenges motivate a multi-faceted evaluation strategy: no single metric tells the whole story. Instead, practitioners combine quantitative scores with visual inspection, stability analysis, and domain-specific validation.
Overview of Clustering Algorithm Families
Before diving into density-based methods, it helps to understand where they sit in the broader landscape:
| Family | Representative Algorithms | Key Idea | Strengths | Limitations |
|---|---|---|---|---|
| Centroid-based | K-Means, K-Medoids | Minimize within-cluster variance around centroids | Fast, scalable, intuitive | Assumes spherical clusters; requires specifying k |
| Hierarchical | Agglomerative, Divisive | Build a tree of nested cluster merges/splits | No need to prespecify k; produces dendrogram | Computationally expensive; sensitive to linkage choice |
| Density-based | DBSCAN, HDBSCAN, OPTICS | Clusters are dense regions separated by sparse regions | Finds arbitrary shapes; handles noise natively | Sensitive to density parameters; struggles with very high dimensions |
| Distribution-based | Gaussian Mixture Models | Each cluster is a probability distribution | Soft assignments; models cluster shape | Requires specifying number of components; can overfit |
| Graph-based | Spectral Clustering | Clusters as connected components in a similarity graph | Handles complex shapes; grounded in graph theory | Expensive for large datasets; requires similarity matrix |
This article focuses on density-based clustering—specifically DBSCAN and HDBSCAN—because they address two common pain points that centroid-based methods cannot: discovering clusters of arbitrary shape and explicitly identifying noise points rather than forcing every observation into a cluster.
DBSCAN: Density-Based Spatial Clustering
Core Intuition
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) operates on a beautifully simple principle: a cluster is a contiguous region of high density, separated from other clusters by regions of low density.
Rather than asking “how many clusters exist?” (as K-Means does), DBSCAN asks “where are the dense regions?” This shift in perspective is what allows it to find clusters of any shape—crescents, spirals, blobs—as long as the cluster forms a connected dense region.
How It Works
DBSCAN classifies every point in the dataset into one of three roles:
- Core points: Points that have at least
min_samplesneighbors within a radius ofeps. These are the “interior” of dense regions. - Border points: Points that fall within
epsof a core point but don’t themselves have enough neighbors to be core points. They sit on the edge of clusters. - Noise points: Points that are neither core nor border points. They exist in sparse regions and are labeled as outliers (assigned label
-1).
The algorithm then forms clusters by connecting core points that are within eps of each other (density-reachability), and assigns border points to the cluster of their nearest core point.
Parameters
DBSCAN has two key hyperparameters:
eps(ε): The radius that defines a point’s neighborhood. Points within this distance are considered neighbors.min_samples: The minimum number of points required within theeps-neighborhood for a point to qualify as a core point.
Together, these parameters define what “dense” means. A region is dense if you can find at least min_samples points packed within a ball of radius eps.
Choosing Parameters
Selecting good values for eps and min_samples requires some care:
min_samples: A common heuristic is to setmin_samples ≥ D + 1, where D is the number of features (dimensions). A more conservative choice is2 × D. Higher values produce more conservative clustering (fewer, denser clusters; more noise points).eps: The k-distance plot is the standard technique. Compute the distance from each point to its k-th nearest neighbor (where k =min_samples), sort these distances in ascending order, and plot them. Look for an “elbow”—the point where the curve transitions from a gradual rise (intra-cluster distances) to a steep rise (inter-cluster or noise distances). The y-value at this elbow is a good candidate foreps.
Limitations
- Single density scale: DBSCAN uses a global
eps, so it struggles when clusters have significantly different densities. Ifepsis too small, sparse clusters may be fragmented or missed entirely; ifepsis too large, distinct clusters may be incorrectly merged together. - Parameter sensitivity: Small changes in
epscan dramatically change results—a cluster might appear or disappear entirely. - High dimensionality: In high-dimensional spaces, the concept of a fixed-radius neighborhood becomes less meaningful as distances concentrate.
HDBSCAN: Hierarchical Density-Based Clustering
Why HDBSCAN?
HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise) was designed to address DBSCAN’s primary weakness: the requirement for a single, global density threshold. In real datasets, clusters often exist at different density scales—imagine a city with both a packed downtown core and sparse suburban neighborhoods. HDBSCAN can discover both.
Core Intuition
Rather than committing to a single value of eps, HDBSCAN effectively runs DBSCAN at all possible density thresholds simultaneously and then selects the most stable clusters from this hierarchy. Think of it as watching clusters form and dissolve as you slowly increase the density threshold—the clusters that persist the longest are the most meaningful.
How It Works
- Build a mutual reachability graph: Transform pairwise distances to account for local density, making sparse regions appear “further apart.”
- Construct a minimum spanning tree: Connect all points using the mutual reachability distances.
- Build a cluster hierarchy: Progressively remove edges from longest to shortest, forming a dendrogram of merging clusters.
- Extract stable clusters: Use a stability metric to select the set of clusters that persist over the widest range of density thresholds, rather than simply cutting the dendrogram at a fixed level.
Parameters
min_cluster_size: The smallest grouping you want to consider as a cluster. This is the most important parameter and is often interpretable in domain terms (e.g., “I don’t care about groups smaller than 10 documents”).min_samples: Controls how conservative the algorithm is. Higher values make the algorithm more resistant to noise but may miss smaller or sparser clusters. Defaults tomin_cluster_sizeif not set.
Key Advantages Over DBSCAN
- Variable-density clusters: Discovers clusters at different density scales without manual tuning.
- Fewer hard decisions: Only
min_cluster_sizetruly needs tuning; noepsto guess. - Rich output: Provides cluster persistence (stability scores), soft membership probabilities, and outlier scores—much more information than hard cluster labels alone.
- Noise handling: Like DBSCAN, it naturally identifies outliers, but with more nuance via outlier scores.
Preprocessing: Setting Up for Success
Before applying any density-based algorithm, proper preprocessing is essential. Density estimation is sensitive to the scale and structure of your feature space.
Feature Scaling
Always scale your features before density-based clustering. If one feature ranges from 0–1 and another from 0–10,000, the distance metric will be dominated by the larger-scale feature. Use:
- StandardScaler: Centers features to zero mean and unit variance. Good default choice.
- RobustScaler: Uses median and IQR instead of mean and standard deviation. Better when outliers are present in the features themselves.
Dimensionality Reduction
Density estimation suffers badly from the curse of dimensionality. In high-dimensional spaces, all points become approximately equidistant, making it impossible to distinguish dense from sparse regions. If your data has many features:
- PCA: A linear reduction that preserves global variance structure. Fast and deterministic.
- UMAP: A nonlinear reduction that preserves local neighborhood structure. Often works better as a preprocessing step for density-based clustering because it maintains the local density relationships that DBSCAN/HDBSCAN rely on.
A common pipeline is: scale → reduce to 10–50 dimensions with PCA → further reduce with UMAP if needed → cluster.
Distance Metric Selection
The choice of distance metric should reflect the nature of your data:
- Euclidean: Default for continuous, scaled features in moderate dimensions.
- Cosine: Natural for text embeddings, TF-IDF vectors, or any data where direction matters more than magnitude.
- Manhattan (L1): More robust to outliers in individual features; better in moderately high dimensions.
- Custom metrics: Domain-specific distances (e.g., edit distance for strings, Haversine for geographic coordinates).
Computational Complexity
Both DBSCAN and HDBSCAN require efficient neighborhood queries. Their practical performance depends heavily on the data structure used for these lookups:
| Scenario | Complexity | Notes |
|---|---|---|
| Naive (no indexing) | O(n²) | Computes all pairwise distances |
| KD-tree | O(n log n) average | Effective for low-to-moderate dimensions (< ~20) |
| Ball tree | O(n log n) average | Better than KD-tree in moderate dimensions |
| Approximate nearest neighbors | ~O(n log n) | Libraries like FAISS or Annoy; trade exactness for speed |
For large datasets (millions of points), consider approximate nearest neighbor methods or subsampling strategies during parameter tuning.
Internal Evaluation Metrics
When ground-truth labels are unavailable—which is the common case in real clustering tasks—internal metrics evaluate cluster quality using only the data and the assigned labels. Each metric captures a different aspect of what makes a “good” clustering.
Silhouette Score
The silhouette score measures how similar each point is to its own cluster compared to the nearest neighboring cluster. It balances two competing goals: cohesion (points should be close to their cluster-mates) and separation (points should be far from other clusters).
For each sample i:
- \(a(i)\) = mean distance from i to all other points in the same cluster (intra-cluster distance)
- \(b(i)\) = the minimum, over all other clusters, of the mean distance from i to all points in that cluster (i.e., the mean distance to the nearest neighboring cluster)
The overall silhouette score is the mean across all samples: \(\bar{s} = \text{mean}_i\; s(i) \in [-1, 1]\)
Interpretation:
- Values near +1 indicate points are well-matched to their own cluster and poorly matched to neighboring clusters (ideal).
- Values near 0 indicate points sit on the boundary between two clusters.
- Values near -1 indicate points may have been assigned to the wrong cluster.
When to use: The silhouette score is versatile and widely applicable. It’s particularly useful for comparing different numbers of clusters or different parameter settings.
Caveat: It assumes clusters are convex and roughly equally sized. For elongated, irregularly shaped, or density-varying clusters (exactly the kind DBSCAN/HDBSCAN find), it can underestimate quality.
Davies–Bouldin Index
The Davies–Bouldin (DB) index measures the average “worst-case” similarity between each cluster and its most similar neighboring cluster. Lower values indicate better-separated, more compact clusters.
For k clusters, let:
- \(S_i\) = average distance of all points in cluster i to the cluster centroid \(\mu_i\) (a measure of cluster spread)
- \(M_{ij}\) = distance between centroids \(\mu_i\) and \(\mu_j\)
Interpretation: For each cluster, the index finds the neighboring cluster that is most “confusable” (large spread relative to centroid distance) and averages these worst cases. A DB index of 0 would mean perfectly compact, infinitely separated clusters.
When to use: Useful when you want a metric that penalizes clusters that are both spread out and close together. It’s computationally simpler than the silhouette score for large datasets since it only requires cluster-level statistics.
Caveat: Like silhouette, it is centroid-based and assumes convex clusters.
Calinski–Harabasz Index
The Calinski–Harabasz (CH) index, also called the Variance Ratio Criterion, measures the ratio of between-cluster dispersion to within-cluster dispersion. It rewards clusterings where clusters are tight internally but spread apart from each other.
Let n = total samples, k = number of clusters, \(\mu\) = global centroid, \(\mu_i\) = centroid of cluster i, and \(n_i\) = size of cluster i.
- Between-cluster dispersion: \(B = \sum_{i=1}^{k} n_i |\mu_i - \mu|^2\)
- Within-cluster dispersion: \(W = \sum_{i=1}^{k} \sum_{x \in C_i} |x - \mu_i|^2\)
Interpretation: Higher values indicate better-defined clusters. The normalization by degrees of freedom (dividing B by k−1 and W by n−k) makes it somewhat comparable across different numbers of clusters, though it still tends to favor more clusters.
When to use: Fast to compute and effective for comparing different values of k. Works best when clusters are roughly spherical and similar in size.
Important Caveat for Density-Based Clustering
All three metrics above—silhouette, Davies–Bouldin, and Calinski–Harabasz—were designed with compact, globular clusters in mind. When applied to density-based clustering results, they can be misleading:
- Irregularly shaped clusters (rings, filaments) may score poorly even when the clustering is correct.
- Noise points (labeled
-1) should typically be excluded from metric computation, as they would distort scores. - Clusters of vastly different sizes or densities may yield middling average scores despite being individually well-defined.
For these reasons, always supplement internal metrics with density-aware measures and visual inspection when using DBSCAN or HDBSCAN.
External Evaluation Metrics
When ground-truth labels are available (from expert annotation, known data-generating processes, or benchmark datasets), external metrics provide an objective measure of clustering quality by comparing predicted clusters to true groupings.
Adjusted Rand Index (ARI)
The Rand Index counts the fraction of point pairs where the clustering and ground truth agree (both in the same group, or both in different groups). The Adjusted Rand Index corrects for chance—a random clustering will score near 0 rather than some positive baseline.
Range: −1 to 1. A score of 1 means perfect agreement, 0 means no better than random, and negative values indicate worse-than-random agreement.
When to use: ARI is symmetric and doesn’t require clusters to have a one-to-one correspondence with labels. It’s robust to different numbers of clusters and cluster sizes.
Adjusted Mutual Information (AMI)
AMI measures the information shared between the clustering and the ground truth, normalized and adjusted for chance. It quantifies how much knowing the cluster assignment reduces uncertainty about the true label (and vice versa).
Range: 0 to 1 (after adjustment). A score of 1 means the clustering and labels carry identical information; 0 means they are independent.
When to use: AMI is particularly useful when clusters and labels have different granularity (e.g., more clusters than true classes). It’s information-theoretic and doesn’t assume any particular cluster shape.
Practical Guidance
- Use ARI when you want a general-purpose agreement measure that’s easy to interpret.
- Use AMI when cluster and label counts differ significantly or when you care about information content.
- Both metrics handle noise points naturally—noise can be treated as its own “cluster” or excluded depending on your evaluation goals.
- Neither metric requires clusters to be of similar size, making them suitable for evaluating density-based methods.
Density-Aware and Stability Measures
For density-based clustering, standard metrics may not capture what matters most. These additional evaluation approaches are specifically designed for the properties that DBSCAN and HDBSCAN produce.
Cluster Persistence (HDBSCAN)
HDBSCAN provides a cluster persistence score for each cluster, measuring how long (across what range of density thresholds) the cluster survived in the hierarchy before being merged into a parent cluster. Higher persistence means the cluster represents a robust, stable density feature rather than a transient artifact of a particular threshold.
Use persistence to:
- Rank clusters by reliability (high-persistence clusters are trustworthy; low-persistence clusters may be spurious).
- Identify when the algorithm is finding structure versus fitting noise.
Soft Membership and Outlier Scores
HDBSCAN assigns each point:
- A membership probability (0 to 1): how confidently the point belongs to its assigned cluster. Points near cluster boundaries have lower probabilities.
- An outlier score (0 to 1): how likely the point is to be an outlier. High outlier scores indicate the point doesn’t fit naturally into any dense region.
These soft scores enable downstream decisions:
- Filter out low-confidence assignments before using cluster labels.
- Treat high-outlier-score points as candidates for anomaly detection.
- Use membership probabilities as weights in subsequent analyses.
Bootstrap Stability
A clustering is only useful if it’s stable—if small perturbations to the data don’t dramatically change the results. Bootstrap stability testing works as follows:
- Repeatedly subsample or bootstrap your dataset (e.g., 80% random samples, 20+ iterations).
- Run clustering on each subsample.
- Measure agreement between runs using ARI or a similar metric.
High agreement across subsamples indicates robust clustering. Low agreement suggests the algorithm is fitting noise or that the data lacks clear cluster structure at the chosen parameters.
Practical and Visual Validation
Metrics alone never tell the complete story. Visual and domain-informed checks are essential for building confidence in your results.
Cluster Size Distribution
Examine the distribution of cluster sizes. Warning signs include:
- A single dominant cluster containing most points (the algorithm may not be separating structure).
- Many tiny clusters with just a few points (possible over-fragmentation or noise sensitivity).
- An excessive proportion of noise points (parameters may be too conservative).
There’s no universal “correct” distribution—it depends on your data and domain—but extreme imbalances warrant investigation.
Visualization
Project your data to 2D using UMAP or t-SNE and color points by:
- Cluster assignment (do clusters form visually coherent regions?)
- Outlier score (are noise points in sparse regions as expected?)
- Membership probability (do border regions show lower confidence?)
Important caveat: 2D projections can distort distances and densities. Use them for qualitative sanity checks, not as definitive evidence of cluster quality.
Domain-Specific Validation
The ultimate test of a clustering is whether it’s useful. Domain validation might include:
- Checking whether clusters correspond to known categories (even if labels weren’t used during clustering).
- Verifying temporal or spatial coherence within clusters.
- Presenting cluster summaries to domain experts for qualitative assessment.
- Measuring downstream task performance when clusters are used as features.
Putting It All Together: Example Workflow
The following Python example demonstrates a complete clustering pipeline from preprocessing through evaluation:
# pip install scikit-learn hdbscan umap-learn
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import NearestNeighbors
from sklearn.cluster import DBSCAN
from sklearn.metrics import (
silhouette_score,
davies_bouldin_score,
calinski_harabasz_score,
adjusted_rand_score,
adjusted_mutual_info_score,
)
import hdbscan
# --- Preprocessing ---
# X: (n_samples, n_features) numpy array
X_scaled = StandardScaler().fit_transform(X)
# --- Parameter Selection for DBSCAN ---
# k-distance plot to choose eps (k = min_samples)
k = 5
nbrs = NearestNeighbors(n_neighbors=k).fit(X_scaled)
distances, _ = nbrs.kneighbors(X_scaled)
k_distances = np.sort(distances[:, -1])
# Plot k_distances; pick eps at the "elbow" of the curve
# --- DBSCAN Clustering ---
db = DBSCAN(eps=0.5, min_samples=k, metric="euclidean").fit(X_scaled)
labels_db = db.labels_ # -1 indicates noise points
# --- HDBSCAN Clustering ---
hdb = hdbscan.HDBSCAN(min_cluster_size=10, min_samples=5, metric="euclidean")
hdb.fit(X_scaled)
labels_hdb = hdb.labels_
probabilities = hdb.probabilities_ # soft membership scores
outlier_scores = hdb.outlier_scores_ # per-point outlier scores
persistence = hdb.cluster_persistence_ # per-cluster stability
Computing Evaluation Metrics
# --- Internal metrics (exclude noise points for fair evaluation) ---
mask_db = labels_db != -1
if mask_db.sum() > 1 and len(set(labels_db[mask_db])) > 1:
sil_db = silhouette_score(X_scaled[mask_db], labels_db[mask_db])
db_index = davies_bouldin_score(X_scaled[mask_db], labels_db[mask_db])
ch_index = calinski_harabasz_score(X_scaled[mask_db], labels_db[mask_db])
print(f"DBSCAN — Silhouette: {sil_db:.3f}, DB: {db_index:.3f}, CH: {ch_index:.1f}")
# --- External metrics (when ground-truth labels y_true are available) ---
ari_hdb = adjusted_rand_score(y_true, labels_hdb)
ami_hdb = adjusted_mutual_info_score(y_true, labels_hdb)
print(f"HDBSCAN — ARI: {ari_hdb:.3f}, AMI: {ami_hdb:.3f}")
# --- Stability analysis ---
print(f"Cluster persistence scores: {persistence}")
print(f"Fraction of points labeled as noise: {(labels_hdb == -1).mean():.2%}")
Common Pitfalls and Best Practices
Pitfalls to Avoid
Skipping feature scaling. Unscaled features make distance-based algorithms behave unpredictably—the clustering will be dominated by whichever feature has the largest numeric range.
Using DBSCAN on variable-density data. If your data has clusters at different density scales, DBSCAN’s single
epswill either fragment dense clusters or merge sparse ones. Use HDBSCAN instead.Trusting a single metric. No internal metric perfectly captures cluster quality. A high silhouette score doesn’t guarantee meaningful clusters, especially for non-convex shapes. Always use multiple metrics and visual checks.
Ignoring noise points in evaluation. Including noise points (label
-1) in silhouette or other internal metrics will produce misleadingly low scores. Evaluate clustered points separately, then assess whether the noise classification itself is reasonable.Clustering in excessively high dimensions. Density estimation degrades rapidly above ~20 dimensions. Apply dimensionality reduction first.
Overfitting parameters to a single metric. Tuning
epsormin_cluster_sizeto maximize silhouette score can produce “optimal” metrics for meaningless clusters. Validate with domain knowledge.
Best Practices
Start with HDBSCAN unless you have strong reasons to prefer DBSCAN. HDBSCAN handles variable densities, has fewer sensitive parameters, and provides richer output.
Build an evaluation portfolio: combine internal metrics + visual inspection + stability testing + domain validation. No single signal is sufficient.
Use the k-distance plot for DBSCAN parameter selection rather than guessing or grid search.
Test stability via bootstrap subsampling. Unstable clusters are not trustworthy regardless of what metrics say.
Inspect outliers. In density-based clustering, noise points are features, not bugs. Understand why points are classified as noise—they may represent genuine anomalies, data quality issues, or an indication that parameters are too conservative.
Consider approximate nearest neighbors (FAISS, Annoy, PyNNDescent) for datasets exceeding ~100K points to keep runtime practical.
Document your choices. Record preprocessing steps, parameter values, and the reasoning behind them. Clustering is inherently subjective; reproducibility requires explicit documentation.
Conclusion and Key Takeaways
Density-based clustering offers a powerful alternative to centroid-based methods when your data contains clusters of varying shapes, sizes, and densities—which is to say, most real-world data. Here are the key points to take away:
- DBSCAN is conceptually simple and effective when clusters share a similar density. Its main challenge is choosing
eps, which the k-distance plot addresses. - HDBSCAN extends DBSCAN to handle multi-scale density and provides richer output (persistence, probabilities, outlier scores). It should be your default choice for most density-based clustering tasks.
- Evaluation requires a portfolio approach. Internal metrics (silhouette, Davies–Bouldin, Calinski–Harabasz) provide quantitative signals but have blind spots for non-convex clusters. External metrics (ARI, AMI) are ideal when labels exist. Stability testing and visual inspection fill the gaps.
- Preprocessing matters enormously. Scale your features, reduce dimensionality when needed, and choose distance metrics thoughtfully.
- No metric replaces domain judgment. The ultimate measure of a clustering’s quality is whether it produces actionable, interpretable, and stable groupings for your specific use case.
Clustering is as much art as science. The algorithms and metrics described here are tools—powerful ones—but they require thoughtful application, iterative refinement, and a healthy skepticism of any single quantitative score.
