Clustering¶
Clustering is not an evaluator in DataEval — it is the infrastructure that
several evaluators depend on. Understanding how clustering works, and what its
outputs mean, helps you interpret results from Duplicates (cluster
mode), Outliers (cluster mode), label_errors(), and all five
methods of Prioritize.
What is it¶
Clustering partitions a set of embeddings into groups (clusters) such that samples within a group are more similar to each other than to samples in other groups. In DataEval, clustering always operates on embeddings rather than raw images: pixel-space distance is a poor proxy for semantic similarity, and clustering in raw pixel space would identify images with similar colors or brightness as neighbors regardless of content.
The cluster() function is the entry point. It accepts embeddings and
returns a ClusterResult containing cluster assignments, the minimum
spanning tree (MST), linkage tree, membership strengths, and k-nearest neighbor
indices and distances. Two algorithms are available: HDBSCAN (default) and
KMeans.
The two algorithms¶
HDBSCAN¶
HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise) discovers clusters from density structure rather than requiring a predetermined number of clusters. The algorithm proceeds in stages:
MST construction: compute the k-nearest neighbor graph (up to \(k=25\) neighbors per point) and extract the minimum spanning tree over the full dataset. The MST captures the global connectivity structure of the embedding space in a compact form — cutting an edge in the MST corresponds to separating the dataset into two connected components.
Condensed tree: convert the MST into a hierarchical linkage tree via single-linkage clustering, then condense it by merging nodes with fewer than
min_cluster_sizesamples into their parent. The condensed tree represents the multi-scale cluster structure of the data.Cluster extraction: apply excess-of-mass (EOM) selection to extract the most persistent clusters from the condensed tree. Clusters that persist across a wide range of density thresholds are preferred over clusters that only appear briefly.
Outlier assignment: HDBSCAN assigns label -1 to samples that do not belong to any dense cluster. DataEval then reassigns these outliers to their nearest cluster center, so the final
clustersarray contains no -1 values. Themembership_strengthsarray retains the original HDBSCAN confidence scores — a low membership strength on a sample assigned to a cluster means HDBSCAN originally considered it borderline.
min_cluster_size is the key hyperparameter. DataEval sets it adaptively:
if
n_clustersis provided,min_cluster_size = max(5, n_samples // (n_clusters * 3))— smaller clusters are allowed when more clusters are expected;if
n_clustersis not provided, the default is 5% of the dataset, capped at 100 —min_cluster_size = min(100, max(5, (n_samples * 0.05)))
HDBSCAN is the default because it handles irregular cluster shapes, varying cluster densities, and datasets where the number of clusters is unknown. Its results are deterministic given the same input data in the same order.
KMeans¶
KMeans partitions the embeddings into exactly n_clusters spherical clusters
by iteratively assigning each point to its nearest centroid and recomputing
centroids. DataEval uses k-means++ initialization to reduce sensitivity to the
starting configuration.
When n_clusters is not specified, the default is \(\lfloor\sqrt{n}\rfloor\)
where \(n\) is the number of samples. KMeans always assigns every sample to a
cluster — there are no outliers. Membership strength is computed from normalized
distance to the cluster center: points close to their centroid have high
membership strength; points far from any centroid have low membership strength.
KMeans is appropriate when you have strong prior knowledge about the number of clusters (for example, when it equals the number of classes in a classification problem) or when cluster shape is expected to be roughly spherical.
KMeans results are not exactly reproducible without fixing the global random
seed via dataeval.config.set_seed(). The k-means++ initialization draws
random starting centroids; different runs will produce different cluster
assignments unless the seed is fixed. HDBSCAN does not have this property.
What the ClusterResult contains¶
Field |
Type |
Description |
|---|---|---|
|
|
Cluster assignment for each sample (0-indexed, no -1 values) |
|
|
MST edge list as (source, target, weight) triples |
|
|
Hierarchical linkage array from the MST |
|
|
Per-sample cluster membership confidence [0, 1] |
|
|
Indices of up to 25 nearest neighbors per sample |
|
|
Distances to those neighbors |
The membership_strengths field is the most useful for downstream use. Samples
with low membership strength are at the boundary of their cluster — they are
the candidates that Duplicates cluster mode considers semantically
ambiguous, and they are the samples that label_errors() is most likely to
flag as mislabeled if they are near a different class’s cluster.
Relationship to other evaluators¶
Duplicates cluster mode runs HDBSCAN on embeddings and groups
samples that land in the same cluster. Each cluster’s mean and standard
deviation are determined and used to tag samples whose nearest neighbor
is less than one deviation from the mean (sample_dist < mean - std).
Unlike hash-based deduplication (which detects visually similar images),
cluster-based deduplication detects semantically redundant images — images
that are different in appearance but represent the same concept in the embedding space.
Outliers cluster mode runs clustering and flags samples whose
distance to their nearest cluster center exceeds a threshold derived from that
cluster’s distance distribution (mean + n·std, default n=3). This identifies
samples that are semantically isolated — not just statistically unusual in image
statistics, but genuinely unlike the rest of the cluster they were assigned to.
label_errors() does not call cluster() directly, but uses the same
MST infrastructure (compute_neighbor_distances) to find intra-class and
extra-class nearest neighbors. The geometry it relies on — whether a sample’s
nearest neighbors come from its own class or a different one — is the same
geometric structure that clustering makes visible.
Prioritize with kmeans_distance, kmeans_complexity,
hdbscan_distance, or hdbscan_complexity methods runs the respective
clustering algorithm internally and uses cluster structure to score samples for
dataset pruning. See Prioritization
for details on how cluster distance and complexity scores translate to sample ranking.
Practical considerations¶
Embedding dimension should be below 500. Clustering in very high-dimensional spaces suffers from the curse of dimensionality — distances become increasingly uniform, making density estimation unreliable. If your embedding model produces high-dimensional representations, apply PCA or another dimensionality reduction step before clustering.
HDBSCAN’s min_cluster_size controls granularity. A large value produces
fewer, larger clusters; a small value produces many small clusters with more
samples classified as boundary points (low membership strength). For datasets
with many fine-grained categories, setting n_clusters to approximately the
number of classes is a reasonable starting hint.
KMeans requires fixing the seed for reproducibility. If reproducible
cluster assignments matter for your workflow (for example, if you are using
cluster assignments as input to a downstream analysis), call
dataeval.config.set_seed(your_seed) before running KMeans. See the
Configuring the seed
how-to for an example.
Clustering is embedding-dependent. The clusters discovered by either algorithm reflect the structure of the embedding space, not ground truth semantic categories. A weak embedding that fails to separate classes will produce clusters that do not align with class boundaries, making all downstream uses of cluster structure — outlier detection, duplicate detection, label error detection — less reliable.
When to use it directly¶
Most users will interact with clustering indirectly through Duplicates,
Outliers, label_errors(), or Prioritize. Calling
cluster() directly is appropriate when:
You want to visualize the cluster structure of your dataset (using the
mstorlinkage_treeoutputs with a dendrogram or network visualization)You want to understand how many natural groupings exist in your embeddings before configuring
n_clustersfor KMeans methods in PrioritizeYou want to examine
membership_strengthsto identify boundary samples before label review
Limitations¶
HDBSCAN is sensitive to the density structure of the embedding. Very
uniform density distributions (common in high-dimensional spaces) may cause
HDBSCAN to return a single large cluster. If HDBSCAN produces one cluster for
your entire dataset, consider reducing embedding dimensionality or switching to
KMeans with an explicit n_clusters.
KMeans assumes spherical clusters of roughly equal size. Datasets with elongated, nested, or strongly variable-density clusters will be partitioned suboptimally by KMeans. In these cases, HDBSCAN is more appropriate.
Non-repeatable KMeans results can complicate auditing and reproducibility requirements. Document the seed used for any cluster-based analysis that feeds into a test report.
Related concept pages¶
Data Integrity — how clustering feeds into Duplicates, Outliers, and label_errors
Embeddings — the representation clustering operates on
Acting on Results — how to interpret and respond to cluster-based findings
See this in practice¶
How-to guides¶
Tutorials¶
Assessing the data space tutorial — coverage gaps and embedding-space decisions