dataeval.core.rank_hdbscan_complexity¶

dataeval.core.rank_hdbscan_complexity(embeddings, c=None, max_cluster_size=None, reference=None)¶

Rank samples using HDBSCAN cluster complexity weighting.

Uses a weighted sampling strategy based on intra-cluster and inter-cluster distances from HDBSCAN clustering. Returns samples in easy-first order.

Note: This method does not produce scores, so .stratified() cannot be used with results from this function.

Parameters:¶

embeddings : NDArray[np.floating]¶: Embedding vectors to rank, shape (n_samples, n_features).
c : int | None, default None¶: Expected number of clusters (used as hint for min_cluster_size). If None, uses sqrt(n_samples).
max_cluster_size : int | None, default None¶: Maximum size limit for identified clusters.
reference : NDArray[np.floating] | None, default None¶: Reference embeddings for comparative ranking. If provided, samples are ranked relative to the reference set rather than themselves.

Returns:¶

Result with:

indices: NDArray[np.intp] - Indices sorted in easy-first order
scores: None (this method does not produce scores)

Return type:¶

RankResult

Examples

>>> from dataeval.core import rank_hdbscan_complexity
>>> import numpy as np
>>> embeddings = np.random.rand(100, 64).astype(np.float32)
>>> result = rank_hdbscan_complexity(embeddings, c=10)