dataeval.core.rank_hdbscan_distance¶
-
dataeval.core.rank_hdbscan_distance(embeddings, c=
None, max_cluster_size=None, reference=None)¶ Rank samples using distance to HDBSCAN cluster centers.
Clusters embeddings using HDBSCAN and ranks by distance to assigned cluster centers. Returns samples in easy-first order (low distance = prototypical).
- Parameters:¶
- embeddings : NDArray[np.floating]¶
Embedding vectors to rank, shape (n_samples, n_features).
- c : int | None, default None¶
Expected number of clusters (used as hint for min_cluster_size). If None, uses sqrt(n_samples).
- max_cluster_size : int | None, default None¶
Maximum size limit for identified clusters.
- reference : NDArray[np.floating] | None, default None¶
Reference embeddings for comparative ranking. If provided, samples are ranked relative to the reference set rather than themselves.
- Returns:¶
Dictionary containing:
indices: NDArray[np.intp] - Ranked indices in easy-first order
scores: NDArray[np.float32] | None - Distance to cluster center for each sample
method: str - “hdbscan_distance”
policy: str - “easy_first”
- Return type:¶
Examples
>>> from dataeval.core import rank_hdbscan_distance >>> import numpy as np >>> embeddings = np.random.rand(100, 64).astype(np.float32) >>> result = rank_hdbscan_distance(embeddings, c=10)