dataeval.core.rank_kmeans_complexity¶
-
dataeval.core.rank_kmeans_complexity(embeddings, c=
None, n_init='auto', reference=None)¶ Rank samples using cluster complexity weighting.
Uses a weighted sampling strategy based on intra-cluster and inter-cluster distances. Returns samples in easy-first order.
Note: This method does not produce scores, so rerank_stratified() cannot be used with results from this function.
- Parameters:¶
- embeddings : NDArray[np.floating]¶
Embedding vectors to rank, shape (n_samples, n_features).
- c : int | None, default None¶
Number of clusters. If None, uses sqrt(n_samples).
- n_init : int | "auto", default "auto"¶
Number of K-means initializations.
- reference : NDArray[np.floating] | None, default None¶
Reference embeddings for comparative ranking. If provided, samples are ranked relative to the reference set rather than themselves.
- Returns:¶
Dictionary containing:
indices: NDArray[np.intp] - Ranked indices in easy-first order
scores: None (this method does not produce scores)
method: str - “kmeans_complexity”
policy: str - “easy_first”
- Return type:¶
RankResult
- Raises:¶
ValueError – If c is invalid (>= dataset size or negative).
Examples
>>> from dataeval.core import rank_kmeans_complexity >>> import numpy as np >>> embeddings = np.random.rand(100, 64).astype(np.float32) >>> result = rank_kmeans_complexity(embeddings, c=10)