dataeval.core.rank_kmeans_complexity

dataeval.core.rank_kmeans_complexity(embeddings, c=None, n_init='auto', reference=None)

Rank samples using cluster complexity weighting.

Uses a weighted sampling strategy based on intra-cluster and inter-cluster distances. Returns samples in easy-first order.

Note: This method does not produce scores, so rerank_stratified() cannot be used with results from this function.

Parameters:
embeddings : NDArray[np.floating]

Embedding vectors to rank, shape (n_samples, n_features).

c : int | None, default None

Number of clusters. If None, uses sqrt(n_samples).

n_init : int | "auto", default "auto"

Number of K-means initializations.

reference : NDArray[np.floating] | None, default None

Reference embeddings for comparative ranking. If provided, samples are ranked relative to the reference set rather than themselves.

Returns:

Dictionary containing:

  • indices: NDArray[np.intp] - Ranked indices in easy-first order

  • scores: None (this method does not produce scores)

  • method: str - “kmeans_complexity”

  • policy: str - “easy_first”

Return type:

RankResult

Raises:

ValueError – If c is invalid (>= dataset size or negative).

Examples

>>> from dataeval.core import rank_kmeans_complexity
>>> import numpy as np
>>> embeddings = np.random.rand(100, 64).astype(np.float32)
>>> result = rank_kmeans_complexity(embeddings, c=10)