dataeval.core.rank_kmeans_distance¶

dataeval.core.rank_kmeans_distance(embeddings, c=None, n_init='auto', reference=None)¶

Rank samples using distance to cluster centers.

Clusters embeddings using K-means and ranks by distance to assigned cluster centers. Returns samples in easy-first order (low distance = prototypical).

Parameters:¶

embeddings : NDArray[np.floating]¶: Embedding vectors to rank, shape (n_samples, n_features).
c : int | None, default None¶: Number of clusters. If None, uses sqrt(n_samples).
n_init : int | "auto", default "auto"¶: Number of K-means initializations.
reference : NDArray[np.floating] | None, default None¶: Reference embeddings for comparative ranking. If provided, samples are ranked relative to the reference set rather than themselves.

Returns:¶

Dictionary containing:

indices: NDArray[np.intp] - Indices sorted in easy-first order
scores: NDArray[np.float32] | None - Distance to cluster center for each sample

Return type:¶

RankResult

Raises:¶

ValueError – If c is invalid (>= dataset size or negative).

Examples

>>> from dataeval.core import rank_kmeans_distance
>>> import numpy as np
>>> embeddings = np.random.rand(100, 64).astype(np.float32)
>>> result = rank_kmeans_distance(embeddings, c=10)