dataeval.core.rerank_stratified

dataeval.core.rerank_stratified(result, num_bins=50)

Rerank by stratified sampling across score bins.

Takes a RankResult (expected to be in easy_first order) and applies stratified sampling to balance selection across score bins. This encourages diversity by de-weighting samples with similar scores.

The output is in hard_first order to maintain priority while balancing.

Parameters:
result : RankResult

Ranking result with scores (must be from rank_knn or rank_kmeans_distance).

num_bins : int, default 50

Number of bins for stratification.

Returns:

Dictionary containing:

  • indices: NDArray[np.intp] - Reranked indices in hard_first order

  • scores: NDArray[np.float32] | None - Scores in original order (unchanged)

  • method: str - Same as input

  • policy: str - “stratified”

Return type:

RankResult

Raises:

ValueError – If result does not contain scores (e.g., from rank_kmeans_complexity).

Examples

>>> from dataeval.core import rank_knn, rerank_stratified
>>> import numpy as np
>>> embeddings = np.random.rand(100, 64).astype(np.float32)
>>> result = rank_knn(embeddings, k=5)
>>> result = rerank_stratified(result, num_bins=20)