dataeval.quality.Prioritize

class dataeval.quality.Prioritize(encoder=None, method=None, k=None, c=None, n_init=None, policy=None, num_bins=None, class_labels=None, reference=None, config=None)

Prioritize dataset samples based on their position in the embedding space.

This class provides factory methods for common configurations and supports both direct instantiation and fluent policy configuration.

Parameters:
encoder : EmbeddingEncoder

Encoder to use for extracting embeddings from data.

method : {"knn", "kmeans_distance", "kmeans_complexity"}, default "knn"

Ranking method to use:

  • ”knn”: K-nearest neighbors distance ranking

  • ”kmeans_distance”: Distance to assigned cluster center

  • ”kmeans_complexity”: Weighted sampling based on cluster structure

k : int or None, default None

Number of nearest neighbors for “knn” method. If None, uses sqrt(n_samples).

c : int or None, default None

Number of clusters for kmeans methods. If None, uses sqrt(n_samples).

n_init : int or "auto", default "auto"

Number of K-means initializations for kmeans methods.

policy : {"hard_first", "easy_first", "stratified", "class_balance"}, default "hard_first"

Selection policy:

  • ”hard_first”: Challenging samples first (high distance)

  • ”easy_first”: Prototypical samples first (low distance)

  • ”stratified”: Balanced selection across difficulty bins

  • ”class_balance”: Balanced selection across class labels

num_bins : int, default 50

Number of bins for “stratified” policy.

class_labels : NDArray[np.integer] or None, default None

Class labels for “class_balance” policy. If None, extracted from AnnotatedDataset metadata during evaluate().

reference : AnnotatedDataset or Embeddings or None, default None

Optional reference dataset or pre-computed embeddings. When provided, incoming datasets will be prioritized relative to this reference set. Useful for active learning (reference = labeled data) or quality filtering (reference = high-quality corpus).

config : Prioritize.Config or None, default None

Optional configuration object with default parameters. Parameters specified directly in __init__ will override config defaults.

Examples

Using factory methods (recommended):

>>> from dataeval.quality import Prioritize
>>>
>>> # KNN with hard samples first (default policy)
>>> result = Prioritize.knn(encoder, k=10).evaluate(dataset)
>>>
>>> # KNN with easy samples first
>>> result = Prioritize.knn(encoder, k=10).easy_first().evaluate(dataset)
>>>
>>> # K-means distance with stratified sampling
>>> result = Prioritize.kmeans_distance(encoder, c=15).stratified(num_bins=20).evaluate(dataset)
>>>
>>> # K-means complexity with class-balanced selection
>>> result = Prioritize.kmeans_complexity(encoder, c=10).class_balanced().evaluate(labeled_data)

Direct instantiation:

>>> prioritizer = Prioritize(
...     encoder=encoder,
...     method="knn",
...     k=10,
...     policy="stratified",
...     num_bins=20,
... )
>>> result = prioritizer.evaluate(dataset)

Active learning with reference data:

>>> prioritizer = Prioritize.knn(encoder, k=10, reference=labeled_data)
>>> result = prioritizer.hard_first().evaluate(unlabeled_data)

Using configuration:

>>> config = Prioritize.Config(encoder=encoder, method="knn", k=10)
>>> prioritizer = Prioritize(config=config)
class_balanced(class_labels=None)

Return a new instance configured with class_balance policy.

Ensures balanced representation across class labels while maintaining priority order within each class.

Parameters:
class_labels : NDArray[np.integer] or None, default None

Class labels for each sample. If None, will be extracted from AnnotatedDataset metadata during evaluate().

Returns:

New instance with policy set to “class_balance”.

Return type:

Prioritize

Examples

>>> result = Prioritize.knn(encoder, k=5).class_balanced(class_labels).evaluate(dataset)

With AnnotatedDataset (labels extracted automatically):

>>> result = Prioritize.knn(encoder, k=5).class_balanced().evaluate(labeled_data)
easy_first()

Return a new instance configured with easy_first policy.

Selects easy/prototypical samples first (ascending order of difficulty).

Returns:

New instance with policy set to “easy_first”.

Return type:

Prioritize

Examples

>>> result = Prioritize.knn(encoder, k=5).easy_first().evaluate(dataset)
evaluate(dataset)

Evaluate the dataset and return prioritized indices.

Uses the configured method and policy to rank samples.

Parameters:
dataset : AnnotatedDataset[Any] | Embeddings

The incoming dataset to prioritize. Can be either:

  • AnnotatedDataset: Will compute embeddings using the encoder

  • Embeddings: Pre-computed embeddings

Returns:

Output containing prioritized indices, scores (if available), and configuration information.

Return type:

PrioritizeOutput

Raises:
  • ValueError – If class_labels is None when using class_balance policy with Embeddings. If stratified policy is used with kmeans_complexity method.

  • TypeError – If dataset is neither an AnnotatedDataset nor Embeddings.

Examples

Using factory methods:

>>> result = Prioritize.knn(encoder, k=5).hard_first().evaluate(dataset)

Using direct instantiation:

>>> prioritizer = Prioritize(encoder=encoder, method="knn", k=5, policy="hard_first")
>>> result = prioritizer.evaluate(dataset)
hard_first()

Return a new instance configured with hard_first policy.

Selects hard/challenging samples first (descending order of difficulty).

Returns:

New instance with policy set to “hard_first”.

Return type:

Prioritize

Examples

>>> result = Prioritize.knn(encoder, k=5).hard_first().evaluate(dataset)
classmethod kmeans_complexity(encoder, c=None, n_init='auto', reference=None)

Create a Prioritize instance using K-means complexity method.

Uses weighted sampling based on intra/inter-cluster distances.

Note: This method does not produce scores, so “stratified” policy is not available.

Parameters:
encoder : EmbeddingEncoder

Encoder to use for extracting embeddings from data.

c : int or None, default None

Number of clusters. If None, uses sqrt(n_samples).

n_init : int or "auto", default "auto"

Number of K-means initializations.

reference : AnnotatedDataset or Embeddings or None, default None

Optional reference dataset for relative prioritization.

Returns:

Configured instance ready for policy selection and evaluation.

Return type:

Prioritize

Examples

>>> result = Prioritize.kmeans_complexity(encoder, c=10).hard_first().evaluate(dataset)
classmethod kmeans_distance(encoder, c=None, n_init='auto', reference=None)

Create a Prioritize instance using K-means distance method.

Ranks samples by distance to their assigned cluster centers.

Parameters:
encoder : EmbeddingEncoder

Encoder to use for extracting embeddings from data.

c : int or None, default None

Number of clusters. If None, uses sqrt(n_samples).

n_init : int or "auto", default "auto"

Number of K-means initializations.

reference : AnnotatedDataset or Embeddings or None, default None

Optional reference dataset for relative prioritization.

Returns:

Configured instance ready for policy selection and evaluation.

Return type:

Prioritize

Examples

>>> result = Prioritize.kmeans_distance(encoder, c=15).stratified().evaluate(dataset)
classmethod knn(encoder, k=None, reference=None)

Create a Prioritize instance using k-nearest neighbors method.

Parameters:
encoder : EmbeddingEncoder

Encoder to use for extracting embeddings from data.

k : int or None, default None

Number of nearest neighbors. If None, uses sqrt(n_samples).

reference : AnnotatedDataset or Embeddings or None, default None

Optional reference dataset for relative prioritization.

Returns:

Configured instance ready for policy selection and evaluation.

Return type:

Prioritize

Examples

>>> result = Prioritize.knn(encoder, k=10).hard_first().evaluate(dataset)
>>> result = Prioritize.knn(encoder, k=5).easy_first().evaluate(dataset)
stratified(num_bins=DEFAULT_PRIORITIZE_NUM_BINS)

Return a new instance configured with stratified policy.

Balances selection across different difficulty levels by binning scores and sampling uniformly from bins.

Note: Only available with methods that produce scores (“knn”, “kmeans_distance”).

Parameters:
num_bins : int, default 50

Number of bins for stratification.

Returns:

New instance with policy set to “stratified”.

Return type:

Prioritize

Examples

>>> result = Prioritize.knn(encoder, k=5).stratified(num_bins=20).evaluate(dataset)

Classes

Config

Configuration for Prioritize evaluator.