dataeval.quality.Prioritize

class dataeval.quality.Prioritize(model, reference=None, batch_size=None, device=None)

Prioritize dataset samples based on their position in the embedding space.

This class uses a builder pattern to configure ranking method and policy, then evaluates datasets to produce prioritized sample orderings.

Parameters:
model : EmbeddingModel

Model to use for encoding data.

reference : AnnotatedDataset[Any] | Embeddings | None, default None

Optional reference dataset or pre-computed embeddings. When provided, incoming datasets will be prioritized relative to this reference set. Useful for active learning (reference = labeled data) or quality filtering (reference = high-quality corpus).

batch_size : int | None, default None

Default batch size to use when encoding data. Can be overridden in evaluate().

device : DeviceLike | None, default None

Default device to use for encoding data. Can be overridden in evaluate().

Examples

Basic prioritization using builder pattern:

>>> from dataeval.quality import Prioritize
>>> prioritizer = Prioritize(model)
>>>
>>> # Configure method and policy, then evaluate
>>> result = prioritizer.with_knn(k=10).hard_first().evaluate(unlabeled_data)

Different policies:

>>> # Easy samples first
>>> result = prioritizer.with_knn(k=5).easy_first().evaluate(unlabeled_data)
>>>
>>> # Stratified sampling
>>> result = prioritizer.with_knn(k=5).stratified(num_bins=20).evaluate(unlabeled_data)
>>>
>>> # Class-balanced selection
>>> result = prioritizer.with_kmeans_distance(c=10).class_balanced(class_labels).evaluate(unlabeled_data)

Reconfigure and reuse:

>>> # Can reconfigure the same instance
>>> result = prioritizer.with_kmeans_complexity(c=15).easy_first().evaluate(unlabeled_data)

Active learning with reference:

>>> # Initialize with labeled data as reference
>>> prioritizer = Prioritize(model, reference=labeled_data)
>>> result = prioritizer.with_knn(k=10).hard_first().evaluate(reference_data)
class_balanced(class_labels=None)

Configure class-balanced selection policy.

Ensures balanced representation across class labels while maintaining priority order within each class.

Parameters:
class_labels : NDArray[np.integer] | None, default None

Class labels for each sample in the dataset. If None, will be extracted from AnnotatedDataset metadata during evaluate().

Returns:

Self for method chaining.

Return type:

Prioritize

Examples

>>> prioritizer = Prioritize(model)
>>> result = prioritizer.with_knn(k=5).class_balanced(class_labels).evaluate(unlabeled_data)

With AnnotatedDataset (labels extracted automatically):

>>> result = prioritizer.with_knn(k=5).class_balanced().evaluate(labeled_data)
easy_first()

Configure policy to select easy/prototypical samples first.

Returns samples in ascending order of difficulty (low distance = easy).

Returns:

Self for method chaining.

Return type:

Prioritize

Examples

>>> prioritizer = Prioritize(model)
>>> result = prioritizer.with_knn(k=5).easy_first().evaluate(unlabeled_data)
evaluate(dataset, batch_size=None, device=None)

Evaluate the dataset and return prioritized indices.

Uses the configured method and policy to rank samples. Method and policy must be configured using the builder methods (with_*, easy_first, etc.) before calling evaluate().

Parameters:
dataset : AnnotatedDataset[Any] | Embeddings

The incoming dataset to prioritize. Can be either:

  • AnnotatedDataset: Will compute embeddings using the model

  • Embeddings: Pre-computed embeddings

batch_size : int | None, default None

Batch size for encoding the incoming dataset. If None, uses the value from __init__. Only used when dataset is an AnnotatedDataset.

device : DeviceLike | None, default None

Device for encoding the incoming dataset. If None, uses the value from __init__. Only used when dataset is an AnnotatedDataset.

Returns:

Output containing prioritized indices, scores (if available), and configuration information.

Return type:

PrioritizeOutput

Raises:
  • ValueError – If method or policy not configured. If class_labels is None when using class_balanced policy with Embeddings. If stratified policy is used with kmeans_complexity method.

  • TypeError – If dataset is neither an AnnotatedDataset nor Embeddings.

Examples

Basic usage:

>>> prioritizer = Prioritize(model)
>>> result = prioritizer.with_knn(k=5).hard_first().evaluate(labeled_data)

Override encoding parameters:

>>> result = prioritizer.with_knn(k=10).easy_first().evaluate(labeled_data, batch_size=64)

Reconfigure and evaluate different dataset:

>>> result2 = prioritizer.with_kmeans_distance(c=15).stratified().evaluate(reference_data)
hard_first()

Configure policy to select hard/challenging samples first.

Returns samples in descending order of difficulty (high distance = hard).

Returns:

Self for method chaining.

Return type:

Prioritize

Examples

>>> prioritizer = Prioritize(model)
>>> result = prioritizer.with_knn(k=5).hard_first().evaluate(unlabeled_data)
stratified(num_bins=50)

Configure stratified sampling policy across score bins.

Balances selection across different difficulty levels by binning scores and sampling uniformly from bins. Encourages diversity.

Note: Only available with methods that produce scores (knn, kmeans_distance).

Parameters:
num_bins : int, default 50

Number of bins for stratification.

Returns:

Self for method chaining.

Return type:

Prioritize

Examples

>>> prioritizer = Prioritize(model)
>>> result = prioritizer.with_knn(k=5).stratified(num_bins=20).evaluate(unlabeled_data)
with_kmeans_complexity(c=None, n_init='auto')

Configure K-means complexity ranking method.

Uses weighted sampling based on intra/inter-cluster distances. Note: This method does not produce scores, so stratified() policy is not available.

Parameters:
c : int | None, default None

Number of clusters. If None, uses sqrt(n_samples).

n_init : int | "auto", default "auto"

Number of K-means initializations.

Returns:

Self for method chaining.

Return type:

Prioritize

Examples

>>> prioritizer = Prioritize(model)
>>> result = prioritizer.with_kmeans_complexity(c=15).hard_first().evaluate(unlabeled_data)
with_kmeans_distance(c=None, n_init='auto')

Configure K-means distance ranking method.

Ranks samples by distance to assigned cluster centers.

Parameters:
c : int | None, default None

Number of clusters. If None, uses sqrt(n_samples).

n_init : int | "auto", default "auto"

Number of K-means initializations.

Returns:

Self for method chaining.

Return type:

Prioritize

Examples

>>> prioritizer = Prioritize(model)
>>> result = prioritizer.with_kmeans_distance(c=10).easy_first().evaluate(unlabeled_data)
with_knn(k=None)

Configure k-nearest neighbors ranking method.

Parameters:
k : int | None, default None

Number of nearest neighbors. If None, uses sqrt(n_samples).

Returns:

Self for method chaining.

Return type:

Prioritize

Examples

>>> prioritizer = Prioritize(model)
>>> result = prioritizer.with_knn(k=5).hard_first().evaluate(unlabeled_data)