dataeval.quality.Prioritize¶
-
class dataeval.quality.Prioritize(model, reference=
None, batch_size=None, device=None)¶ Prioritize dataset samples based on their position in the embedding space.
This class uses a builder pattern to configure ranking method and policy, then evaluates datasets to produce prioritized sample orderings.
- Parameters:¶
- model : EmbeddingModel¶
Model to use for encoding data.
- reference : AnnotatedDataset[Any] | Embeddings | None, default None¶
Optional reference dataset or pre-computed embeddings. When provided, incoming datasets will be prioritized relative to this reference set. Useful for active learning (reference = labeled data) or quality filtering (reference = high-quality corpus).
- batch_size : int | None, default None¶
Default batch size to use when encoding data. Can be overridden in evaluate().
- device : DeviceLike | None, default None¶
Default device to use for encoding data. Can be overridden in evaluate().
See also
Outliers,Indices,dataeval.core.rank_knn(),dataeval.core.rank_kmeans_distance(),dataeval.core.rank_kmeans_complexity()Examples
Basic prioritization using builder pattern:
>>> from dataeval.quality import Prioritize >>> prioritizer = Prioritize(model) >>> >>> # Configure method and policy, then evaluate >>> result = prioritizer.with_knn(k=10).hard_first().evaluate(unlabeled_data)Different policies:
>>> # Easy samples first >>> result = prioritizer.with_knn(k=5).easy_first().evaluate(unlabeled_data) >>> >>> # Stratified sampling >>> result = prioritizer.with_knn(k=5).stratified(num_bins=20).evaluate(unlabeled_data) >>> >>> # Class-balanced selection >>> result = prioritizer.with_kmeans_distance(c=10).class_balanced(class_labels).evaluate(unlabeled_data)Reconfigure and reuse:
>>> # Can reconfigure the same instance >>> result = prioritizer.with_kmeans_complexity(c=15).easy_first().evaluate(unlabeled_data)Active learning with reference:
>>> # Initialize with labeled data as reference >>> prioritizer = Prioritize(model, reference=labeled_data) >>> result = prioritizer.with_knn(k=10).hard_first().evaluate(reference_data)-
class_balanced(class_labels=
None)¶ Configure class-balanced selection policy.
Ensures balanced representation across class labels while maintaining priority order within each class.
- Parameters:¶
- class_labels : NDArray[np.integer] | None, default None¶
Class labels for each sample in the dataset. If None, will be extracted from AnnotatedDataset metadata during evaluate().
- Returns:¶
Self for method chaining.
- Return type:¶
Examples
>>> prioritizer = Prioritize(model) >>> result = prioritizer.with_knn(k=5).class_balanced(class_labels).evaluate(unlabeled_data)With AnnotatedDataset (labels extracted automatically):
>>> result = prioritizer.with_knn(k=5).class_balanced().evaluate(labeled_data)
- easy_first()¶
Configure policy to select easy/prototypical samples first.
Returns samples in ascending order of difficulty (low distance = easy).
Examples
>>> prioritizer = Prioritize(model) >>> result = prioritizer.with_knn(k=5).easy_first().evaluate(unlabeled_data)
-
evaluate(dataset, batch_size=
None, device=None)¶ Evaluate the dataset and return prioritized indices.
Uses the configured method and policy to rank samples. Method and policy must be configured using the builder methods (with_*, easy_first, etc.) before calling evaluate().
- Parameters:¶
- dataset : AnnotatedDataset[Any] | Embeddings¶
The incoming dataset to prioritize. Can be either:
AnnotatedDataset: Will compute embeddings using the model
Embeddings: Pre-computed embeddings
- batch_size : int | None, default None¶
Batch size for encoding the incoming dataset. If None, uses the value from __init__. Only used when dataset is an AnnotatedDataset.
- device : DeviceLike | None, default None¶
Device for encoding the incoming dataset. If None, uses the value from __init__. Only used when dataset is an AnnotatedDataset.
- Returns:¶
Output containing prioritized indices, scores (if available), and configuration information.
- Return type:¶
- Raises:¶
ValueError – If method or policy not configured. If class_labels is None when using class_balanced policy with Embeddings. If stratified policy is used with kmeans_complexity method.
TypeError – If dataset is neither an AnnotatedDataset nor Embeddings.
Examples
Basic usage:
>>> prioritizer = Prioritize(model) >>> result = prioritizer.with_knn(k=5).hard_first().evaluate(labeled_data)Override encoding parameters:
>>> result = prioritizer.with_knn(k=10).easy_first().evaluate(labeled_data, batch_size=64)Reconfigure and evaluate different dataset:
>>> result2 = prioritizer.with_kmeans_distance(c=15).stratified().evaluate(reference_data)
- hard_first()¶
Configure policy to select hard/challenging samples first.
Returns samples in descending order of difficulty (high distance = hard).
Examples
>>> prioritizer = Prioritize(model) >>> result = prioritizer.with_knn(k=5).hard_first().evaluate(unlabeled_data)
-
stratified(num_bins=
50)¶ Configure stratified sampling policy across score bins.
Balances selection across different difficulty levels by binning scores and sampling uniformly from bins. Encourages diversity.
Note: Only available with methods that produce scores (knn, kmeans_distance).
- Parameters:¶
- num_bins : int, default 50¶
Number of bins for stratification.
- Returns:¶
Self for method chaining.
- Return type:¶
Examples
>>> prioritizer = Prioritize(model) >>> result = prioritizer.with_knn(k=5).stratified(num_bins=20).evaluate(unlabeled_data)
-
with_kmeans_complexity(c=
None, n_init='auto')¶ Configure K-means complexity ranking method.
Uses weighted sampling based on intra/inter-cluster distances. Note: This method does not produce scores, so stratified() policy is not available.
Examples
>>> prioritizer = Prioritize(model) >>> result = prioritizer.with_kmeans_complexity(c=15).hard_first().evaluate(unlabeled_data)
-
with_kmeans_distance(c=
None, n_init='auto')¶ Configure K-means distance ranking method.
Ranks samples by distance to assigned cluster centers.
Examples
>>> prioritizer = Prioritize(model) >>> result = prioritizer.with_kmeans_distance(c=10).easy_first().evaluate(unlabeled_data)
-
with_knn(k=
None)¶ Configure k-nearest neighbors ranking method.
- Parameters:¶
- k : int | None, default None¶
Number of nearest neighbors. If None, uses sqrt(n_samples).
- Returns:¶
Self for method chaining.
- Return type:¶
Examples
>>> prioritizer = Prioritize(model) >>> result = prioritizer.with_knn(k=5).hard_first().evaluate(unlabeled_data)