dataeval.quality.Prioritize¶
-
class dataeval.quality.Prioritize(encoder=
None, method=None, k=None, c=None, n_init=None, policy=None, num_bins=None, class_labels=None, reference=None, config=None)¶ Prioritize dataset samples based on their position in the embedding space.
This class provides factory methods for common configurations and supports both direct instantiation and fluent policy configuration.
- Parameters:¶
- encoder : EmbeddingEncoder¶
Encoder to use for extracting embeddings from data.
- method : {"knn", "kmeans_distance", "kmeans_complexity"}, default "knn"¶
Ranking method to use:
”knn”: K-nearest neighbors distance ranking
”kmeans_distance”: Distance to assigned cluster center
”kmeans_complexity”: Weighted sampling based on cluster structure
- k : int or None, default None¶
Number of nearest neighbors for “knn” method. If None, uses sqrt(n_samples).
- c : int or None, default None¶
Number of clusters for kmeans methods. If None, uses sqrt(n_samples).
- n_init : int or "auto", default "auto"¶
Number of K-means initializations for kmeans methods.
- policy : {"hard_first", "easy_first", "stratified", "class_balance"}, default "hard_first"¶
Selection policy:
”hard_first”: Challenging samples first (high distance)
”easy_first”: Prototypical samples first (low distance)
”stratified”: Balanced selection across difficulty bins
”class_balance”: Balanced selection across class labels
- num_bins : int, default 50¶
Number of bins for “stratified” policy.
- class_labels : NDArray[np.integer] or None, default None¶
Class labels for “class_balance” policy. If None, extracted from AnnotatedDataset metadata during evaluate().
- reference : AnnotatedDataset or Embeddings or None, default None¶
Optional reference dataset or pre-computed embeddings. When provided, incoming datasets will be prioritized relative to this reference set. Useful for active learning (reference = labeled data) or quality filtering (reference = high-quality corpus).
- config : Prioritize.Config or None, default None¶
Optional configuration object with default parameters. Parameters specified directly in __init__ will override config defaults.
See also
Outliers,Indices,dataeval.core.rank_knn(),dataeval.core.rank_kmeans_distance(),dataeval.core.rank_kmeans_complexity()Examples
Using factory methods (recommended):
>>> from dataeval.quality import Prioritize >>> >>> # KNN with hard samples first (default policy) >>> result = Prioritize.knn(encoder, k=10).evaluate(dataset) >>> >>> # KNN with easy samples first >>> result = Prioritize.knn(encoder, k=10).easy_first().evaluate(dataset) >>> >>> # K-means distance with stratified sampling >>> result = Prioritize.kmeans_distance(encoder, c=15).stratified(num_bins=20).evaluate(dataset) >>> >>> # K-means complexity with class-balanced selection >>> result = Prioritize.kmeans_complexity(encoder, c=10).class_balanced().evaluate(labeled_data)Direct instantiation:
>>> prioritizer = Prioritize( ... encoder=encoder, ... method="knn", ... k=10, ... policy="stratified", ... num_bins=20, ... ) >>> result = prioritizer.evaluate(dataset)Active learning with reference data:
>>> prioritizer = Prioritize.knn(encoder, k=10, reference=labeled_data) >>> result = prioritizer.hard_first().evaluate(unlabeled_data)Using configuration:
>>> config = Prioritize.Config(encoder=encoder, method="knn", k=10) >>> prioritizer = Prioritize(config=config)-
class_balanced(class_labels=
None)¶ Return a new instance configured with class_balance policy.
Ensures balanced representation across class labels while maintaining priority order within each class.
- Parameters:¶
- class_labels : NDArray[np.integer] or None, default None¶
Class labels for each sample. If None, will be extracted from AnnotatedDataset metadata during evaluate().
- Returns:¶
New instance with policy set to “class_balance”.
- Return type:¶
Examples
>>> result = Prioritize.knn(encoder, k=5).class_balanced(class_labels).evaluate(dataset)With AnnotatedDataset (labels extracted automatically):
>>> result = Prioritize.knn(encoder, k=5).class_balanced().evaluate(labeled_data)
- easy_first()¶
Return a new instance configured with easy_first policy.
Selects easy/prototypical samples first (ascending order of difficulty).
Examples
>>> result = Prioritize.knn(encoder, k=5).easy_first().evaluate(dataset)
- evaluate(dataset)¶
Evaluate the dataset and return prioritized indices.
Uses the configured method and policy to rank samples.
- Parameters:¶
- dataset : AnnotatedDataset[Any] | Embeddings¶
The incoming dataset to prioritize. Can be either:
AnnotatedDataset: Will compute embeddings using the encoder
Embeddings: Pre-computed embeddings
- Returns:¶
Output containing prioritized indices, scores (if available), and configuration information.
- Return type:¶
- Raises:¶
ValueError – If class_labels is None when using class_balance policy with Embeddings. If stratified policy is used with kmeans_complexity method.
TypeError – If dataset is neither an AnnotatedDataset nor Embeddings.
Examples
Using factory methods:
>>> result = Prioritize.knn(encoder, k=5).hard_first().evaluate(dataset)Using direct instantiation:
>>> prioritizer = Prioritize(encoder=encoder, method="knn", k=5, policy="hard_first") >>> result = prioritizer.evaluate(dataset)
- hard_first()¶
Return a new instance configured with hard_first policy.
Selects hard/challenging samples first (descending order of difficulty).
Examples
>>> result = Prioritize.knn(encoder, k=5).hard_first().evaluate(dataset)
-
classmethod kmeans_complexity(encoder, c=
None, n_init='auto', reference=None)¶ Create a Prioritize instance using K-means complexity method.
Uses weighted sampling based on intra/inter-cluster distances.
Note: This method does not produce scores, so “stratified” policy is not available.
- Parameters:¶
- encoder : EmbeddingEncoder¶
Encoder to use for extracting embeddings from data.
- c : int or None, default None¶
Number of clusters. If None, uses sqrt(n_samples).
- n_init : int or "auto", default "auto"¶
Number of K-means initializations.
- reference : AnnotatedDataset or Embeddings or None, default None¶
Optional reference dataset for relative prioritization.
- Returns:¶
Configured instance ready for policy selection and evaluation.
- Return type:¶
Examples
>>> result = Prioritize.kmeans_complexity(encoder, c=10).hard_first().evaluate(dataset)
-
classmethod kmeans_distance(encoder, c=
None, n_init='auto', reference=None)¶ Create a Prioritize instance using K-means distance method.
Ranks samples by distance to their assigned cluster centers.
- Parameters:¶
- encoder : EmbeddingEncoder¶
Encoder to use for extracting embeddings from data.
- c : int or None, default None¶
Number of clusters. If None, uses sqrt(n_samples).
- n_init : int or "auto", default "auto"¶
Number of K-means initializations.
- reference : AnnotatedDataset or Embeddings or None, default None¶
Optional reference dataset for relative prioritization.
- Returns:¶
Configured instance ready for policy selection and evaluation.
- Return type:¶
Examples
>>> result = Prioritize.kmeans_distance(encoder, c=15).stratified().evaluate(dataset)
-
classmethod knn(encoder, k=
None, reference=None)¶ Create a Prioritize instance using k-nearest neighbors method.
- Parameters:¶
- encoder : EmbeddingEncoder¶
Encoder to use for extracting embeddings from data.
- k : int or None, default None¶
Number of nearest neighbors. If None, uses sqrt(n_samples).
- reference : AnnotatedDataset or Embeddings or None, default None¶
Optional reference dataset for relative prioritization.
- Returns:¶
Configured instance ready for policy selection and evaluation.
- Return type:¶
Examples
>>> result = Prioritize.knn(encoder, k=10).hard_first().evaluate(dataset) >>> result = Prioritize.knn(encoder, k=5).easy_first().evaluate(dataset)
-
stratified(num_bins=
DEFAULT_PRIORITIZE_NUM_BINS)¶ Return a new instance configured with stratified policy.
Balances selection across different difficulty levels by binning scores and sampling uniformly from bins.
Note: Only available with methods that produce scores (“knn”, “kmeans_distance”).
- Parameters:¶
- num_bins : int, default 50¶
Number of bins for stratification.
- Returns:¶
New instance with policy set to “stratified”.
- Return type:¶
Examples
>>> result = Prioritize.knn(encoder, k=5).stratified(num_bins=20).evaluate(dataset)
Classes¶
Configuration for Prioritize evaluator. |