dataeval.data.selections.Prioritize¶
-
class dataeval.data.selections.Prioritize(model: dataeval.protocols.EmbeddingModel | None, batch_size: int, device: dataeval.config.DeviceLike | None, method: 'knn', policy: 'hard_first' | 'easy_first' | 'stratified' | 'class_balance', *, k: int | None =
None, class_label: numpy.typing.NDArray[numpy.integer[Any]] | None =None)¶ -
class Prioritize(model, batch_size, device, method, policy, *, c=
None, class_label=None)¶ -
class Prioritize(model, batch_size, device, method, policy, *, k=
None, c=None, class_label) -
class Prioritize(model, batch_size, device, method, policy, *, k=
None, c=None, class_label=None) Sort the dataset indices in order of highest priority data in the embedding space.
- Parameters:¶
- model : EmbeddingModel | None¶
Model to use for encoding images
- batch_size : int¶
Batch size to use when encoding images
- device : DeviceLike or None¶
Device to use for encoding images
- method : Literal["knn", "kmeans_distance", "kmeans_complexity"]¶
Method to use for prioritization
- k : int or None, default None¶
Number of nearest neighbors to use for prioritization. If None, uses the square_root of the number of samples. Only used for method=”knn”, ignored otherwise.
- c : int or None, default None¶
Number of clusters to use for prioritization. If None, uses the square_root of the number of samples. Only used for method=”kmeans_*”, ignored otherwise.
Notes
k is only used for method [“knn”].
c is only used for methods [“kmeans_distance”, “kmeans_complexity”].
- Raises:¶
ValueError – If method not in supported methods
-
classmethod using(method: 'knn', policy: 'hard_first' | 'easy_first' | 'stratified' | 'class_balance', *, k: int | None =
None, embeddings: dataeval.data.Embeddings | None =None, reference: dataeval.data.Embeddings | None =None, class_label: numpy.typing.NDArray[numpy.integer[Any]] | None =None) Prioritize¶ -
classmethod using(method: 'kmeans_distance' | 'kmeans_complexity', policy: 'hard_first' | 'easy_first' | 'stratified' | 'class_balance', *, c: int | None =
None, embeddings: dataeval.data.Embeddings | None =None, reference: dataeval.data.Embeddings | None =None, class_label: numpy.typing.NDArray[numpy.integer[Any]] | None =None) Prioritize Use precalculated embeddings to sort the dataset indices in order of highest priority data in the embedding space.
- Parameters:¶
- method : Literal["knn", "kmeans_distance", "kmeans_complexity"]¶
Method to use for sample scoring during prioritization.
- policy : Literal["hard_first","easy_first","stratified","class_balance"]¶
Selection policy for prioritizing scored samples.
- embeddings : Embeddings or None, default None¶
Embeddings to use during prioritization. If None, reference must be set.
- reference : Embeddings or None, default None¶
Reference embeddings used to prioritize the calculated dataset embeddings relative to them. If embeddings is None, this will be used instead.
- k : int or None, default None¶
Number of nearest neighbors to use for prioritization. If None, uses the square_root of the number of samples. Only used for method=”knn”, ignored otherwise.
- c : int or None, default None¶
Number of clusters to use for prioritization. If None, uses the square_root of the number of samples. Only used for method=”kmeans_*”, ignored otherwise.
Notes
k is only used for method [“knn”].
c is only used for methods [“kmeans_distance”, “kmeans_complexity”].
- Raises:¶
ValueError – If both embeddings and reference are None