dataeval.data.selections.Prioritize¶

class dataeval.data.selections.Prioritize(model: dataeval.protocols.EmbeddingModel | None, batch_size: int, device: dataeval.config.DeviceLike | None, method: 'knn', policy: 'hard_first' | 'easy_first' | 'stratified' | 'class_balance', *, k: int | None = None, class_label: numpy.typing.NDArray[numpy.integer[Any]] | None = None)¶

class Prioritize(model, batch_size, device, method, policy, *, c=None, class_label=None)¶

class Prioritize(model, batch_size, device, method, policy, *, k=None, c=None, class_label)

class Prioritize(model, batch_size, device, method, policy, *, k=None, c=None, class_label=None)

Sort the dataset indices in order of highest priority data in the embedding space.

Parameters:¶

model : EmbeddingModel | None¶: Model to use for encoding images
batch_size : int¶: Batch size to use when encoding images
device : DeviceLike or None¶: Device to use for encoding images
method : Literal["knn", "kmeans_distance", "kmeans_complexity"]¶: Method to use for prioritization
k : int or None, default None¶: Number of nearest neighbors to use for prioritization. If None, uses the square_root of the number of samples. Only used for method=”knn”, ignored otherwise.
c : int or None, default None¶: Number of clusters to use for prioritization. If None, uses the square_root of the number of samples. Only used for method=”kmeans_*”, ignored otherwise.

Notes

k is only used for method [“knn”].
c is only used for methods [“kmeans_distance”, “kmeans_complexity”].

Raises:¶: ValueError – If method not in supported methods

Use precalculated embeddings to sort the dataset indices in order of highest priority data in the embedding space.

Parameters:¶

method : Literal["knn", "kmeans_distance", "kmeans_complexity"]¶: Method to use for sample scoring during prioritization.
policy : Literal["hard_first","easy_first","stratified","class_balance"]¶: Selection policy for prioritizing scored samples.
embeddings : Embeddings or None, default None¶: Embeddings to use during prioritization. If None, reference must be set.
reference : Embeddings or None, default None¶: Reference embeddings used to prioritize the calculated dataset embeddings relative to them. If embeddings is None, this will be used instead.
k : int or None, default None¶: Number of nearest neighbors to use for prioritization. If None, uses the square_root of the number of samples. Only used for method=”knn”, ignored otherwise.
c : int or None, default None¶: Number of clusters to use for prioritization. If None, uses the square_root of the number of samples. Only used for method=”kmeans_*”, ignored otherwise.

Notes

k is only used for method [“knn”].
c is only used for methods [“kmeans_distance”, “kmeans_complexity”].

Raises:¶: ValueError – If both embeddings and reference are None