dataeval.data.selections.Prioritize

class dataeval.data.selections.Prioritize(model: torch.nn.Module | None, batch_size: int, device: dataeval.config.DeviceLike | None, method: 'knn', policy: 'hard_first' | 'easy_first' | 'stratified' | 'class_balance', *, k: int | None = None, class_label: numpy.typing.NDArray[numpy.integer[Any]] | None = None)
class Prioritize(model, batch_size, device, method, policy, *, c=None, class_label=None)
class Prioritize(model, batch_size, device, method, policy, *, k=None, c=None, class_label)
class Prioritize(model, batch_size, device, method, policy, *, k=None, c=None, class_label=None)

Sort the dataset indices in order of highest priority data in the embedding space.

Parameters:
model : torch.nn.Module | None

Model to use for encoding images

batch_size : int

Batch size to use when encoding images

device : DeviceLike or None

Device to use for encoding images

method : Literal["knn", "kmeans_distance", "kmeans_complexity"]

Method to use for prioritization

k : int or None, default None

Number of nearest neighbors to use for prioritization. If None, uses the square_root of the number of samples. Only used for method=”knn”, ignored otherwise.

c : int or None, default None

Number of clusters to use for prioritization. If None, uses the square_root of the number of samples. Only used for method=”kmeans_*”, ignored otherwise.

Notes

  1. k is only used for method [“knn”].

  2. c is only used for methods [“kmeans_distance”, “kmeans_complexity”].

Raises:

ValueError – If method not in supported methods

classmethod using(method: 'knn', policy: 'hard_first' | 'easy_first' | 'stratified' | 'class_balance', *, k: int | None = None, embeddings: dataeval.data.Embeddings | None = None, reference: dataeval.data.Embeddings | None = None, class_label: numpy.typing.NDArray[numpy.integer[Any]] | None = None) Prioritize
classmethod using(method: 'kmeans_distance' | 'kmeans_complexity', policy: 'hard_first' | 'easy_first' | 'stratified' | 'class_balance', *, c: int | None = None, embeddings: dataeval.data.Embeddings | None = None, reference: dataeval.data.Embeddings | None = None, class_label: numpy.typing.NDArray[numpy.integer[Any]] | None = None) Prioritize

Use precalculated embeddings to sort the dataset indices in order of highest priority data in the embedding space.

Parameters:
method : Literal["knn", "kmeans_distance", "kmeans_complexity"]

Method to use for sample scoring during prioritization.

policy : Literal["hard_first","easy_first","stratified","class_balance"]

Selection policy for prioritizing scored samples.

embeddings : Embeddings or None, default None

Embeddings to use during prioritization. If None, reference must be set.

reference : Embeddings or None, default None

Reference embeddings used to prioritize the calculated dataset embeddings relative to them. If embeddings is None, this will be used instead.

k : int or None, default None

Number of nearest neighbors to use for prioritization. If None, uses the square_root of the number of samples. Only used for method=”knn”, ignored otherwise.

c : int or None, default None

Number of clusters to use for prioritization. If None, uses the square_root of the number of samples. Only used for method=”kmeans_*”, ignored otherwise.

Notes

  1. k is only used for method [“knn”].

  2. c is only used for methods [“kmeans_distance”, “kmeans_complexity”].

Raises:

ValueError – If both embeddings and reference are None