dataeval.Embeddings

class dataeval.Embeddings(dataset=None, extractor=None, batch_size=None, path=None, memory_threshold=0.8, progress_callback=None)

Collection of image embeddings from a dataset.

Embeddings are accessed by index or slice and are loaded on-demand. For large datasets, embeddings are automatically memory-mapped to disk to avoid exceeding available memory.

This class also implements the FeatureExtractor protocol, allowing it to be used directly with drift detectors and quality metrics that accept feature extractors.

Parameters:
dataset : ImageClassificationDataset, ObjectDetectionDataset, or None, default None

Dataset to access original images from. When None, creates an unbound instance that can be used as a reusable feature extractor. Use bind() to attach a dataset later, or pass data directly to __call__().

extractor : FeatureExtractor or None, default None

Feature extractor for converting images to embeddings. Handles model inference, device management, and transforms. When None, uses FlattenExtractor for simple baseline compatibility with all DataEval tools.

batch_size : int or None, default None

Number of samples to process per batch. When None, uses DataEval’s configured batch size via get_batch_size().

path : Path, str, or None, default None

File path for memory-mapped storage. When None, caches embeddings in memory only. When Path or string is provided, uses memory-mapped storage for large embeddings (automatic based on memory_threshold).

memory_threshold : float, default 0.8

Fraction of available memory (0-1) that triggers memory-mapped storage. When estimated embedding size exceeds this threshold, uses disk-backed memmap instead of in-memory arrays. Only applies when path is provided.

progress_callback : ProgressCallback or None, default None

Callback to report progress during embedding computation.

memory_threshold

Fraction of available memory (0-1) that triggers memory-mapped storage.

Type:

float

Example

Using with a PyTorch model:

>>> from dataeval import Embeddings
>>> from dataeval.extractors import TorchExtractor
>>>
>>> embeddings = Embeddings(train_dataset, extractor=extractor, batch_size=32)
>>> train_emb = embeddings[:]
>>> train_emb.shape
(40, 32)

Using with default flattening (no model):

>>> # Uses FlattenExtractor by default
>>> embeddings = Embeddings(dataset)
>>> flat_features = np.asarray(embeddings)
bind(dataset)

Bind this instance to a dataset.

Attaches a dataset to this Embeddings instance for embedding computation. Any previously cached embeddings are cleared.

Parameters:
dataset : ImageClassificationDataset or ObjectDetectionDataset

Dataset to bind for embedding computation.

Returns:

Returns self for method chaining.

Return type:

Self

Raises:

ValueError – When called on an embeddings-only instance.

Example

>>> from dataeval import Embeddings
>>> from dataeval.extractors import TorchExtractor
>>>
>>> extractor = TorchExtractor(my_model)
>>> emb = Embeddings(extractor=extractor, batch_size=32)
>>> _ = emb.bind(train_dataset)
>>> embeddings = emb()
compute(force=False)

Compute and cache all embeddings.

Forces evaluation of all lazy embeddings, storing them in memory or memmap according to the configured storage strategy. Progress updates are reported via the progress_callback if configured.

Parameters:
force : bool, default False

If True, recomputes all embeddings even if already cached. If False, only computes uncached embeddings.

Returns:

Returns self for method chaining.

Return type:

Embeddings

new(dataset)

Create new Embeddings instance with a different dataset.

Generate a new Embeddings object using the same extractor and configuration but with a different dataset.

Parameters:
dataset : ImageClassificationDataset or ObjectDetectionDataset

Dataset that provides images for the new Embeddings instance.

Returns:

New Embeddings object configured identically to the current instance.

Return type:

Embeddings

Raises:

ValueError – When called on embeddings-only instance that lacks an extractor.

save(path=None)

Compute all embeddings and save to disk.

Forces computation of all embeddings if not already computed, then saves to the specified file path. Progress updates are reported via the progress_callback if configured during computation.

Parameters:
path : Path, str, or None, default None

File path where embeddings will be saved. When None, uses the configured path from initialization. Raises ValueError if no path is available.

Raises:

ValueError – When no path is specified and instance has no configured path.

property batch_size : int

Return the batch size used for embedding computation.

Returns:

Number of samples processed per batch.

Return type:

int

property is_bound : bool

Whether this instance is bound to a dataset.

Returns:

True if a dataset is bound, False otherwise.

Return type:

bool