dataeval.Embeddings¶

class dataeval.Embeddings(dataset=None, extractor=None, batch_size=None, path=None, memory_threshold=0.8, progress_callback=None)¶

Collection of image embeddings from a dataset.

Embeddings are accessed by index or slice and are loaded on-demand. For large datasets, embeddings are automatically memory-mapped to disk to avoid exceeding available memory.

This class also implements the FeatureExtractor protocol, allowing it to be used directly with drift detectors and quality metrics that accept feature extractors.

Parameters:¶

dataset : ImageClassificationDataset, ObjectDetectionDataset, or None, default None¶: Dataset to access original images from. When None, creates an unbound instance that can be used as a reusable feature extractor. Use bind() to attach a dataset later, or pass data directly to __call__().
extractor : FeatureExtractor or None, default None¶: Feature extractor for converting images to embeddings. Handles model inference, device management, and transforms. When None, uses FlattenExtractor for simple baseline compatibility with all DataEval tools.
batch_size : int or None, default None¶: Number of samples to process per batch. When None, uses DataEval’s configured batch size via get_batch_size().
path : Path, str, or None, default None¶: File path for memory-mapped storage. When None, caches embeddings in memory only. When Path or string is provided, uses memory-mapped storage for large embeddings (automatic based on memory_threshold).
memory_threshold : float, default 0.8¶: Fraction of available memory (0-1) that triggers memory-mapped storage. When estimated embedding size exceeds this threshold, uses disk-backed memmap instead of in-memory arrays. Only applies when path is provided.
progress_callback : ProgressCallback or None, default None¶: Callback to report progress during embedding computation.

memory_threshold¶

Fraction of available memory (0-1) that triggers memory-mapped storage.

Type:¶: float

Example

Using with a PyTorch model:

>>> from dataeval import Embeddings
>>> from dataeval.extractors import TorchExtractor
>>>
>>> embeddings = Embeddings(train_dataset, extractor=extractor, batch_size=32)
>>> train_emb = embeddings[:]
>>> train_emb.shape
(40, 32)

Using with default flattening (no model):

>>> # Uses FlattenExtractor by default
>>> embeddings = Embeddings(dataset)
>>> flat_features = np.asarray(embeddings)

bind(dataset)¶

Bind this instance to a dataset.

Attaches a dataset to this Embeddings instance for embedding computation. Any previously cached embeddings are cleared.

Parameters:¶

dataset : ImageClassificationDataset or ObjectDetectionDataset¶: Dataset to bind for embedding computation.

Returns:¶

Returns self for method chaining.

Return type:¶

Self

Raises:¶

ValueError – When called on an embeddings-only instance.

Example

>>> from dataeval import Embeddings
>>> from dataeval.extractors import TorchExtractor
>>>
>>> extractor = TorchExtractor(my_model)
>>> emb = Embeddings(extractor=extractor, batch_size=32)
>>> _ = emb.bind(train_dataset)
>>> embeddings = emb()

compute(force=False)¶

Compute and cache all embeddings.

Forces evaluation of all lazy embeddings, storing them in memory or memmap according to the configured storage strategy. Progress updates are reported via the progress_callback if configured.

Parameters:¶

force : bool, default False¶: If True, recomputes all embeddings even if already cached. If False, only computes uncached embeddings.

Returns:¶

Returns self for method chaining.

Return type:¶

Embeddings

new(dataset)¶

Create new Embeddings instance with a different dataset.

Generate a new Embeddings object using the same extractor and configuration but with a different dataset.

Parameters:¶

dataset : ImageClassificationDataset or ObjectDetectionDataset¶: Dataset that provides images for the new Embeddings instance.

Returns:¶

New Embeddings object configured identically to the current instance.

Return type:¶

Embeddings

Raises:¶

ValueError – When called on embeddings-only instance that lacks an extractor.

save(path=None)¶

Compute all embeddings and save to disk.

Forces computation of all embeddings if not already computed, then saves to the specified file path. Progress updates are reported via the progress_callback if configured during computation.

Parameters:¶

path : Path, str, or None, default None¶: File path where embeddings will be saved. When None, uses the configured path from initialization. Raises ValueError if no path is available.

Raises:¶

ValueError – When no path is specified and instance has no configured path.

property batch_size : int¶

Return the batch size used for embedding computation.

Returns:¶: Number of samples processed per batch.
Return type:¶: int

property is_bound : bool¶

Whether this instance is bound to a dataset.

Returns:¶: True if a dataset is bound, False otherwise.
Return type:¶: bool

property ndim : int¶

Number of dimensions of the array.

property shape : tuple[int, Ellipsis]¶

Shape of the array.