dataeval.data.Embeddings¶

class dataeval.data.Embeddings(dataset, batch_size, transforms=None, model=None, layer_name=None, use_output=True, device=None, path=None, memory_threshold=0.8, verbose=False)¶

Collection of image embeddings from a dataset.

Embeddings are accessed by index or slice and are loaded on-demand. For large datasets, embeddings are automatically memory-mapped to disk to avoid exceeding available memory.

Parameters:¶

dataset : ImageClassificationDataset or ObjectDetectionDataset¶: Dataset to access original images from.
batch_size : int¶: Batch size to use when encoding images. When less than 1, automatically sets to 1 for safe processing.
transforms : Transform or Sequence[Transform] or None, default None¶: Image transformations to apply before encoding. When None, uses raw images without preprocessing.
model : EmbeddingModel or None, default None¶: A model such as a PyTorch neural network model that generates embeddings from images. When None, uses Flatten layer for simple baseline compatibility with all DataEval tools without requiring pre-trained weights or GPU resources.
layer_name : str or None, default None¶: Network layer from which to extract embeddings. When None, uses model output. If specified, extracts either the input or output tensors from this layer depending on the value of use_output
use_output : bool, default True¶: The relative location to extract intermediate tensors in the model. If true, captures the output tensors from layer_name. If False, captures the input tensors to layer_name. Ignored if layer_name is None.
device : DeviceLike or None, default None¶: Hardware device for computation. When None, automatically selects DataEval’s configured device, falling back to PyTorch’s default.
path : Path, str, or None, default None¶: File path for memory-mapped storage. When None, caches embeddings in memory only. When Path or string is provided, uses memory-mapped storage for large embeddings (automatic based on memory_threshold).
memory_threshold : float, default 0.8¶: Fraction of available memory (0-1) that triggers memory-mapped storage. When estimated embedding size exceeds this threshold, uses disk-backed memmap instead of in-memory arrays. Only applies when path is provided.
verbose : bool, default False¶: When True, displays a progress bar when encoding images. Default False reduces console output for cleaner automated workflows.

batch_size¶

Number of images processed per batch during encoding. Minimum value of 1.

Type:¶: int

device¶

Hardware device used for tensor computations.

Type:¶: torch.device

memory_threshold¶

Fraction of available memory (0-1) that triggers memory-mapped storage.

Type:¶: float

verbose¶

Whether progress information is displayed during operations.

Type:¶: bool

compute(force=False)¶

Compute and cache all embeddings.

Forces evaluation of all lazy embeddings, storing them in memory or memmap according to the configured storage strategy.

Parameters:¶

force : bool, default False¶: If True, recomputes all embeddings even if already cached. If False, only computes uncached embeddings.

Returns:¶

Returns self for method chaining.

Return type:¶

Embeddings

classmethod from_array(array)¶

Create Embeddings instance from an existing array.

Parameters:¶

array : ArrayLike¶: In-memory data to wrap in an Embeddings object. Can be a numpy array, memmap, or the result of np.load(). Memmap arrays are preserved as-is.

Returns:¶

Embeddings-only instance containing the provided data.

Return type:¶

Embeddings

Example

>>> import numpy as np
>>> from dataeval.data import Embeddings
>>> # From in-memory array
>>> array = np.random.randn(100, 512)
>>> embeddings = Embeddings.from_array(array)
>>> tmp_file = tmp_path / "embeddings.npy"
>>> # From saved file (preserves memmap)
>>> np.save(tmp_file, array)
>>> loaded = np.load(tmp_file, mmap_mode="r")
>>> embeddings = Embeddings.from_array(loaded)
>>> print(embeddings.shape)
(100, 512)

classmethod load(path, mmap_mode=None)¶

Load embeddings from a saved .npy file.

Parameters:¶

path : Path or str¶: File path to the saved .npy file containing embeddings.
mmap_mode : str or None, default None¶: Mode for memory-mapping the file. When None, loads the entire array into memory as an ndarray. When specified, uses memory-mapping which is more efficient for large files. Valid modes are: - ‘r’: Open existing file for reading only - ‘r+’: Open existing file for reading and writing - ‘w+’ : Open existing file and overwrite - ‘c’: Copy-on-write mode without updating file See numpy.load documentation for more details.

Returns:¶

Embeddings-only instance containing the loaded data.

Return type:¶

Embeddings

Example

>>> import numpy as np
>>> from dataeval.data import Embeddings
>>> # Save some embeddings
>>> array = np.random.randn(100, 512)
>>> tmp_file = tmp_path / "embeddings.npy"
>>> np.save(tmp_file, array)
>>> # Load as in-memory array
>>> embeddings = Embeddings.load(tmp_file)
>>> # Load as memmap for large files
>>> embeddings_mmap = Embeddings.load(tmp_file, mmap_mode="r")
>>> print(embeddings.shape)
(100, 512)

new(dataset)¶

Create new Embeddings instance with a different dataset.

Generate a new Embeddings object using the same model, transforms, and configuration but with a different dataset.

Parameters:¶

dataset : ImageClassificationDataset or ObjectDetectionDataset¶: Dataset that provides images for the new Embeddings instance.

Returns:¶

New Embeddings object configured identically to the current instance.

Return type:¶

Embeddings

Raises:¶

ValueError – When called on embeddings-only instance that lacks a model.

save(path=None)¶

Compute all embeddings and save to disk.

Forces computation of all embeddings if not already computed, then saves to the specified file path.

Parameters:¶

path : Path, str, or None, default None¶: File path where embeddings will be saved. When None, uses the configured path from initialization. Raises ValueError if no path is available.

Raises:¶

ValueError – When no path is specified and instance has no configured path.

Return type:¶

None

to_tensor(indices=None)¶

Convert embeddings to PyTorch tensor.

Process specified dataset indices through the model in batches and return concatenated embeddings as a single tensor on the configured device.

Parameters:¶

indices : Sequence[int] or None, default None¶: Dataset indices to convert to embeddings. When None, processes entire dataset.

Returns:¶

Concatenated embeddings with shape (n_samples, embedding_dim) on configured device.

Return type:¶

torch.Tensor

Warning

Processing large datasets can be memory and compute intensive. Consider using numpy arrays via __getitem__ for memory efficiency.