dataeval.data.Embeddings

class dataeval.data.Embeddings(dataset, batch_size, transforms=None, model=None, layer_name=None, use_output=True, device=None, path=None, memory_threshold=0.8, verbose=False)

Collection of image embeddings from a dataset.

Embeddings are accessed by index or slice and are loaded on-demand. For large datasets, embeddings are automatically memory-mapped to disk to avoid exceeding available memory.

Parameters:
dataset : ImageClassificationDataset or ObjectDetectionDataset

Dataset to access original images from.

batch_size : int

Batch size to use when encoding images. When less than 1, automatically sets to 1 for safe processing.

transforms : Transform or Sequence[Transform] or None, default None

Image transformations to apply before encoding. When None, uses raw images without preprocessing.

model : EmbeddingModel or None, default None

A model such as a PyTorch neural network model that generates embeddings from images. When None, uses Flatten layer for simple baseline compatibility with all DataEval tools without requiring pre-trained weights or GPU resources.

layer_name : str or None, default None

Network layer from which to extract embeddings. When None, uses model output. If specified, extracts either the input or output tensors from this layer depending on the value of use_output

use_output : bool, default True

The relative location to extract intermediate tensors in the model. If true, captures the output tensors from layer_name. If False, captures the input tensors to layer_name. Ignored if layer_name is None.

device : DeviceLike or None, default None

Hardware device for computation. When None, automatically selects DataEval’s configured device, falling back to PyTorch’s default.

path : Path, str, or None, default None

File path for memory-mapped storage. When None, caches embeddings in memory only. When Path or string is provided, uses memory-mapped storage for large embeddings (automatic based on memory_threshold).

memory_threshold : float, default 0.8

Fraction of available memory (0-1) that triggers memory-mapped storage. When estimated embedding size exceeds this threshold, uses disk-backed memmap instead of in-memory arrays. Only applies when path is provided.

verbose : bool, default False

When True, displays a progress bar when encoding images. Default False reduces console output for cleaner automated workflows.

batch_size

Number of images processed per batch during encoding. Minimum value of 1.

Type:

int

device

Hardware device used for tensor computations.

Type:

torch.device

memory_threshold

Fraction of available memory (0-1) that triggers memory-mapped storage.

Type:

float

verbose

Whether progress information is displayed during operations.

Type:

bool

compute(force=False)

Compute and cache all embeddings.

Forces evaluation of all lazy embeddings, storing them in memory or memmap according to the configured storage strategy.

Parameters:
force : bool, default False

If True, recomputes all embeddings even if already cached. If False, only computes uncached embeddings.

Returns:

Returns self for method chaining.

Return type:

Embeddings

classmethod from_array(array)

Create Embeddings instance from an existing array.

Parameters:
array : ArrayLike

In-memory data to wrap in an Embeddings object. Can be a numpy array, memmap, or the result of np.load(). Memmap arrays are preserved as-is.

Returns:

Embeddings-only instance containing the provided data.

Return type:

Embeddings

Example

>>> import numpy as np
>>> from dataeval.data import Embeddings
>>> # From in-memory array
>>> array = np.random.randn(100, 512)
>>> embeddings = Embeddings.from_array(array)
>>> tmp_file = tmp_path / "embeddings.npy"
>>> # From saved file (preserves memmap)
>>> np.save(tmp_file, array)
>>> loaded = np.load(tmp_file, mmap_mode="r")
>>> embeddings = Embeddings.from_array(loaded)
>>> print(embeddings.shape)
(100, 512)
classmethod load(path, mmap_mode=None)

Load embeddings from a saved .npy file.

Parameters:
path : Path or str

File path to the saved .npy file containing embeddings.

mmap_mode : str or None, default None

Mode for memory-mapping the file. When None, loads the entire array into memory as an ndarray. When specified, uses memory-mapping which is more efficient for large files. Valid modes are: - ‘r’: Open existing file for reading only - ‘r+’: Open existing file for reading and writing - ‘w+’ : Open existing file and overwrite - ‘c’: Copy-on-write mode without updating file See numpy.load documentation for more details.

Returns:

Embeddings-only instance containing the loaded data.

Return type:

Embeddings

Example

>>> import numpy as np
>>> from dataeval.data import Embeddings
>>> # Save some embeddings
>>> array = np.random.randn(100, 512)
>>> tmp_file = tmp_path / "embeddings.npy"
>>> np.save(tmp_file, array)
>>> # Load as in-memory array
>>> embeddings = Embeddings.load(tmp_file)
>>> # Load as memmap for large files
>>> embeddings_mmap = Embeddings.load(tmp_file, mmap_mode="r")
>>> print(embeddings.shape)
(100, 512)
new(dataset)

Create new Embeddings instance with a different dataset.

Generate a new Embeddings object using the same model, transforms, and configuration but with a different dataset.

Parameters:
dataset : ImageClassificationDataset or ObjectDetectionDataset

Dataset that provides images for the new Embeddings instance.

Returns:

New Embeddings object configured identically to the current instance.

Return type:

Embeddings

Raises:

ValueError – When called on embeddings-only instance that lacks a model.

save(path=None)

Compute all embeddings and save to disk.

Forces computation of all embeddings if not already computed, then saves to the specified file path.

Parameters:
path : Path, str, or None, default None

File path where embeddings will be saved. When None, uses the configured path from initialization. Raises ValueError if no path is available.

Raises:

ValueError – When no path is specified and instance has no configured path.

Return type:

None

to_tensor(indices=None)

Convert embeddings to PyTorch tensor.

Process specified dataset indices through the model in batches and return concatenated embeddings as a single tensor on the configured device.

Parameters:
indices : Sequence[int] or None, default None

Dataset indices to convert to embeddings. When None, processes entire dataset.

Returns:

Concatenated embeddings with shape (n_samples, embedding_dim) on configured device.

Return type:

torch.Tensor

Warning

Processing large datasets can be memory and compute intensive. Consider using numpy arrays via __getitem__ for memory efficiency.