dataeval.data.Embeddings¶

class dataeval.data.Embeddings(dataset, batch_size, transforms=None, model=None, device=None, cache=False, verbose=False)¶

Collection of image embeddings from a dataset.

Embeddings are accessed by index or slice and are loaded on-demand.

Parameters:¶

dataset : ImageClassificationDataset or ObjectDetectionDataset¶: Dataset to access original images from.
batch_size : int¶: Batch size to use when encoding images. When less than 1, automatically sets to 1 for safe processing.
transforms : Transform or Sequence[Transform] or None, default None¶: Image transformationss to apply before encoding. When None, uses raw images without preprocessing.
model : torch.nn.Module or None, default None¶: Neural network model that generates embeddings from images. When None, uses Flatten layer for simple baseline compatibility with all DataEval tools without requiring pre-trained weights or GPU resources.
device : DeviceLike or None, default None¶: Hardware device for computation. When None, automatically selects DataEval’s configured device, falling back to PyTorch’s default.
cache : Path, str, or bool, default False¶: When True, caches embeddings in memory for faster repeated access. When Path or string is provided, persists embeddings to disk for reuse across sessions. Default False minimizes memory usage.
verbose : bool, default False¶: When True, displays a progress bar when encoding images. Default False reduces console output for cleaner automated workflows.

batch_size¶

Number of images processed per batch during encoding. Minimum value of 1.

Type:¶: int

cache¶

Disk path where embeddings are stored, or True when cached in memory.

Type:¶: Path or bool
Return type:¶: pathlib.Path | bool

device¶

Hardware device used for tensor computations.

Type:¶: torch.device

verbose¶

Whether progress information is displayed during operations.

Type:¶: bool

classmethod from_array(array, device=None)¶

Create Embeddings instance from an existing image array.

Parameters:¶

array : ArrayLike¶: In-memory image data to wrap in an Embeddings object.
device : DeviceLike or None, default None¶: Hardware device for computation. When None, automatically selects DataEval’s configured device, falling back to PyTorch’s default.

Return type:¶

Embeddings

Example

>>> import numpy as np
>>> from dataeval.data import Embeddings
>>> array = np.random.randn(100, 3, 224, 224)
>>> embeddings = Embeddings.from_array(array)
>>> print(embeddings.to_tensor().shape)
torch.Size([100, 3, 224, 224])

classmethod load(path)¶

Loads the embeddings from disk.

Create an Embeddings instance from previously saved embedding data.

Parameters:¶

path : Path or str¶: File path to load embeddings from.

Returns:¶

Embeddings-only instance containing the loaded data.

Return type:¶

Embeddings

Raises:¶

FileNotFoundError – When the specified file path does not exist.
Exception – When file loading or parsing fails.

new(dataset)¶

Create new Embeddings instance with a different dataset.

Generate a new Embeddings object using the same model, transforms, and configuration but with a different dataset.

Parameters:¶

dataset : ImageClassificationDataset or ObjectDetectionDataset¶: Dataset that provides images for the new Embeddings instance.

Returns:¶

New Embeddings object configured identically to the current instance.

Return type:¶

Embeddings

Raises:¶

ValueError – When called on embeddings-only instance that lacks a model.

save(path)¶

Save embeddings to disk.

Persist current embeddings to the specified file path for later loading and reuse.

Parameters:¶

path : Path or str¶: File path where embeddings will be saved.

Return type:¶

None

to_numpy(indices=None)¶

Convert dataset items to embedding array.

Parameters:¶

indices : Sequence[int] or None, default None¶: Dataset indices to convert to embeddings. When None, processes entire dataset.

Returns:¶

Embedding array with shape (n_samples, embedding_dim)

Return type:¶

NDArray[Any]

Warning

Processing large datasets can be memory and compute intensive.

to_tensor(indices=None)¶

Convert dataset items to embedding tensor.

Process specified dataset indices through the model in batches and return concatenated embeddings as a single tensor.

Parameters:¶

indices : Sequence[int] or None, default None¶: Dataset indices to convert to embeddings. When None, processes entire dataset.

Returns:¶

Concatenated embeddings with shape (n_samples, embedding_dim).

Return type:¶

torch.Tensor

Warning

Processing large datasets can be memory and compute intensive.