dataeval.data.Embeddings

class dataeval.data.Embeddings(dataset, batch_size, transforms=None, model=None, device=None, cache=False, verbose=False)

Collection of image embeddings from a dataset.

Embeddings are accessed by index or slice and are loaded on-demand.

Parameters:
dataset : ImageClassificationDataset or ObjectDetectionDataset

Dataset to access original images from.

batch_size : int

Batch size to use when encoding images. When less than 1, automatically sets to 1 for safe processing.

transforms : Transform or Sequence[Transform] or None, default None

Image transformationss to apply before encoding. When None, uses raw images without preprocessing.

model : torch.nn.Module or None, default None

Neural network model that generates embeddings from images. When None, uses Flatten layer for simple baseline compatibility with all DataEval tools without requiring pre-trained weights or GPU resources.

device : DeviceLike or None, default None

Hardware device for computation. When None, automatically selects DataEval’s configured device, falling back to PyTorch’s default.

cache : Path, str, or bool, default False

When True, caches embeddings in memory for faster repeated access. When Path or string is provided, persists embeddings to disk for reuse across sessions. Default False minimizes memory usage.

verbose : bool, default False

When True, displays a progress bar when encoding images. Default False reduces console output for cleaner automated workflows.

batch_size

Number of images processed per batch during encoding. Minimum value of 1.

Type:

int

cache

Disk path where embeddings are stored, or True when cached in memory.

Type:

Path or bool

Return type:

pathlib.Path | bool

device

Hardware device used for tensor computations.

Type:

torch.device

verbose

Whether progress information is displayed during operations.

Type:

bool

classmethod from_array(array, device=None)

Create Embeddings instance from an existing image array.

Parameters:
array : ArrayLike

In-memory image data to wrap in an Embeddings object.

device : DeviceLike or None, default None

Hardware device for computation. When None, automatically selects DataEval’s configured device, falling back to PyTorch’s default.

Return type:

Embeddings

Example

>>> import numpy as np
>>> from dataeval.data import Embeddings
>>> array = np.random.randn(100, 3, 224, 224)
>>> embeddings = Embeddings.from_array(array)
>>> print(embeddings.to_tensor().shape)
torch.Size([100, 3, 224, 224])
classmethod load(path)

Loads the embeddings from disk.

Create an Embeddings instance from previously saved embedding data.

Parameters:
path : Path or str

File path to load embeddings from.

Returns:

Embeddings-only instance containing the loaded data.

Return type:

Embeddings

Raises:
  • FileNotFoundError – When the specified file path does not exist.

  • Exception – When file loading or parsing fails.

new(dataset)

Create new Embeddings instance with a different dataset.

Generate a new Embeddings object using the same model, transforms, and configuration but with a different dataset.

Parameters:
dataset : ImageClassificationDataset or ObjectDetectionDataset

Dataset that provides images for the new Embeddings instance.

Returns:

New Embeddings object configured identically to the current instance.

Return type:

Embeddings

Raises:

ValueError – When called on embeddings-only instance that lacks a model.

save(path)

Save embeddings to disk.

Persist current embeddings to the specified file path for later loading and reuse.

Parameters:
path : Path or str

File path where embeddings will be saved.

Return type:

None

to_numpy(indices=None)

Convert dataset items to embedding array.

Parameters:
indices : Sequence[int] or None, default None

Dataset indices to convert to embeddings. When None, processes entire dataset.

Returns:

Embedding array with shape (n_samples, embedding_dim)

Return type:

NDArray[Any]

Warning

Processing large datasets can be memory and compute intensive.

to_tensor(indices=None)

Convert dataset items to embedding tensor.

Process specified dataset indices through the model in batches and return concatenated embeddings as a single tensor.

Parameters:
indices : Sequence[int] or None, default None

Dataset indices to convert to embeddings. When None, processes entire dataset.

Returns:

Concatenated embeddings with shape (n_samples, embedding_dim).

Return type:

torch.Tensor

Warning

Processing large datasets can be memory and compute intensive.