dataeval.data.Embeddings¶
-
class dataeval.data.Embeddings(dataset, batch_size, transforms=
None, model=None, device=None, cache=False, verbose=False)¶ Collection of image embeddings from a dataset.
Embeddings are accessed by index or slice and are loaded on-demand.
- Parameters:¶
- dataset : ImageClassificationDataset or ObjectDetectionDataset¶
Dataset to access original images from.
- batch_size : int¶
Batch size to use when encoding images. When less than 1, automatically sets to 1 for safe processing.
- transforms : Transform or Sequence[Transform] or None, default None¶
Image transformationss to apply before encoding. When None, uses raw images without preprocessing.
- model : torch.nn.Module or None, default None¶
Neural network model that generates embeddings from images. When None, uses Flatten layer for simple baseline compatibility with all DataEval tools without requiring pre-trained weights or GPU resources.
- device : DeviceLike or None, default None¶
Hardware device for computation. When None, automatically selects DataEval’s configured device, falling back to PyTorch’s default.
- cache : Path, str, or bool, default False¶
When True, caches embeddings in memory for faster repeated access. When Path or string is provided, persists embeddings to disk for reuse across sessions. Default False minimizes memory usage.
- verbose : bool, default False¶
When True, displays a progress bar when encoding images. Default False reduces console output for cleaner automated workflows.
- cache¶
Disk path where embeddings are stored, or True when cached in memory.
-
classmethod from_array(array, device=
None)¶ Create Embeddings instance from an existing image array.
Example
>>> import numpy as np >>> from dataeval.data import Embeddings >>> array = np.random.randn(100, 3, 224, 224) >>> embeddings = Embeddings.from_array(array) >>> print(embeddings.to_tensor().shape) torch.Size([100, 3, 224, 224])
- classmethod load(path)¶
Loads the embeddings from disk.
Create an Embeddings instance from previously saved embedding data.
- new(dataset)¶
Create new Embeddings instance with a different dataset.
Generate a new Embeddings object using the same model, transforms, and configuration but with a different dataset.
- Parameters:¶
- dataset : ImageClassificationDataset or ObjectDetectionDataset¶
Dataset that provides images for the new Embeddings instance.
- Returns:¶
New Embeddings object configured identically to the current instance.
- Return type:¶
- Raises:¶
ValueError – When called on embeddings-only instance that lacks a model.
- save(path)¶
Save embeddings to disk.
Persist current embeddings to the specified file path for later loading and reuse.
-
to_numpy(indices=
None)¶ Convert dataset items to embedding array.
- Parameters:¶
- indices : Sequence[int] or None, default None¶
Dataset indices to convert to embeddings. When None, processes entire dataset.
- Returns:¶
Embedding array with shape (n_samples, embedding_dim)
- Return type:¶
NDArray[Any]
Warning
Processing large datasets can be memory and compute intensive.
-
to_tensor(indices=
None)¶ Convert dataset items to embedding tensor.
Process specified dataset indices through the model in batches and return concatenated embeddings as a single tensor.
- Parameters:¶
- indices : Sequence[int] or None, default None¶
Dataset indices to convert to embeddings. When None, processes entire dataset.
- Returns:¶
Concatenated embeddings with shape (n_samples, embedding_dim).
- Return type:¶
torch.Tensor
Warning
Processing large datasets can be memory and compute intensive.