dataeval.Embeddings¶
-
class dataeval.Embeddings(dataset=
None, extractor=None, batch_size=None, path=None, memory_threshold=0.8, progress_callback=None)¶ Collection of image embeddings from a dataset.
Embeddings are accessed by index or slice and are loaded on-demand. For large datasets, embeddings are automatically memory-mapped to disk to avoid exceeding available memory.
This class also implements the
FeatureExtractorprotocol, allowing it to be used directly with drift detectors and quality metrics that accept feature extractors.- Parameters:¶
- dataset : ImageClassificationDataset, ObjectDetectionDataset, or None, default None¶
Dataset to access original images from. When None, creates an unbound instance that can be used as a reusable feature extractor. Use
bind()to attach a dataset later, or pass data directly to__call__().- extractor : FeatureExtractor or None, default None¶
Feature extractor for converting images to embeddings. Handles model inference, device management, and transforms. When None, uses
FlattenExtractorfor simple baseline compatibility with all DataEval tools.- batch_size : int or None, default None¶
Number of samples to process per batch. When None, uses DataEval’s configured batch size via
get_batch_size().- path : Path, str, or None, default None¶
File path for memory-mapped storage. When None, caches embeddings in memory only. When Path or string is provided, uses memory-mapped storage for large embeddings (automatic based on memory_threshold).
- memory_threshold : float, default 0.8¶
Fraction of available memory (0-1) that triggers memory-mapped storage. When estimated embedding size exceeds this threshold, uses disk-backed memmap instead of in-memory arrays. Only applies when path is provided.
- progress_callback : ProgressCallback or None, default None¶
Callback to report progress during embedding computation.
- memory_threshold¶
Fraction of available memory (0-1) that triggers memory-mapped storage.
- Type:¶
float
Example
Using with a PyTorch model:
>>> from dataeval import Embeddings >>> from dataeval.extractors import TorchExtractor >>> >>> embeddings = Embeddings(train_dataset, extractor=extractor, batch_size=32) >>> train_emb = embeddings[:] >>> train_emb.shape (40, 32)Using with default flattening (no model):
>>> # Uses FlattenExtractor by default >>> embeddings = Embeddings(dataset) >>> flat_features = np.asarray(embeddings)- bind(dataset)¶
Bind this instance to a dataset.
Attaches a dataset to this Embeddings instance for embedding computation. Any previously cached embeddings are cleared.
- Parameters:¶
- dataset : ImageClassificationDataset or ObjectDetectionDataset¶
Dataset to bind for embedding computation.
- Returns:¶
Returns self for method chaining.
- Return type:¶
Self
- Raises:¶
ValueError – When called on an embeddings-only instance.
Example
>>> from dataeval import Embeddings >>> from dataeval.extractors import TorchExtractor >>> >>> extractor = TorchExtractor(my_model) >>> emb = Embeddings(extractor=extractor, batch_size=32) >>> _ = emb.bind(train_dataset) >>> embeddings = emb()
-
compute(force=
False)¶ Compute and cache all embeddings.
Forces evaluation of all lazy embeddings, storing them in memory or memmap according to the configured storage strategy. Progress updates are reported via the progress_callback if configured.
- new(dataset)¶
Create new Embeddings instance with a different dataset.
Generate a new Embeddings object using the same extractor and configuration but with a different dataset.
- Parameters:¶
- dataset : ImageClassificationDataset or ObjectDetectionDataset¶
Dataset that provides images for the new Embeddings instance.
- Returns:¶
New Embeddings object configured identically to the current instance.
- Return type:¶
- Raises:¶
ValueError – When called on embeddings-only instance that lacks an extractor.
-
save(path=
None)¶ Compute all embeddings and save to disk.
Forces computation of all embeddings if not already computed, then saves to the specified file path. Progress updates are reported via the progress_callback if configured during computation.
- property batch_size : int¶
Return the batch size used for embedding computation.
- property is_bound : bool¶
Whether this instance is bound to a dataset.
- property ndim : int¶
Number of dimensions of the array.
- property shape : tuple[int, Ellipsis]¶
Shape of the array.