dataeval.protocols.EmbeddingEncoder¶
- class dataeval.protocols.EmbeddingEncoder¶
Protocol for embedding encoders that extract features from datasets.
Implementations handle all backend-specific logic including: - Model/function management - Device handling (if applicable) - Transforms (preprocessing) - Batching strategy - Layer extraction (if applicable)
The
encode()method supports both streaming and non-streaming modes via thestreamparameter.Example
Creating a custom encoder:
>>> import numpy as np >>> from numpy.typing import NDArray >>> from dataeval.protocols import EmbeddingEncoder, Dataset >>> >>> class MyEncoder: ... def __init__(self, batch_size: int = 32): ... self._batch_size = batch_size ... ... @property ... def batch_size(self) -> int: ... return self._batch_size ... ... def encode(self, dataset, indices, stream=False): ... def _generate(): ... for batch_start in range(0, len(indices), self._batch_size): ... batch_idx = list(indices[batch_start : batch_start + self._batch_size]) ... results = [] ... for idx in batch_idx: ... item = dataset[idx] ... image = item[0] if isinstance(item, tuple) else item ... results.append(np.asarray(image).flatten()) ... yield batch_idx, np.vstack(results) ... ... if stream: ... return _generate() ... return np.vstack([emb for _, emb in _generate()]) >>> >>> encoder = MyEncoder(batch_size=32) >>> isinstance(encoder, EmbeddingEncoder) True- encode(dataset: Dataset[tuple[ArrayLike, Any, Any]] | Dataset[ArrayLike], indices: collections.abc.Sequence[int], stream: True) collections.abc.Iterator[tuple[collections.abc.Sequence[int], Array]]¶
-
encode(dataset: Dataset[tuple[ArrayLike, Any, Any]] | Dataset[ArrayLike], indices: collections.abc.Sequence[int], stream: False =
...) Array Encode images at specified indices to embeddings.
- Parameters:¶
- dataset : Dataset¶
Dataset providing images to encode. Can return either (image, label, metadata) tuples or just images.
- indices : Sequence[int]¶
Indices of images to encode from the dataset.
- stream : bool, default False¶
If True, yields (batch_indices, batch_embeddings) tuples for memory-efficient streaming. If False (default), returns all embeddings as a single array.
- Returns:¶
When stream=False: Embeddings array of shape (len(indices), embedding_dim). When stream=True: Iterator yielding (batch_indices, batch_embeddings) tuples.
- Return type:¶