dataeval.encoders.TorchEmbeddingEncoder

class dataeval.encoders.TorchEmbeddingEncoder(model, batch_size=None, transforms=None, device=None, layer_name=None, use_output=True)

PyTorch-based embedding encoder.

Encapsulates all PyTorch-specific logic for embedding extraction: - Model management (torch.nn.Module) - Device handling - Transform pipeline - Batch processing via DataLoader - Layer hooking for intermediate layer extraction

Parameters:
model : torch.nn.Module

PyTorch model for embedding extraction.

batch_size : int or None, default None

Number of samples per batch. When None, uses DataEval’s configured batch size.

transforms : Transform or Sequence[Transform] or None, default None

Preprocessing transforms to apply before encoding. When None, uses raw images.

device : DeviceLike or None, default None

Device for computation. When None, uses DataEval’s configured device.

layer_name : str or None, default None

Layer to extract embeddings from. When None, uses model output.

use_output : bool, default True

If True, captures layer output; if False, captures layer input. Only used when layer_name is specified.

Example

Basic usage with a model:

>>> import torch.nn as nn
>>> from dataeval.encoders import TorchEmbeddingEncoder
>>> from dataeval import Embeddings
>>>
>>> model = nn.Sequential(nn.Flatten(), nn.Linear(784, 128))
>>> encoder = TorchEmbeddingEncoder(model, batch_size=32, device="cpu")
>>> embeddings = Embeddings(dataset, encoder=encoder)

Extracting from an intermediate layer:

>>> encoder = TorchEmbeddingEncoder(
...     model,
...     batch_size=32,
...     layer_name="0",  # Extract from Flatten layer
...     use_output=True,
... )
encode(dataset: dataeval.protocols.Dataset[tuple[dataeval.protocols.ArrayLike, Any, Any]] | dataeval.protocols.Dataset[dataeval.protocols.ArrayLike], indices: collections.abc.Sequence[int], stream: True) collections.abc.Iterator[tuple[collections.abc.Sequence[int], numpy.typing.NDArray[Any]]]
encode(dataset: dataeval.protocols.Dataset[tuple[dataeval.protocols.ArrayLike, Any, Any]] | dataeval.protocols.Dataset[dataeval.protocols.ArrayLike], indices: collections.abc.Sequence[int], stream: False = ...) numpy.typing.NDArray[Any]

Encode images at specified indices to embeddings.

Parameters:
dataset : Dataset

Dataset providing images to encode.

indices : Sequence[int]

Indices of images to encode from the dataset.

stream : bool, default False

If True, yields (batch_indices, batch_embeddings) tuples. If False, returns all embeddings as a single array.

Returns:

Embeddings array or iterator of batches.

Return type:

NDArray[Any] or Iterator[tuple[Sequence[int], NDArray[Any]]]

Raises:

IndexError – If any indices are out of range for the dataset.

property batch_size : int

Return the batch size used for encoding.

property layer_name : str | None

Return the layer name for intermediate extraction, if set.

property use_output : bool

Return whether output (True) or input (False) is captured from the layer.