dataeval.extractors.BoVWExtractor

class dataeval.extractors.BoVWExtractor(vocab_size=2048)

Computes Bag of Visual Words histograms using SIFT keypoints.

This class implements the FeatureExtractor protocol for use with drift detectors and duplicate detection. It extracts SIFT keypoints from images and quantizes them into a visual vocabulary, producing rotation and scale invariant histogram embeddings.

The BoVW approach works by:

  1. Extracting local SIFT descriptors from each image

  2. Clustering all descriptors to form a “visual vocabulary” (codebook)

  3. Representing each image as a histogram of visual word occurrences

This produces embeddings that are invariant to image rotation, scale changes, and minor viewpoint variations, making it effective for finding near-duplicate images even when they have been transformed.

Parameters:
vocab_size : int, default 2048

Number of visual words (clusters) in the vocabulary. Larger vocabularies capture finer visual distinctions but require more training data and memory. Common values range from 256 to 4096. The actual vocabulary size may be smaller if the training data contains fewer total SIFT descriptors than the requested size.

vocab_size

The configured vocabulary size.

Type:

int

kmeans

The fitted k-means clustering model. None before fit() is called.

Type:

MiniBatchKMeans or None

sift

The SIFT feature detector/descriptor extractor.

Type:

cv2.SIFT

Example

Basic usage with fit/transform pattern

>>> import numpy as np
>>> from dataeval.extractors import BoVWExtractor
>>>
>>> # Create sample images (C, H, W format)
>>> rng = np.random.default_rng(42)
>>> images = [rng.integers(0, 256, (3, 64, 64), dtype=np.uint8) for _ in range(10)]
>>>
>>> # Create extractor and fit vocabulary
>>> extractor = BoVWExtractor(vocab_size=64)
>>> extractor.fit(images)
>>>
>>> # Transform images to embeddings
>>> embeddings = extractor.transform(images)
>>> embeddings.shape
(10, 64)

Using with duplicate detection

>>> from dataeval.quality import Duplicates
>>>
>>> # Fit extractor on reference dataset
>>> extractor = BoVWExtractor(vocab_size=128)
>>> extractor.fit(reference_data)
>>>
>>> # Use embeddings for duplicate detection
>>> embeddings = extractor.transform(unlabeled_data)

One-shot fit and transform

>>> # Convenience method that fits and transforms in one call
>>> extractor = BoVWExtractor(vocab_size=64)
>>> embeddings = extractor(images)
>>> embeddings.shape
(10, 64)

Notes

Vocabulary Training: The vocabulary should be trained on a representative sample of images. Once fitted, the same extractor can transform new images into comparable embeddings. Calling fit() again will replace the existing vocabulary.

Image Format: Images should be in (C, H, W) channel-first format, which is standard for PyTorch datasets. Both RGB (3 channels) and grayscale (1 channel) images are supported. Images are automatically converted to uint8 if needed.

Empty Features: Images with no detected SIFT features (e.g., uniform color images) will have zero-valued histogram embeddings.

Reproducibility: The k-means clustering uses a random seed from DataEval’s global configuration via get_seed(). Set a seed with set_seed() for reproducible results.

See also

dataeval.quality.Duplicates

Duplicate detection using embeddings

dataeval.extractors.ClassifierUncertaintyExtractor

Uncertainty-based feature extraction

fit(data)

Train the visual vocabulary on the provided images.

Extracts SIFT descriptors from all images and clusters them using MiniBatchKMeans to form the visual vocabulary (codebook).

Parameters:
data : Any

Iterable of images in (C, H, W) format. Supports RGB (3 channels) and grayscale (1 channel) images. Also accepts (image, label) tuples as returned by PyTorch datasets.

Returns:

Returns self for method chaining.

Return type:

BoVWExtractor

Raises:

ValueError – If no SIFT features are found in any image. This typically occurs when all images are uniform (e.g., solid color).

transform(data)

Transform images into BoVW histogram embeddings.

Uses the fitted vocabulary to convert images into normalized histograms of visual word occurrences.

Parameters:
data : Any

Iterable of images in (C, H, W) format. Supports RGB (3 channels) and grayscale (1 channel) images. Also accepts (image, label) tuples as returned by PyTorch datasets.

Returns:

Embeddings array of shape (n_images, vocab_size). Each row is an L2-normalized histogram of visual word occurrences. Images with no detected features have zero-valued histograms.

Return type:

Array

Raises:

RuntimeError – If called before fit().