dataeval.extractors.BoVWExtractor¶

class dataeval.extractors.BoVWExtractor(vocab_size=2048)¶

Computes Bag of Visual Words histograms using SIFT keypoints.

This class implements the FeatureExtractor protocol for use with drift detectors and duplicate detection. It extracts SIFT keypoints from images and quantizes them into a visual vocabulary, producing rotation and scale invariant histogram embeddings.

The BoVW approach works by:

Extracting local SIFT descriptors from each image
Clustering all descriptors to form a “visual vocabulary” (codebook)
Representing each image as a histogram of visual word occurrences

This produces embeddings that are invariant to image rotation, scale changes, and minor viewpoint variations, making it effective for finding near-duplicate images even when they have been transformed.

Parameters:¶

vocab_size : int, default 2048¶: Number of visual words (clusters) in the vocabulary. Larger vocabularies capture finer visual distinctions but require more training data and memory. Common values range from 256 to 4096. The actual vocabulary size may be smaller if the training data contains fewer total SIFT descriptors than the requested size.

vocab_size¶

The configured vocabulary size.

Type:¶: int

kmeans¶

The fitted k-means clustering model. None before fit() is called.

Type:¶: MiniBatchKMeans or None

sift¶

The SIFT feature detector/descriptor extractor.

Type:¶: cv2.SIFT

Example

Basic usage with fit/transform pattern

>>> import numpy as np
>>> from dataeval.extractors import BoVWExtractor
>>>
>>> # Create sample images (C, H, W format)
>>> rng = np.random.default_rng(42)
>>> images = [rng.integers(0, 256, (3, 64, 64), dtype=np.uint8) for _ in range(10)]
>>>
>>> # Create extractor and fit vocabulary
>>> extractor = BoVWExtractor(vocab_size=64)
>>> extractor.fit(images)
>>>
>>> # Transform images to embeddings
>>> embeddings = extractor.transform(images)
>>> embeddings.shape
(10, 64)

Using with duplicate detection

>>> from dataeval.quality import Duplicates
>>>
>>> # Fit extractor on reference dataset
>>> extractor = BoVWExtractor(vocab_size=128)
>>> extractor.fit(reference_data)
>>>
>>> # Use embeddings for duplicate detection
>>> embeddings = extractor.transform(unlabeled_data)

One-shot fit and transform

>>> # Convenience method that fits and transforms in one call
>>> extractor = BoVWExtractor(vocab_size=64)
>>> embeddings = extractor(images)
>>> embeddings.shape
(10, 64)

Notes

Vocabulary Training: The vocabulary should be trained on a representative sample of images. Once fitted, the same extractor can transform new images into comparable embeddings. Calling fit() again will replace the existing vocabulary.

Image Format: Images should be in (C, H, W) channel-first format, which is standard for PyTorch datasets. Both RGB (3 channels) and grayscale (1 channel) images are supported. Images are automatically converted to uint8 if needed.

Empty Features: Images with no detected SIFT features (e.g., uniform color images) will have zero-valued histogram embeddings.

Reproducibility: The k-means clustering uses a random seed from DataEval’s global configuration via get_seed(). Set a seed with set_seed() for reproducible results.