dataeval.extractors.BoVWExtractor¶
-
class dataeval.extractors.BoVWExtractor(vocab_size=
2048)¶ Computes Bag of Visual Words histograms using SIFT keypoints.
This class implements the
FeatureExtractorprotocol for use with drift detectors and duplicate detection. It extracts SIFT keypoints from images and quantizes them into a visual vocabulary, producing rotation and scale invariant histogram embeddings.The BoVW approach works by:
Extracting local SIFT descriptors from each image
Clustering all descriptors to form a “visual vocabulary” (codebook)
Representing each image as a histogram of visual word occurrences
This produces embeddings that are invariant to image rotation, scale changes, and minor viewpoint variations, making it effective for finding near-duplicate images even when they have been transformed.
- Parameters:¶
- vocab_size : int, default 2048¶
Number of visual words (clusters) in the vocabulary. Larger vocabularies capture finer visual distinctions but require more training data and memory. Common values range from 256 to 4096. The actual vocabulary size may be smaller if the training data contains fewer total SIFT descriptors than the requested size.
- kmeans¶
The fitted k-means clustering model. None before
fit()is called.- Type:¶
MiniBatchKMeans or None
Example
Basic usage with fit/transform pattern
>>> import numpy as np >>> from dataeval.extractors import BoVWExtractor >>> >>> # Create sample images (C, H, W format) >>> rng = np.random.default_rng(42) >>> images = [rng.integers(0, 256, (3, 64, 64), dtype=np.uint8) for _ in range(10)] >>> >>> # Create extractor and fit vocabulary >>> extractor = BoVWExtractor(vocab_size=64) >>> extractor.fit(images) >>> >>> # Transform images to embeddings >>> embeddings = extractor.transform(images) >>> embeddings.shape (10, 64)Using with duplicate detection
>>> from dataeval.quality import Duplicates >>> >>> # Fit extractor on reference dataset >>> extractor = BoVWExtractor(vocab_size=128) >>> extractor.fit(reference_data) >>> >>> # Use embeddings for duplicate detection >>> embeddings = extractor.transform(unlabeled_data)One-shot fit and transform
>>> # Convenience method that fits and transforms in one call >>> extractor = BoVWExtractor(vocab_size=64) >>> embeddings = extractor(images) >>> embeddings.shape (10, 64)Notes
Vocabulary Training: The vocabulary should be trained on a representative sample of images. Once fitted, the same extractor can transform new images into comparable embeddings. Calling
fit()again will replace the existing vocabulary.Image Format: Images should be in (C, H, W) channel-first format, which is standard for PyTorch datasets. Both RGB (3 channels) and grayscale (1 channel) images are supported. Images are automatically converted to uint8 if needed.
Empty Features: Images with no detected SIFT features (e.g., uniform color images) will have zero-valued histogram embeddings.
Reproducibility: The k-means clustering uses a random seed from DataEval’s global configuration via
get_seed(). Set a seed withset_seed()for reproducible results.See also
dataeval.quality.DuplicatesDuplicate detection using embeddings
dataeval.extractors.ClassifierUncertaintyExtractorUncertainty-based feature extraction
- fit(data)¶
Train the visual vocabulary on the provided images.
Extracts SIFT descriptors from all images and clusters them using MiniBatchKMeans to form the visual vocabulary (codebook).
- Parameters:¶
- data : Any¶
Iterable of images in (C, H, W) format. Supports RGB (3 channels) and grayscale (1 channel) images. Also accepts (image, label) tuples as returned by PyTorch datasets.
- Returns:¶
Returns self for method chaining.
- Return type:¶
- Raises:¶
ValueError – If no SIFT features are found in any image. This typically occurs when all images are uniform (e.g., solid color).
- transform(data)¶
Transform images into BoVW histogram embeddings.
Uses the fitted vocabulary to convert images into normalized histograms of visual word occurrences.
- Parameters:¶
- data : Any¶
Iterable of images in (C, H, W) format. Supports RGB (3 channels) and grayscale (1 channel) images. Also accepts (image, label) tuples as returned by PyTorch datasets.
- Returns:¶
Embeddings array of shape (n_images, vocab_size). Each row is an L2-normalized histogram of visual word occurrences. Images with no detected features have zero-valued histograms.
- Return type:¶
- Raises:¶
RuntimeError – If called before
fit().