dataeval.shift.MetadataFeatureExtractor

class dataeval.shift.MetadataFeatureExtractor(continuous_factor_bins=None, auto_bin_method=None, exclude=None, include=None, use_binned=True, add_stats=None, metadata=None)

Extract metadata factors from datasets for drift detection.

This class implements the FeatureExtractor protocol for use with drift detectors. It extracts and bins metadata factors from annotated datasets, with support for reusing pre-computed metadata to avoid redundant processing.

The extractor maintains state to cache reference metadata and avoid recomputation when the same dataset is passed multiple times.

Parameters:
continuous_factor_bins : Mapping[str, int | Sequence[float]] or None, default None

Binning configuration for continuous factors. Maps factor names to either the number of bins or explicit bin edges.

auto_bin_method : {"uniform_width", "uniform_count", "clusters"}, default "uniform_width"

Automatic binning strategy for continuous factors without explicit bins.

exclude : Sequence[str] or None, default None

Factor names to exclude from processing.

include : Sequence[str] or None, default None

Factor names to include in processing.

use_binned : bool, default True

If True, returns binned_data (discrete integers). If False, returns factor_data (original continuous/categorical values).

metadata : Metadata or None, default None

Pre-computed Metadata object to reuse. When provided, avoids recomputation for the same dataset. This is useful when you’ve already processed metadata and want to use it for drift detection without redundant binning.

add_stats : dataeval.flags.ImageStats | None

continuous_factor_bins

Binning configuration for continuous factors.

Type:

Mapping[str, int | Sequence[float]]

auto_bin_method

Automatic binning strategy.

Type:

{“uniform_width”, “uniform_count”, “clusters”}

use_binned

Whether to return binned or raw factor data.

Type:

bool

Example

Basic usage with a dataset:

>>> from dataeval.flags import ImageStats
>>> from dataeval.shift import DriftUnivariate, MetadataFeatureExtractor
>>>
>>> # Use ExampleDataset from conftest
>>> train_dataset = ExampleDataset(100, seed=42)
>>> test_dataset = ExampleDataset(50, seed=43)
>>>
>>> # Create metadata extractor
>>> metadata_extractor = MetadataFeatureExtractor(
...     continuous_factor_bins={"brightness": 10, "contrast": 10},
...     use_binned=False,
...     add_stats=ImageStats.VISUAL_BRIGHTNESS | ImageStats.VISUAL_CONTRAST,
... )
>>>
>>> # Use with drift detector on raw datasets
>>> drift_detector = DriftUnivariate(
...     data=train_dataset,
...     method="ks",
...     feature_extractor=metadata_extractor,
... )
>>> result = drift_detector.predict(test_dataset)
>>> print(f"Drift detected: {result.drifted}")
Drift detected: True

Reusing pre-computed metadata:

>>> from dataeval import Metadata
>>> from dataeval.core import calculate
>>> from dataeval.flags import ImageStats
>>>
>>> # Create dataset for metadata extraction
>>> train_ds_meta = ExampleDataset(100, seed=42)
>>>
>>> # Compute metadata once with additional image statistics
>>> stats_flags = ImageStats.VISUAL_BRIGHTNESS | ImageStats.VISUAL_CONTRAST
>>> stats = calculate(train_ds_meta, stats=stats_flags)
>>> train_metadata = Metadata(
...     train_ds_meta,
...     continuous_factor_bins={"brightness": 10, "contrast": 10},
... )
>>> train_metadata.add_factors(stats["stats"])
>>>
>>> # Reuse metadata with drift detector
>>> metadata_extractor = MetadataFeatureExtractor(metadata=train_metadata, use_binned=False, add_stats=stats_flags)
>>> drift_detector = DriftUnivariate(
...     data=train_ds_meta,
...     method="ks",
...     feature_extractor=metadata_extractor,
... )

Notes

The extractor caches a reference to the dataset used during initialization to avoid redundant metadata processing when the same dataset is passed multiple times (common in reference data initialization).

Binning configuration is preserved when reusing metadata to ensure consistent discretization across reference and test data.

See also

Metadata

Underlying metadata processing class

ImageStats

Supported image statistics

DriftUnivariate

Univariate drift detection with multiple statistical tests