dataeval.shift.DriftMMD¶
-
class dataeval.shift.DriftMMD(p_val=
None, update_strategy=None, sigma=None, n_permutations=None, permutation_batch_size=None, device=None, config=None)¶ Drift detector using Maximum Mean Discrepancy (MMD) Drift Detection with permutation test.
Detects distributional differences by comparing kernel embeddings of reference and test datasets in a reproducing kernel Hilbert space (RKHS). Uses permutation testing to assess statistical significance of the observed MMD^2 statistic.
Uses a fit/predict lifecycle: construct with hyperparameters, call
fit()with reference data, then callpredict()with test data. Supports chunked mode when chunking parameters are provided tofit().MMD is particularly effective for high-dimensional data like images as it can capture complex distributional differences that univariate tests might miss. The kernel-based approach enables detection of both marginal and dependency changes between features.
- Parameters:¶
- p_val : float, default 0.05¶
Significance threshold for statistical tests, between 0 and 1. For FDR correction, this represents the acceptable false discovery rate. Default 0.05 provides 95% confidence level for drift detection.
- update_strategy : UpdateStrategy or None, default None¶
Strategy for updating reference data when new data arrives. When None, reference data remains fixed throughout detection. Ignored in chunked mode.
- sigma : Array or None, default None¶
Bandwidth parameter(s) for the Gaussian RBF kernel. Controls the kernel’s sensitivity to distance between data points. When None, automatically selects bandwidth using median heuristic. Can provide multiple values as array to average over different scales.
- n_permutations : int, default 100¶
Number of random permutations used in the permutation test to estimate the null distribution of MMD² under no drift. Higher values provide more accurate p-value estimates but increase computation time. Default 100 balances statistical accuracy with computational efficiency.
- permutation_batch_size : int or "auto", default "auto"¶
Batch size for computing permutations to reduce memory usage. When “auto” (default), automatically detects appropriate batch size based on available GPU memory (on CUDA devices) or computes all permutations at once (on CPU). Set to an integer to manually specify batch size. Useful when working with large kernel matrices or many permutations to avoid GPU out-of-memory errors.
- device : DeviceLike or None, default None¶
Hardware device for computation. When None, automatically selects DataEval’s configured device, falling back to PyTorch’s default.
- config : DriftMMD.Config or None, default None¶
Optional configuration object with default parameters. Parameters specified directly in __init__ will override config defaults.
- update_strategy¶
Reference data update strategy.
- Type:¶
UpdateStrategy or None
- permutation_batch_size¶
Batch size for computing permutations, or “auto” for automatic detection.
- Type:¶
int or “auto”
Example
Initialize with image embeddings
>>> train_emb = np.ones((100, 128), dtype=np.float32) >>> drift = DriftMMD().fit(train_emb)Test incoming images for drift
>>> test_emb = np.zeros((20, 128), dtype=np.float32) >>> result = drift.predict(test_emb)>>> print(f"Drift detected: {result.drifted}") Drift detected: True>>> print(f"Mean MMD statistic: {result.distance:.2f}") Mean MMD statistic: 1.26Chunked drift detection with z-score thresholds
>>> drift = DriftMMD().fit(train_emb, chunk_size=20) >>> result = drift.predict(test_emb)Using configuration:
>>> config = DriftMMD.Config(p_val=0.01, n_permutations=200) >>> drift = DriftMMD(config=config).fit(train_emb)-
fit(data, chunker=
None, chunk_size=None, chunk_count=None, chunks=None, chunk_indices=None, threshold=None)¶ Fit detector with reference data.
Stores reference data, initializes the kernel, and precomputes the reference kernel matrix. Optionally enables chunked mode.
- Parameters:¶
- data : Array¶
Reference dataset used as baseline distribution for drift detection.
- chunker : ArrayChunker or None, default None¶
Explicit chunker instance for chunked mode.
- chunk_size : int or None, default None¶
Create fixed-size chunks of this many samples.
- chunk_count : int or None, default None¶
Split reference into this many equal chunks.
- chunks : list[ArrayLike] or None, default None¶
Pre-split reference data arrays for chunked mode.
- chunk_indices : list[list[int]] or None, default None¶
Index groupings for chunking reference data.
- threshold : Threshold or None, default None¶
Threshold strategy for chunked mode. Defaults to StandardDeviationThreshold (mean +/- 3*std).
- Return type:¶
Self
-
predict(data=
None, chunks=None, chunk_indices=None)¶ Predict whether a batch of data has drifted from the reference data.
In non-chunked mode, uses permutation test. In chunked mode, computes per-chunk MMD^2 and compares against baseline thresholds.
- Parameters:¶
- data : Any, optional¶
Batch of instances to predict drift on. Required for non-chunked mode and for chunked mode unless pre-built chunks are provided.
- chunks : list[ArrayLike] or None, default None¶
Pre-built test data chunks.
- chunk_indices : list[list[int]] or None, default None¶
Index groupings for chunking
data.
- Returns:¶
Non-chunked mode:
detailsis aDriftMMDStatsTypedDict. Chunked mode:detailsis apolars.DataFramewith per-chunk results.- Return type:¶
- score(data)¶
Compute the p-value resulting from a permutation test using the maximum mean discrepancy.
The maximum mean discrepancy is used as a distance measure between the reference data and the data to be tested.
- property is_chunked : bool¶
Whether the detector is operating in chunked mode.
- property x_ref : numpy.typing.NDArray[numpy.float32]¶
Reference data, lazily encoded on first access.
Overrides
BaseDrift.x_refvia MRO when this mixin appears beforeBaseDriftin the inheritance list.
Classes¶
Configuration for DriftMMD detector. |