dataeval.shift.DriftUnivariate

class dataeval.shift.DriftUnivariate(method=None, p_val=None, update_strategy=None, correction=None, alternative=None, n_features=None, extractor=None, config=None)

Drift detector using univariate statistical tests.

Detects distributional changes by comparing empirical distributions of reference and test datasets using classical statistical tests. For multivariate data, applies the test independently to each feature and aggregates results using multiple testing correction.

Uses a fit/predict lifecycle: construct with hyperparameters, call fit() with reference data, then call predict() with test data. Use chunked() to create a chunked wrapper for time-series monitoring.

Supports five statistical methods with different strengths:

  • Kolmogorov-Smirnov (ks): Measures maximum distance between empirical CDFs. General-purpose test, sensitive to middle portions of distributions. Supports directional alternatives for detecting systematic shifts.

  • Cramér-von Mises (cvm): Measures integrated squared distance between CDFs. More sensitive than KS to subtle distributional differences across the entire domain. Higher statistical power for many drift types.

  • Mann-Whitney U (mwu): Nonparametric rank-based test for stochastic ordering. Robust to outliers and effective for detecting location (median) shifts. Works well with non-normal distributions. Supports directional alternatives.

  • Anderson-Darling (anderson): Tests equality of distributions with emphasis on tail differences. More sensitive than KS to heavy-tailed distributions. Ideal for detecting drift in extreme values. Two-sided only.

  • Baumgartner-Weiss-Schindler (bws): Modern test emphasizing tail differences with higher power than KS. Balanced sensitivity to both tails and center. Supports directional alternatives. Requires scipy>=1.12.0.

Choosing a Method:

  • Use ks for general-purpose drift detection with directional testing

  • Use cvm for higher sensitivity to overall distributional changes

  • Use mwu for robust detection of median shifts, especially with outliers

  • Use anderson when tail behavior is critical (SLA violations, rare events)

  • Use bws for best overall power with tail sensitivity and directional testing

Parameters:
method : "ks", "cvm", "mwu", "anderson", or "bws", default "ks"

Statistical test method to use. See method descriptions above. Default “ks” provides a well-established general-purpose test.

p_val : float, default 0.05

Significance threshold for drift detection, between 0 and 1. Default 0.05 limits false drift alerts to 5% when no drift exists (Type I error rate).

update_strategy : UpdateStrategy or None, default None

Strategy for updating reference data when new data arrives. When None, reference data remains fixed throughout detection.

correction : "bonferroni" or "fdr", default "bonferroni"

Multiple testing correction method for multivariate drift detection. “bonferroni” provides conservative family-wise error control by dividing significance threshold by number of features. “fdr” uses Benjamini-Hochberg procedure for less conservative control. Default “bonferroni” minimizes false positive drift detections.

alternative : "two-sided", "less" or "greater", default "two-sided"

Alternative hypothesis for the statistical test. Applies to: ks, mwu, bws methods only.

  • ”two-sided”: detects any distributional difference

  • ”less”: tests if test distribution is stochastically smaller

  • ”greater”: tests if test distribution is stochastically larger

Default “two-sided” provides most general drift detection. Ignored for cvm and anderson (only support two-sided).

n_features : int | None, default None

Number of features to analyze in univariate tests. When None, automatically inferred from the flattened shape of first data sample.

extractor : FeatureExtractor or None, default None

Optional feature extraction function to convert input data to arrays. When provided, enables drift detection on non-array inputs such as datasets, metadata, or raw model outputs. The extractor is applied to both reference and test data before drift detection. When None, data must already be Array-like.

config : DriftUnivariate.Config or None, default None

Optional configuration object with default parameters. Parameters specified directly in __init__ will override config defaults.

See also

DriftUnivariate.Stats

Per-prediction statistics returned in DriftOutput.details.

Example

Basic drift detection with Kolmogorov-Smirnov test

>>> rng = np.random.default_rng(42)
>>> train_emb = rng.standard_normal((100, 128)).astype(np.float32)
>>> drift_detector = DriftUnivariate(method="ks").fit(train_emb)
>>> test_emb = np.zeros((20, 128), dtype=np.float32)
>>> result = drift_detector.predict(test_emb)
>>> print(f"Drift detected: {result.drifted}")
Drift detected: True

Chunked drift detection with z-score thresholds

>>> chunked = DriftUnivariate(method="ks").chunked(chunk_size=20)
>>> chunked.fit(train_emb)
ChunkedDrift(DriftUnivariate(method='ks', p_val=0.05, correction='bonferroni', alternative='two-sided', n_features=None, update_strategy=None, extractor=None), chunker=SizeChunker(chunk_size=20, incomplete='keep'), fitted=True)
>>> result = chunked.predict(test_emb)
>>> print(f"Drift detected: {result.drifted}, chunks: {len(result.details)}")
Drift detected: True, chunks: 1

Using configuration:

>>> config = DriftUnivariate.Config(method="cvm", p_val=0.01, correction="fdr")
>>> drift = DriftUnivariate(config=config).fit(train_emb)
chunked(chunker=None, chunk_size=None, chunk_count=None, threshold=None)

Create a chunked wrapper around this drift detector.

Returns a ChunkedDrift that splits data into chunks during fit and predict, computing per-chunk metrics and comparing against baseline thresholds.

Parameters:
chunker : BaseChunker or None, default None

Explicit chunker instance.

chunk_size : int or None, default None

Create fixed-size chunks of this many samples.

chunk_count : int or None, default None

Split into this many equal chunks.

threshold : Threshold or None, default None

Threshold strategy for determining drift bounds from baseline. When None, uses the detector’s default threshold.

Returns:

A chunked drift wrapper around this detector.

Return type:

ChunkedDrift[TDetails]

fit(reference_data)

Fit detector with reference data.

Parameters:
reference_data : Any

Reference dataset used as baseline for drift detection. Can be Array or any type supported by the configured extractor.

Return type:

Self

predict(data)

Predict drift and optionally update reference data.

Performs feature-wise drift detection with multiple testing correction.

Parameters:
data : Any

Test dataset to analyze for drift.

Returns:

Drift prediction with per-feature statistics.

Return type:

DriftOutput[DriftUnivariate.Stats]

score(data)

Compute feature-wise p-values and test statistics.

Applies the detector’s statistical test independently to each feature, comparing the distribution of each feature between reference and test data.

Parameters:
data : Array

Test dataset to compare against reference data.

Returns:

First array contains p-values for each feature (all between 0 and 1). Second array contains test statistics for each feature (all >= 0). Both arrays have shape (n_features,).

Return type:

tuple[NDArray[np.float32], NDArray[np.float32]]

Notes

Lower p-values indicate stronger evidence of drift for that feature. Higher test statistics indicate greater distributional differences.

property n_features : int

Number of features in the reference data.

Lazily computes the number of features from the first data sample if not provided during initialization. Features correspond to the flattened dimensionality of the input data (e.g., pixels for images).

Returns:

Number of features (flattened dimensions) in the reference data. Always > 0 for valid datasets.

Return type:

int

Notes

For image data, this equals C x H x W. Computed once and cached for efficiency.

property reference_data : numpy.typing.NDArray[numpy.float32]

Reference data, lazily encoded on first access.

Overrides BaseDrift.reference_data via MRO when this mixin appears before BaseDrift in the inheritance list.

Classes

Config

Configuration for DriftUnivariate detector.

Stats

Per-feature statistics from univariate drift detection.