dataeval.detectors.drift.DriftCVM¶
-
class dataeval.detectors.drift.DriftCVM(data, p_val=
0.05, update_strategy=None, correction='bonferroni', n_features=None)¶ Drift detector using the Cramér-von Mises (CVM) Test.
Detects distributional changes in continuous data by comparing empirical cumulative distribution functions between reference and test datasets. For multivariate data, applies CVM test independently to each feature and aggregates results using either the Bonferroni or False Discovery Rate (FDR) correction.
The CVM test is particularly effective at detecting subtle distributional shifts throughout the entire domain, providing higher power than Kolmogorov-Smirnov for many types of drift.
- Parameters:¶
- data : Embeddings or Array¶
Reference dataset used as baseline distribution for drift detection. Should represent the expected data distribution.
- p_val : float, default 0.05¶
Significance threshold for drift detection, between 0 and 1. Default 0.05 limits false drift alerts to 5% when no drift exists (Type I error rate).
- update_strategy : UpdateStrategy or None, default None¶
Strategy for updating reference data when new data arrives. When None, reference data remains fixed throughout detection.
- correction : "bonferroni" or "fdr", default "bonferroni"¶
Multiple testing correction method for multivariate drift detection. “bonferroni” provides conservative family-wise error control by dividing significance threshold by number of features. “fdr” uses Benjamini-Hochberg procedure for less conservative control. Default “bonferroni” minimizes false positive drift detections.
- n_features : int or None, default None¶
Number of features to analyze in univariate tests. When None, automatically inferred from the flattened shape of first data sample.
Example
Basic drift detection with image embeddings
>>> from dataeval.data import Embeddings >>> train_emb = Embeddings(train_images, model=encoder, batch_size=64) >>> drift_detector = DriftCVM(train_emb)Test incoming images for distributional drift
>>> result = drift_detector.predict(test_images) >>> print(f"Drift detected: {result.drifted}") Drift detected: True>>> print(f"Mean CVM statistic: {result.distance:.4f}") Mean CVM statistic: 24.1325Using different correction methods
>>> drift_fdr = DriftCVM(train_emb, correction="fdr", p_val=0.1) >>> result = drift_fdr.predict(test_images)Access feature level results
>>> n_features = result.feature_drift >>> print(f"Features showing drift: {n_features.sum()} / {len(n_features)}") Features showing drift: 576 / 576- predict(data)¶
Predict drift and update reference data using specified strategy.
Performs feature-wise drift detection, applies multiple testing correction, and optionally updates the reference dataset based on the configured update strategy.
- Parameters:¶
- data : Embeddings or Array¶
Test dataset to analyze for drift against reference data.
- Returns:¶
Complete drift detection results including overall drift prediction, corrected thresholds, feature-level analysis, and summary statistics.
- Return type:¶
- score(data)¶
Calculate feature-wise p-values and test statistics.
Applies the detector’s statistical test independently to each feature, comparing the distribution of each feature between reference and test data.
- Parameters:¶
- data : Embeddings or Array¶
Test dataset to compare against reference data.
- Returns:¶
First array contains p-values for each feature (all between 0 and 1). Second array contains test statistics for each feature (all >= 0). Both arrays have shape (n_features,).
- Return type:¶
tuple[NDArray[np.float32], NDArray[np.float32]]
Notes
Lower p-values indicate stronger evidence of drift for that feature. Higher test statistics indicate greater distributional differences.
- property n_features : int¶
Number of features in the reference data.
Lazily computes the number of features from the first data sample if not provided during initialization. Features correspond to the flattened dimensionality of the input data (e.g., pixels for images).
- Returns:¶
Number of features (flattened dimensions) in the reference data. Always > 0 for valid datasets.
- Return type:¶
int
Notes
For image data, this equals C x H x W. Computed once and cached for efficiency.
- property x_ref : numpy.typing.NDArray[numpy.float32]¶
Reference data for drift detection.
Lazily encodes the reference dataset on first access. Data is flattened and converted to 32-bit floating point for consistent numerical processing across different input types.
- Returns:¶
Reference data as flattened 32-bit floating point array. Shape is (n_samples, n_features_flattened).
- Return type:¶
NDArray[np.float32]
Notes
Data is cached after first access to avoid repeated encoding overhead.