Drift MMD
Drift refers to the phenomenon where the statistical properties of the data change over time. It occurs when the underlying distribution of the input features or the target variable (what the model is trying to predict) shifts, leading to a discrepancy between the training data and the real-world data the model encounters during deployment.
Through concepts examined in the NeurIPS 2019 paper Failing Loudly: An Empirical Study of Methods for Detecting Dataset Shift, we can utilize various methods in order to determine if drift is detected. For high-dimensional data, we typically want to reduce the dimensionality before performing tests against the dataset. To do so, we incorporate Untrained AutoEncoders (UAE) and Black-Box Shift Estimation (BBSE) predictors using the classifier’s softmax outputs as out-of-the box preprocessing methods and note that Principal Component Analysis can also be easily implemented using scikit-learn. Preprocessing methods which do not rely on the classifier will usually pick up drift in the input data, while BBSE focuses on label shift.
How-To Guides
Check out this how to to begin using the Drift Detection class
DataEval API
Maximum Mean Discrepancy
The Maximum Mean Discrepancy (MMD) detector is a kernel-based method for multivariate 2 sample testing. The MMD is a distance-based measure between 2 distributions p and q based on the mean embeddings \(\mu_{p}\) and \(\mu_{q}\) in a reproducing kernel Hilbert space \(F\):
We can compute unbiased estimates of \(MMD^2\) from the samples of the 2 distributions after applying the kernel trick. We use by default a radial basis function kernel, but users are free to pass their own kernel of preference to the detector. We obtain a \(p\)-value via a permutation test on the values of \(MMD^2\).
- class dataeval.detectors.DriftMMD(x_ref: ~numpy._typing._array_like._SupportsArray[~numpy.dtype[~typing.Any]] | ~numpy._typing._nested_sequence._NestedSequence[~numpy._typing._array_like._SupportsArray[~numpy.dtype[~typing.Any]]] | bool | int | float | complex | str | bytes | ~numpy._typing._nested_sequence._NestedSequence[bool | int | float | complex | str | bytes], p_val: float = 0.05, x_ref_preprocessed: bool = False, update_x_ref: ~dataeval._internal.detectors.drift.base.UpdateStrategy | None = None, preprocess_fn: ~typing.Callable[[~numpy._typing._array_like._SupportsArray[~numpy.dtype[~typing.Any]] | ~numpy._typing._nested_sequence._NestedSequence[~numpy._typing._array_like._SupportsArray[~numpy.dtype[~typing.Any]]] | bool | int | float | complex | str | bytes | ~numpy._typing._nested_sequence._NestedSequence[bool | int | float | complex | str | bytes]], ~numpy._typing._array_like._SupportsArray[~numpy.dtype[~typing.Any]] | ~numpy._typing._nested_sequence._NestedSequence[~numpy._typing._array_like._SupportsArray[~numpy.dtype[~typing.Any]]] | bool | int | float | complex | str | bytes | ~numpy._typing._nested_sequence._NestedSequence[bool | int | float | complex | str | bytes]] | None = None, kernel: ~typing.Callable = <class 'dataeval.detectors.GaussianRBF'>, sigma: ~numpy._typing._array_like._SupportsArray[~numpy.dtype[~typing.Any]] | ~numpy._typing._nested_sequence._NestedSequence[~numpy._typing._array_like._SupportsArray[~numpy.dtype[~typing.Any]]] | bool | int | float | complex | str | bytes | ~numpy._typing._nested_sequence._NestedSequence[bool | int | float | complex | str | bytes] | None = None, configure_kernel_from_x_ref: bool = True, n_permutations: int = 100, device: str | None = None)
Maximum Mean Discrepancy (MMD) data drift detector using a permutation test.
- Parameters:
x_ref (ArrayLike) – Data used as reference distribution.
p_val (float, default 0.05) – p-value used for the significance of the permutation test.
x_ref_preprocessed (bool, default False) – Whether the given reference data x_ref has been preprocessed yet. If x_ref_preprocessed=True, only the test data x will be preprocessed at prediction time. If x_ref_preprocessed=False, the reference data will also be preprocessed.
preprocess_at_init (bool, default True) – Whether to preprocess the reference data when the detector is instantiated. Otherwise, the reference data will be preprocessed at prediction time. Only applies if x_ref_preprocessed=False.
update_x_ref (Optional[UpdateStrategy], default None) – Reference data can optionally be updated using an UpdateStrategy class. Update using the last n instances seen by the detector with
dataeval.detectors.LastSeenUpdateStrategyor via reservoir sampling withdataeval.detectors.ReservoirSamplingUpdateStrategy.preprocess_fn (Optional[Callable], default None) – Function to preprocess the data before computing the data drift metrics.
kernel (Callable, default
dataeval.detectors.GaussianRBF) – Kernel used for the MMD computation, defaults to Gaussian RBF kernel.sigma (Optional[ArrayLike], default None) – Optionally set the GaussianRBF kernel bandwidth. Can also pass multiple bandwidth values as an array. The kernel evaluation is then averaged over those bandwidths.
configure_kernel_from_x_ref (bool, default True) – Whether to already configure the kernel bandwidth from the reference data.
n_permutations (int, default 100) – Number of permutations used in the permutation test.
device (Optional[str], default None) – Device type used. The default None uses the GPU and falls back on CPU. Can be specified by passing either ‘cuda’, ‘gpu’ or ‘cpu’.
- predict(x: _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes]) Dict[str, int | float]
Predict whether a batch of data has drifted from the reference data and then updates reference data using specified strategy.
- Parameters:
x (ArrayLike) – Batch of instances.
- Return type:
Dictionary containing the drift prediction, p-value, threshold and MMD metric.
- score(x: _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes]) Tuple[float, float, float]
Compute the p-value resulting from a permutation test using the maximum mean discrepancy as a distance measure between the reference data and the data to be tested.
- Parameters:
x (ArrayLike) – Batch of instances.
- Returns:
p-value obtained from the permutation test, the MMD^2 between the reference and
test set, and the MMD^2 threshold above which drift is flagged.