Maximum Mean Discrepancy¶
Maximum Mean Discrepancy (MMD) Drift Detection is a kernel-based method for comparing two distributions by calculating the distance between their mean embeddings in a reproducing kernel Hilbert space (RKHS). The MMD test statistic is defined as:
where \(\mu_{p}\) and \(\mu_{q}\) are the mean embeddings of distributions p and q in the RKHS. The MMD test is particularly useful for detecting complex, multivariate distributional differences. Unbiased estimates of \(\textrm{MMD}^2\) can be obtained using the kernel trick, and a permutation test is used to obtain the p-value.
A common choice for the kernel is the radial basis function (RBF) kernel, though other kernels can be used depending on the application.
Key characteristics:
Kernel trick: Projects data into high-dimensional feature space using kernel trick
Multivariate: Naturally handles multiple features and their dependencies
Universal: With universal kernels (e.g., RBF), can detect any distributional difference
Non-parametric: No assumptions about distribution shapes
Interpretability: Lower than univariate tests; doesn’t identify which features drifted
Common kernels:
Radial Basis Function (RBF) / Gaussian kernel: $\( k(x, y) = \exp\left(-\frac{\|x-y\|^2}{2\sigma^2}\right) \)$
Most common choice; universal kernel
Bandwidth \(\sigma\) controls sensitivity to local vs. global differences
Polynomial kernel: $\( k(x, y) = (x^T y + c)^d \)$
Captures polynomial interactions up to degree \(d\)
Statistical testing:
A permutation test is used to obtain the p-value:
Pool reference and test samples
Randomly permute and split into two groups multiple times
Compute MMD for each permutation
P-value = proportion of permutations with MMD ≥ observed MMD
When to use:
Image/video embeddings (ResNet, CLIP, ViT, etc.) - primary use case
High-dimensional data where feature interactions matter
When drift involves changes in correlations between features
Deep learning computer vision applications
Cross-domain shifts (e.g., synthetic → real, indoor → outdoor)
When univariate tests fail to detect known drift
Limitations:
Computationally expensive for large datasets (quadratic in sample size)
Kernel selection and hyperparameter tuning required
Limited interpretability (doesn’t indicate which features drifted)
Requires sufficient samples for reliable permutation testing