Drift Detection Tutorial Using Multiple Drift Detectors#
Problem Statement#
When evaluating and monitoring data after model deployment, it is important to test incoming data for potential drift which may affect model performance.
When to use#
The dataeval.detectors drift detection classes should be used when you would like to measure new data for operational drift.
What you will need#
A set of image embeddings for each dataset (usually obtained with an AutoEncoder)
A python environment with the following packages installed:
dataeval[torch]ordataeval[all]
Setting up#
Let’s import the required libraries needed to set up a minimal working example
from functools import partial
import numpy as np
import torch
from dataeval.detectors.drift import (
DriftCVM,
DriftKS,
DriftMMD,
preprocess_drift,
)
from dataeval.utils.torch.datasets import MNIST
from dataeval.utils.torch.models import AriaAutoencoder
device = "cuda" if torch.cuda.is_available() else "cpu"
Loading in data#
Let’s start by loading in torchvision’s mnist dataset, then we will examine it
# Load in the training mnist dataset and use the first 4000
train_ds = MNIST(root="./data/", train=True, download=True, size=4000, dtype=np.float32, channels="channels_first")
# Split out the images and labels
images, labels = train_ds.data, train_ds.targets
Files already downloaded and verified
print("Number of samples: ", len(images))
print("Image shape:", images[0].shape)
Number of samples: 4000
Image shape: (1, 28, 28)
Test reference against control#
Let’s check for drift between the first 2000 images and the second 2000 images from this sample.
data_reference = images[0:2000]
data_control = images[2000:]
In order to reduce the dimensionality of the data, we can set a simple Autoencoder to the preprocess_fn. While this is optional for the MNIST data set, it is highly recommended for datasets that have higher dimensionality.
For the purposes of the tutorial, we will use 3 forms of drift detectors: Maximum Mean Discrepancy (MMD), Cramér-von Mises (CVM), and Kolmogorov-Smirnov (KS).
# define encoder
encoder_net = AriaAutoencoder(1).encoder.to(device)
# define preprocessing function
preprocess_fn = partial(preprocess_drift, model=encoder_net, batch_size=64, device=device)
# initialise drift detectors
detectors = [detector(data_reference, preprocess_fn=preprocess_fn) for detector in [DriftMMD, DriftCVM, DriftKS]]
We estimate that the test for drift is false for all detectors as both the reference and test data set is from the same MNIST training dataset.
[(type(detector).__name__, detector.predict(data_control).is_drift) for detector in detectors]
[('DriftMMD', True), ('DriftCVM', True), ('DriftKS', True)]
Loading in corrupted data#
Now let’s load in a corrupted MNIST dataset.
corruption = MNIST(
root="./data",
train=True,
download=False,
size=2000,
dtype=np.float32,
channels="channels_first",
corruption="translate",
)
corrupted_images = corruption.data
Files already downloaded and verified
print("Number of corrupted samples: ", len(corrupted_images))
print("Corrupted image shape:", corrupted_images[0].shape)
Number of corrupted samples: 2000
Corrupted image shape: (1, 28, 28)
Check for drift against corrupted data#
Test for drift between the corrupted dataset and the original reference set using all 3 detectors.
[(type(detector).__name__, detector.predict(corrupted_images).is_drift) for detector in detectors]
[('DriftMMD', True), ('DriftCVM', True), ('DriftKS', True)]
We conclude that the translated MNIST images are significantly different from the original images according to all 3 measures of drift.