Monitor shifts in operational data¶
This guide provides a beginner friendly introduction on monitoring post deployment data shifts.
Estimated time to complete: 5 minutes
Relevant ML stages: Monitoring
Relevant personas: Machine Learning Engineer, T&E Engineer
What you’ll do¶
Construct embeddings by training a simple neural network
Compare the embeddings between a training and operational set
Compare the label distributions between a training and operational set
What you’ll learn¶
Learn how to analyze embeddings for operational drift
Learn how to analyze label distributions
What you’ll need¶
Knowledge of Python
Beginner knowledge of PyTorch or neural networks
Introduction¶
Monitoring is a critical step in the AI/ML lifecycle. When a model is deployed, data can, and generally will, drift from the distribution on which the model was originally trained. One critical step in AI T&E is the detection of changes in the operational distribution so that they may be proactively addressed. While some change might not affect performance, significant deviation is often associated with model degradation.
For this tutorial, you will use the popular 2012 VOC computer vision dataset to detect drift between the image distribution of the train split and the val split, which will represent an operational dataset in this guide. You will then determine if the labels within these two datasets has high parity, or equivalent label distributions.
Setup¶
You’ll begin by importing the necessary libraries for this tutorial.
import numpy as np
import torch
import torch.nn as nn
from maite_datasets.object_detection import VOCDetectionTorch
from torchvision import models
from torchvision.transforms.v2 import GaussianNoise
from dataeval.data import Embeddings, Metadata
from dataeval.detectors.drift import DriftCVM, DriftKS, DriftMMD
from dataeval.metrics.bias import label_parity
# Set a random seed
rng = np.random.default_rng(213)
# Set default torch device for notebook
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
torch.set_default_device(device)
More on device
The device is set above as it will be used in subsequent steps. The device is the piece of hardware where the model, data, and other related objects are stored in memory. If a GPU is available, this notebook will use that hardware rather than the CPU. To force running only on the CPU, change device to "cpu" For more information, see the PyTorch device page.
Step 1: Constructing Embeddings¶
A common first step in many aspects of data monitoring is reducing images down to a smaller dimension. While this step is not always necessary, it is good practice to use embeddings over raw images to improve the speed and memory efficiency of many workflows without sacrificing downstream performance.
In this step, you will use a pretrained ResNet18 model to reduce the dimensionality of the VOC dataset.
Define model architecture¶
Below is a simple PyTorch nn.Module that wraps the pre-trained ResNet18 referred to above.
# Define the embedding network
class EmbeddingNet(nn.Module):
def __init__(self):
super().__init__()
# Load in pretrained resnet18 model
self.model = models.resnet18(weights=models.ResNet18_Weights.DEFAULT)
# Add an additional fully connected layer with an embedding dimension of 128
self.model.fc = nn.Linear(self.model.fc.in_features, 128)
def forward(self, x):
"""Run input data through the model"""
return self.model(x)
The model can now be instantiated in the code below.
embedding_net = EmbeddingNet()
Download VOC dataset¶
With the model created on the device set at the beginning, you will download the train and validation splits of the 2012 VOC Dataset. Afterwards, you will use the defined custom_batch function to chunk the data into batches to make the model run more efficiently.
# Define pretrained model transformations
transforms = models.ResNet18_Weights.DEFAULT.transforms()
# Load the training dataset
train_ds = VOCDetectionTorch("./data", year="2012", image_set="train", download=True, transforms=transforms)
print(train_ds)
VOCDetectionTorch Dataset
-------------------------
Year: 2012
Transforms: [ImageClassification(
crop_size=[224]
resize_size=[256]
mean=[0.485, 0.456, 0.406]
std=[0.229, 0.224, 0.225]
interpolation=InterpolationMode.BILINEAR
)]
Image_set: train
Metadata: {'id': 'VOCDetectionTorch_train', 'index2label': {0: 'aeroplane', 1: 'bicycle', 2: 'bird', 3: 'boat', 4: 'bottle', 5: 'bus', 6: 'car', 7: 'cat', 8: 'chair', 9: 'cow', 10: 'diningtable', 11: 'dog', 12: 'horse', 13: 'motorbike', 14: 'person', 15: 'pottedplant', 16: 'sheep', 17: 'sofa', 18: 'train', 19: 'tvmonitor'}, 'split': 'train'}
Path: /dataeval/docs/source/notebooks/data/vocdataset/VOCdevkit/VOC2012
Size: 5717
# Load the "operational" dataset
operational_ds = VOCDetectionTorch("./data", year="2012", image_set="val", download=True, transforms=transforms)
print(operational_ds)
VOCDetectionTorch Dataset
-------------------------
Year: 2012
Transforms: [ImageClassification(
crop_size=[224]
resize_size=[256]
mean=[0.485, 0.456, 0.406]
std=[0.229, 0.224, 0.225]
interpolation=InterpolationMode.BILINEAR
)]
Image_set: val
Metadata: {'id': 'VOCDetectionTorch_val', 'index2label': {0: 'aeroplane', 1: 'bicycle', 2: 'bird', 3: 'boat', 4: 'bottle', 5: 'bus', 6: 'car', 7: 'cat', 8: 'chair', 9: 'cow', 10: 'diningtable', 11: 'dog', 12: 'horse', 13: 'motorbike', 14: 'person', 15: 'pottedplant', 16: 'sheep', 17: 'sofa', 18: 'train', 19: 'tvmonitor'}, 'split': 'val'}
Path: /dataeval/docs/source/notebooks/data/vocdataset/VOCdevkit/VOC2012
Size: 5823
It is good to notice a few points about each dataset:
Number of datapoints
Resize size
These two values give an estimate of the memory impact that each dataset has. The following step will modify the resize size by creating model embeddings for each image to reduce this impact.
Extract Embeddings¶
Now it is time to process the datasets through your model. Aggregating the model outputs gives you the embeddings of the data. This will be helpful in determining drift between the training and operational splits.
Below you will call the helper function and create embeddings for both the train and operational splits. The labels will also be saved so they can be used in a later step.
# Create training batches and targets
train_embs = Embeddings(train_ds, batch_size=64, model=embedding_net, cache=True)
# Create operational batches and targets
operational_embs = Embeddings(operational_ds, batch_size=64, model=embedding_net, cache=True)
Notice that the shape of embeddings is different than before.
Previously
Training shape - (5717, 256)
Operational shape - (5823, 256)
After embeddings
print(f"({len(train_embs)}, {train_embs[0].shape})") # (5717, shape)
print(f"({len(operational_embs)}, {operational_embs[0].shape})") # (5823, shape)
(5717, torch.Size([128]))
(5823, torch.Size([128]))
The reduced shape of both the training and operational datasets will improve the performance of the upcoming drift algorithms without impacting the accuracy of the results.
Step 2: Monitor drift¶
In this step, you will be checking for drift between the training embeddings and the operational embeddings from before. If drift is detected, a model trained on this training data should be retrained with new operational data. This can help mitigate performance degradation in a deployed model. Visit our About Drift page to learn more.
Drift detectors¶
DataEval offers a few drift detectors: DriftMMD, DriftCVM, DriftKS
Since each detector outputs a binary decision on whether drift is detected, a majority vote will be used to make the determination of drift.
To learn more about these algorithms, see the theory behind drift detection concept page.
Fit the detectors¶
Each drift detector needs a reference set that the operational set will be compared against. In the following code, you will set the reference data to the training embeddings.
# A type alias for all of the drift detectors
DriftDetector = DriftMMD | DriftCVM | DriftKS
# Create a mapping for the detectors to iterate over
detectors: dict[str, DriftDetector] = {
"MMD": DriftMMD(train_embs),
"CVM": DriftCVM(train_embs),
"KS": DriftKS(train_embs),
}
train_embs.to_tensor()
tensor([[-0.7681, 0.2673, 0.9020, ..., 0.7955, -0.8487, -0.1268],
[-1.1138, -0.2953, -0.5866, ..., 0.5392, -0.5934, 0.4686],
[-1.1482, 0.8335, -0.4806, ..., 1.1189, -0.5809, -0.2297],
...,
[-1.1345, -0.6711, -0.4499, ..., -0.2993, -0.5611, 0.4070],
[-0.7017, -0.3413, 0.0257, ..., 0.3907, 0.0611, -0.4274],
[-0.5480, -0.3742, 0.3132, ..., 0.1510, -0.4800, -0.0578]],
device='cuda:0')
Make predictions¶
Now that the detectors are setup, predictions can be made against the operational embeddings you made earlier.
# Iterate and print the name of the detector class and its boolean drift prediction
for name, detector in detectors.items():
print(f"{name} detected drift? {detector.predict(operational_embs).drifted}")
MMD detected drift? False
CVM detected drift? False
KS detected drift? False
Did you expect these results?
There is no drift detected between the train and operational embeddings because they come from very similar distributions.
Ideally, your training data and your validation data, which we used as operational, come from the same distribution. This is the purpose of data splitters.
So how do we know if the detectors can detect drift?
Well, add some random Gaussian noise to the operational embeddings and find out.
# Applies gaussian noise to images before processing
noisy_embs = Embeddings(operational_ds, batch_size=64, model=embedding_net, transforms=GaussianNoise(), cache=True)
# Iterate and print the name of the detector class and its boolean drift prediction
for name, detector in detectors.items():
print(f"{name} detected drift? {detector.predict(noisy_embs).drifted}")
MMD detected drift? True
CVM detected drift? True
KS detected drift? True
Now drift is detected!
Adding Gaussian noise was enough to cause a noticeable change in the drift detectors, but this is not always the case. There are many types of drift that data can and will experience.
In this step, you learned how to take your generated embeddings and detect drift between the training and operational image data. While there was no drift originally, you were able to add small perturbations to the data that did affect the data distributions and cause drift.
Next you will look at the labels’ distributions.
Step 3: Parity¶
Instead of looking at the images, you can compare the distributions of the labels using a method called label parity.
There is parity between two sets of labels if the label frequencies are approximately equal.
You will now compare the label distributions using the label_parity function.
# Get the metadata for each dataset
train_md = Metadata(train_ds)
operational_md = Metadata(operational_ds)
# The VOC dataset has 20 classes
label_parity(train_md.class_labels, operational_md.class_labels, num_classes=20).p_value
np.float64(0.949856067521638)
From the ParityOutput class, you can see that it calculated a p_value of ~0.95. Since this is close to 1.0, it can be said that the two distributions have parity, or similar distributions.
Conclusion¶
In this tutorial, you have learned to create embeddings from the VOC dataset, look for drift between two sets of data, and calculate the parity of two label distributions. These are important steps when monitoring data as drift and lack of parity can affect a model’s ability to achieve performance recorded during model training. When data drift is detected or the label distributions lack parity, it is a good idea to consider retraining the model and incorporating operational data into the dataset.
What’s next¶
DataEval plays a small, but impactful role in data monitoring as a metrics library.
Visit these additional resources for more information on other aspects:
Read about the entire monitoring in AI/ML stage
Explore DataEval’s API reference for drift and other monitoring tools