Data Monitoring Guide#
Introduction#
Monitoring is a critical step in the AI/ML lifecycle. When a model is deployed, data can, and generally will, drift from the distribution on which the model was originally trained, or may be fundamentally different from the outset for a variety of reasons. One critical step in AI T&E is the detection of changes in the operational distribution so that one may proactively address them. While some changes will not affect performance, significant deviation is often associated with model degradation.
You will walk through the steps of detecting drift and parity.
For this tutorial, you will use the VOC dataset, an image dataset used for computer vision competitions. You will be comparing the image distribution of the train split to that of the val split, pretending as though the val split represents an operational dataset.
What you’ll need#
You’ll begin by importing the necessary libraries for this tutorial.
import numpy as np
import torch
import torch.nn as nn
from torchvision import datasets, models
# Drift
from dataeval.detectors.drift import DriftCVM, DriftKS, DriftMMD
from dataeval.metrics.bias import label_parity
# Set the random value
rng = np.random.default_rng(213)
What you’ll learn#
You’ll learn how to detect drift on an object detection dataset
You’ll learn how to measure Parity on metadata between your training and test set
You’ll learn how to use embeddings to efficiently run large datasets
Step 1: Constructing Embeddings#
Encoding Images#
The first step in many aspects of data monitoring is reducing images down to a dimension that our tools can operate in. To do this, you will use existing model weights from ResNet18. You will apply these to the VOC dataset. A more in depth look at this dataset and the construction of embeddings can be seen in the EDA Tutorial.
The first steps are defining the encoder network and embedding the training images.
# Define the embedding network
class EmbeddingNet(nn.Module):
def __init__(self):
super().__init__()
self.model = models.resnet18(weights=models.ResNet18_Weights.DEFAULT)
self.model.fc = nn.Linear(self.model.fc.in_features, 128)
def forward(self, x):
x = self.model(x)
return x
embedding_net = EmbeddingNet()
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
embedding_net.to(device)
# Extract embeddings
def extract_embeddings(dataset, model):
model.eval()
embeddings = torch.empty(size=(0, 128)).to(device)
with torch.no_grad():
images = []
for i, (img, _) in enumerate(dataset):
images.append(img)
if (i + 1) % 64 == 0:
inputs = torch.stack(images, dim=0).to(device)
outputs = model(inputs)
embeddings = torch.vstack((embeddings, outputs))
images = []
inputs = torch.stack(images, dim=0).to(device)
outputs = model(inputs)
embeddings = torch.vstack((embeddings, outputs))
return embeddings.detach().cpu().numpy()
Downloading: "https://download.pytorch.org/models/resnet18-f37072fd.pth" to /home/dataeval/.cache/torch/hub/checkpoints/resnet18-f37072fd.pth
0%| | 0.00/44.7M [00:00<?, ?B/s]
35%|███▌ | 15.7M/44.7M [00:00<00:00, 165MB/s]
81%|████████▏ | 36.4M/44.7M [00:00<00:00, 195MB/s]
100%|██████████| 44.7M/44.7M [00:00<00:00, 194MB/s]
Next, you will reload our training dataset with the desired preprocessing for our given model and then you will run the model to get the image embeddings.
# Define pretrained model transformations
preprocess = models.ResNet18_Weights.DEFAULT.transforms()
# Load the dataset
dataset = datasets.VOCDetection("./data", year="2011", image_set="train", download=False, transform=preprocess)
# Create image embeddings
embeddings = extract_embeddings(dataset, embedding_net)
np.shape(embeddings)
(5717, 128)
The images are reduced to dimension 128. Next you do the same for the operational dataset.
# Load the 'operational' dataset
op_dataset = datasets.VOCDetection("./data", year="2011", image_set="val", download=False, transform=preprocess)
# Create image embeddings
op_embeddings = extract_embeddings(op_dataset, embedding_net)
np.shape(op_embeddings)
(5823, 128)
Step 2: Drift#
Now that you have embedded both sets of images into 128-dimensional space, you would like to determine if the val dataset has drifted from the train dataset.
you will use 3 dataeval tools to make this determination. Each operated by comparing the distributions of embeddings between the two images sets. They produce a probability value, where a small value means that it is very unlikely that these two sets of embeddings come from the same distribution, and therefore drift has likely occurred. Based on this p-value(s), each drift metric will output a binary is_drift, which you will examine here.
d1 = DriftMMD(embeddings)
d2 = DriftCVM(embeddings)
d3 = DriftKS(embeddings)
d1.predict(op_embeddings).is_drift
False
d2.predict(op_embeddings).is_drift
False
d3.predict(op_embeddings).is_drift
False
Since these two image sets are random subsets of the same dataset, you unsurprisingly do not detect and drift. However, let’s add some Gaussian noise to the operational embeddings to see what happens to the drift detectors.
perturbed_op_embeddings = np.float32(op_embeddings + np.random.normal(size=np.shape(op_embeddings)))
d1.predict(perturbed_op_embeddings).is_drift
True
d2.predict(perturbed_op_embeddings).is_drift
True
d3.predict(perturbed_op_embeddings).is_drift
True
When you perturb the operational embeddings, you find that drift is detected. To give a more realistic example, you can also look at an individual class from the operational set.
labels = []
for data in op_dataset:
objects = data[1]["annotation"]["object"]
names = []
for each in objects:
names.append(each["name"])
labels.append(names)
# Subset embeddings of images which contain a chair
chair_embeddings = op_embeddings[[("chair" in i) for i in labels], :]
d1.predict(chair_embeddings).is_drift
True
d2.predict(chair_embeddings).is_drift
True
d3.predict(chair_embeddings).is_drift
True
In both cases, you can see the drift detectors pick up on very simple perturbations, but return 0 when the dataset is indistinguishable from that on which the model was trained.
Step 3: Parity#
Another task you might want to perform in monitoring is looking at parity of classes between training and operational datasets. There is parity between two datasets in terms of label if the label frequencies are (approximately) equal. Lets check if the distribution of the objects in each image is the same between datasets.
op_labels = []
for data in op_dataset:
objects = data[1]["annotation"]["object"]
names = []
for each in objects:
names.append(each["name"])
op_labels.append(names)
op_labels = [x for i in op_labels for x in i]
labels = []
for data in dataset:
objects = data[1]["annotation"]["object"]
names = []
for each in objects:
names.append(each["name"])
labels.append(names)
labels = [x for i in labels for x in i]
from sklearn import preprocessing
# Turn string labels into integer labels so the DataEval parity function can read them.
le = preprocessing.LabelEncoder()
le.fit(labels)
label_int = le.transform(labels)
op_label_int = le.transform(op_labels)
label_parity(label_int, op_label_int, 20).p_value
0.949856067521638
You can see, unsurprisingly, that there is no discernible difference in the distribution of classes between the datasets (the p_value is extremely high).
Conclusion#
You have checked for potential issues in the operational dataset that may affect the model after deployment. Both drift and class parity (lack thereof) can affect a model’s ability to achieve the performance recorded at model training. If one detects that a dataset has drifted significantly and/or that parity has been violated, it might be a good idea to consider retraining the model, incorporating operational data into this retraining.