Monitor shifts in operational data¶

This guide provides a beginner friendly introduction on monitoring post deployment data shifts.

Estimated time to complete: 5 minutes

Relevant ML stages: Monitoring

Relevant personas: Machine Learning Engineer, T&E Engineer

What you’ll do¶

Construct embeddings by training a simple neural network
Compare different drift detectors and understand their strengths
Inspect detector-specific outputs for root-cause analysis
Use chunked drift detection to monitor drift across data segments
Compare the label distributions between a training and operational set

What you’ll learn¶

Learn the strengths and trade-offs of each drift detector
Learn how to analyze embeddings for operational drift
Learn how to inspect per-feature and per-detector statistics
Learn how to use chunked drift detection for temporal monitoring
Learn how to analyze label distributions

What you’ll need¶

Knowledge of Python
Beginner knowledge of PyTorch or neural networks

Introduction¶

Monitoring is a critical step in the AI/ML lifecycle. When a model is deployed, data can, and generally will, drift from the distribution on which the model was originally trained. One critical step in AI T&E is the detection of changes in the operational distribution so that they may be proactively addressed. While some change might not affect performance, significant deviation is often associated with model degradation.

For this tutorial, you will use the popular 2012 VOC computer vision dataset to detect drift between the image distribution of the train split and the val split, which will represent an operational dataset in this guide. You will then determine if the labels within these two datasets has high parity, or equivalent label distributions.

Setup¶

You’ll begin by importing the necessary libraries for this tutorial.

import numpy as np
import polars as pl
import torch
from IPython.display import display
from maite_datasets.object_detection import VOCDetection
from torchvision.models import ResNet18_Weights, resnet18
from torchvision.transforms.v2 import GaussianNoise

from dataeval import Embeddings, Metadata
from dataeval.core import label_parity
from dataeval.extractors import TorchExtractor
from dataeval.shift import ChunkedDrift, DriftDomainClassifier, DriftKNeighbors, DriftMMD, DriftUnivariate

# Set a random seed
rng = np.random.default_rng(213)

# Set default torch device for notebook
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
torch.set_default_device(device)

More on device

The device is set above as it will be used in subsequent steps. The device is the piece of hardware where the model, data, and other related objects are stored in memory. If a GPU is available, this notebook will use that hardware rather than the CPU. To force running only on the CPU, change device to "cpu" For more information, see the PyTorch device page.

Constructing embeddings¶

An important concept in many aspects of machine learning is Dimensionality Reduction. While this step is not always necessary, it is good practice to use embeddings over raw images to improve the speed and memory efficiency of many workflows without sacrificing downstream performance.

Define model architecture¶

In this section, you will use a pretrained ResNet18 model from Torchvision to reduce the dimensionality of the VOC dataset.

resnet = resnet18(weights=ResNet18_Weights.DEFAULT, progress=False)

# Replace the final fully connected layer with a Linear layer
resnet.fc = torch.nn.Linear(resnet.fc.in_features, 128)

Download VOC dataset¶

With the model created on the device set at the beginning, you will download the train and validation splits of the 2012 VOC Dataset.

# Load the training dataset
train_ds = VOCDetection("./data", year="2012", image_set="train", download=True)
print(train_ds)
print(f"Image 0 shape: {train_ds[0][0].shape}")

VOCDetection Dataset
--------------------
    Year: 2012
    Transforms: []
    Image Set: train
    Metadata: {'id': 'VOCDetection_train', 'index2label': {0: 'aeroplane', 1: 'bicycle', 2: 'bird', 3: 'boat', 4: 'bottle', 5: 'bus', 6: 'car', 7: 'cat', 8: 'chair', 9: 'cow', 10: 'diningtable', 11: 'dog', 12: 'horse', 13: 'motorbike', 14: 'person', 15: 'pottedplant', 16: 'sheep', 17: 'sofa', 18: 'train', 19: 'tvmonitor'}, 'split': 'train'}
    Path: /builds/jatic/aria/dataeval/docs/source/notebooks/data/vocdataset/VOCdevkit/VOC2012
    Size: 5717
Image 0 shape: (3, 442, 500)

# Load the "operational" dataset
operational_ds = VOCDetection("./data", year="2012", image_set="val", download=True)
print(operational_ds)
print(f"Image 0 shape: {train_ds[0][0].shape}")

VOCDetection Dataset
--------------------
    Year: 2012
    Transforms: []
    Image Set: val
    Metadata: {'id': 'VOCDetection_val', 'index2label': {0: 'aeroplane', 1: 'bicycle', 2: 'bird', 3: 'boat', 4: 'bottle', 5: 'bus', 6: 'car', 7: 'cat', 8: 'chair', 9: 'cow', 10: 'diningtable', 11: 'dog', 12: 'horse', 13: 'motorbike', 14: 'person', 15: 'pottedplant', 16: 'sheep', 17: 'sofa', 18: 'train', 19: 'tvmonitor'}, 'split': 'val'}
    Path: /builds/jatic/aria/dataeval/docs/source/notebooks/data/vocdataset/VOCdevkit/VOC2012
    Size: 5823
Image 0 shape: (3, 442, 500)

It is good to notice a few points about each dataset:

Number of datapoints
Resize size

These two values give an estimate of the memory impact that each dataset has. The following step will modify the resize size by creating model embeddings for each image to reduce this impact.

Extract embeddings¶

Now it is time to process the datasets through your model. Aggregating the model outputs gives you the embeddings of the data. This will be helpful in determining drift between the training and operational splits.

Below you will call the helper function and create embeddings for both the train and operational splits. The labels will also be saved so they can be used in a later step.

# Define pretrained model transformations
transforms = ResNet18_Weights.DEFAULT.transforms()

# Create extractor with model and transforms
extractor = TorchExtractor(resnet, transforms=transforms)

# Create training batches and targets
train_embs = Embeddings(train_ds, extractor=extractor, batch_size=64)

# Create operational batches and targets
operational_embs = Embeddings(operational_ds, extractor=extractor, batch_size=64)

Notice that the shape of embeddings is different than before.

Previously¶

Training shape - (5717, 3, 442, 500)
Operational shape - (5823, 3, 442, 500)

After embeddings¶

print(f"({len(train_embs)}, {train_embs[0].shape})")  # (5717, shape)
print(f"({len(operational_embs)}, {operational_embs[0].shape})")  # (5823, shape)

(5717, (128,))
(5823, (128,))

The reduced shape of both the training and operational datasets will improve the performance of the upcoming drift algorithms without impacting the accuracy of the results.

Understanding drift detectors¶

Before testing for drift, it helps to understand the different approaches available. Each detector has distinct strengths that make it better suited for certain scenarios. This tutorial uses four detectors that represent fundamentally different strategies for detecting distributional change.

Detector	Approach	Strengths	Best For
`DriftUnivariate` (CVM)	Statistical test per feature	Fast, interpretable per-feature results	Identifying which features drifted
`DriftMMD`	Kernel-based multivariate test	Captures feature dependencies	High-dimensional data, complex shifts
`DriftDomainClassifier`	Trains a classifier to distinguish distributions	Feature importances for root-cause analysis	Understanding why drift occurred
`DriftKNeighbors`	Compares k-NN distances	Lightweight and fast	Quick monitoring checks

Other univariate methods

DriftUnivariate supports several statistical tests beyond CVM, including Kolmogorov-Smirnov (ks), Mann-Whitney U (mwu), Anderson-Darling (anderson), and Baumgartner-Weiss-Schindler (bws). Each has different sensitivity characteristics — see the drift concept page for details.

Test for drift¶

In this step, you will be checking for drift between the training embeddings and the operational embeddings from before. If drift is detected, a model trained on this training data should be retrained with new operational data. This can help mitigate performance degradation in a deployed model. Visit our About Drift page to learn more.

Drift detectors¶

DataEval offers several drift detectors. This tutorial demonstrates four that each take a different approach: DriftUnivariate, DriftMMD, DriftDomainClassifier, and DriftKNeighbors.

Since each detector outputs a binary decision on whether drift is detected, a majority vote can be used to make the determination of drift.
To learn more about these algorithms, see the theory behind drift detection concept page.

Fit the detectors¶

Each drift detector needs a reference set that the operational set will be compared against. In the following code, you will set the reference data to the training embeddings.

# A type alias for all of the drift detectors
DriftDetector = DriftUnivariate | DriftMMD | DriftDomainClassifier | DriftKNeighbors

# Create a mapping for the detectors to iterate over
detectors: dict[str, DriftDetector] = {
    "CVM": DriftUnivariate(method="cvm").fit(train_embs),
    "MMD": DriftMMD().fit(train_embs),
    "MVDC": DriftDomainClassifier().fit(train_embs),
    "KNN": DriftKNeighbors().fit(train_embs),
}

Make predictions¶

Now that the detectors are setup, predictions can be made against the operational embeddings you made earlier.

# Iterate and print the name of the detector class and its boolean drift prediction
for name, detector in detectors.items():
    print(f"{name} detected drift? {detector.predict(operational_embs).drifted}")

CVM detected drift? False

MMD detected drift? False

MVDC detected drift? False
KNN detected drift? False

Did you expect these results?

There is no drift detected between the train and operational embeddings because they come from very similar distributions.
Ideally, your training data and your validation data, which we used as operational, come from the same distribution. This is the purpose of data splitters.

So how do we know if the detectors can detect drift?

Well, add some random Gaussian noise to the operational embeddings and find out.

# Define transform with added gaussian noise
noisy_transforms = [transforms, GaussianNoise()]

# Create extractor with noisy transforms
noisy_extractor = TorchExtractor(resnet, transforms=noisy_transforms)

# Applies gaussian noise to images before processing
noisy_embs = Embeddings(operational_ds, extractor=noisy_extractor, batch_size=64)

# Iterate and print the name of the detector class and its boolean drift prediction
for name, detector in detectors.items():
    print(f"{name} detected drift? {detector.predict(noisy_embs).drifted}")

CVM detected drift? True

MMD detected drift? True

MVDC detected drift? True
KNN detected drift? True

Now drift is detected!

Adding Gaussian noise was enough to cause a noticeable change in the drift detectors, but this is not always the case. There are many types of drift that data can and will experience.

Inspecting detector outputs¶

Each detector doesn’t just report whether drift occurred — it provides statistics that reveal different things about the drift. Let’s look at what each detector tells us.

# Store results for inspection
results = {name: detector.predict(noisy_embs) for name, detector in detectors.items()}

DriftUnivariate: per-feature analysis¶

The univariate detector tests each feature independently and reports which features drifted and their p-values. This is useful for identifying which dimensions of the embedding space shifted.

cvm_result = results["CVM"]
cvm_details = cvm_result.details

n_drifted = sum(cvm_details["feature_drift"])
n_features = len(cvm_details["feature_drift"])
print(f"Features drifted: {n_drifted}/{n_features}")
print(f"Corrected p-value threshold: {cvm_details['feature_threshold']:.6f}")
print(f"Min feature p-value: {min(cvm_details['p_vals']):.6f}")
print(f"Max feature p-value: {max(cvm_details['p_vals']):.6f}")

Features drifted: 128/128
Corrected p-value threshold: 0.050000
Min feature p-value: 0.000000
Max feature p-value: 0.010467

DriftDomainClassifier: feature importances¶

The domain classifier trains a model to distinguish reference from test data and reports how important each feature was in making that distinction. High AUROC means the distributions are easily separable — a strong signal of drift.

mvdc_result = results["MVDC"]
mvdc_details = mvdc_result.details

print(f"AUROC: {mvdc_result.distance:.4f} (threshold: {mvdc_result.threshold})")
print(f"Per-fold AUROCs: {[round(a, 4) for a in mvdc_details['fold_aurocs']]}")

# Show top 5 most important features
importances = np.array(mvdc_details["feature_importances"])
top_indices = np.argsort(importances)[::-1][:5]
print("\nTop 5 features driving drift:")
for idx in top_indices:
    print(f"  Feature {idx}: importance = {importances[idx]:.4f}")

AUROC: 0.9986 (threshold: 0.55)
Per-fold AUROCs: [np.float32(0.9984), np.float32(0.9984), np.float32(0.9978), np.float32(0.9993), np.float32(0.9993)]

Top 5 features driving drift:
  Feature 58: importance = 153.2000
  Feature 107: importance = 119.2000
  Feature 86: importance = 112.0000
  Feature 49: importance = 109.2000
  Feature 1: importance = 99.6000

DriftKNeighbors: distance comparison¶

The k-NN detector compares how far test samples are from their nearest neighbors in the reference set versus the expected baseline distance. A large increase signals that test data occupies different regions of feature space.

knn_result = results["KNN"]
knn_details = knn_result.details

print(f"Mean reference k-NN distance: {knn_details['mean_ref_distance']:.4f}")
print(f"Mean test k-NN distance:      {knn_details['mean_test_distance']:.4f}")
print(f"Distance increase:             {knn_details['mean_test_distance'] - knn_details['mean_ref_distance']:.4f}")
print(f"P-value: {knn_details['p_val']:.6f}")

Mean reference k-NN distance: 5.0591
Mean test k-NN distance:      5.5217
Distance increase:             0.4626
P-value: 0.000000

DriftMMD: multivariate distribution distance¶

MMD measures the overall distance between two distributions in a kernel feature space. It captures both marginal and joint distributional changes that univariate tests might miss.

mmd_result = results["MMD"]
mmd_details = mmd_result.details

print(f"MMD² distance:   {mmd_result.distance:.6f}")
print(f"MMD² threshold:  {mmd_details['distance_threshold']:.6f}")
print(f"P-value:         {mmd_details['p_val']:.6f}")

MMD² distance:   0.099169
MMD² threshold:  0.000055
P-value:         0.000000

Each detector reveals a different facet of drift: the univariate detector pinpoints which features changed, the domain classifier shows which features matter most for distinguishing the distributions, the k-NN detector quantifies how far the data moved, and MMD provides a single multivariate distance between the distributions.

Choosing the right detector¶

The best detector depends on what you need to know:

Which features drifted? Use DriftUnivariate — it provides per-feature p-values and drift flags
Why did drift occur? Use DriftDomainClassifier — its feature importances show what drives the shift
How sensitive to multivariate changes? Use DriftMMD — it captures complex dependencies between features
Need fast, lightweight checks? Use DriftKNeighbors — simple distance comparison with minimal overhead
Want robust detection? Use multiple detectors with a majority vote to reduce false positives

Monitor drift over time with chunking¶

In real deployments, operational data arrives in batches over time. Rather than comparing all operational data at once, you can use chunking to split the data into segments and monitor how drift evolves across each chunk. This helps identify when drift begins to appear.

DataEval’s drift detectors support chunking through the chunk_count or chunk_size parameters on fit(). During fitting, the detector establishes a baseline by computing the metric across chunks of the reference data. During prediction, each chunk of test data is compared against this baseline, returning a DriftOutput with a polars.DataFrame in the details field containing per-chunk results.

Simulate gradual drift onset¶

To illustrate how chunking reveals when drift begins, you will build a combined dataset where the first 40% of samples are clean operational embeddings and the remaining 60% are noisy. This simulates a scenario where data quality degrades partway through a monitoring window.

# Build a combined array: first 40% clean, last 60% noisy
n_operational = len(operational_embs)
split_idx = int(n_operational * 0.4)

combined_embs = np.concatenate([operational_embs[:split_idx], noisy_embs[split_idx:]])
print(f"Combined shape: {combined_embs.shape} (clean: {split_idx}, noisy: {n_operational - split_idx})")

Combined shape: (5823, 128) (clean: 2329, noisy: 3494)

Fit detectors with chunking¶

# Re-fit detectors with chunking enabled (5 chunks each)
chunked_detectors: dict[str, ChunkedDrift] = {
    "CVM": DriftUnivariate(method="cvm").chunked(chunk_count=5).fit(train_embs),
    "MMD": DriftMMD().chunked(chunk_count=5).fit(train_embs),
    "MVDC": DriftDomainClassifier(threshold=(0.45, 0.65)).chunked(chunk_count=5).fit(train_embs),
    "KNN": DriftKNeighbors().chunked(chunk_count=5).fit(train_embs),
}

Predict on combined data and display chunk results¶

for name, detector in chunked_detectors.items():
    result = detector.predict(combined_embs)
    print(f"\n{name} - Overall drift detected: {result.drifted} (metric: {result.metric_name})")
    if isinstance(result.details, pl.DataFrame):
        display(result.details)

CVM - Overall drift detected: True (metric: cvm_distance)

shape: (5, 8)

key	index	start_index	end_index	value	upper_threshold	lower_threshold	drifted
str	i64	i64	i64	f64	f64	f64	bool
"[0:1143]"	0	0	1143	1.609194	1.947011	0.0	false
"[1144:2287]"	1	1144	2287	0.251445	1.947011	0.0	false
"[2288:3431]"	2	2288	3431	21.420139	1.947011	0.0	true
"[3432:4575]"	3	3432	4575	23.144905	1.947011	0.0	true
"[4576:5822]"	4	4576	5822	24.681578	1.947011	0.0	true

MMD - Overall drift detected: True (metric: mmd2)

shape: (5, 8)

key	index	start_index	end_index	value	upper_threshold	lower_threshold	drifted
str	i64	i64	i64	f64	f64	f64	bool
"[0:1143]"	0	0	1143	0.006662	0.009011	-0.004314	false
"[1144:2287]"	1	1144	2287	0.000378	0.009011	-0.004314	false
"[2288:3431]"	2	2288	3431	0.094143	0.009011	-0.004314	true
"[3432:4575]"	3	3432	4575	0.101829	0.009011	-0.004314	true
"[4576:5822]"	4	4576	5822	0.101317	0.009011	-0.004314	true

MVDC - Overall drift detected: True (metric: auroc)

shape: (5, 8)

key	index	start_index	end_index	value	upper_threshold	lower_threshold	drifted
str	i64	i64	i64	f64	f64	f64	bool
"[0:1143]"	0	0	1143	0.611608	0.65	0.45	false
"[1144:2287]"	1	1144	2287	0.504783	0.65	0.45	false
"[2288:3431]"	2	2288	3431	0.973157	0.65	0.45	true
"[3432:4575]"	3	3432	4575	0.997731	0.65	0.45	true
"[4576:5822]"	4	4576	5822	0.997487	0.65	0.45	true

KNN - Overall drift detected: True (metric: knn_distance)

shape: (5, 8)

key	index	start_index	end_index	value	upper_threshold	lower_threshold	drifted
str	i64	i64	i64	f64	f64	f64	bool
"[0:1143]"	0	0	1143	4.987197	5.191182	4.927011	false
"[1144:2287]"	1	1144	2287	5.10144	5.191182	4.927011	false
"[2288:3431]"	2	2288	3431	5.523569	5.191182	4.927011	true
"[3432:4575]"	3	3432	4575	5.577302	5.191182	4.927011	true
"[4576:5822]"	4	4576	5822	5.5198	5.191182	4.927011	true

The first two chunks (covering the clean 40%) should show no drift, while the later chunks (covering the noisy 60%) should trigger drift alerts. This chunk-level view makes it easy to pinpoint when in a data stream drift begins.

Next you will look at the labels’ distributions.

Evaluate parity¶

Instead of looking at the images, you can compare the distributions of the labels using a method called label parity.
There is parity between two sets of labels if the label frequencies are approximately equal.

You will now compare the label distributions using the label_parity function.

# Get the metadata for each dataset
train_md = Metadata(train_ds)
operational_md = Metadata(operational_ds)

# The VOC dataset has 20 classes
label_parity(train_md.class_labels, operational_md.class_labels, num_classes=20)["p_value"]

0.949856067521638

From the label_parity() function, you can see that it calculated a p_value of ~0.95. Since this is close to 1.0, it can be said that the two distributions have class label parity, or similar distributions.

Conclusion¶

In this tutorial, you have learned to create embeddings from the VOC dataset, compare different drift detectors and their unique outputs, use chunked monitoring to identify when drift begins, and calculate the parity of label distributions.

Key takeaways:

DriftUnivariate reveals which features drifted through per-feature statistical tests
DriftDomainClassifier explains why drift occurred through feature importances
DriftMMD provides a single multivariate distance that captures complex distributional changes
DriftKNeighbors offers fast, lightweight detection based on distance comparisons
Chunked monitoring helps pinpoint when drift begins in a data stream

These are important steps when monitoring data, as drift and lack of parity can affect a model’s ability to achieve performance recorded during model training. When data drift is detected or the label distributions lack parity, it is a good idea to consider retraining the model and incorporating operational data into the dataset.

What’s next¶

DataEval plays a small, but impactful role in data monitoring as a metrics library.
Visit these additional resources for more information on other aspects:

Increase your understanding of the types of data shifts that occur during monitoring
Read about the entire monitoring in AI/ML stage
Explore DataEval’s API reference for drift and other monitoring tools
Learn about identifying out-of-distribution samples

To learn more about setting a global seed in DataEval, see the hardware configuration how-to.

On your own¶

Once you are familiar with DataEval and data monitoring, run this analysis using your own reference and operational datasets.

Experiment with:

Different embeddings for KNN: ResNet, ViT, CLIP, or domain-specific pretrained models
Custom architectures: Design models for your specific data type (not generic examples)
Different drift scenarios: Test on your own data with varying difficulty levels