{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Drift Detection Tutorial Using Multiple Drift Detectors\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## _Problem Statement_\n", "\n", "When evaluating and monitoring data after model deployment, it is important to test incoming data for potential drift which may affect model performance.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### _When to use_\n", "\n", "The `dataeval.detectors` drift detection classes should be used when you would like to measure new data for operational drift.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### _What you will need_\n", "\n", "1. A set of image embeddings for each dataset (usually obtained with an AutoEncoder)\n", "2. A python environment with the following packages installed:\n", " - `dataeval` or `dataeval[all]`\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### _Setting up_\n", "\n", "Let's import the required libraries needed to set up a minimal working example\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [ "remove_cell" ] }, "outputs": [], "source": [ "try:\n", " import google.colab # noqa: F401\n", "\n", " # specify the version of DataEval (==X.XX.X) for versions other than the latest\n", " %pip install -q dataeval\n", "except Exception:\n", " pass" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from functools import partial\n", "\n", "import numpy as np\n", "import torch\n", "\n", "from dataeval.detectors.drift import (\n", " DriftCVM,\n", " DriftKS,\n", " DriftMMD,\n", " preprocess_drift,\n", ")\n", "from dataeval.utils.dataset.datasets import MNIST\n", "from dataeval.utils.torch.models import Autoencoder\n", "\n", "device = \"cuda\" if torch.cuda.is_available() else \"cpu\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Loading in data\n", "\n", "Let's start by loading in torchvision's mnist dataset,\n", "then we will examine it\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Load in the training mnist dataset and use the first 4000\n", "train_ds = MNIST(root=\"./data/\", train=True, download=True, size=4000, dtype=np.float32, channels=\"channels_first\")\n", "\n", "# Split out the images and labels\n", "images, labels = train_ds.data, train_ds.targets" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(\"Number of samples: \", len(images))\n", "print(\"Image shape:\", images[0].shape)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Test reference against control\n", "\n", "Let's check for drift between the first 2000 images and the second 2000 images from this sample.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data_reference = images[0:2000]\n", "data_control = images[2000:]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In order to reduce the dimensionality of the data, we can set a simple Autoencoder to the `preprocess_fn`. While this is optional for the MNIST data set, it is highly recommended for datasets that have higher dimensionality.\n", "\n", "For the purposes of the tutorial, we will use 3 forms of drift detectors: Maximum Mean Discrepancy (MMD), Cramér-von Mises (CVM), and Kolmogorov-Smirnov (KS).\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# define encoder\n", "encoder_net = Autoencoder(1).encoder.to(device)\n", "\n", "# define preprocessing function\n", "preprocess_fn = partial(preprocess_drift, model=encoder_net, batch_size=64, device=device)\n", "\n", "# initialise drift detectors\n", "detectors = [detector(data_reference, preprocess_fn=preprocess_fn) for detector in [DriftMMD, DriftCVM, DriftKS]]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We estimate that the test for drift is false for all detectors as both the reference and test data set is from the same MNIST training dataset.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "results = {type(detector).__name__: detector.predict(data_control).is_drift for detector in detectors}\n", "print(results)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [ "remove_cell" ] }, "outputs": [], "source": [ "### TEST ASSERTION CELL ###\n", "assert all(not v for v in results.values())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Loading in corrupted data\n", "\n", "Now let's load in a corrupted MNIST dataset.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "corruption = MNIST(\n", " root=\"./data\",\n", " train=True,\n", " download=False,\n", " size=2000,\n", " dtype=np.float32,\n", " channels=\"channels_first\",\n", " corruption=\"translate\",\n", ")\n", "corrupted_images = corruption.data" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(\"Number of corrupted samples: \", len(corrupted_images))\n", "print(\"Corrupted image shape:\", corrupted_images[0].shape)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Check for drift against corrupted data\n", "\n", "Test for drift between the corrupted dataset and the original reference set using all 3 detectors.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "corrupted = {type(detector).__name__: detector.predict(corrupted_images).is_drift for detector in detectors}\n", "print(corrupted)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [ "remove_cell" ] }, "outputs": [], "source": [ "### TEST ASSERTION CELL ###\n", "assert all(v for v in corrupted.values())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We conclude that the translated MNIST images are significantly different from the original images according to all 3 measures of drift.\n" ] } ], "metadata": { "kernelspec": { "display_name": ".venv-3.11", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.9" } }, "nbformat": 4, "nbformat_minor": 4 }