{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Drift Detection Tutorial Using Multiple Drift Detectors\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## _Problem Statement_\n",
    "\n",
    "When evaluating and monitoring data after model deployment, it is important to test incoming data for potential drift which may affect model performance.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### _When to use_\n",
    "\n",
    "The `dataeval.detectors` drift detection classes should be used when you would like to measure new data for operational drift.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### _What you will need_\n",
    "\n",
    "1. A set of image embeddings for each dataset (usually obtained with an AutoEncoder)\n",
    "2. A python environment with the following packages installed:\n",
    "   - `dataeval` or `dataeval[all]`\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### _Setting up_\n",
    "\n",
    "Let's import the required libraries needed to set up a minimal working example\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "remove_cell"
    ]
   },
   "outputs": [],
   "source": [
    "try:\n",
    "    import google.colab  # noqa: F401\n",
    "\n",
    "    # specify the version of DataEval (==X.XX.X) for versions other than the latest\n",
    "    %pip install -q dataeval\n",
    "except Exception:\n",
    "    pass"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from functools import partial\n",
    "\n",
    "import numpy as np\n",
    "import torch\n",
    "\n",
    "from dataeval.detectors.drift import (\n",
    "    DriftCVM,\n",
    "    DriftKS,\n",
    "    DriftMMD,\n",
    "    preprocess_drift,\n",
    ")\n",
    "from dataeval.utils.dataset.datasets import MNIST\n",
    "from dataeval.utils.torch.models import Autoencoder\n",
    "\n",
    "device = \"cuda\" if torch.cuda.is_available() else \"cpu\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Loading in data\n",
    "\n",
    "Let's start by loading in torchvision's mnist dataset,\n",
    "then we will examine it\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Load in the training mnist dataset and use the first 4000\n",
    "train_ds = MNIST(root=\"./data/\", train=True, download=True, size=4000, dtype=np.float32, channels=\"channels_first\")\n",
    "\n",
    "# Split out the images and labels\n",
    "images, labels = train_ds.data, train_ds.targets"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"Number of samples: \", len(images))\n",
    "print(\"Image shape:\", images[0].shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Test reference against control\n",
    "\n",
    "Let's check for drift between the first 2000 images and the second 2000 images from this sample.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data_reference = images[0:2000]\n",
    "data_control = images[2000:]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In order to reduce the dimensionality of the data, we can set a simple Autoencoder to the `preprocess_fn`. While this is optional for the MNIST data set, it is highly recommended for datasets that have higher dimensionality.\n",
    "\n",
    "For the purposes of the tutorial, we will use 3 forms of drift detectors: Maximum Mean Discrepancy (MMD), Cramér-von Mises (CVM), and Kolmogorov-Smirnov (KS).\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# define encoder\n",
    "encoder_net = Autoencoder(1).encoder.to(device)\n",
    "\n",
    "# define preprocessing function\n",
    "preprocess_fn = partial(preprocess_drift, model=encoder_net, batch_size=64, device=device)\n",
    "\n",
    "# initialise drift detectors\n",
    "detectors = [detector(data_reference, preprocess_fn=preprocess_fn) for detector in [DriftMMD, DriftCVM, DriftKS]]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We estimate that the test for drift is false for all detectors as both the reference and test data set is from the same MNIST training dataset.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "results = {type(detector).__name__: detector.predict(data_control).is_drift for detector in detectors}\n",
    "print(results)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "remove_cell"
    ]
   },
   "outputs": [],
   "source": [
    "### TEST ASSERTION CELL ###\n",
    "assert all(not v for v in results.values())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Loading in corrupted data\n",
    "\n",
    "Now let's load in a corrupted MNIST dataset.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "corruption = MNIST(\n",
    "    root=\"./data\",\n",
    "    train=True,\n",
    "    download=False,\n",
    "    size=2000,\n",
    "    dtype=np.float32,\n",
    "    channels=\"channels_first\",\n",
    "    corruption=\"translate\",\n",
    ")\n",
    "corrupted_images = corruption.data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"Number of corrupted samples: \", len(corrupted_images))\n",
    "print(\"Corrupted image shape:\", corrupted_images[0].shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Check for drift against corrupted data\n",
    "\n",
    "Test for drift between the corrupted dataset and the original reference set using all 3 detectors.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "corrupted = {type(detector).__name__: detector.predict(corrupted_images).is_drift for detector in detectors}\n",
    "print(corrupted)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "remove_cell"
    ]
   },
   "outputs": [],
   "source": [
    "### TEST ASSERTION CELL ###\n",
    "assert all(v for v in corrupted.values())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We conclude that the translated MNIST images are significantly different from the original images according to all 3 measures of drift.\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": ".venv-3.11",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}