{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# HP Divergence Estimation Tutorial\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## _Problem Statement_\n",
    "\n",
    "When evaluating new testing data, or comparing two datasets, we often want to have a quantitative way of comparing and evaluating shifts in covariates. HP divergence is a nonparametric divergence metric which gives the distance between two datasets. A divergence of 0 means that the two datasets are approximately identically distributed. A divergence of 1 means the two datasets are completely separable.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### _When to use_\n",
    "\n",
    "The `Divergence` class should be used when you would like to know how far two datasets are diverged for one another. For example, if you would like to measure operational drift.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### _What you will need_\n",
    "\n",
    "1. A set of image embeddings for each dataset (usually obtained with an AutoEncoder)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### _Setting up_\n",
    "\n",
    "Let's import the required libraries needed to set up a minimal working example\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "remove_cell"
    ]
   },
   "outputs": [],
   "source": [
    "try:\n",
    "    import google.colab  # noqa: F401\n",
    "\n",
    "    # specify the version of DataEval (==X.XX.X) for versions other than the latest\n",
    "    %pip install -q dataeval\n",
    "except Exception:\n",
    "    pass"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from dataeval.metrics.estimators import divergence\n",
    "from dataeval.utils.torch.datasets import MNIST"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Loading in data\n",
    "\n",
    "Let's start by loading in tensorflow's MNIST dataset,\n",
    "then we will examine it.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Load in the training mnist dataset and use the first 4000\n",
    "train_ds = MNIST(root=\"./data/\", train=True, download=True, size=4000, flatten=True)\n",
    "\n",
    "# Split out the images and labels\n",
    "images, labels = train_ds.data, train_ds.targets"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"Number of samples: \", len(images))\n",
    "print(\"Image shape:\", images[0].shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Calculate initial divergence\n",
    "\n",
    "Let's calculate the divergence between the first 2000 images and the second 2000 images from this sample.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "data_a = images[0:2000]\n",
    "data_b = images[2000:]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "div = divergence(data_a, data_b)\n",
    "print(div)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We estimate that the divergence between these (identically distributed) images sets is at or close to 0.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Loading in corrupted data\n",
    "\n",
    "Now let's load in a corrupted mnist dataset.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "corruption = MNIST(root=\"./data\", train=True, download=False, size=2000, flatten=True, corruption=\"translate\")\n",
    "corrupted_images = corruption.data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"Number of corrupted samples: \", len(corrupted_images))\n",
    "print(\"Corrupted image shape:\", corrupted_images[0].shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Calculate corrupted divergence\n",
    "\n",
    "Now lets calculate the Divergence between this corrupted dataset and the original images\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "div = divergence(data_a, corrupted_images)\n",
    "print(div)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We conclude that the translated MNIST images are significantly different from the original images.\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}