{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Coverage\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## _Problem Statement_\n",
    "\n",
    "For most computer vision tasks like **image classification** and **object detection**, we often have a lot of images, but certain subsets of the images can be undersampled, such as label, style within a label, etc. A way to detect this regional sparsity is through coverage analysis.\n",
    "\n",
    "To help with this, DataEval has introduced a Coverage class ( `Coverage` ), that provides a user with example images which have few similar instances within the provided dataset.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### _When to use_\n",
    "\n",
    "The `Coverage` class should be used when you have lots of images, but only a small fraction from certain regimes/labels.\n",
    "\n",
    "### _What you will need_\n",
    "\n",
    "1. Image classification dataset.\n",
    "2. Autoencoder trained on image classification dataset for dimension reduction (e.g. through the `AETrainer` class).\n",
    "3. A python environment with the following packages installed:\n",
    "   - `dataeval[torch]` or `dataeval[all]`\n",
    "   - `tabulate`\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### _Setting up_\n",
    "\n",
    "Let's import the required libraries needed to set up a minimal working example\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "remove_cell"
    ]
   },
   "outputs": [],
   "source": [
    "try:\n",
    "    import base64\n",
    "    import io\n",
    "    import json\n",
    "\n",
    "    import google.colab  # noqa: F401\n",
    "    import torch\n",
    "\n",
    "    # specify the version of DataEval (==X.XX.X) for versions other than the latest\n",
    "    %pip install -q dataeval[torch]\n",
    "    !export LC_ALL=\"en_US.UTF-8\"\n",
    "    !export LD_LIBRARY_PATH=\"/usr/lib64-nvidia\"\n",
    "    !export LIBRARY_PATH=\"/usr/local/cuda/lib64/stubs\"\n",
    "    !ldconfig /usr/lib64-nvidia\n",
    "\n",
    "    # Code below is to download the pretrained model weights stored on github\n",
    "    !mkdir models\n",
    "    !curl -o gitlfsbinary https://api.github.com/repos/aria-ml/dataeval/git/blobs/ad520d5589fdc49830f98d28aa5eaed0bbdfe5cb\n",
    "\n",
    "    with open(\"gitlfsbinary\") as f:\n",
    "        rawfile = json.load(f)\n",
    "\n",
    "    binaryfile = base64.b64decode(rawfile[\"content\"])\n",
    "    buffer = io.BytesIO(binaryfile)\n",
    "\n",
    "    temp = torch.load(buffer, weights_only=False)\n",
    "    torch.save(temp, \"models/ae\")\n",
    "\n",
    "    del rawfile\n",
    "    del binaryfile\n",
    "    del buffer\n",
    "    del temp\n",
    "except Exception:\n",
    "    pass\n",
    "\n",
    "%pip install -q tabulate"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import math\n",
    "\n",
    "import matplotlib.pyplot as plt  # type: ignore\n",
    "import numpy as np\n",
    "import torch\n",
    "import torch.nn as nn\n",
    "from sklearn.manifold import TSNE  # type: ignore\n",
    "\n",
    "from dataeval.metrics.bias import coverage\n",
    "from dataeval.utils.torch.datasets import MNIST"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Load the data\n",
    "\n",
    "We will use the MNIST dataset from torchvision for this tutorial on coverage.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# We train a 10-d autoencoder on MNIST data for 1000 epochs with batch size 128\n",
    "num_epochs = 1000\n",
    "batch_size = 128\n",
    "\n",
    "# Set seeds\n",
    "torch.manual_seed(14)\n",
    "\n",
    "# MNIST with mean 0 unit variance\n",
    "trainset = MNIST(\n",
    "    root=\"./data\",\n",
    "    train=True,\n",
    "    download=True,\n",
    "    size=2000,\n",
    "    unit_interval=True,\n",
    "    dtype=np.float32,\n",
    "    channels=\"channels_first\",\n",
    "    normalize=(0.1307, 0.3081),\n",
    ")\n",
    "dataloader = torch.utils.data.DataLoader(trainset, batch_size=batch_size, shuffle=True, num_workers=4)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this tutorial, we will use an autoencoder to reduce the dimension of the MNIST images.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Define model architecture\n",
    "class Autoencoder(nn.Module):\n",
    "    def __init__(self):\n",
    "        super().__init__()\n",
    "        self.encoder = nn.Sequential(\n",
    "            # 28 x 28\n",
    "            nn.Conv2d(1, 4, kernel_size=5),\n",
    "            # 4 x 24 x 24\n",
    "            nn.ReLU(True),\n",
    "            nn.Conv2d(4, 8, kernel_size=5),\n",
    "            nn.ReLU(True),\n",
    "            # 8 x 20 x 20 = 3200\n",
    "            nn.Flatten(),\n",
    "            nn.Linear(3200, 10),\n",
    "            # 10\n",
    "            nn.Sigmoid(),\n",
    "        )\n",
    "        self.decoder = nn.Sequential(\n",
    "            # 10\n",
    "            nn.Linear(10, 400),\n",
    "            # 400\n",
    "            nn.ReLU(True),\n",
    "            nn.Linear(400, 4000),\n",
    "            # 4000\n",
    "            nn.ReLU(True),\n",
    "            nn.Unflatten(1, (10, 20, 20)),\n",
    "            # 10 x 20 x 20\n",
    "            nn.ConvTranspose2d(10, 10, kernel_size=5),\n",
    "            # 24 x 24\n",
    "            nn.ConvTranspose2d(10, 1, kernel_size=5),\n",
    "            # 28 x 28\n",
    "            nn.Sigmoid(),\n",
    "        )\n",
    "\n",
    "    def forward(self, x):\n",
    "        x = self.encoder(x)\n",
    "        x = self.decoder(x)\n",
    "        return x\n",
    "\n",
    "    def encode(self, x):\n",
    "        x = self.encoder(x)\n",
    "        return x"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For computational reasons, we will simply load the trained autoencoder. See the how-to [How to create image embeddings with an autoencoder](AETrainerTutorial.ipynb) for more information on how to train an autoencoder.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sd = torch.load(\"models/ae\")\n",
    "model = Autoencoder()\n",
    "model.load_state_dict(sd)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Get images to predict on and predict\n",
    "pred = trainset.data\n",
    "label = trainset.targets\n",
    "mod_preds = model.encode(torch.tensor(pred)).detach().numpy()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To visualize the encodings, we will use TSNE on them to view separation.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Visualize 10d as 2d with TSNE\n",
    "tsne = TSNE(n_components=2)\n",
    "red_dim = tsne.fit_transform(mod_preds)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Plot results with color being label\n",
    "fig, ax = plt.subplots()\n",
    "scatter = ax.scatter(\n",
    "    x=red_dim[:, 0],\n",
    "    y=red_dim[:, 1],\n",
    "    c=label,\n",
    "    label=label,\n",
    ")\n",
    "ax.legend(*scatter.legend_elements(), loc=\"upper right\", ncols=2)\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Some good separation, but you can see a few images in the \"gaps\". This could be an artifact of dimension reduction, or suggest that we have poor coverage for some covariates.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Way to calculate data-agnostic radius (probably don't want to do this)\n",
    "k = 20\n",
    "n = 2000\n",
    "d = 10\n",
    "rho = (1 / math.sqrt(math.pi)) * ((4 * 20 * math.gamma(d / 2 + 1)) / (n)) ** (1 / d)\n",
    "\n",
    "# Way to calculate data-adaptive radius (most extreme 1% are uncovered)\n",
    "percent = 0.01\n",
    "cutoff = int(n * percent)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Use data adaptive cutoff\n",
    "cvrg = coverage(mod_preds, radius_type=\"adaptive\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Plot the least covered 0.5%\n",
    "f, axs = plt.subplots(4, 4)\n",
    "axs = axs.flatten()\n",
    "for count, i in enumerate(axs):\n",
    "    i.imshow(np.squeeze(pred[cvrg.indices[count]]), cmap=\"gray\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The Coverage tool identified that in this set of 2000 images, there is potential under-coverage when it comes to wonky/ crossed 7s.  \n",
    "Other digits have some undercovered instances, but could be they are just outliers.  \n",
    "More investigation into outlier status is needed, see [How to identify outliers and/or anomalies in a dataset](ClustererTutorial.ipynb) for more info.\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.10"
  },
  "orig_nbformat": 4
 },
 "nbformat": 4,
 "nbformat_minor": 2
}