{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Coverage\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## _Problem Statement_\n", "\n", "For most computer vision tasks like **image classification** and **object detection**, we often have a lot of images, but certain subsets of the images can be undersampled, such as label, style within a label, etc. A way to detect this regional sparsity is through coverage analysis.\n", "\n", "To help with this, DataEval has introduced a Coverage class ( `Coverage` ), that provides a user with example images which have few similar instances within the provided dataset.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### _When to use_\n", "\n", "The `Coverage` class should be used when you have lots of images, but only a small fraction from certain regimes/labels.\n", "\n", "### _What you will need_\n", "\n", "1. Image classification dataset.\n", "2. Autoencoder trained on image classification dataset for dimension reduction (e.g. through the `AETrainer` class).\n", "3. A python environment with the following packages installed:\n", " - `dataeval[torch]` or `dataeval[all]`\n", " - `tabulate`\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### _Setting up_\n", "\n", "Let's import the required libraries needed to set up a minimal working example\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [ "remove_cell" ] }, "outputs": [], "source": [ "try:\n", " import base64\n", " import io\n", " import json\n", "\n", " import google.colab # noqa: F401\n", " import torch\n", "\n", " # specify the version of DataEval (==X.XX.X) for versions other than the latest\n", " %pip install -q dataeval[torch]\n", " !export LC_ALL=\"en_US.UTF-8\"\n", " !export LD_LIBRARY_PATH=\"/usr/lib64-nvidia\"\n", " !export LIBRARY_PATH=\"/usr/local/cuda/lib64/stubs\"\n", " !ldconfig /usr/lib64-nvidia\n", "\n", " # Code below is to download the pretrained model weights stored on github\n", " !mkdir models\n", " !curl -o gitlfsbinary https://api.github.com/repos/aria-ml/dataeval/git/blobs/ad520d5589fdc49830f98d28aa5eaed0bbdfe5cb\n", "\n", " with open(\"gitlfsbinary\") as f:\n", " rawfile = json.load(f)\n", "\n", " binaryfile = base64.b64decode(rawfile[\"content\"])\n", " buffer = io.BytesIO(binaryfile)\n", "\n", " temp = torch.load(buffer, weights_only=False)\n", " torch.save(temp, \"models/ae\")\n", "\n", " del rawfile\n", " del binaryfile\n", " del buffer\n", " del temp\n", "except Exception:\n", " pass\n", "\n", "%pip install -q tabulate" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import math\n", "\n", "import matplotlib.pyplot as plt # type: ignore\n", "import numpy as np\n", "import torch\n", "import torch.nn as nn\n", "from sklearn.manifold import TSNE # type: ignore\n", "\n", "from dataeval.metrics.bias import coverage\n", "from dataeval.utils.torch.datasets import MNIST" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Load the data\n", "\n", "We will use the MNIST dataset from torchvision for this tutorial on coverage.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# We train a 10-d autoencoder on MNIST data for 1000 epochs with batch size 128\n", "num_epochs = 1000\n", "batch_size = 128\n", "\n", "# Set seeds\n", "torch.manual_seed(14)\n", "\n", "# MNIST with mean 0 unit variance\n", "trainset = MNIST(\n", " root=\"./data\",\n", " train=True,\n", " download=True,\n", " size=2000,\n", " unit_interval=True,\n", " dtype=np.float32,\n", " channels=\"channels_first\",\n", " normalize=(0.1307, 0.3081),\n", ")\n", "dataloader = torch.utils.data.DataLoader(trainset, batch_size=batch_size, shuffle=True, num_workers=4)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this tutorial, we will use an autoencoder to reduce the dimension of the MNIST images.\n" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "# Define model architecture\n", "class Autoencoder(nn.Module):\n", " def __init__(self):\n", " super().__init__()\n", " self.encoder = nn.Sequential(\n", " # 28 x 28\n", " nn.Conv2d(1, 4, kernel_size=5),\n", " # 4 x 24 x 24\n", " nn.ReLU(True),\n", " nn.Conv2d(4, 8, kernel_size=5),\n", " nn.ReLU(True),\n", " # 8 x 20 x 20 = 3200\n", " nn.Flatten(),\n", " nn.Linear(3200, 10),\n", " # 10\n", " nn.Sigmoid(),\n", " )\n", " self.decoder = nn.Sequential(\n", " # 10\n", " nn.Linear(10, 400),\n", " # 400\n", " nn.ReLU(True),\n", " nn.Linear(400, 4000),\n", " # 4000\n", " nn.ReLU(True),\n", " nn.Unflatten(1, (10, 20, 20)),\n", " # 10 x 20 x 20\n", " nn.ConvTranspose2d(10, 10, kernel_size=5),\n", " # 24 x 24\n", " nn.ConvTranspose2d(10, 1, kernel_size=5),\n", " # 28 x 28\n", " nn.Sigmoid(),\n", " )\n", "\n", " def forward(self, x):\n", " x = self.encoder(x)\n", " x = self.decoder(x)\n", " return x\n", "\n", " def encode(self, x):\n", " x = self.encoder(x)\n", " return x" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For computational reasons, we will simply load the trained autoencoder. See the how-to [How to create image embeddings with an autoencoder](AETrainerTutorial.ipynb) for more information on how to train an autoencoder.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sd = torch.load(\"models/ae\")\n", "model = Autoencoder()\n", "model.load_state_dict(sd)" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": [ "# Get images to predict on and predict\n", "pred = trainset.data\n", "label = trainset.targets\n", "mod_preds = model.encode(torch.tensor(pred)).detach().numpy()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To visualize the encodings, we will use TSNE on them to view separation.\n" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [], "source": [ "# Visualize 10d as 2d with TSNE\n", "tsne = TSNE(n_components=2)\n", "red_dim = tsne.fit_transform(mod_preds)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Plot results with color being label\n", "fig, ax = plt.subplots()\n", "scatter = ax.scatter(\n", " x=red_dim[:, 0],\n", " y=red_dim[:, 1],\n", " c=label,\n", " label=label,\n", ")\n", "ax.legend(*scatter.legend_elements(), loc=\"upper right\", ncols=2)\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Some good separation, but you can see a few images in the \"gaps\". This could be an artifact of dimension reduction, or suggest that we have poor coverage for some covariates.\n" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [], "source": [ "# Way to calculate data-agnostic radius (probably don't want to do this)\n", "k = 20\n", "n = 2000\n", "d = 10\n", "rho = (1 / math.sqrt(math.pi)) * ((4 * 20 * math.gamma(d / 2 + 1)) / (n)) ** (1 / d)\n", "\n", "# Way to calculate data-adaptive radius (most extreme 1% are uncovered)\n", "percent = 0.01\n", "cutoff = int(n * percent)" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [], "source": [ "# Use data adaptive cutoff\n", "cvrg = coverage(mod_preds, radius_type=\"adaptive\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Plot the least covered 0.5%\n", "f, axs = plt.subplots(4, 4)\n", "axs = axs.flatten()\n", "for count, i in enumerate(axs):\n", " i.imshow(np.squeeze(pred[cvrg.indices[count]]), cmap=\"gray\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The Coverage tool identified that in this set of 2000 images, there is potential under-coverage when it comes to wonky/ crossed 7s. \n", "Other digits have some undercovered instances, but could be they are just outliers. \n", "More investigation into outlier status is needed, see [How to identify outliers and/or anomalies in a dataset](ClustererTutorial.ipynb) for more info.\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.10" }, "orig_nbformat": 4 }, "nbformat": 4, "nbformat_minor": 2 }