{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Coverage\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## _Problem Statement_\n", "\n", "For most computer vision tasks like **image classification** and **object detection**, we often have a lot of images, but certain subsets of the images can be undersampled, such as label, style within a label, etc. A way to detect this regional sparsity is through coverage analysis.\n", "\n", "To help with this, DataEval has introduced a Coverage class ( `Coverage` ), that provides a user with example images which have few similar instances within the provided dataset.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### _When to use_\n", "\n", "The `Coverage` class should be used when you have lots of images, but only a small fraction from certain regimes/labels.\n", "\n", "### _What you will need_\n", "\n", "1. Image classification dataset.\n", "2. Autoencoder trained on image classification dataset for dimension reduction (e.g. through the `AETrainer` class).\n", "3. A python environment with the following packages installed:\n", " - `dataeval` or `dataeval[all]`\n", " - `tabulate`\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### _Setting up_\n", "\n", "Let's import the required libraries needed to set up a minimal working example\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [ "remove_cell" ] }, "outputs": [], "source": [ "try:\n", " import base64\n", " import io\n", " import json\n", "\n", " import google.colab # noqa: F401\n", " import torch\n", "\n", " # specify the version of DataEval (==X.XX.X) for versions other than the latest\n", " %pip install -q dataeval\n", " !export LC_ALL=\"en_US.UTF-8\"\n", " !export LD_LIBRARY_PATH=\"/usr/lib64-nvidia\"\n", " !export LIBRARY_PATH=\"/usr/local/cuda/lib64/stubs\"\n", " !ldconfig /usr/lib64-nvidia\n", "\n", " # Code below is to download the pretrained model weights stored on github\n", " !mkdir models\n", " !curl -o gitlfsbinary https://api.github.com/repos/aria-ml/dataeval/git/blobs/ad520d5589fdc49830f98d28aa5eaed0bbdfe5cb\n", "\n", " with open(\"gitlfsbinary\") as f:\n", " rawfile = json.load(f)\n", "\n", " binaryfile = base64.b64decode(rawfile[\"content\"])\n", " buffer = io.BytesIO(binaryfile)\n", "\n", " temp = torch.load(buffer, weights_only=False)\n", " torch.save(temp, \"models/ae\")\n", "\n", " del rawfile\n", " del binaryfile\n", " del buffer\n", " del temp\n", "except Exception:\n", " pass\n", "\n", "%pip install -q tabulate" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import math\n", "\n", "import matplotlib.pyplot as plt # type: ignore\n", "import numpy as np\n", "import torch\n", "import torch.nn as nn\n", "from sklearn.manifold import TSNE # type: ignore\n", "\n", "from dataeval.metrics.bias import coverage\n", "from dataeval.utils.dataset.datasets import MNIST" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Load the data\n", "\n", "We will use the MNIST dataset from torchvision for this tutorial on coverage.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# We train a 10-d autoencoder on MNIST data for 1000 epochs with batch size 128\n", "num_epochs = 1000\n", "batch_size = 128\n", "\n", "# Set seeds\n", "torch.manual_seed(14)\n", "\n", "# MNIST with mean 0 unit variance\n", "trainset = MNIST(\n", " root=\"./data\",\n", " train=True,\n", " download=True,\n", " size=2000,\n", " unit_interval=True,\n", " dtype=np.float32,\n", " channels=\"channels_first\",\n", " normalize=(0.1307, 0.3081),\n", ")\n", "dataloader = torch.utils.data.DataLoader(trainset, batch_size=batch_size, shuffle=True, num_workers=4)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this tutorial, we will use an autoencoder to reduce the dimension of the MNIST images.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Define model architecture\n", "class Autoencoder(nn.Module):\n", " def __init__(self):\n", " super().__init__()\n", " self.encoder = nn.Sequential(\n", " # 28 x 28\n", " nn.Conv2d(1, 4, kernel_size=5),\n", " # 4 x 24 x 24\n", " nn.ReLU(True),\n", " nn.Conv2d(4, 8, kernel_size=5),\n", " nn.ReLU(True),\n", " # 8 x 20 x 20 = 3200\n", " nn.Flatten(),\n", " nn.Linear(3200, 10),\n", " # 10\n", " nn.Sigmoid(),\n", " )\n", " self.decoder = nn.Sequential(\n", " # 10\n", " nn.Linear(10, 400),\n", " # 400\n", " nn.ReLU(True),\n", " nn.Linear(400, 4000),\n", " # 4000\n", " nn.ReLU(True),\n", " nn.Unflatten(1, (10, 20, 20)),\n", " # 10 x 20 x 20\n", " nn.ConvTranspose2d(10, 10, kernel_size=5),\n", " # 24 x 24\n", " nn.ConvTranspose2d(10, 1, kernel_size=5),\n", " # 28 x 28\n", " nn.Sigmoid(),\n", " )\n", "\n", " def forward(self, x):\n", " x = self.encoder(x)\n", " x = self.decoder(x)\n", " return x\n", "\n", " def encode(self, x):\n", " x = self.encoder(x)\n", " return x" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For computational reasons, we will simply load the trained autoencoder. See the how-to [How to create image embeddings with an autoencoder](AETrainerTutorial.ipynb) for more information on how to train an autoencoder.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "sd = torch.load(\"models/ae\")\n", "model = Autoencoder()\n", "model.load_state_dict(sd)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Get images to predict on and predict\n", "pred = trainset.data\n", "label = trainset.targets\n", "mod_preds = model.encode(torch.tensor(pred)).detach().numpy()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To visualize the encodings, we will use TSNE on them to view separation.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Visualize 10d as 2d with TSNE\n", "tsne = TSNE(n_components=2)\n", "red_dim = tsne.fit_transform(mod_preds)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Plot results with color being label\n", "fig, ax = plt.subplots()\n", "scatter = ax.scatter(\n", " x=red_dim[:, 0],\n", " y=red_dim[:, 1],\n", " c=label,\n", " label=label,\n", ")\n", "ax.legend(*scatter.legend_elements(), loc=\"upper right\", ncols=2)\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Some good separation, but you can see a few images in the \"gaps\". This could be an artifact of dimension reduction, or suggest that we have poor coverage for some covariates.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Way to calculate data-agnostic radius (probably don't want to do this)\n", "k = 20\n", "n = 2000\n", "d = 10\n", "rho = (1 / math.sqrt(math.pi)) * ((4 * 20 * math.gamma(d / 2 + 1)) / (n)) ** (1 / d)\n", "\n", "# Way to calculate data-adaptive radius (most extreme 1% are uncovered)\n", "percent = 0.01\n", "cutoff = int(n * percent)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Use data adaptive cutoff\n", "cvrg = coverage(mod_preds, radius_type=\"adaptive\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Plot the least covered 0.5%\n", "f, axs = plt.subplots(4, 4)\n", "axs = axs.flatten()\n", "for count, i in enumerate(axs):\n", " i.imshow(np.squeeze(pred[cvrg.indices[count]]), cmap=\"gray\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The Coverage tool identified that in this set of 2000 images, there is potential under-coverage when it comes to wonky 2s and 7s. \n", "Other digits have some undercovered instances, but could be they are just outliers. \n", "More investigation into outlier status is needed, see [How to identify outliers and/or anomalies in a dataset](ClustererTutorial.ipynb) for more info.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [ "remove_cell" ] }, "outputs": [], "source": [ "### TEST ASSERTION CELL ###\n", "wonky = sum(label[i] == 7 or label[i] == 2 for idx, i in enumerate(cvrg.indices) if idx < 16)\n", "assert (wonky / 16) > 0.5" ] } ], "metadata": { "kernelspec": { "display_name": ".venv-3.12", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.7" }, "orig_nbformat": 4 }, "nbformat": 4, "nbformat_minor": 2 }