{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# HP Divergence Estimation Tutorial\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## _Problem Statement_\n", "\n", "When evaluating new testing data, or comparing two datasets, we often want to have a quantitative way of comparing and evaluating shifts in covariates. HP divergence is a nonparametric divergence metric which gives the distance between two datasets. A divergence of 0 means that the two datasets are approximately identically distributed. A divergence of 1 means the two datasets are completely separable.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### _When to use_\n", "\n", "The `Divergence` class should be used when you would like to know how far two datasets are diverged for one another. For example, if you would like to measure operational drift.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### _What you will need_\n", "\n", "1. A set of image embeddings for each dataset (usually obtained with an AutoEncoder)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### _Setting up_\n", "\n", "Let's import the required libraries needed to set up a minimal working example\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [ "remove_cell" ] }, "outputs": [], "source": [ "try:\n", " import google.colab # noqa: F401\n", "\n", " # specify the version of DataEval (==X.XX.X) for versions other than the latest\n", " %pip install -q dataeval\n", "except Exception:\n", " pass" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from dataeval.metrics.estimators import divergence\n", "from dataeval.utils.torch.datasets import MNIST" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Loading in data\n", "\n", "Let's start by loading in tensorflow's MNIST dataset,\n", "then we will examine it.\n" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "# Load in the training mnist dataset and use the first 4000\n", "train_ds = MNIST(root=\"./data/\", train=True, download=True, size=4000, flatten=True)\n", "\n", "# Split out the images and labels\n", "images, labels = train_ds.data, train_ds.targets" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(\"Number of samples: \", len(images))\n", "print(\"Image shape:\", images[0].shape)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Calculate initial divergence\n", "\n", "Let's calculate the divergence between the first 2000 images and the second 2000 images from this sample.\n" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "data_a = images[0:2000]\n", "data_b = images[2000:]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "div = divergence(data_a, data_b)\n", "print(div)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We estimate that the divergence between these (identically distributed) images sets is at or close to 0.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Loading in corrupted data\n", "\n", "Now let's load in a corrupted mnist dataset.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "corruption = MNIST(root=\"./data\", train=True, download=False, size=2000, flatten=True, corruption=\"translate\")\n", "corrupted_images = corruption.data" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(\"Number of corrupted samples: \", len(corrupted_images))\n", "print(\"Corrupted image shape:\", corrupted_images[0].shape)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Calculate corrupted divergence\n", "\n", "Now lets calculate the Divergence between this corrupted dataset and the original images\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "div = divergence(data_a, corrupted_images)\n", "print(div)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We conclude that the translated MNIST images are significantly different from the original images.\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.10" } }, "nbformat": 4, "nbformat_minor": 4 }