HP Divergence Estimation Tutorial#

Problem Statement#

When evaluating new testing data, or comparing two datasets, we often want to have a quantitative way of comparing and evaluating shifts in covariates. HP divergence is a nonparametric divergence metric which gives the distance between two datasets. A divergence of 0 means that the two datasets are approximately identically distributed. A divergence of 1 means the two datasets are completely separable.

When to use#

The Divergence class should be used when you would like to know how far two datasets are diverged for one another. For example, if you would like to measure operational drift.

What you will need#

A set of image embeddings for each dataset (usually obtained with an AutoEncoder)

Setting up#

Let’s import the required libraries needed to set up a minimal working example

from dataeval.metrics.estimators import divergence
from dataeval.utils.torch.datasets import MNIST

WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1730234126.100617     940 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1730234126.106497     940 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered

Loading in data#

Let’s start by loading in tensorflow’s MNIST dataset, then we will examine it.

# Load in the training mnist dataset and use the first 4000
train_ds = MNIST(root="./data/", train=True, download=True, size=4000, flatten=True)

# Split out the images and labels
images, labels = train_ds.data, train_ds.targets

Files already downloaded and verified

print("Number of samples: ", len(images))
print("Image shape:", images[0].shape)

Number of samples:  4000
Image shape: (784,)

Calculate initial divergence#

Let’s calculate the divergence between the first 2000 images and the second 2000 images from this sample.

data_a = images[0:2000]
data_b = images[2000:]

div = divergence(data_a, data_b)
print(div)

DivergenceOutput(divergence=0.18899999999999995, errors=1622.0)

We estimate that the divergence between these (identically distributed) images sets is at or close to 0.

Loading in corrupted data#

Now let’s load in a corrupted mnist dataset.

corruption = MNIST(root="./data", train=True, download=False, size=2000, flatten=True, corruption="translate")
corrupted_images = corruption.data

Files already downloaded and verified

print("Number of corrupted samples: ", len(corrupted_images))
print("Corrupted image shape:", corrupted_images[0].shape)

Number of corrupted samples:  2000
Corrupted image shape: (784,)

Calculate corrupted divergence#

Now lets calculate the Divergence between this corrupted dataset and the original images

div = divergence(data_a, corrupted_images)
print(div)

DivergenceOutput(divergence=0.962, errors=76.0)

We conclude that the translated MNIST images are significantly different from the original images.