HP Divergence Estimation Tutorial#

Problem Statement#

When evaluating new testing data, or comparing two datasets, we often want to have a quantitative way of comparing and evaluating shifts in covariates. HP divergence is a nonparametric divergence metric which gives the distance between two datasets. A divergence of 0 means that the two datasets are approximately identically distributed. A divergence of 1 means the two datasets are completely separable.

When to use#

The Divergence class should be used when you would like to know how far two datasets are diverged for one another. For example, if you would like to measure operational drift.

What you will need#

A set of image embeddings for each dataset (usually obtained with an AutoEncoder)

Setting up#

Let’s import the required libraries needed to set up a minimal working example

from dataeval.metrics.estimators import divergence
from dataeval.utils.torch.datasets import MNIST

Loading in data#

Let’s start by loading in tensorflow’s MNIST dataset, then we will examine it.

# Load in the training mnist dataset and use the first 4000
train_ds = MNIST(root="./data/", train=True, download=True, size=4000, flatten=True)

# Split out the images and labels
images, labels = train_ds.data, train_ds.targets

Files already downloaded and verified

print("Number of samples: ", len(images))
print("Image shape:", images[0].shape)

Number of samples:  4000
Image shape: (784,)

Calculate initial divergence#

Let’s calculate the divergence between the first 2000 images and the second 2000 images from this sample.

data_a = images[0:2000]
data_b = images[2000:]

div = divergence(data_a, data_b)
print(div)

DivergenceOutput: {'divergence': 0.0, 'errors': np.int64(2026)}

We estimate that the divergence between these (identically distributed) images sets is at or close to 0.

Loading in corrupted data#

Now let’s load in a corrupted mnist dataset.

corruption = MNIST(root="./data", train=True, download=False, size=2000, flatten=True, corruption="translate")
corrupted_images = corruption.data

Files already downloaded and verified

print("Number of corrupted samples: ", len(corrupted_images))
print("Corrupted image shape:", corrupted_images[0].shape)

Number of corrupted samples:  2000
Corrupted image shape: (784,)

Calculate corrupted divergence#

Now lets calculate the Divergence between this corrupted dataset and the original images

div = divergence(data_a, corrupted_images)
print(div)

DivergenceOutput: {'divergence': np.float64(0.97), 'errors': np.int64(60)}

We conclude that the translated MNIST images are significantly different from the original images.