How to measure train and test dataset divergence¶

Problem Statement¶

When evaluating new testing data, or comparing two datasets, we often want to have a quantitative way of comparing and evaluating shifts in covariates. HP divergence is a nonparametric divergence metric which gives the distance between two datasets. A divergence of 0 means that the two datasets are approximately identically distributed. A divergence of 1 means the two datasets are completely separable.

When to use¶

The Divergence class should be used when you would like to know how far two datasets are diverged for one another. For example, if you would like to measure operational drift.

What you will need¶

A Python environment with the following packages installed:
- dataeval or dataeval[all]
A set of image embeddings for each dataset (usually obtained with an AutoEncoder)

Setting up¶

Let’s import the required libraries needed to set up a minimal working example

from maite_datasets.image_classification import MNIST

from dataeval.data import Embeddings
from dataeval.metrics.estimators import divergence

Loading in data¶

Load the MNIST data and create the training dataset. For the purposes of this example, we will use subsets of the training (4000) data.

# Load in the training mnist dataset and use the first 4000
train_ds = MNIST(root="./data/", image_set="train", download=True)

# Extract the first 4000 embeddings
embeddings = Embeddings(train_ds, batch_size=400)[:4000]

print("Number of samples: ", len(embeddings))
print("Image shape:", embeddings[0].shape)

Number of samples:  4000
Image shape: torch.Size([784])

Calculate initial divergence¶

Let’s calculate the divergence between the first 2000 images and the second 2000 images from this sample.

data_a = embeddings[:2000]
data_b = embeddings[2000:]

div = divergence(data_a, data_b)
print(div)

{'divergence': np.float64(0.1855), 'errors': np.int64(1629)}

We estimate that the divergence between these (identically distributed) images sets is at or close to 0.

Loading in corrupted data¶

Now let’s load in a corrupted mnist dataset.

corrupted_ds = MNIST(root="./data", image_set="train", corruption="translate", download=True)
corrupted_emb = Embeddings(corrupted_ds, batch_size=64)[:2000]

print("Number of corrupted samples: ", len(corrupted_emb))
print("Corrupted image shape:", corrupted_emb[0].shape)

Number of corrupted samples:  2000
Corrupted image shape: torch.Size([784])

Calculate corrupted divergence¶

Now lets calculate the Divergence between this corrupted dataset and the original images

div = divergence(data_a, corrupted_emb)
print(div)

{'divergence': np.float64(0.963), 'errors': np.int64(74)}

We conclude that the translated MNIST images are significantly different from the original images.