HP Divergence Estimation Tutorial#

Problem Statement#

When evaluating new testing data, or comparing two datasets, we often want to have a quantitative way of comparing and evaluating shifts in covariates. HP divergence is a nonparametric divergence metric which gives the distance between two datasets. A divergence of 0 means that the two datasets are approximately identically distributed. A divergence of 1 means the two datasets are completely separable.

When to use#

The Divergence class should be used when you would like to know how far two datasets are diverged for one another. For example, if you would like to measure operational drift.

What you will need#

  1. A set of image embeddings for each dataset (usually obtained with an AutoEncoder)

  2. A python environment with the following packages installed:

    • tensorflow-datasets

    • pytest

Setting up#

Let’s import the required libraries needed to set up a minimal working example

import numpy as np
import tensorflow_datasets as tfds

from dataeval.metrics.estimators import divergence

Loading in data#

Let’s start by loading in tensorflow’s MNIST dataset, then we will examine it.

# Load in the mnist dataset from tensorflow datasets
images, ds_info = tfds.load(
    "mnist",
    split="train[:4000]",
    with_info=True,
)
tfds.visualization.show_examples(images, ds_info)
images = images.shuffle(images.cardinality())
images = np.array([i["image"] for i in images])
../../_images/1b75f1e08d02bb29113d121f5d1e27dc1646b45ccb08414738c8bc60106a0eab.png
print("Number of samples: ", len(images))
print("Image shape:", images[0].shape)
Number of samples:  4000
Image shape: (28, 28, 1)

Calculate initial divergence#

Let’s calculate the divergence between the first 2500 images and the second 2500 images from this sample.

data_a = images[0:2000].reshape((2000, -1))
data_b = images[2000:].reshape((2000, -1))
div = divergence(data_a, data_b)
print(div)
DivergenceOutput(divergence=0.0, errors=2028.0)

We estimate that the divergence between these (identically distributed) images sets is at or close to 0.

Loading in corrupted data#

Now let’s load in a corrupted mnist dataset.

corrupted, ds_info = tfds.load(
    "mnist_corrupted/translate",
    split="train[:2000]",
    with_info=True,
)
tfds.visualization.show_examples(corrupted, ds_info)
corrupted = corrupted.shuffle(corrupted.cardinality())
corrupted = np.array([i["image"] for i in corrupted])
../../_images/2620c21ed3ddb201f4ef764d1b528310ace850ed498ef78af57fa6aa6582dca8.png
print("Number of corrupted samples: ", len(corrupted))
print("Corrupted image shape:", corrupted[0].shape)
Number of corrupted samples:  2000
Corrupted image shape: (28, 28, 1)

Calculate corrupted divergence#

Now lets calculate the Divergence between this corrupted dataset and the original images

data_corrupted = corrupted.reshape((2000, -1))
div = divergence(data_a, data_corrupted)
print(div)
DivergenceOutput(divergence=0.969, errors=62.0)

We conclude that the translated MNIST images are significantly different from the original images.