Class Parity Label Analysis Tutorial¶

Problem Statement¶

For machine learning tasks, a discrepancy in label frequencies between train and test datasets can result in poor model performance.

To help with this, DataEval has a tool that compares the label distributions of two datasets.

When to use¶

The Parity class and similar should be used when you would like to determine if two datasets have statistically independent labels.

What you will need¶

A labeled training image dataset
A labeled test image dataset to evaluate the label distribution of

Setting up¶

Let’s import the required libraries needed to set up a minimal working example

from dataeval.metrics.bias import label_parity
from dataeval.utils.dataset.datasets import MNIST

Load the data¶

We will use the MNIST dataset from torchvision for this tutorial on class label statistics

train_ds = MNIST("./data", train=True, download=True, size=2000)
test_ds = MNIST("./data", train=False, download=True, size=500)

# Take a subset of 2000 training images and 500 test images
train_labels = train_ds.targets
test_labels = test_ds.targets

Files already downloaded and verified

Files already downloaded and verified

Evaluate label statistical independence¶

Now, let’s look at how to use DataEval’s label statistics analyzer. Start by initializing a Parity object. Compute the chi-squared value of hypothesis that test_ds has the same class distribution as train_ds by specifying the two datasets to be compared, as well as the number of unique classes (for MNIST, there are 10 unique classes). It also returns the p-value of the test.

results = label_parity(train_labels, test_labels)
print(f"The chi-squared value for the two label distributions is {results.score}, with p-value {results.p_value}")

The chi-squared value for the two label distributions is 0.0, with p-value 1.0