Class Parity Label Analysis Tutorial¶
Problem Statement¶
For machine learning tasks, a discrepancy in label frequencies between train and test datasets can result in poor model performance.
To help with this, DataEval has a tool that compares the label distributions of two datasets.
When to use¶
DataEval provides a label_parity() function to use when you would like to determine if two datasets have statistically independent labels.
What you will need¶
A Python environment with the following packages installed:
dataeval or dataeval[all]
A labeled training image dataset
A labeled test image dataset to evaluate the label distribution of
Setting up¶
Let’s import the required libraries needed to set up a minimal working example
from dataeval.metrics.bias import label_parity
from dataeval.utils.dataset.datasets import MNIST
Load the data¶
We will use the MNIST dataset from torchvision for this tutorial on class label statistics
train_ds = MNIST("./data", train=True, download=True, size=2000)
test_ds = MNIST("./data", train=False, download=True, size=500)
# Take a subset of 2000 training images and 500 test images
train_labels = train_ds.targets
test_labels = test_ds.targets
Files already downloaded and verified
Files already downloaded and verified
Evaluate label statistical independence¶
Now, let’s look at how to use DataEval’s label statistics analyzer.
Using the label_parity() function, compute the chi-squared value of hypothesis that test_ds has the same class distribution as train_ds by specifying the labels of the two datasets to be compared. It also returns the p-value of the test.
results = label_parity(train_labels, test_labels)
print(f"The chi-squared value for the two label distributions is {results.score}, with p-value {results.p_value}")
The chi-squared value for the two label distributions is 0.0, with p-value 1.0