How to measure label independence

Problem Statement

For machine learning tasks, a discrepancy in label frequencies between train and test datasets can result in poor model performance.

To help with this, DataEval has a tool that compares the label distributions of two datasets.

When to use

DataEval provides a label_parity() function to use when you would like to determine if two datasets have statistically independent labels.

What you will need

  1. A Python environment with the following packages installed:

    • dataeval

  2. A labeled training image dataset

  3. A labeled test image dataset to evaluate the label distribution of

Setting up

Let’s import the required libraries needed to set up a minimal working example

from maite_datasets.image_classification import MNIST

from dataeval.data import Metadata
from dataeval.metrics.bias import label_parity

Load the data

While you can use your own dataset, for this example we imported the MNIST dataset and will use it going forward. It was imported from the DataEval utils package.

train_ds = MNIST("./data", image_set="train", download=True)
test_ds = MNIST("./data", image_set="test", download=True)

train_md = Metadata(train_ds)
test_md = Metadata(test_ds)

# Get the labels from the collated dataset targets
train_labels = train_md.class_labels
test_labels = test_md.class_labels

Evaluate label statistical independence

Now, let’s look at how to use DataEval’s label statistics analyzer. Using the label_parity() function, compute the chi-squared value of hypothesis that test_ds has the same class distribution as train_ds by specifying the labels of the two datasets to be compared. It also returns the p-value of the test.

results = label_parity(train_labels, test_labels)
print(f"The chi-squared value for the two label distributions is {results.score}, with p-value {results.p_value}")
The chi-squared value for the two label distributions is 3.4272710666174526, with p-value 0.9449259505581811