How to detect undersampled data subsets¶

Problem Statement¶

For most computer vision tasks like image classification and object detection, we often have a lot of images, but certain subsets of the images can be undersampled, such as label, style within a label, etc. A way to detect this regional sparsity is through coverage analysis.

To help with this, DataEval has introduced a Coverage class, that provides a user with example images which have few similar instances within the provided dataset.

When to use¶

The Coverage class should be used when you have lots of images, but only a small fraction from certain regimes/labels.

What you will need¶

Image classification dataset.
Autoencoder trained on image classification dataset for dimension reduction (e.g. through the AETrainer class).
A Python environment with the following packages installed:
- dataeval or dataeval[all]
- tabulate

Setting up¶

Let’s import the required libraries needed to set up a minimal working example

import matplotlib.pyplot as plt
import numpy as np
import torch
import torch.nn as nn
from sklearn.manifold import TSNE

from dataeval.metrics.bias import coverage
from dataeval.utils.data import collate
from dataeval.utils.data.datasets import MNIST

Load the data¶

Load the MNIST data and create the training dataset. For the purposes of this example, we will use subsets of the training (2000) data.

# Set seeds
torch.manual_seed(14)

# MNIST with mean 0 unit variance
train_ds = MNIST(
    root="./data",
    train=True,
    size=2000,
    unit_interval=True,
    dtype=np.float32,
    channels="channels_first",
    normalize=(0.1307, 0.3081),
)

Determining if data needs to be downloaded
Loaded data successfully
Running data preprocessing steps

In this tutorial, we will use an autoencoder to reduce the dimension of the MNIST images.

# Define model architecture
class Autoencoder(nn.Module):
    def __init__(self):
        super().__init__()
        self.encoder = nn.Sequential(
            # 28 x 28
            nn.Conv2d(1, 4, kernel_size=5),
            # 4 x 24 x 24
            nn.ReLU(True),
            nn.Conv2d(4, 8, kernel_size=5),
            nn.ReLU(True),
            # 8 x 20 x 20 = 3200
            nn.Flatten(),
            nn.Linear(3200, 10),
            # 10
            nn.Sigmoid(),
        )
        self.decoder = nn.Sequential(
            # 10
            nn.Linear(10, 400),
            # 400
            nn.ReLU(True),
            nn.Linear(400, 4000),
            # 4000
            nn.ReLU(True),
            nn.Unflatten(1, (10, 20, 20)),
            # 10 x 20 x 20
            nn.ConvTranspose2d(10, 10, kernel_size=5),
            # 24 x 24
            nn.ConvTranspose2d(10, 1, kernel_size=5),
            # 28 x 28
            nn.Sigmoid(),
        )

    def forward(self, x):
        x = self.encoder(x)
        x = self.decoder(x)
        return x

    def encode(self, x):
        x = self.encoder(x)
        return x

For computational reasons, we will simply load the trained autoencoder. See the how-to How to create image embeddings with an autoencoder for more information on how to train an autoencoder.

# The trained autoencoder was trained for 1000 epochs
sd = torch.load("models/ae", weights_only=True)
model = Autoencoder()
model.load_state_dict(sd)

<All keys matched successfully>

# Split out the images and labels for each set
embeddings, targets, _ = collate(train_ds, model=model)

To visualize the encodings, we will use TSNE on them to view separation.

# Visualize 10d as 2d with TSNE
tsne = TSNE(n_components=2)
red_dim = tsne.fit_transform(embeddings.cpu().numpy())

# Plot results with color being label
fig, ax = plt.subplots()
scatter = ax.scatter(
    x=red_dim[:, 0],
    y=red_dim[:, 1],
    c=targets.labels,
    label=targets.labels,
)
ax.legend(*scatter.legend_elements(), loc="upper right", ncols=2)
plt.show()

../_images/6716e34aeabcab56755b8f1e84346c7dbe3a415938ad026f19318382488d1751.png

Some good separation, but you can see a few images in the “gaps”. This could be an artifact of dimension reduction, or suggest that we have poor coverage for some covariates.

# Use data adaptive cutoff
cvrg = coverage(embeddings, radius_type="adaptive")

# Plot the least covered 0.5%
f, axs = plt.subplots(4, 5, figsize=(5, 5))
axs = axs.flatten()
for count, i in enumerate(axs):
    idx = cvrg.uncovered_indices[count]
    i.imshow(np.squeeze(train_ds[idx][0]), cmap="gray")
    i.set_axis_off()
    i.title.set_text(int(targets.labels[idx]))

../_images/2349fc9b02720a565256375dc85a93b6f95a6021cae12578712b8ed8b044f4a0.png

The Coverage tool identified that in this set of 2000 images, there is potential under-coverage when it comes to wonky 2s and 7s. Other digits have some undercovered instances, but could be they are just outliers. More investigation into outlier status is needed, see How to identify outliers and/or anomalies in a dataset for more info.