Identify bias and correlations¶

This guide provides a beginner friendly introduction to dataset bias, including balance and diversity.

Estimated time to complete: 15 minutes

Relevant ML stages: Data Engineering

Relevant personas: Data Engineer, T&E Engineer

What you’ll do¶

Use DataEval to identify bias and correlations in the 2012 VOC dataset
Analyze the results using plots and tables

What you’ll learn¶

You will see how to identify bias and correlations present in a dataset.
You will understand the potential impact on your data and ways to mitigate them.

What you’ll need¶

Basic familiarity with Python
Basic understanding of your dataset structure, including but not limited to its metadata
An environment with DataEval installed

Introduction¶

Identifying any biases or correlations present in a dataset is essential to accurately interpreting your model’s performance and its ability to generalize to new data. A common cause of poor generalization is shortcut learning — where a model uses secondary or background information to make predictions — which is enabled or exacerbated by dataset sampling biases.

Bias and correlations¶

Understanding biases or correlations present in your dataset is a key component to creating meaningful data splits. Bias in data can lead to misleading conclusions and poor model performance on operational data. There are many different types of bias. A few of these biases occur during data collection, others occur during dataset development, others occur during model development, while others are a result of the user.

Not all forms of bias directly affect the dataset and in order to address the biases that do, you have to make a few assumptions:

All desired classes are present.
All available metadata is provided.
The metadata has been recorded correctly.

If any of the above assumptions are violated, then the analysis will not be accurate. When using your own data, you should verify the above assumptions.

This guide does not focus on eliminating all bias, rather it focuses on identifying the bias that can be found when developing a dataset.

DataEval metrics¶

DataEval has two dedicated classes for identifying and understanding the bias or correlations that may be present in a dataset: Balance and Diversity.

The Balance evaluator measures correlational relationships between metadata factors and classes by calculating the mutual information between the metadata factors and the labels.

The Diversity evaluator measures the evenness or uniformity of the sampling of metadata factors over a dataset using the inverse Simpson index or Shannon index.

These techniques help ensure that when you split the data for your projects, you minimize things like shortcut learning and leakage between training and testing sets.

Importing the necessary libraries¶

You’ll begin by importing the necessary libraries to walk through this guide.

import dataeval_plots as dep
import plotly.io as pio
from maite_datasets.object_detection import VOCDetection

# Load the functions from DataEval that are helpful for bias
# as well as the VOCDetection dataset for the tutorial
from dataeval import Metadata
from dataeval.bias import Balance, Diversity

# Use plotly to render plots
dep.set_default_backend("plotly")

# Use the notebook renderer so JS is embedded
pio.renderers.default = "notebook"

Step 1: Load the data¶

You are going to work with the PASCAL VOC 2012 dataset. This dataset is a small curated dataset that was used for a computer vision competition. The images were used for classification, object detection, and segmentation. This dataset was chosen because it has multiple classes and a variety of images and metadata.

If this data is already on your computer you can change the file location from "./data" to wherever the data is stored. Remember to also change the download value from True to False.

For the sake of ensuring that this tutorial runs quickly on most computers, you are going to analyze only the training dataset, which is a little under 6000 images.

# Download the 2012 train dataset and verify the size of the loaded dataset
ds = VOCDetection(root="./data", download=True, image_set="train", year="2012")
len(ds)

Before moving on, verify that the above code cell printed out 5717 for the size of the dataset.

This ensures that everything is working as needed for the tutorial.

Step 2: Structure the metadata¶

This guide focuses on evaluating labels and metadata of the dataset rather than the images themselves. As each dataset has its own image and metadata formats, you will need to understand how your particular metadata is structured.

Start by taking a look at the metadata structure of the VOC 2012 dataset by creating a Metadata class from the dataset.

# Extract the Metadata from the dataset
metadata = Metadata(ds)

The metadata in the dataset is provided as a dictionary entry for each datum, such that the aggregated data is a collection of N metadata dictionaries each with a nested list of M objects in the image. Start by inspecting the raw metadata of the first image.

metadata.raw[0]

{'id': 0,
 'year': '2008',
 'image_id': '000008',
 'mask_path': '/builds/jatic/aria/dataeval/docs/source/notebooks/data/vocdataset/VOCdevkit/VOC2012/SegmentationClass/2008_000008.jpg',
 'folder': 'VOC2012',
 'filename': '2008_000008.jpg',
 'database': 'The VOC2008 Database',
 'annotation_source': 'PASCAL VOC2008',
 'image_source': 'flickr',
 'image_width': 500,
 'image_height': 442,
 'image_depth': 3,
 'segmented': 0,
 'pose': ['Left', 'Unspecified'],
 'truncated': [0, 1],
 'difficult': [0, 0]}

Note

Metadata is unable to process nested lists. For this dataset, part is a factor that describes certain parts of a person object (such as head, foot and hand), each with separate bounding box coordinates. You will ignore this information for this example.

Note

The nested objects horse and person from the first metadata entry will be expanded to a complete metadata entry for each object.

Next you will want to select the factors to include for bias analysis as well as the continuous factor bins for any continuous data.

metadata.include = [
    "image_width",
    "image_height",
    "segmented",
    "pose",
    "truncated",
    "difficult",
]

metadata.continuous_factor_bins = {
    "image_width": 5,
    "image_height": 5,
}

Now that the Metadata is ready to go, you can begin analyzing the dataset for bias!

Step 3: Assess dataset balance¶

The Balance class measures correlational relationships between metadata factors and classes in a dataset. It analyzes the metadata factors against both the classes and other factors to identify relationships.

The results can be retrieved using the balance and factors attributes of the output.

bal = Balance().evaluate(metadata)

The information provided by Balance may be visually understood with a heat map.

dep.plot(bal)

The heatmap shows that the greatest correlations are in the bounding box locations (xmin with xmax and ymin with ymax) and the image dimensions (height and width).

Also the ymax of the bounding box location is correlated with the height of the image. It is not surprising that height and width have correlation since many of the images are similarly sized.

The correlations between xmin and xmax and between ymin and ymax suggests that there is repetition in bounding box width and height across the objects. However, the fact that pose has a value of 0.08 with class means that a few of the classes have specific poses across a fair percentage of the images for that class. An example of this would be most pottedplant images having the same pose value.

In addition to analyzing class and other factors, the balance function also analyzes metadata factors with individual classes to identify relationships between only one class and secondary factors.

You can visualize the classwise results for balance by setting the plot_classwise parameter to True.

dep.plot(bal, plot_classwise=True)

The classwise heatmap shows that factors other than class do not have any significant correlation with a specific class.

Classwise balance shows correlation of individual classes with all class labels, indicating relative class imbalance. In this case the person class is over-represented relative to most other classes.

This means that a model might learn a bias towards the person class label due to its frequency in the training set, which becomes a problem if the test/operational dataset doesn’t have the same imbalance.

Step 4: Assess dataset diversity¶

The Diversity evaluator measures the evenness or uniformity of the sampling of metadata factors over a dataset. Values near 1 indicate uniform sampling, while values near 0 indicate imbalanced sampling, e.g. all values taking a single value. For more information see the Diversity concept page.

The results can be retrieved using the diversity_index attribute of the output.

div = Diversity().evaluate(metadata)

Again, it’s often easiest to see the differences between the different factors when visualizing them using a bar chart to show the factor-class analysis.

dep.plot(div)

In the results above, the factors truncated and occluded have values near 1, meaning that there is relatively little or no bias in these factors.

The categories of most interest are those that are between 0.4 and 0.1 because this region represents skewed value distributions for the factor.

The following factors fall into this category:

class
width
height
segmented
difficult

These factors contain bias that should be addressed either by adding or removing data to even out the sampling. For instance, the class factor highlights that there is unevenness in the number of data points per class.

In addition to analyzing class, the diversity function also analyzes metadata factors with individual classes to assess uniformity of metadata factors within a class. You can visualize the classwise results by setting the plot_classwise parameter to True.

dep.plot(div, plot_classwise=True)

These results expand the above results on a classwise basis.

Things to look for here are large variances for a given factor across the different classes. For example, pose has values ranging from 0.01 to 0.84, which means that a few classes have almost uniform selection of the different pose values while other classes essentially only have one pose value. This makes sense as the bottle or pottedplant class does not have multiple pose directions, while the person class does.

What needs to be further investigated are things like whether the sofa class should have a pose direction, because a diversity value of 0.4 means that a few of the images do while others do not.

Also, the cat class has a low score signifying that most of the images fall into one or two categories rather than being spread even across the categories. This highlights an error in the data collection process — the value was not specified for most cat images and therefore defaulted to “Unspecified”.

An alternative error would be a dataset in which the cat images have most cats facing a specific direction, which would require additional data to overcome the bias, but that is not the case for this dataset. It has plenty of cats facing each direction, but only a few of them contain a pose value.

Conclusion¶

Having analyzed the dataset for bias with multiple metrics, the conclusion is that this dataset has bias. Training a model on this dataset has the potential to learn shortcuts and underperform on operational data if the biases are not representative of biases in the operational dataset.

The metadata categories identified by the Balance and Diversity evaluators contain issues such as imbalanced classes and imbalanced parameters per class. DataEval isn’t able to tell you exactly why they are imbalanced, but it highlights the categories that you need to check.

As you can see, the DataEval methods are here to help you gain a deep understanding of your dataset and all of its strengths and limitations. It is designed to help you create representative and reliable datasets.

Good luck with your data!

What’s next¶

In addition to identifying bias and correlations in a dataset, DataEval offers additional tutorials to help you learn about dataset analysis:

To clean a dataset use the Data Cleaning Guide.
To identify coverage gaps and outliers use the Assessing the Data Space Guide.
To monitor data for shifts during operation use the Data Monitoring Guide.

To learn more about the balance and diversity evaluators, see the Balance and Diversity concept pages.

On your own¶

Once you are familiar with DataEval and dataset analysis, you will want to run this analysis on your own dataset. When you do, make sure that you analyze all of your data and not just the training set.