How to add intrinsic factors to Metadata¶
Problem Statement¶
When performing analysis on datasets, metadata may sometimes be sparse or unavailable. Adding metadata to a dataset for analysis may be necessary at times, and can come in the forms of calculated intrinsic values or additional information originally unavailable on the source dataset.
This guide will show you how to add in the calculated statistics from DataEval’s
calculate() function to the metadata for bias analysis.
When to use¶
Adding metadata factors should be done when little or no metadata is available on the dataset, or to gain insights specific to metadata of interest that is not present natively in the dataset metadata.
What you will need¶
A dataset to analyze
A Python environment with the following packages installed:
dataevaldataeval-plots[plotly]maite-datasets
Getting Started¶
First import the required libraries needed to set up the example.
import dataeval_plots as dep
import plotly.io as pio
import polars as pl
from maite_datasets.image_classification import CIFAR10
from dataeval import Metadata
from dataeval.bias import Balance, Diversity, Parity
from dataeval.core import calculate
from dataeval.flags import ImageStats
from dataeval.selection import Limit, Select, Shuffle
_ = pl.Config.set_tbl_rows(-1)
# Use plotly to render plots
dep.set_default_backend("plotly")
# Use the notebook renderer so JS is embedded
pio.renderers.default = "notebook"
Load the dataset¶
Begin by loading in the CIFAR-10 dataset.
The CIFAR-10 dataset contains 60,000 images - 50,000 in the train set and 10,000 in the test set. We will use a shuffled sample of 20,000 images from both sets.
# Load in the CIFAR10 dataset and limit to 20,000 images with random shuffling
cifar10 = Select(CIFAR10("data", image_set="base", download=True), [Limit(20000), Shuffle(seed=0)])
print(cifar10)
Select Dataset
--------------
Selections: [Limit(size=20000), Shuffle(seed=0)]
Selected Size: 20000
CIFAR10 Dataset
---------------
Transforms: []
Image Set: base
Metadata: {'id': 'CIFAR10_base', 'index2label': {0: 'airplane', 1: 'automobile', 2: 'bird', 3: 'cat', 4: 'deer', 5: 'dog', 6: 'frog', 7: 'horse', 8: 'ship', 9: 'truck'}, 'split': 'base'}
Path: /builds/jatic/aria/dataeval/docs/source/notebooks/data/cifar10
Size: 60000
Inspect the metadata¶
You can begin by inspecting the available factor names in the dataset.
metadata = Metadata(cifar10)
print(f"Factor names: {metadata.factor_names}")
Factor names: ['batch_num', 'id']
A quick check of the balance() of the single factor will show no mutual information
between the classes and the batch_num which indicates the on-disk binary file the image
was extracted from.
# Balance at index 0 is always class
Balance().evaluate(metadata).balance[2]
| factor_name | mi_value |
|---|---|
| cat | f64 |
| "id" | 0.009829 |
Add image statistics to the metadata¶
In order to perform additional bias analysis on the dataset when no meaningful metadata
are provided, you will augment the metadata with statistics of the images using the
calculate() function.
Begin by running calculate for the PIXEL and VISUAL image stats for the dataset
and adding the stats factors to the Metadata.
# Calculate pixel and visual statistics
calc_results = calculate(cifar10, stats=ImageStats.PIXEL | ImageStats.VISUAL)
# Append the factors to the metadata
metadata.add_factors(calc_results["stats"])
Next you will add the calculate output to the metadata as factors, and exclude
factors that are uniform or without significance.
Additionally, you will specify a binning strategy for continuous statistical factors, which are, for our purposes, continuous. For this example, bin everything into 10 uniform-width bins.
# Exclude the id and batch_num as it is not a relevant factor for bias analysis
metadata.exclude = ["id", "batch_num"]
# Provide binning for the continuous statistical factors using 5 uniform-width bins for each factor
keys = ("mean", "std", "var", "skew", "kurtosis", "entropy", "brightness", "darkness", "sharpness", "contrast", "zeros")
metadata.continuous_factor_bins = dict.fromkeys(keys, 5)
Perform bias analysis¶
Now you can run the bias analysis evaluators Balance, Diversity and
Parity on the dataset metadata augmented with intrinsic statistical factors.
balance_output = Balance().evaluate(metadata)
/builds/jatic/aria/dataeval/src/dataeval/core/_mutual_info.py:204: RuntimeWarning:
divide by zero encountered in scalar divide
/builds/jatic/aria/dataeval/src/dataeval/core/_mutual_info.py:204: RuntimeWarning:
divide by zero encountered in scalar divide
/builds/jatic/aria/dataeval/src/dataeval/core/_mutual_info.py:204: RuntimeWarning:
divide by zero encountered in scalar divide
/builds/jatic/aria/dataeval/src/dataeval/core/_mutual_info.py:204: RuntimeWarning:
divide by zero encountered in scalar divide
/builds/jatic/aria/dataeval/src/dataeval/core/_mutual_info.py:204: RuntimeWarning:
divide by zero encountered in scalar divide
/builds/jatic/aria/dataeval/src/dataeval/core/_mutual_info.py:204: RuntimeWarning:
divide by zero encountered in scalar divide
/builds/jatic/aria/dataeval/src/dataeval/core/_mutual_info.py:204: RuntimeWarning:
divide by zero encountered in scalar divide
/builds/jatic/aria/dataeval/src/dataeval/core/_mutual_info.py:204: RuntimeWarning:
divide by zero encountered in scalar divide
/builds/jatic/aria/dataeval/src/dataeval/core/_mutual_info.py:204: RuntimeWarning:
invalid value encountered in scalar divide
/builds/jatic/aria/dataeval/src/dataeval/core/_mutual_info.py:204: RuntimeWarning:
divide by zero encountered in scalar divide
/builds/jatic/aria/dataeval/src/dataeval/core/_mutual_info.py:204: RuntimeWarning:
divide by zero encountered in scalar divide
/builds/jatic/aria/dataeval/src/dataeval/core/_mutual_info.py:204: RuntimeWarning:
divide by zero encountered in scalar divide
/builds/jatic/aria/dataeval/src/dataeval/core/_mutual_info.py:204: RuntimeWarning:
divide by zero encountered in scalar divide
/builds/jatic/aria/dataeval/src/dataeval/core/_mutual_info.py:204: RuntimeWarning:
divide by zero encountered in scalar divide
dep.plot(balance_output)
Notice the very high mutual information between the variance and standard deviation of image intensities, which is expected. Mean image intensity correlates with brightness, darkness, and contrast. However, none of the intrinsic factors correlate strongly with class label.
dep.plot(balance_output, plot_classwise=True)
Classwise balance also indicates minimal correlation of image statistics and individual classes. Uniform mutual information between individual classes and all class labels indicates balanced class representation in the subsampled dataset.
diversity_output = Diversity().evaluate(metadata)
dep.plot(diversity_output)
The diversity index also indicates uniform sampling of classes within the dataset. The apparently low diversity of kurtosis across the dataset may indicate an inadequate binning strategy (for metric computation) given that the other statistical moments appear to be more evenly distributed. Further investigation and iteration could be done to assess sensitivity to binning strategy.
parity_output = Parity().evaluate(metadata)
parity_output.factors
| factor_name | score | p_value | is_correlated | has_insufficient_data |
|---|---|---|---|---|
| cat | f64 | f64 | bool | bool |
| "brightness" | 0.161594 | 0.0 | false | true |
| "contrast" | 0.129615 | 1.2931e-266 | false | true |
| "darkness" | 0.164408 | 0.0 | false | true |
| "entropy" | 0.111562 | 3.8866e-193 | false | true |
| "kurtosis" | 0.034223 | 1.6581e-12 | false | true |
| "mean" | 0.157045 | 0.0 | false | false |
| "missing" | 0.0 | 1.0 | false | false |
| "sharpness" | 0.215762 | 0.0 | false | true |
| "skew" | 0.12426 | 1.3122e-243 | false | true |
| "std" | 0.203804 | 0.0 | false | true |
| "var" | 0.196433 | 0.0 | false | true |
| "zeros" | 0.012127 | 0.090827 | false | true |
You can now augment your datasets with additional metadata information, either from
additional sources or using dataeval statistical functions for insights into your data.