Balance¶
What is it¶
Balance and classwise balance are metrics that measure correlational relationships between metadata factors and class labels. Balance and classwise balance can indicate opportunities for shortcut learning and disproportionate dataset sampling, with respect to individual classes or between metadata factors, that leads to poor generalization or overestimation of model performance.
Balance metrics compute the mutual information between class labels and metadata factors which may include intrinsic metadata, such as image statistics or bounding box statistics; or extrinsic metadata factors such as environmental information, sensor information, and operational information. Mutual information measures the information gained about one variable, e.g. class label, by observing another variable, e.g. location. Balance metrics provide a T&E engineer or model developer insight into dataset relationships which should be investigated or accounted for during model training or model evaluation.
The balance metric returns mutual information computed between
metadata factors and class labels but also mutual information between all pairs
of metadata factors to characterize inter-factor relationships.
The balance_classwise metric returns mutual information between individual
class labels and metadata factors to identify relationships between only one
class and secondary factors.
When to use it¶
For both model and dataset development it is important to understand correlational relationships that underlie the dataset. Often, opportunities for data collection are sparse, available only in non-operational locations and conditions, with limited target diversity, etc. A model trained on these realistic datasets could learn to use secondary information to perform the primary learning task, reducing the model’s ability to generalize to new domains or to perform unexpectedly when presented with new data. Balance metrics provide a method for identifying relationships between dataset factors and class labels a priori. A T&E engineer or model developer should then use that information to design tests for model generalization or data augmentation to mitigate the opportunity for shortcut learning or sampling imbalance.
In order to use balance, the user must supply their metadata
in a DataEval specific format. Because of this requirement, DataEval has a
Metadata class that will take in user metadata and format
it into DataEval’s format. The balance function takes in the Metadata
class for its analysis.
Identifying opportunity for shortcut learning¶
The literature contains many examples of shortcut learning and adversarial perturbations that cause a model to fail to generalize. For instance, an image classifier may learn to detect cows in a grassy field but internally cues on the grassy field (background) rather than the properties of the cow. When presented with an image of a cow at a sandy beach, the model fails to identify the cow because it was able to previously use secondary information about location or visual context to reliably identify cows. Balance metrics are one way to identify the potential for learning such shortcuts.
Identifying disproportionate dataset sampling¶
In addition to identifying possible shortcuts, balance metrics
may identify issues where data are sampled disproportionately with respect to a
particular factor. For instance, in the example above where the model is
trained on images where cows nearly always appear in a grassy field, classwise
balance would show a strong relationship between the cow class label and
grassy field environment provided that background information is encoded in
the metadata. Given the apparent correlation between the cow class label and
grassy field background, a model developer or T&E engineer should first assess
whether the correlation is problematic and whether the dataset should be
resampled, further data collected in other environments, or augmentation
techniques used to mitigate the apparent bias.
Not all dataset correlation and sampling biases are problematic, however. For instance, it may be expected that elevation of the sun correlates with day of the year, and we do not expect this relationship to bias our model performance. Or, consider a case where different sensors are available in different geographic regions, leading to a correlational relationship between location/region and sensor. A subject matter expert could determine, given properties of the data and sensors, whether this relationship is problematic and whether data need to be augmented for training and evaluation.
It is important to note that correlational relationships within a dataset measured by balance metrics only indicate opportunity for shortcut learning; balance and other metrics within DataEval do not measure whether shortcut learning has occurred. It is important to interrogate potential biases exhibited by the trained model and to assess the need for further data augmentation to mitigate or compensate observed biases.
Theory behind it¶
Mutual information is a metric that is often used for measuring the quality of dataset clustering or for feature selection, and there are several formulations to measure relationships between two categorical variables, between categorical and continuous variables, and between two continuous variables. We consider class label a categorical variable, as there is typically no presumed ordering between classes; however, other metadata factors, such as latitude and longitude or time stamps may take continuous (ordered) values.
The implementation of mutual information within DataEval draws on multiple
implementations within the scikit-learn package including
mutual_info_classification and mutual_info_regression.
For categorical or discrete target variables, mutual_info_classif computes
the mutual information with respect to both discrete/categorical and continuous
factors. DataEval attempts to infer whether a variable is continuous or
discrete by the fraction of unique values present—i.e. whether the data
may be binned uniquely with a relatively small number of bins.
Mutual information between categorical/discrete variables is computed from contingency tables which measure co-occurence of each variable, while mutual information involving continuous (ordered) data is computed using the k-nearest neighbor (KNN) graph as in Refs. [1] and [2].
Normalization¶
Raw mutual information scores are difficult for a human to contextualize, so balance metrics normalize the mutual information by the arithmetic mean of marginal entropies of each variable. Given that some variables could have a marginal entropy of zero (all values the same), the arithmetic mean is somewhat preferable over the geometric mean in those cases.
Currently, entropies are computed over unique values for categorical variables
and over binned values for continuous variables. Since the KNN representation
used to compute mutual information is not necessarily consistent with the
histogram representation used to compute marginal entropies it is possible for
balance to return normalized mutual information greater than 1. However, most
values will lie in the interval [0, 1]. A value near or above 1 indicates a
high degree of correlation, and a value near zero indicates little measured
correlation.
Normalized mutual information is not adjusted for chance and may lead to larger
values than might be expected. In particular, the normalized mutual information
associated with random label assignments is not in general 0 and may lead to
overestimated normalized mutual information [3]. Adjusted forms of mutual
information are implemented for the categorical-categorical case with
probabilities computed from a contingency table but, since we admit continuous
variables as well, normalized mutual information is the value reported by
balance and balance_classwise.