balance#
Balance and classwise balance are metrics that measure distributional correlation between metadata factors and class label. Balance and classwise balance can indicate opportunities for shortcut learning and disproportionate dataset sampling with respect to class labels or between metadata factors.
- dataeval.metrics.bias.balance(class_labels: Sequence[int], metadata: list[dict], num_neighbors: int = 5) BalanceOutput#
Mutual information (MI) between factors (class label, metadata, label/image properties)
- Parameters:
class_labels (Sequence[int]) – List of class labels for each image
metadata (List[Dict]) – List of metadata factors for each image
num_neighbors (int, default 5) – Number of nearest neighbors to use for computing MI between discrete and continuous variables.
- Returns:
(num_factors+1) x (num_factors+1) estimate of mutual information between num_factors metadata factors and class label. Symmetry is enforced.
- Return type:
BalanceOutput
Notes
We use mutual_info_classif from sklearn since class label is categorical. mutual_info_classif outputs are consistent up to O(1e-4) and depend on a random seed. MI is computed differently for categorical and continuous variables, and we attempt to infer whether a variable is categorical by the fraction of unique values in the dataset.
Example
Return balance (mutual information) of factors with class_labels
>>> bal = balance(class_labels, metadata) >>> bal.balance array([0.99999822, 0.13363788, 0.04505382, 0.02994455])
Return intra/interfactor balance (mutual information)
>>> bal.factors array([[0.99999843, 0.03510422, 0.09725766], [0.03510422, 0.08433558, 0.15621459], [0.09725766, 0.15621459, 0.99999856]])
Return classwise balance (mutual information) of factors with individual class_labels
>>> bal.classwise array([[0.99999822, 0.13363788, 0. , 0. ], [0.99999822, 0.13363788, 0. , 0. ]])
See also
sklearn.feature_selection.mutual_info_classif,sklearn.feature_selection.mutual_info_regression,sklearn.metrics.mutual_info_score