balance#

Balance and classwise balance are metrics that measure distributional correlation between metadata factors and class label. Balance and classwise balance can indicate opportunities for shortcut learning and disproportionate dataset sampling with respect to class labels or between metadata factors.

dataeval.metrics.bias.balance(class_labels: ArrayLike, metadata: Mapping[str, ArrayLike], num_neighbors: int = 5, continuous_factor_bincounts: Mapping[str, int] | None = None) BalanceOutput#

Mutual information (MI) between factors (class label, metadata, label/image properties)

Parameters:
  • class_labels (ArrayLike) – List of class labels for each image

  • metadata (Mapping[str, ArrayLike]) – Dict of lists of metadata factors for each image

  • num_neighbors (int, default 5) – Number of nearest neighbors to use for computing MI between discrete and continuous variables.

  • continuous_factor_bincounts (Mapping[str, int] or None, default None) – The factors in metadata that have continuous values and the array of bin counts to discretize values into. All factors are treated as having discrete values unless they are specified as keys in this dictionary. Each element of this array must occur as a key in metadata.

Returns:

(num_factors+1) x (num_factors+1) estimate of mutual information between num_factors metadata factors and class label. Symmetry is enforced.

Return type:

BalanceOutput

Note

We use mutual_info_classif from sklearn since class label is categorical. mutual_info_classif outputs are consistent up to O(1e-4) and depend on a random seed. MI is computed differently for categorical and continuous variables, and we attempt to infer whether a variable is categorical by the fraction of unique values in the dataset.

Example

Return balance (mutual information) of factors with class_labels

>>> bal = balance(class_labels, metadata, continuous_factor_bincounts=continuous_factor_bincounts)
>>> bal.balance
array([0.99999822, 0.13363788, 0.04505382, 0.02994455])

Return intra/interfactor balance (mutual information)

>>> bal.factors
array([[0.99999843, 0.04133555, 0.09725766],
       [0.04133555, 0.08433558, 0.1301489 ],
       [0.09725766, 0.1301489 , 0.99999856]])

Return classwise balance (mutual information) of factors with individual class_labels

>>> bal.classwise
array([[0.99999822, 0.13363788, 0.        , 0.        ],
       [0.99999822, 0.13363788, 0.        , 0.        ]])

See also

sklearn.feature_selection.mutual_info_classif, sklearn.feature_selection.mutual_info_regression, sklearn.metrics.mutual_info_score