Balance

DataEval API

dataeval.metrics.balance(class_labels: Sequence[int], metadata: List[Dict], num_neighbors: int = 5) BalanceOutput

Mutual information (MI) between factors (class label, metadata, label/image properties)

Parameters:
  • class_labels (Sequence[int]) – List of class labels for each image

  • metadata (List[Dict]) – List of metadata factors for each image

  • num_neighbors (int, default 5) – Number of nearest neighbors to use for computing MI between discrete and continuous variables.

Returns:

(num_factors+1) x (num_factors+1) estimate of mutual information between num_factors metadata factors and class label. Symmetry is enforced.

Return type:

BalanceOutput

Notes

We use mutual_info_classif from sklearn since class label is categorical. mutual_info_classif outputs are consistent up to O(1e-4) and depend on a random seed. MI is computed differently for categorical and continuous variables, and we attempt to infer whether a variable is categorical by the fraction of unique values in the dataset.

See also

sklearn.feature_selection.mutual_info_classif, sklearn.feature_selection.mutual_info_regression, sklearn.metrics.mutual_info_score

dataeval.metrics.balance_classwise(class_labels: Sequence[int], metadata: List[Dict], num_neighbors: int = 5) BalanceOutput

Compute mutual information (analogous to correlation) between metadata factors (class label, metadata, label/image properties) with individual class labels.

Parameters:
  • class_labels (Sequence[int]) – List of class labels for each image

  • metadata (List[Dict]) – List of metadata factors for each image

  • num_neighbors (int, default 5) – Number of nearest neighbors to use for computing MI between discrete and continuous variables.

Notes

We use mutual_info_classif from sklearn since class label is categorical. mutual_info_classif outputs are consistent up to O(1e-4) and depend on a random seed. MI is computed differently for categorical and continuous variables, so we have to specify with is_categorical.

Returns:

(num_classes x num_factors) estimate of mutual information between num_factors metadata factors and individual class labels.

Return type:

BalanceOutput

See also

sklearn.feature_selection.mutual_info_classif, sklearn.feature_selection.mutual_info_regression, sklearn.metrics.mutual_info_score, compute_mutual_information