dataeval.bias.Balance

class dataeval.bias.Balance(num_neighbors=None, class_imbalance_threshold=None, factor_correlation_threshold=None, config=None)

Calculates mutual information (MI) between factors (class label, metadata, label/image properties).

Identifies imbalanced classes and highly correlated metadata factors based on mutual information thresholds.

Parameters:
num_neighbors : int, default 5

Number of points to consider as neighbors

class_imbalance_threshold : float, default 0.3

Threshold for identifying imbalanced classes. Classes with MI above this threshold with any metadata factor are considered imbalanced.

factor_correlation_threshold : float, default 0.5

Threshold for identifying highly correlated metadata factors. Factor pairs with MI above this threshold are considered highly correlated.

metadata

Preprocessed metadata from the last evaluate() call.

Type:

Metadata

num_neighbors

Number of points to consider as neighbors

Type:

int

class_imbalance_threshold

Threshold for identifying imbalanced classes

Type:

float

factor_correlation_threshold

Threshold for identifying highly correlated metadata factors

Type:

float

Notes

We use mutual_info_classif from sklearn since class label is categorical. mutual_info_classif outputs are consistent up to O(1e-4) and depend on a random seed. MI is computed differently for categorical and continuous variables.

Examples

Initialize the Balance class:

>>> balance = Balance()

Specifying custom thresholds:

>>> balance = Balance(class_imbalance_threshold=0.2, factor_correlation_threshold=0.6)

Using configuration:

>>> config = Balance.Config(num_neighbors=10, class_imbalance_threshold=0.2)
>>> balance = Balance(config=config)

See also

sklearn.feature_selection.mutual_info_classif, sklearn.feature_selection.mutual_info_regression, sklearn.metrics.mutual_info_score

evaluate(data)

Compute mutual information between factors and identify imbalanced classes.

Parameters:
data : AnnotatedDataset[Any] or Metadata

Either an annotated dataset (which will be converted to Metadata) or any object implementing the Metadata protocol.

Returns:

Three DataFrames containing MI scores and threshold flags:

  • balance: Global class-to-factor mutual information

  • factors: Inter-factor mutual information

  • classwise: Per-class-to-factor mutual information

Return type:

BalanceOutput

Example

Return balance (mutual information) of factors with class_labels

>>> from dataeval import Metadata
>>> metadata = Metadata(dataset)
>>> balance = Balance()
>>> result = balance.evaluate(metadata)
>>> result.balance
shape: (6, 2)
┌─────────────┬──────────┐
│ factor_name ┆ mi_value │
│ ---         ┆ ---      │
│ cat         ┆ f64      │
╞═════════════╪══════════╡
│ class_label ┆ 1.0      │
│ angle       ┆ 0.029047 │
│ id          ┆ 0.575706 │
│ location    ┆ 0.024849 │
│ time_of_day ┆ 0.06278  │
│ weather     ┆ 0.023614 │
└─────────────┴──────────┘
>>> result.factors
shape: (20, 4)
┌─────────────┬─────────────┬──────────┬───────────────┐
│ factor1     ┆ factor2     ┆ mi_value ┆ is_correlated │
│ ---         ┆ ---         ┆ ---      ┆ ---           │
│ cat         ┆ cat         ┆ f64      ┆ bool          │
╞═════════════╪═════════════╪══════════╪═══════════════╡
│ angle       ┆ id          ┆ 1.0      ┆ true          │
│ angle       ┆ location    ┆ 0.12422  ┆ false         │
│ angle       ┆ time_of_day ┆ 0.072422 ┆ false         │
│ angle       ┆ weather     ┆ 0.037279 ┆ false         │
│ id          ┆ angle       ┆ 1.0      ┆ true          │
│ …           ┆ …           ┆ …        ┆ …             │
│ time_of_day ┆ weather     ┆ 0.023866 ┆ false         │
│ weather     ┆ angle       ┆ 0.037279 ┆ false         │
│ weather     ┆ id          ┆ 1.0      ┆ true          │
│ weather     ┆ location    ┆ 0.047246 ┆ false         │
│ weather     ┆ time_of_day ┆ 0.023866 ┆ false         │
└─────────────┴─────────────┴──────────┴───────────────┘
>>> result.classwise
shape: (24, 4)
┌────────────┬─────────────┬──────────┬───────────────┐
│ class_name ┆ factor_name ┆ mi_value ┆ is_imbalanced │
│ ---        ┆ ---         ┆ ---      ┆ ---           │
│ cat        ┆ cat         ┆ f64      ┆ bool          │
╞════════════╪═════════════╪══════════╪═══════════════╡
│ boat       ┆ angle       ┆ 0.020807 ┆ false         │
│ boat       ┆ class_label ┆ 1.0      ┆ true          │
│ boat       ┆ id          ┆ 0.471488 ┆ true          │
│ boat       ┆ location    ┆ 0.009547 ┆ false         │
│ boat       ┆ time_of_day ┆ 0.04239  ┆ false         │
│ …          ┆ …           ┆ …        ┆ …             │
│ plane      ┆ class_label ┆ 1.0      ┆ true          │
│ plane      ┆ id          ┆ 0.49531  ┆ true          │
│ plane      ┆ location    ┆ 0.033162 ┆ false         │
│ plane      ┆ time_of_day ┆ 0.040861 ┆ false         │
│ plane      ┆ weather     ┆ 0.000407 ┆ false         │
└────────────┴─────────────┴──────────┴───────────────┘

Classes

Config

Configuration for Balance evaluator.