dataeval.bias.Balance

class dataeval.bias.Balance(num_neighbors=5, class_imbalance_threshold=0.3, factor_correlation_threshold=0.5)

Calculates mutual information (MI) between factors (class label, metadata, label/image properties).

Identifies imbalanced classes and highly correlated metadata factors based on mutual information thresholds.

Parameters:
num_neighbors : int, default 5

Number of points to consider as neighbors

class_imbalance_threshold : float, default 0.3

Threshold for identifying imbalanced classes. Classes with MI above this threshold with any metadata factor are considered imbalanced.

factor_correlation_threshold : float, default 0.5

Threshold for identifying highly correlated metadata factors. Factor pairs with MI above this threshold are considered highly correlated.

metadata

Preprocessed metadata from the last evaluate() call.

Type:

Metadata

num_neighbors

Number of points to consider as neighbors

Type:

int

class_imbalance_threshold

Threshold for identifying imbalanced classes

Type:

float

factor_correlation_threshold

Threshold for identifying highly correlated metadata factors

Type:

float

Notes

We use mutual_info_classif from sklearn since class label is categorical. mutual_info_classif outputs are consistent up to O(1e-4) and depend on a random seed. MI is computed differently for categorical and continuous variables.

Examples

Initialize the Balance class:

>>> balance = Balance()

Specifying custom thresholds:

>>> balance = Balance(class_imbalance_threshold=0.2, factor_correlation_threshold=0.6)

See also

sklearn.feature_selection.mutual_info_classif, sklearn.feature_selection.mutual_info_regression, sklearn.metrics.mutual_info_score

evaluate(data)

Compute mutual information between factors and identify imbalanced classes.

Parameters:
data : AnnotatedDataset[Any] or Metadata

Either an annotated dataset (which will be converted to Metadata) or preprocessed Metadata directly.

Returns:

Three DataFrames containing MI scores and threshold flags: - balance: Global class-to-factor mutual information - factors: Inter-factor mutual information - classwise: Per-class-to-factor mutual information

Return type:

BalanceOutput

Example

Return balance (mutual information) of factors with class_labels

>>> metadata = generate_random_metadata(
...     labels=["doctor", "artist", "teacher"],
...     factors={"age": [25, 30, 35, 45], "income": [50000, 65000, 80000], "gender": ["M", "F"]},
...     length=100,
...     random_seed=175,
... )
>>> balance = Balance()
>>> result = balance.evaluate(metadata)
>>> result.balance
shape: (4, 2)
┌─────────────┬──────────┐
│ factor_name ┆ mi_value │
│ ---         ┆ ---      │
│ cat         ┆ f64      │
╞═════════════╪══════════╡
│ class_label ┆ 0.888187 │
│ age         ┆ 0.251485 │
│ gender      ┆ 0.00399  │
│ income      ┆ 0.362771 │
└─────────────┴──────────┘
>>> result.factors
shape: (6, 4)
┌─────────┬─────────┬──────────┬───────────────┐
│ factor1 ┆ factor2 ┆ mi_value ┆ is_correlated │
│ ---     ┆ ---     ┆ ---      ┆ ---           │
│ cat     ┆ cat     ┆ f64      ┆ bool          │
╞═════════╪═════════╪══════════╪═══════════════╡
│ age     ┆ gender  ┆ 0.046483 ┆ false         │
│ age     ┆ income  ┆ 0.078066 ┆ false         │
│ gender  ┆ age     ┆ 0.046483 ┆ false         │
│ gender  ┆ income  ┆ 0.047947 ┆ false         │
│ income  ┆ age     ┆ 0.078066 ┆ false         │
│ income  ┆ gender  ┆ 0.047947 ┆ false         │
└─────────┴─────────┴──────────┴───────────────┘
>>> result.classwise
shape: (9, 4)
┌────────────┬─────────────┬──────────┬───────────────┐
│ class_name ┆ factor_name ┆ mi_value ┆ is_imbalanced │
│ ---        ┆ ---         ┆ ---      ┆ ---           │
│ cat        ┆ cat         ┆ f64      ┆ bool          │
╞════════════╪═════════════╪══════════╪═══════════════╡
│ artist     ┆ age         ┆ 0.301469 ┆ true          │
│ artist     ┆ gender      ┆ 0.04493  ┆ false         │
│ artist     ┆ income      ┆ 0.250237 ┆ false         │
│ doctor     ┆ age         ┆ 0.164287 ┆ false         │
│ doctor     ┆ gender      ┆ 0.095962 ┆ false         │
│ doctor     ┆ income      ┆ 0.46587  ┆ true          │
│ teacher    ┆ age         ┆ 0.137221 ┆ false         │
│ teacher    ┆ gender      ┆ 0.018392 ┆ false         │
│ teacher    ┆ income      ┆ 0.160404 ┆ false         │
└────────────┴─────────────┴──────────┴───────────────┘