dataeval.bias.Balance¶
-
class dataeval.bias.Balance(num_neighbors=
None, class_imbalance_threshold=None, factor_correlation_threshold=None, config=None)¶ Calculates mutual information (MI) between factors (class label, metadata, label/image properties).
Identifies imbalanced classes and highly correlated metadata factors based on mutual information thresholds.
- Parameters:¶
- num_neighbors : int, default 5¶
Number of points to consider as neighbors
- class_imbalance_threshold : float, default 0.3¶
Threshold for identifying imbalanced classes. Classes with MI above this threshold with any metadata factor are considered imbalanced.
- factor_correlation_threshold : float, default 0.5¶
Threshold for identifying highly correlated metadata factors. Factor pairs with MI above this threshold are considered highly correlated.
- factor_correlation_threshold¶
Threshold for identifying highly correlated metadata factors
- Type:¶
float
Notes
We use mutual_info_classif from sklearn since class label is categorical. mutual_info_classif outputs are consistent up to O(1e-4) and depend on a random seed. MI is computed differently for categorical and continuous variables.
Examples
Initialize the Balance class:
>>> balance = Balance()Specifying custom thresholds:
>>> balance = Balance(class_imbalance_threshold=0.2, factor_correlation_threshold=0.6)Using configuration:
>>> config = Balance.Config(num_neighbors=10, class_imbalance_threshold=0.2) >>> balance = Balance(config=config)See also
sklearn.feature_selection.mutual_info_classif,sklearn.feature_selection.mutual_info_regression,sklearn.metrics.mutual_info_score- evaluate(data)¶
Compute mutual information between factors and identify imbalanced classes.
- Parameters:¶
- data : AnnotatedDataset[Any] or Metadata¶
Either an annotated dataset (which will be converted to Metadata) or any object implementing the Metadata protocol.
- Returns:¶
Three DataFrames containing MI scores and threshold flags:
balance: Global class-to-factor mutual information
factors: Inter-factor mutual information
classwise: Per-class-to-factor mutual information
- Return type:¶
Example
Return balance (mutual information) of factors with class_labels
>>> from dataeval import Metadata >>> metadata = Metadata(dataset)>>> balance = Balance() >>> result = balance.evaluate(metadata) >>> result.balance shape: (6, 2) ┌─────────────┬──────────┐ │ factor_name ┆ mi_value │ │ --- ┆ --- │ │ cat ┆ f64 │ ╞═════════════╪══════════╡ │ class_label ┆ 1.0 │ │ angle ┆ 0.029047 │ │ id ┆ 0.575706 │ │ location ┆ 0.024849 │ │ time_of_day ┆ 0.06278 │ │ weather ┆ 0.023614 │ └─────────────┴──────────┘>>> result.factors shape: (20, 4) ┌─────────────┬─────────────┬──────────┬───────────────┐ │ factor1 ┆ factor2 ┆ mi_value ┆ is_correlated │ │ --- ┆ --- ┆ --- ┆ --- │ │ cat ┆ cat ┆ f64 ┆ bool │ ╞═════════════╪═════════════╪══════════╪═══════════════╡ │ angle ┆ id ┆ 1.0 ┆ true │ │ angle ┆ location ┆ 0.12422 ┆ false │ │ angle ┆ time_of_day ┆ 0.072422 ┆ false │ │ angle ┆ weather ┆ 0.037279 ┆ false │ │ id ┆ angle ┆ 1.0 ┆ true │ │ … ┆ … ┆ … ┆ … │ │ time_of_day ┆ weather ┆ 0.023866 ┆ false │ │ weather ┆ angle ┆ 0.037279 ┆ false │ │ weather ┆ id ┆ 1.0 ┆ true │ │ weather ┆ location ┆ 0.047246 ┆ false │ │ weather ┆ time_of_day ┆ 0.023866 ┆ false │ └─────────────┴─────────────┴──────────┴───────────────┘>>> result.classwise shape: (24, 4) ┌────────────┬─────────────┬──────────┬───────────────┐ │ class_name ┆ factor_name ┆ mi_value ┆ is_imbalanced │ │ --- ┆ --- ┆ --- ┆ --- │ │ cat ┆ cat ┆ f64 ┆ bool │ ╞════════════╪═════════════╪══════════╪═══════════════╡ │ boat ┆ angle ┆ 0.020807 ┆ false │ │ boat ┆ class_label ┆ 1.0 ┆ true │ │ boat ┆ id ┆ 0.471488 ┆ true │ │ boat ┆ location ┆ 0.009547 ┆ false │ │ boat ┆ time_of_day ┆ 0.04239 ┆ false │ │ … ┆ … ┆ … ┆ … │ │ plane ┆ class_label ┆ 1.0 ┆ true │ │ plane ┆ id ┆ 0.49531 ┆ true │ │ plane ┆ location ┆ 0.033162 ┆ false │ │ plane ┆ time_of_day ┆ 0.040861 ┆ false │ │ plane ┆ weather ┆ 0.000407 ┆ false │ └────────────┴─────────────┴──────────┴───────────────┘
Classes¶
Configuration for Balance evaluator. |