dataeval.core.mutual_info¶
-
dataeval.core.mutual_info(class_labels, factor_data, discrete_features=
None, num_neighbors=5)¶ Compute normalized mutual information between factors, transformed to lie in [0, 1].
Factors include class label, metadata, and label/image properties.
- Parameters:¶
- class_labels : Array1D[int]¶
Target class labels as integer indices. Can be a 1D list, or array-like object.
- factor_data : Array2D[int | float]¶
Factor values after binning or digitization. Can be a 2D list, or array-like object.
- discrete_features : Array1D[bool] | None = None¶
Boolean array defining whether or not the feature set is discretized. Can be a 1D list, or array-like object.
- num_neighbors : int = 5¶
Number of points to consider as neighbors.
- Returns:¶
TypedDict containing:
class_to_factor: NDArray[np.float64] - 1D array of normalized MI between class labels and each factor
interfactor: NDArray[np.float64] - (num_factors) x (num_factors) matrix of normalized MI between factors only
- Return type:¶
See also
sklearn.feature_selection.mutual_info_classif,sklearn.feature_selection.mutual_info_regression,sklearn.metrics.mutual_info_scoreNotes
We use mutual_info_classif from sklearn since class label is categorical. mutual_info_classif outputs are consistent up to O(1e-4) and depend on a random seed. MI is computed differently for categorical and continuous variables. With continuous variables, since there is no upper limit to the entropy of a continuous distribution, normalization by entropy becomes problematic. So instead we transform mutual information into a balance metric using the Linfoot transformation.
References
Example
Return balance (normalized mutual information) of factors with class_labels
>>> rng = np.random.default_rng(175) >>> class_labels = rng.choice([0, 1, 2], size=100) >>> factor_data = np.column_stack([ ... rng.choice([25, 35, 45, 55], size=100), # age ... rng.choice([50000, 65000, 80000], size=100), # income ... rng.choice([0, 1], size=100), # gender ... ]) >>> result = mutual_info(class_labels=class_labels, factor_data=factor_data) >>> result["class_to_factor"] array([1. , 0.034, 0.026, 0.004]) >>> result["interfactor"] array([[1. , 0.017, 0.056], [0.017, 1. , 0.01 ], [0.056, 0.01 , 1. ]])