dataeval.core.balance¶
-
dataeval.core.balance(class_labels, factor_data, discrete_features=
None, num_neighbors=5)¶ Mutual information between factors (class label, metadata, label/image properties).
- Parameters:¶
- class_labels : NDArray[np.intp]¶
Target class labels as integer indices.
- factor_data : NDArray[np.intp]¶
Factor values after binning or digitization.
- discrete_features : Iterable[bool] | None = None¶
Boolean array or iterable defining whether or not the feature set is discretized.
- num_neighbors : int = 5¶
Number of points to consider as neighbors.
- Returns:¶
(num_factors+1) x (num_factors+1) estimate of mutual information between num_factors metadata factors and class label. Symmetry is enforced.
- Return type:¶
NDArray[np.float64]
Notes
We use mutual_info_classif from sklearn since class label is categorical. mutual_info_classif outputs are consistent up to O(1e-4) and depend on a random seed. MI is computed differently for categorical and continuous variables.
Example
Return balance (mutual information) of factors with class_labels
>>> metadata = generate_random_metadata( ... labels=["doctor", "artist", "teacher"], ... factors={ ... "age": [25, 30, 35, 45], ... "income": [50000, 65000, 80000], ... "gender": ["M", "F"]}, ... length=100, ... random_seed=175)>>> bal = balance( ... class_labels=metadata.class_labels, ... factor_data=metadata.binned_data, ... discrete_features=[True, True, True]) >>> bal array([[1.017, 0.034, 0. , 0.028], [0.034, 1. , 0.015, 0.038], [0. , 0.015, 1. , 0.008], [0.028, 0.038, 0.008, 1. ]])Return intra/interfactor balance (mutual information)
See also
sklearn.feature_selection.mutual_info_classif,sklearn.feature_selection.mutual_info_regression,sklearn.metrics.mutual_info_score