dataeval.core.mutual_info_classwise

dataeval.core.mutual_info_classwise(class_labels, factor_data, discrete_features=None, num_neighbors=5)

Mutual information (MI) between factors (class label, metadata, label/image properties), transformed to lie in [0, 1].

Parameters:
class_labels : Array1D[int]

Target class labels as integer indices. Can be a 1D list, or array-like object.

factor_data : Array2D[int]

Factor values after binning or digitization. Can be a 1D list, or array-like object.

discrete_features : Array1D[bool] | None = None

Boolean array or iterable defining whether or not the feature set is discretized. Can be a 1D list, or array-like object.

num_neighbors : int = 5

Number of points to consider as neighbors.

Returns:

(num_factors+1) x (num_factors+1) estimate of mutual information between num_factors metadata factors and class label. Symmetry is enforced.

Return type:

NDArray[np.float64]

Notes

We use mutual_info_classif from sklearn since class label is categorical. mutual_info_classif outputs are consistent up to O(1e-4) and depend on a random seed. MI is computed differently for categorical and continuous variables. We return a transformation of MI onto the interval [0, 1].

Example

Return balance (mutual information) of factors with class_labels

>>> class_labels, binned_data = generate_random_class_labels_and_binned_data(
...     labels=["doctor", "artist", "teacher"],
...     factors={"age": [25, 30, 35, 45], "income": [50000, 65000, 80000], "gender": ["M", "F"]},
...     length=100,
...     random_seed=175,
... )

Return classwise balance (mutual information) of factors with individual class_labels

>>> mutual_info_classwise(class_labels=class_labels, factor_data=binned_data)
array([[0.748, 0.164, 0.096, 0.466],
       [0.692, 0.301, 0.045, 0.25 ],
       [0.708, 0.137, 0.018, 0.16 ]])

See also

sklearn.feature_selection.mutual_info_classif, sklearn.feature_selection.mutual_info_regression, sklearn.metrics.mutual_info_score