dataeval.core.mutual_info¶

dataeval.core.mutual_info(class_labels, factor_data, discrete_features=None, num_neighbors=5)¶

Mutual information between factors (class label, metadata, label/image properties), transformed to lie in [0, 1].

Parameters:¶

class_labels : Array1D[int]¶: Target class labels as integer indices. Can be a 1D list, or array-like object.
factor_data : Array2D[int | float]¶: Factor values after binning or digitization. Can be a 2D list, or array-like object.
discrete_features : Array1D[bool] | None = None¶: Boolean array defining whether or not the feature set is discretized. Can be a 1D list, or array-like object.
num_neighbors : int = 5¶: Number of points to consider as neighbors.

Returns:¶

TypedDict containing:

class_to_factor: NDArray[np.float64] - 1D array of MI between class labels and each factor
interfactor: NDArray[np.float64] - (num_factors) x (num_factors) matrix of MI between factors only

Return type:¶

MutualInfoResult

Notes

We use mutual_info_classif from sklearn since class label is categorical. mutual_info_classif outputs are consistent up to O(1e-4) and depend on a random seed. MI is computed differently for categorical and continuous variables. With continuous variables, since there is no upper limit to the entropy of a continuous distribution, normalization by entropy becomes problematic. So instead we transform mutual information into a balance metric using the Linfoot transformation.

References

[1] Linfoot, E.H. (1957). “An Informational Measure of Correlation.” Information and Control, 1(1), 85-89.

Example

Return balance (mutual information) of factors with class_labels

>>> rng = np.random.default_rng(175)
>>> class_labels = rng.choice([0, 1, 2], size=100)
>>> factor_data = np.column_stack(
...     [
...         rng.choice([25, 35, 45, 55], size=100),  # age
...         rng.choice([50000, 65000, 80000], size=100),  # income
...         rng.choice([0, 1], size=100),  # gender
...     ]
... )
>>> result = mutual_info(class_labels=class_labels, factor_data=factor_data)
>>> result["class_to_factor"]
array([1.   , 0.034, 0.026, 0.004])
>>> result["interfactor"]
array([[1.   , 0.017, 0.056],
       [0.017, 1.   , 0.01 ],
       [0.056, 0.01 , 1.   ]])