dataeval.core.factor_predictors¶

dataeval.core.factor_predictors(factors, indices, discrete_features=None)¶

Compute a measure of mutual information between metadata factors and flagged sample indices.

Given a set of metadata factors per sample and indices of flagged samples, this function calculates the mutual information between each factor and the flagged status. In other words, it finds which metadata factors most likely correlate to a flagged sample (e.g., outliers, OOD samples, or other anomalies). The maximum possible MI is equal to the entropy of the flagged indices, so we normalize by that entropy in order to return a measure of association on a scale from 0 to 1.

Parameters:¶

factors : dict[str, NDArray]¶: A dictionary mapping factor names to arrays of values. All arrays must have the same length. - Keys: factor names (str) - Values: Arrays of shape (n_samples,) or (n_samples, n_features_per_factor)
indices : SequenceLike[int]¶: Sequence of sample indices that are flagged for analysis. Indices must not exceed the number of samples in factor arrays.
discrete_features : list[bool] | None¶: List indicating whether each factor is discrete (True) or continuous (False). Length must match the number of factors. If None, all factors are treated as continuous.

Returns:¶

A map with keys corresponding to factor names, and values indicating the strength of association between each named factor and the flagged status, as normalized mutual information. Returns dict with 0.0 values for all factors if no indices are provided.

Return type:¶

Mapping[str, float]

Notes

A high mutual information between a factor and flagged samples is an indication of correlation, but not causation. Additional analysis should be done to determine how to handle factors with a high mutual information. And note that “high” is always relative to the information or entropy represented by the flagged indices, which is why we use that entropy to normalize.

Examples

>>> factors = {
...     "time": np.array([1.0, 2.0, 3.0, 4.0, 5.0]),
...     "altitude": np.array([100, 110, 105, 108, 112]),
... }
>>> indices = [2, 3, 4]  # Flag last three samples
>>> factor_predictors(factors, indices)
{'time': 0.866750699769533, 'altitude': 0.0}