dataeval.core.factor_predictors

dataeval.core.factor_predictors(factors, indices, discrete_features=None)

Computes mutual information between metadata factors and flagged sample indices.

Given a set of metadata factors per sample and indices of flagged samples, this function calculates the mutual information between each factor and the flagged status. In other words, it finds which metadata factors most likely correlate to a flagged sample (e.g., outliers, OOD samples, or other anomalies).

Parameters:
factors : dict[str, NDArray]

A dictionary mapping factor names to arrays of values. All arrays must have the same length. - Keys: factor names (str) - Values: Arrays of shape (n_samples,) or (n_samples, n_features_per_factor)

indices : SequenceLike[int]

Sequence of sample indices that are flagged for analysis. Indices must not exceed the number of samples in factor arrays.

discrete_features : list[bool] | None

List indicating whether each factor is discrete (True) or continuous (False). Length must match the number of factors. If None, all factors are treated as continuous.

Returns:

A map with keys corresponding to factor names, and values indicating the strength of association between each named factor and the flagged status, as mutual information measured in bits. Returns dict with 0.0 values for all factors if no indices are provided.

Return type:

Mapping[str, float]

Notes

A high mutual information between a factor and flagged samples is an indication of correlation, but not causation. Additional analysis should be done to determine how to handle factors with a high mutual information.

Examples

>>> factors = {
...     "time": np.array([1.0, 2.0, 3.0, 4.0, 5.0]),
...     "altitude": np.array([100, 110, 105, 108, 112]),
... }
>>> indices = [2, 3, 4]  # Flag last three samples
>>> factor_predictors(factors, indices)
{'time': 0.8415720833333329, 'altitude': 0.0}