dataeval.metadata.find_ood_predictors

dataeval.metadata.find_ood_predictors(metadata, ood)

Computes mutual information between a set of metadata features and per sample out-of-distribution flags.

Given a set of metadata features per sample and a corresponding OODOutput that indicates whether a sample was determined to be out of distribution, this function calculates the mutual information between each factor and being out of distribution. In other words, it finds which metadata factors most likely correlate to an out of distribution sample.

Note

A high mutual information between a factor and ood samples is an indication of correlation, but not causation. Additional analysis should be done to determine how to handle factors with a high mutual information.

Parameters:
metadata : Metadata

A set of arrays of values, indexed by metadata feature names, with one value per data example per feature.

ood : OODOutput

A class output by DataEval’s OOD functions that contains which examples are OOD.

Returns:

A dictionary with keys corresponding to metadata feature names, and values indicating the strength of association between each named feature and the OOD flag, as mutual information measured in bits.

Return type:

OODPredictorOutput

Examples

>>> from dataeval.outputs import OODOutput

All samples are out-of-distribution

>>> is_ood = OODOutput(np.array([True, True, True]), np.array([]), np.array([]))
>>> find_ood_predictors(metadata1, is_ood)
OODPredictorOutput({'time': 8.008566032557951e-17, 'altitude': 8.008566032557951e-17})

No out-of-distribution samples

>> is_ood = OODOutput(np.array([False, False, False]), np.array([]), np.array([])) >> find_ood_predictors(metadata1, is_ood) OODPredictorOutput({})