dataeval.metadata.find_most_deviated_factors

dataeval.metadata.find_most_deviated_factors(metadata_ref, metadata_tst, ood)

Determine greatest deviation in metadata features per out of distribution sample in test metadata.

Parameters:
metadata_ref : Metadata

A reference set of Metadata containing factor names and samples with discrete and/or continuous values per factor

metadata_tst : Metadata

The set of Metadata that is tested against the reference metadata. This set must have the same number of features but does not require the same number of samples.

ood : OODOutput

A class output by DataEval’s OOD functions that contains which examples are OOD.

Returns:

An output class containing the factor name and deviation of the highest metadata deviations for each OOD example in the test metadata.

Return type:

MostDeviatedFactorsOutput

Notes

  1. Both Metadata inputs must have discrete and continuous data in the shape (samples, factors) and have equivalent factor names and lengths

  2. The flag at index i in OODOutput.is_ood must correspond directly to sample i of metadata_tst being out-of-distribution from metadata_ref

Examples

>>> from dataeval.detectors.ood import OODOutput

All samples are out-of-distribution

>>> is_ood = OODOutput(np.array([True, True, True]), np.array([]), np.array([]))
>>> find_most_deviated_factors(metadata1, metadata2, is_ood)
MostDeviatedFactorsOutput([('time', 2.0), ('time', 2.592), ('time', 3.51)])

No samples are out-of-distribution

>>> is_ood = OODOutput(np.array([False, False, False]), np.array([]), np.array([]))
>>> find_most_deviated_factors(metadata1, metadata2, is_ood)
MostDeviatedFactorsOutput([])