dataeval.metadata.most_deviated_factors

dataeval.metadata.most_deviated_factors(metadata_1, metadata_2, ood)

Determines greatest deviation in metadata features per out of distribution sample in metadata_2.

Parameters:
metadata_1 : Metadata

A reference set of Metadata containing factor names and samples with discrete and/or continuous values per factor

metadata_2 : Metadata

The set of Metadata that is tested against the reference metadata. This set must have the same number of features but does not require the same number of samples.

ood : OODOutput

A class output by the DataEval’s OOD functions that contains which examples are OOD.

Returns:

An array of the factor name and deviation of the highest metadata deviation for each OOD example in metadata_2.

Return type:

list[tuple[str, float]]

Notes

  1. Both Metadata inputs must have discrete and continuous data in the shape (samples, factors) and have equivalent factor names and lengths

  2. The flag at index i in OODOutput.is_ood must correspond directly to sample i of metadata_2 being out-of-distribution from metadata_1

Examples

>>> from dataeval.detectors.ood import OODOutput

All samples are out-of-distribution

>>> is_ood = OODOutput(np.array([True, True, True]), np.array([]), np.array([]))
>>> most_deviated_factors(metadata1, metadata2, is_ood)
[('time', 2.0), ('time', 2.592), ('time', 3.51)]

If there are no out-of-distribution samples, a list is returned

>>> is_ood = OODOutput(np.array([False, False, False]), np.array([]), np.array([]))
>>> most_deviated_factors(metadata1, metadata2, is_ood)
[]