Divergence

dataeval.metrics.divergence(data_a: _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes], data_b: _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes], method: Literal['FNN', 'MST'] = 'FNN') DivergenceOutput

Calculates the divergence and any errors between the datasets

Parameters:
  • data_a (ArrayLike, shape - (N, P)) – A dataset in an ArrayLike format to compare. Function expects the data to have 2 dimensions, N number of observations in a P-dimensionial space.

  • data_b (ArrayLike, shape - (N, P)) – A dataset in an ArrayLike format to compare. Function expects the data to have 2 dimensions, N number of observations in a P-dimensionial space.

  • method (Literal["MST, "FNN"], default "FNN") – Method used to estimate dataset divergence

Returns:

The divergence value (0.0..1.0) and the number of differing edges between the datasets

Return type:

DivergenceOutput

Notes

The divergence value indicates how similar the 2 datasets are with 0 indicating approximately identical data distributions.

Warning

MST is very slow in this implementation, this is unlike matlab where they have comparable speeds Overall, MST takes ~25x LONGER!! Source of slowdown: conversion to and from CSR format adds ~10% of the time diff between 1nn and scipy mst function the remaining 90%

References

For more information about this divergence, its formal definition, and its associated estimators see https://arxiv.org/abs/1412.6534.

Examples

Evaluate the datasets:

>>> divergence(datasetA, datasetB)
DivergenceOutput(divergence=0.28, errors=36.0)