dataeval.core.divergence_fnn

dataeval.core.divergence_fnn(emb_a, emb_b)

Calculates the divergence by counting the label disagreements between nearest neighbors in the datasets.

Parameters:
emb_a : ArrayLike, shape - (N, P)

Image embeddings in an ArrayLike format to compare. Function expects the data to have 2 dimensions, N number of observations in a P-dimensional space.

emb_b : ArrayLike, shape - (N, P)

Image embeddings in an ArrayLike format to compare. Function expects the data to have 2 dimensions, N number of observations in a P-dimensional space.

Returns:

Mapping with keys:

  • divergence: float - The divergence value between 0.0 and 1.0

  • errors: int - The number of label disagreements

Return type:

DivergenceResult

Examples

Return divergence of two datasets (0-no divergence, 1-complete divergence)

>>> import sklearn.datasets as dsets
>>> from dataeval.core import divergence_fnn
>>> datasetA = dsets.make_blobs(
...     n_samples=50, centers=np.array([(-1, -1), (1, 1)]), cluster_std=0.3, random_state=712
... )[0]
>>> datasetB = (
...     dsets.make_blobs(n_samples=50, centers=np.array([(-0.5, -0.5), (1, 1)]), cluster_std=0.3, random_state=712)[
...         0
...     ]
...     + 0.05
... )
>>> datasetC = dsets.make_blobs(
...     n_samples=50, centers=np.array([(-0.5, 0.5), (1, -1)]), cluster_std=0.3, random_state=712
... )[0]

Overlapping datasets - divergence == 0:

>>> divergence_fnn(datasetA, datasetB)
{'divergence': 0.0, 'errors': 54}

Completely separated datasets - divergence == 1:

>>> divergence_fnn(datasetA, datasetC)
{'divergence': 1.0, 'errors': 0}