dataeval.core.divergence_mst

dataeval.core.divergence_mst(emb_a, emb_b)

Calculate the divergence by counting “between dataset” edges in the minimum spanning tree.

Parameters:
emb_a : ArrayLike, shape - (N, P)

Image embeddings in an ArrayLike format to compare. Function expects the data to have 2 dimensions, N number of observations in a P-dimensional space.

emb_b : ArrayLike, shape - (N, P)

Image embeddings in an ArrayLike format to compare. Function expects the data to have 2 dimensions, N number of observations in a P-dimensional space.

Returns:

Mapping with keys:

  • divergence: float - The divergence value between 0.0 and 1.0

  • errors: int - The number of cross-label edges

Return type:

DivergenceResult

Examples

Return divergence of two datasets (0-no divergence, 1-complete divergence)

>>> import sklearn.datasets as dsets
>>> from dataeval.core import divergence_mst
>>> datasetA = dsets.make_blobs(
...     n_samples=50, centers=np.array([(-1, -1), (1, 1)]), cluster_std=0.3, random_state=712
... )[0]
>>> datasetB = (
...     dsets.make_blobs(n_samples=50, centers=np.array([(-0.5, -0.5), (1, 1)]), cluster_std=0.3, random_state=712)[
...         0
...     ]
...     + 0.05
... )
>>> datasetC = dsets.make_blobs(
...     n_samples=50, centers=np.array([(-0.5, 0.5), (1, -1)]), cluster_std=0.3, random_state=712
... )[0]

Overlapping datasets - divergence == 0:

>>> divergence_mst(datasetA, datasetB)
{'divergence': 0.040000000000000036, 'errors': 48}

Completely separated datasets - divergence == 1:

>>> divergence_mst(datasetA, datasetC)
{'divergence': 0.96, 'errors': 2}