dataeval.core.divergence_mst¶
- dataeval.core.divergence_mst(emb_a, emb_b)¶
Calculate the divergence by counting “between dataset” edges in the minimum spanning tree.
- Parameters:¶
- emb_a : ArrayLike, shape - (N, P)¶
Image embeddings in an ArrayLike format to compare. Function expects the data to have 2 dimensions, N number of observations in a P-dimensional space.
- emb_b : ArrayLike, shape - (N, P)¶
Image embeddings in an ArrayLike format to compare. Function expects the data to have 2 dimensions, N number of observations in a P-dimensional space.
- Returns:¶
Mapping with keys:
divergence: float - The divergence value between 0.0 and 1.0
errors: int - The number of cross-label edges
- Return type:¶
Examples
Return divergence of two datasets (0-no divergence, 1-complete divergence)
>>> import sklearn.datasets as dsets >>> from dataeval.core import divergence_mst >>> datasetA = dsets.make_blobs( ... n_samples=50, centers=np.array([(-1, -1), (1, 1)]), cluster_std=0.3, random_state=712 ... )[0] >>> datasetB = ( ... dsets.make_blobs(n_samples=50, centers=np.array([(-0.5, -0.5), (1, 1)]), cluster_std=0.3, random_state=712)[ ... 0 ... ] ... + 0.05 ... ) >>> datasetC = dsets.make_blobs( ... n_samples=50, centers=np.array([(-0.5, 0.5), (1, -1)]), cluster_std=0.3, random_state=712 ... )[0]Overlapping datasets - divergence == 0:
>>> divergence_mst(datasetA, datasetB) {'divergence': 0.040000000000000036, 'errors': 48}Completely separated datasets - divergence == 1:
>>> divergence_mst(datasetA, datasetC) {'divergence': 0.96, 'errors': 2}