dataeval.core.completeness

dataeval.core.completeness(embeddings)

Measure the dimensional utilization of embeddings.

Completeness measures how effectively the data explores all available dimensions in its embedding space. This implementation uses a directional diversity approach based on eigenvalue entropy, which is more robust for high-dimensional data than traditional box-counting or neighbor-distance-based methods. The isotropy measure is similar, but measures directional diversity relative to the actual space spanned by the embeddings, rather than to the entire ambient space.

Parameters:
embeddings : Array

Array of image embeddings, shape (n_samples, n_dimensions). Can be a 2D list, array-like object, or tensor.

Returns:

Mapping with keys:

  • completeness: float - Completeness score between 0 and 1

  • isotropy: float - Isotropy score between 0 and 1

  • nearest_neighbor_pairs: Sequence[tuple[int, int]] - Pairs of point indices and their nearest neighbors, sorted by decreasing distance

Return type:

CompletenessResult

Raises:
  • ValueError – If embeddings are not 2D

  • ValueError – If embeddings have a zero dimension

Examples

Well-spread data across 3 dimensions:

>>> rng = np.random.default_rng(42)
>>> embeddings = rng.random((50, 3))
>>> result = completeness(embeddings)
>>> result["completeness"]
0.9963684026790749
>>> result["isotropy"]
0.9865994134108708

Single plane data across 3 dimensions:

>>> directions = rng.normal(size=(2, 3))  # 2 random lines
>>> directions /= np.linalg.norm(directions, axis=1, keepdims=True)
>>> t = np.random.uniform(0, 0.5, (len(directions), 25, 1))
>>> embeddings = ([0.5] * 3 + t * directions[:, np.newaxis, :]).reshape(-1, 3)
>>> result = completeness(embeddings)
>>> result["completeness"]
0.6001089325287554
>>> result["isotropy"]
0.40470070513943307

Completeness can be less than isotropy:

>>> X_low = rng.normal(size=(50, 2))
>>> Q, _ = np.linalg.qr(rng.normal(size=(3, 2)))
>>> embeddings = X_low @ Q.T
>>> result = completeness(embeddings)
>>> result["completeness"]  # penalized by unused ambient dimension
0.6844547029590969
>>> result["isotropy"]  # close to 1, isotropic within 2D subspace
0.9869106459012913