dataeval.core.completeness

dataeval.core.completeness(embeddings)

Measure the dimensional utilization of embeddings.

Completeness measures how effectively the data explores all available dimensions in its embedding space. This implementation uses a directional diversity approach based on eigenvalue entropy, which is more robust for high-dimensional data than traditional box-counting or neighbor-distance-based methods.

Parameters:
embeddings : Array

Array of image embeddings, shape (n_samples, n_dimensions). Can be a 2D list, array-like object, or tensor.

Returns:

Mapping with keys:

  • completeness: float - Completeness score between 0 and 1

  • nearest_neighbor_pairs: Sequence[tuple[int, int]] - Pairs of point indices and their nearest neighbors, sorted by decreasing distance

Return type:

CompletenessResult

Raises:
  • ValueError – If embeddings are not 2D

  • ValueError – If embeddings have a zero dimension

Examples

Well-spread data across 3 dimensions:

>>> rng = np.random.default_rng(42)
>>> embeddings = rng.random((50, 3))
>>> result = completeness(embeddings)
>>> result["completeness"]
0.9963684026790749

Single plane data across 3 dimensions:

>>> directions = rng.normal(size=(2, 3))  # 2 random lines
>>> directions /= np.linalg.norm(directions, axis=1, keepdims=True)
>>> t = np.random.uniform(0, 0.5, (len(directions), 25, 1))
>>> embeddings = ([0.5] * 3 + t * directions[:, np.newaxis, :]).reshape(-1, 3)
>>> result = completeness(embeddings)
>>> result["completeness"]
0.6001089325287554