dataeval.core.completeness¶

dataeval.core.completeness(embeddings)¶

Measure the dimensional utilization of embeddings.

Completeness measures how effectively the data explores all available dimensions in its embedding space. This implementation uses a directional diversity approach based on eigenvalue entropy, which is more robust for high-dimensional data than traditional box-counting or neighbor-distance-based methods.

Parameters:¶

embeddings : Array¶: Array of image embeddings, shape (n_samples, n_dimensions). Can be a 2D list, array-like object, or tensor.

Returns:¶

Mapping with keys:

completeness: float - Completeness score between 0 and 1
nearest_neighbor_pairs: Sequence[tuple[int, int]] - Pairs of point indices and their nearest neighbors, sorted by decreasing distance

Return type:¶

CompletenessResult

Raises:¶

ValueError – If embeddings are not 2D
ValueError – If embeddings have a zero dimension

Examples

Well-spread data across 3 dimensions:

>>> rng = np.random.default_rng(42)
>>> embeddings = rng.random((50, 3))
>>> result = completeness(embeddings)
>>> result["completeness"]
0.9963684026790749

Single plane data across 3 dimensions:

>>> directions = rng.normal(size=(2, 3))  # 2 random lines
>>> directions /= np.linalg.norm(directions, axis=1, keepdims=True)
>>> t = np.random.uniform(0, 0.5, (len(directions), 25, 1))
>>> embeddings = ([0.5] * 3 + t * directions[:, np.newaxis, :]).reshape(-1, 3)
>>> result = completeness(embeddings)
>>> result["completeness"]
0.6001089325287554