dataeval.core.label_errors

dataeval.core.label_errors(embeddings, labels, k=50)

Identify potential label errors in a dataset using embedding geometry.

Computes an “Intra/Extra Class Distance Ratio” for every sample. Samples are flagged as errors if they are significantly closer to samples of a different class than to samples of their own class (score >= 1.0).

Parameters:
embeddings : NDArray

Input feature embeddings (e.g., from DINO, ResNet) with shape (n_samples, n_features).

labels : NDArray[np.int64]

Ground truth labels corresponding to the embeddings, with shape (n_samples,).

k : int, optional

Number of neighbors to use for local density estimation. Default is 50.

Returns:

A dictionary containing:

  • ’errors’: Dict mapping sample indices to tuples of (original_label, [suggested_labels]). Only contains samples with a score >= 1.0.

  • ’error_rank’: Array of sample indices sorted by likelihood of error (descending score).

  • ’scores’: Array of raw distance ratio scores for all samples.

Return type:

LabelErrorResult