dataeval.core

Core stateless functions for performing dataset, metadata and model evaluation.

Classes

BERResult

Type definition for Bayes Error Rate bounds output.

CalculationResult

Type definition for calculation output.

ClusterResult

Type definition for cluster output.

ClusterStats

Pre-calculated statistics for adaptive outlier detection.

CompletenessResult

Type definition for completeness output.

CoverageResult

Type definition for coverage output.

DivergenceResult

Type definition for divergence output.

FeatureDistanceResult

Type definition for feature distance output.

LabelErrorResult

Type definition for label error output.

LabelParityResult

Type definition for label parity output.

LabelStatsResult

Type definition for label statistics output.

MSTResult

Type definition for minimum spanning tree output.

MutualInfoResult

Type definition for mutual information output.

NullModelMetrics

Per-model results for null-model metrics.

NullModelMetricsResult

Type definition for null model metrics output.

ParityResult

Type definition for parity output.

RankResult

Type definition for rank output.

Functions

ber_knn(embeddings, class_labels, k)

Estimate Multi-class Bayes error rate using KNN.

ber_mst(embeddings, class_labels)

Estimate Multi-class Bayes error rate using a minimum spanning tree.

calculate(data[, boxes, stats, per_image, per_target, ...])

Compute specified statistics on a set of images, optionally within bounding boxes.

calculate_ratios(stats_output, *[, ...])

Calculate box-to-image ratios from calculate() output.

cluster(embeddings[, algorithm, n_clusters, ...])

Use hierarchical clustering on the flattened data and return clustering information.

completeness(embeddings)

Measure the dimensional utilization of embeddings.

compute_cluster_stats(embeddings, cluster_labels)

Compute cluster centers and distance statistics for adaptive outlier detection.

compute_neighbors(data_fit[, data_query, k, algorithm])

For each sample in data_query, compute the k nearest neighbors in data_fit.

coverage_adaptive(embeddings, num_observations, percent)

Evaluate coverage using an adaptive radius calculation method.

coverage_naive(embeddings, num_observations)

Evaluate coverage using a naive radius calculation method.

dhash(image)

Compute difference hash (dHash) for an image.

dhash_d4(image)

Compute orientation-invariant difference hash using gradients.

divergence_fnn(emb_a, emb_b)

Calculate the divergence by counting label disagreements between nearest neighbors.

divergence_mst(emb_a, emb_b)

Calculate the divergence by counting "between dataset" edges in the minimum spanning tree.

factor_deviation(reference_factors, test_factors, indices)

Determine greatest deviation in metadata features per sample.

factor_predictors(factors, indices[, discrete_features])

Compute mutual information between metadata factors and flagged sample indices.

feature_distance(continuous_data_1, continuous_data_2)

Measure the feature-wise distance between two continuous distributions.

label_errors(embeddings, labels[, k])

Identify potential label errors in a dataset using embedding geometry.

label_parity(expected_labels, observed_labels, *[, ...])

Calculate the chi-square statistic to assess label distribution parity.

label_stats(class_labels[, item_indices, index2label, ...])

Calculate statistics for data labels.

minimum_spanning_tree(embeddings[, k])

Compute the minimum spanning tree of a dataset.

mutual_info(class_labels, factor_data[, ...])

Compute mutual information between factors, transformed to lie in [0, 1].

mutual_info_classwise(class_labels, factor_data[, ...])

Compute mutual information (MI) between factors, transformed to lie in [0, 1].

nullmodel_accuracy(class_prob, model_prob, *[, multiclass])

Calculate accuracy from binary classification results.

nullmodel_fpr(class_prob, model_prob)

Calculate FPR (False Positive Rate) from binary classification results.

nullmodel_metrics(test_labels[, train_labels])

Calculate null model metrics (dummy classifiers metrics) for given class distributions.

nullmodel_precision(class_prob, model_prob)

Calculate precision from binary classification results.

nullmodel_recall(class_prob, model_prob)

Calculate recall (True Positive Rate) from binary classification results.

parity(factor_data, class_labels)

Calculate statistical parity using Bias-Corrected Cramér's V.

phash(image)

Compute perceptual hash using Discrete Cosine Transform (DCT).

phash_d4(image)

Compute orientation-invariant perceptual hash using DCT.

rank_hdbscan_complexity(embeddings[, c, ...])

Rank samples using HDBSCAN cluster complexity weighting.

rank_hdbscan_distance(embeddings[, c, ...])

Rank samples using distance to HDBSCAN cluster centers.

rank_kmeans_complexity(embeddings[, c, n_init, reference])

Rank samples using cluster complexity weighting.

rank_kmeans_distance(embeddings[, c, n_init, reference])

Rank samples using distance to cluster centers.

rank_knn(embeddings[, k, reference])

Rank samples using k-nearest neighbors distance.

rank_result_class_balanced(result, class_labels)

Transform RankResult indices using class-balanced selection.

rank_result_stratified(result[, num_bins])

Transform RankResult indices using stratified sampling.

uap(labels, scores)

Estimate the empirical mean precision for the upperbound average precision.

xxhash(image)

Compute fast non-cryptographic hash using xxHash algorithm.