dataeval.core

Core stateless functions for performing dataset, metadata and model evaluation.

Functions

ber_knn(embeddings, class_labels, k)

An estimator for Multi-class Bayes error rate using KNN test statistic basis.

ber_mst(embeddings, class_labels)

An estimator for Multi-class Bayes error rate using FR with a minimum spanning tree (MST) test statistic basis.

calculate(data[, boxes, stats, per_image, per_target, ...])

Compute specified statistics on a set of images, optionally within bounding boxes.

calculate_ratios(stats_output, *[, ...])

Calculate box-to-image ratios from calculate() output.

cluster(embeddings[, algorithm, n_clusters, ...])

Uses hierarchical clustering on the flattened data and returns clustering

compute_cluster_stats(embeddings, cluster_labels)

Compute cluster centers and distance statistics for adaptive outlier detection.

compute_neighbors(data_fit[, data_query, k, algorithm])

For each sample in data_query, compute the k nearest neighbors in data_fit.

coverage_adaptive(embeddings, num_observations, percent)

Evaluate coverage using an adaptive radius calculation method.

coverage_naive(embeddings, num_observations)

Evaluate coverage using a naive radius calculation method.

dhash(image)

Compute difference hash (dHash) for an image.

dhash_d4(image)

Compute orientation-invariant difference hash using gradients.

divergence_fnn(emb_a, emb_b)

Calculates the divergence by counting the label disagreements between nearest neighbors

divergence_mst(emb_a, emb_b)

Calculates the divergence by counting the number of "between dataset" edges in the

factor_deviation(reference_factors, test_factors, indices)

Determine greatest deviation in metadata features per sample.

factor_predictors(factors, indices[, discrete_features])

Computes mutual information between metadata factors and flagged sample indices.

feature_distance(continuous_data_1, continuous_data_2)

Measures the feature-wise distance between two continuous distributions and computes a

label_errors(embeddings, labels[, k])

Identifies potential label errors in a dataset using embedding geometry.

label_parity(expected_labels, observed_labels, *[, ...])

Calculate the chi-square statistic to assess the parity between expected and observed label distributions.

label_stats(class_labels[, item_indices, index2label, ...])

Calculates statistics for data labels.

minimum_spanning_tree(embeddings[, k])

Compute the minimum spanning tree of a dataset.

mutual_info(class_labels, factor_data[, ...])

Mutual information between factors (class label, metadata, label/image properties),

mutual_info_classwise(class_labels, factor_data[, ...])

Mutual information (MI) between factors (class label, metadata, label/image properties),

nullmodel_accuracy(class_prob, model_prob, *[, multiclass])

Calculates accuracy from binary classification results.

nullmodel_fpr(class_prob, model_prob)

Calculates FPR (False Positive Rate) from binary classification results.

nullmodel_metrics(test_labels[, train_labels])

Calculate null model metrics (dummy classifiers metrics) for given class distributions.

nullmodel_precision(class_prob, model_prob)

Calculates precision from binary classification results.

nullmodel_recall(class_prob, model_prob)

Calculates recall (True Positive Rate) from binary classification results.

parity(factor_data, class_labels)

Calculate statistical parity using Bias-Corrected Cramér's V.

phash(image)

Compute perceptual hash using Discrete Cosine Transform (DCT).

phash_d4(image)

Compute orientation-invariant perceptual hash using DCT.

rank_hdbscan_complexity(embeddings[, c, ...])

Rank samples using HDBSCAN cluster complexity weighting.

rank_hdbscan_distance(embeddings[, c, ...])

Rank samples using distance to HDBSCAN cluster centers.

rank_kmeans_complexity(embeddings[, c, n_init, reference])

Rank samples using cluster complexity weighting.

rank_kmeans_distance(embeddings[, c, n_init, reference])

Rank samples using distance to cluster centers.

rank_knn(embeddings[, k, reference])

Rank samples using k-nearest neighbors distance.

rerank_class_balance(result, class_labels)

Rerank to balance selection across class labels.

rerank_hard_first(result)

Reverse ranking order to put hard samples first.

rerank_stratified(result[, num_bins])

Rerank by stratified sampling across score bins.

uap(labels, scores)

FR Test Statistic based estimate of the empirical mean precision for the upperbound average precision.

xxhash(image)

Compute fast non-cryptographic hash using xxHash algorithm.