dataeval.core¶

Core stateless functions for performing dataset, metadata and model evaluation.

Functions¶

`ber_knn`(embeddings, class_labels, k)	An estimator for Multi-class Bayes error rate using KNN test statistic basis.
`ber_mst`(embeddings, class_labels)	An estimator for Multi-class Bayes error rate using FR with a minimum spanning tree (MST) test statistic basis.
`calculate`(data[, boxes, stats, per_image, per_target, ...])	Compute specified statistics on a set of images, optionally within bounding boxes.
`calculate_ratios`(stats_output, *[, ...])	Calculate box-to-image ratios from calculate() output.
`cluster`(embeddings[, algorithm, n_clusters, ...])	Uses hierarchical clustering on the flattened data and returns clustering
`compute_cluster_stats`(embeddings, cluster_labels)	Compute cluster centers and distance statistics for adaptive outlier detection.
`compute_neighbors`(data_fit[, data_query, k, algorithm])	For each sample in data_query, compute the k nearest neighbors in data_fit.
`coverage_adaptive`(embeddings, num_observations, percent)	Evaluate coverage using an adaptive radius calculation method.
`coverage_naive`(embeddings, num_observations)	Evaluate coverage using a naive radius calculation method.
`dhash`(image)	Compute difference hash (dHash) for an image.
`dhash_d4`(image)	Compute orientation-invariant difference hash using gradients.
`divergence_fnn`(emb_a, emb_b)	Calculates the divergence by counting the label disagreements between nearest neighbors
`divergence_mst`(emb_a, emb_b)	Calculates the divergence by counting the number of "between dataset" edges in the
`factor_deviation`(reference_factors, test_factors, indices)	Determine greatest deviation in metadata features per sample.
`factor_predictors`(factors, indices[, discrete_features])	Computes mutual information between metadata factors and flagged sample indices.
`feature_distance`(continuous_data_1, continuous_data_2)	Measures the feature-wise distance between two continuous distributions and computes a
`label_errors`(embeddings, labels[, k])	Identifies potential label errors in a dataset using embedding geometry.
`label_parity`(expected_labels, observed_labels, *[, ...])	Calculate the chi-square statistic to assess the parity between expected and observed label distributions.
`label_stats`(class_labels[, item_indices, index2label, ...])	Calculates statistics for data labels.
`minimum_spanning_tree`(embeddings[, k])	Compute the minimum spanning tree of a dataset.
`mutual_info`(class_labels, factor_data[, ...])	Mutual information between factors (class label, metadata, label/image properties),
`mutual_info_classwise`(class_labels, factor_data[, ...])	Mutual information (MI) between factors (class label, metadata, label/image properties),
`nullmodel_accuracy`(class_prob, model_prob, *[, multiclass])	Calculates accuracy from binary classification results.
`nullmodel_fpr`(class_prob, model_prob)	Calculates FPR (False Positive Rate) from binary classification results.
`nullmodel_metrics`(test_labels[, train_labels])	Calculate null model metrics (dummy classifiers metrics) for given class distributions.
`nullmodel_precision`(class_prob, model_prob)	Calculates precision from binary classification results.
`nullmodel_recall`(class_prob, model_prob)	Calculates recall (True Positive Rate) from binary classification results.
`parity`(factor_data, class_labels)	Calculate statistical parity using Bias-Corrected Cramér's V.
`phash`(image)	Compute perceptual hash using Discrete Cosine Transform (DCT).
`phash_d4`(image)	Compute orientation-invariant perceptual hash using DCT.
`rank_hdbscan_complexity`(embeddings[, c, ...])	Rank samples using HDBSCAN cluster complexity weighting.
`rank_hdbscan_distance`(embeddings[, c, ...])	Rank samples using distance to HDBSCAN cluster centers.
`rank_kmeans_complexity`(embeddings[, c, n_init, reference])	Rank samples using cluster complexity weighting.
`rank_kmeans_distance`(embeddings[, c, n_init, reference])	Rank samples using distance to cluster centers.
`rank_knn`(embeddings[, k, reference])	Rank samples using k-nearest neighbors distance.
`rerank_class_balance`(result, class_labels)	Rerank to balance selection across class labels.
`rerank_hard_first`(result)	Reverse ranking order to put hard samples first.
`rerank_stratified`(result[, num_bins])	Rerank by stratified sampling across score bins.
`uap`(labels, scores)	FR Test Statistic based estimate of the empirical mean precision for the upperbound average precision.
`xxhash`(image)	Compute fast non-cryptographic hash using xxHash algorithm.