dataeval.core¶

Core stateless functions for performing dataset, metadata and model evaluation.

Classes¶

`BERResult`	Type definition for Bayes Error Rate bounds output.
`ClusterResult`	Type definition for cluster output.
`ClusterStats`	Pre-calculated statistics for adaptive outlier detection.
`CompletenessResult`	Type definition for completeness output.
`CoverageResult`	Type definition for coverage output.
`DivergenceResult`	Type definition for divergence output.
`FeatureDistanceResult`	Type definition for feature distance output.
`LabelErrorResult`	Type definition for label error output.
`LabelParityResult`	Type definition for label parity output.
`LabelStatsResult`	Type definition for label statistics output.
`MSTResult`	Type definition for minimum spanning tree output.
`MutualInfoResult`	Type definition for mutual information output.
`NullModelMetrics`	Per-model results for null-model metrics.
`NullModelMetricsResult`	Type definition for null model metrics output.
`ParityResult`	Type definition for parity output.
`RankResult`	Type definition for rank output.
`StatsResult`	Type definition for calculation output.

Functions¶

`ber_knn`(embeddings, class_labels, k)	Estimate Multi-class Bayes error rate using KNN.
`ber_mst`(embeddings, class_labels)	Estimate Multi-class Bayes error rate using a minimum spanning tree.
`cluster`(embeddings[, algorithm, n_clusters, ...])	Use hierarchical clustering on the flattened data and return clustering information.
`combine_stats_results`(results)	Combine one or more StatsResults into unified stats, source_index, and dataset_steps.
`completeness`(embeddings)	Measure the dimensional utilization of embeddings.
`compute_cluster_stats`(embeddings, cluster_labels)	Compute cluster centers and distance statistics for adaptive outlier detection.
`compute_neighbors`(data_fit[, data_query, k, algorithm])	For each sample in data_query, compute the k nearest neighbors in data_fit.
`compute_ratios`(stats_output, *[, target_stats_output, ...])	Compute box-to-image ratios from compute_stats() output.
`compute_stats`(data, *[, boxes, stats, per_image, ...])	Compute specified statistics on a set of images, optionally within bounding boxes.
`coverage_adaptive`(embeddings, num_observations, percent)	Evaluate coverage using an adaptive radius calculation method.
`coverage_naive`(embeddings, num_observations)	Evaluate coverage using a naive radius calculation method.
`dhash`(image)	Compute difference hash (dHash) for an image.
`dhash_d4`(image)	Compute orientation-invariant difference hash using gradients.
`divergence_fnn`(emb_a, emb_b)	Compute the divergence by counting label disagreements between nearest neighbors.
`divergence_mst`(emb_a, emb_b)	Compute the divergence by counting "between dataset" edges in the minimum spanning tree.
`factor_deviation`(reference_factors, test_factors, indices)	Determine greatest deviation in metadata features per sample.
`factor_predictors`(factors, indices[, discrete_features])	Compute mutual information between metadata factors and flagged sample indices.
`feature_distance`(continuous_data_1, continuous_data_2)	Measure the feature-wise distance between two continuous distributions.
`label_errors`(embeddings, labels[, k])	Identify potential label errors in a dataset using embedding geometry.
`label_parity`(expected_labels, observed_labels, *[, ...])	Compute the chi-square statistic to assess label distribution parity.
`label_stats`(class_labels[, item_indices, index2label, ...])	Compute statistics for data labels.
`minimum_spanning_tree`(embeddings[, k])	Compute the minimum spanning tree of a dataset.
`mutual_info`(class_labels, factor_data[, ...])	Compute mutual information between factors, transformed to lie in [0, 1].
`mutual_info_classwise`(class_labels, factor_data[, ...])	Compute mutual information (MI) between factors, transformed to lie in [0, 1].
`nullmodel_accuracy`(class_prob, model_prob, *[, multiclass])	Compute accuracy from binary classification results.
`nullmodel_fpr`(class_prob, model_prob)	Compute FPR (False Positive Rate) from binary classification results.
`nullmodel_metrics`(test_labels[, train_labels])	Compute null model metrics (dummy classifiers metrics) for given class distributions.
`nullmodel_precision`(class_prob, model_prob)	Compute precision from binary classification results.
`nullmodel_recall`(class_prob, model_prob)	Compute recall (True Positive Rate) from binary classification results.
`parity`(factor_data, class_labels)	Compute statistical parity using Bias-Corrected Cramér's V.
`phash`(image)	Compute perceptual hash using Discrete Cosine Transform (DCT).
`phash_d4`(image)	Compute orientation-invariant perceptual hash using DCT.
`rank_hdbscan_complexity`(embeddings[, c, ...])	Rank samples using HDBSCAN cluster complexity weighting.
`rank_hdbscan_distance`(embeddings[, c, ...])	Rank samples using distance to HDBSCAN cluster centers.
`rank_kmeans_complexity`(embeddings[, c, n_init, reference])	Rank samples using cluster complexity weighting.
`rank_kmeans_distance`(embeddings[, c, n_init, reference])	Rank samples using distance to cluster centers.
`rank_knn`(embeddings[, k, reference])	Rank samples using k-nearest neighbors distance.
`rank_result_class_balanced`(result, class_labels)	Transform RankResult indices using class-balanced selection.
`rank_result_stratified`(result[, num_bins])	Transform RankResult indices using stratified sampling.
`uap`(labels, scores)	Estimate the empirical mean precision for the upperbound average precision.
`xxhash`(image)	Compute fast non-cryptographic hash using xxHash algorithm.