dataeval.core¶
Core stateless functions for performing dataset, metadata and model evaluation.
Classes¶
Type definition for Bayes Error Rate bounds output. |
|
Type definition for cluster output. |
|
Pre-calculated statistics for adaptive outlier detection. |
|
Type definition for completeness output. |
|
Type definition for coverage output. |
|
Type definition for divergence output. |
|
Type definition for feature distance output. |
|
Type definition for label error output. |
|
Type definition for label parity output. |
|
Type definition for label statistics output. |
|
Type definition for minimum spanning tree output. |
|
Type definition for normalized mutual information output. |
|
Per-model results for null-model metrics. |
|
Type definition for null model metrics output. |
|
Type definition for parity output. |
|
Type definition for rank output. |
|
Type definition for calculation output. |
Functions¶
|
Estimate Multi-class Bayes error rate using KNN. |
|
Estimate Multi-class Bayes error rate using a minimum spanning tree. |
|
Use hierarchical clustering on the flattened data and return clustering information. |
|
Combine one or more StatsResults into unified stats, source_index, and dataset_steps. |
|
Measure the dimensional utilization of embeddings. |
|
Compute cluster centers and distance statistics for adaptive outlier detection. |
|
For each sample in data_query, compute the k nearest neighbors in data_fit. |
|
Compute box-to-image ratios from compute_stats() output. |
|
Compute specified statistics on a set of images, optionally within bounding boxes. |
|
Evaluate coverage using an adaptive radius calculation method. |
|
Evaluate coverage using a naive radius calculation method. |
|
Compute difference hash (dHash) for an image. |
|
Compute orientation-invariant difference hash using gradients. |
|
Compute the divergence by counting label disagreements between nearest neighbors. |
|
Compute the divergence by counting "between dataset" edges in the minimum spanning tree. |
|
Determine greatest deviation in metadata features per sample. |
|
Compute a measure of mutual information between metadata factors and flagged sample indices. |
|
Measure the feature-wise distance between two continuous distributions. |
|
Identify potential label errors in a dataset using embedding geometry. |
|
Compute the chi-square statistic to assess label distribution parity. |
|
Compute statistics for data labels. |
|
Compute the minimum spanning tree of a dataset. |
|
Compute normalized mutual information between factors, transformed to lie in [0, 1]. |
|
Compute normalized mutual information (NMI) between factors. |
|
Compute accuracy from binary classification results. |
|
Compute FPR (False Positive Rate) from binary classification results. |
|
Compute null model metrics (dummy classifiers metrics) for given class distributions. |
|
Compute precision from binary classification results. |
|
Compute recall (True Positive Rate) from binary classification results. |
|
Compute statistical parity using Bias-Corrected Cramér's V. |
|
Compute perceptual hash using Discrete Cosine Transform (DCT). |
|
Compute orientation-invariant perceptual hash using DCT. |
|
Rank samples using HDBSCAN cluster complexity weighting. |
|
Rank samples using distance to HDBSCAN cluster centers. |
|
Rank samples using cluster complexity weighting. |
|
Rank samples using distance to cluster centers. |
|
Rank samples using k-nearest neighbors distance. |
|
Transform RankResult indices using class-balanced selection. |
|
Transform RankResult indices using stratified sampling. |
|
Estimate the empirical mean precision for the upperbound average precision. |
|
Compute fast non-cryptographic hash using xxHash algorithm. |