Acting on Results

DataEval produces diagnostic outputs, not decisions. Every evaluator returns scores, indices, flags, or p-values — but none of them tell you what to do next. That judgment requires understanding what the result means in the context of your program, your data collection constraints, and your performance requirements.

This page maps each category of DataEval output to the decisions it informs and the actions available to you. It is not a procedure to follow linearly. Real datasets present multiple issues simultaneously, and fixing one can reveal or interact with another. Use this page as a reference: when a diagnostic produces a result you need to act on, find the relevant section and understand your options before committing to a course of action.

One framing applies across all of it: DataEval finds evidence of problems. The interpretation is always yours. A high BER upper bound is evidence that class overlap may be irreducible — it is not proof that your requirement is unachievable. A detected duplicate is evidence of potential redundancy — it is not an instruction to delete. Every DataEval output is a prompt to investigate, not a verdict.

Data Integrity findings

Duplicates and near-duplicates

Duplicates returns three groups of indices: exact matches (by xxHash), near-matches (by perceptual hash with D4 variants for rotation/flip invariance), and semantic clusters (by cluster-based grouping). The available actions differ by group.

Exact duplicates are unambiguous. An image appearing twice contributes nothing new to either training or evaluation. In training data, exact duplicates artificially inflate the effective weight of those samples during gradient updates. In test data, they introduce data leakage if the duplicate also appears in training. Remove all but one instance per exact-match group. The exception is when exact duplicates appear with different labels — that is a label error, not a redundancy issue, and requires human review before any removal.

Near-duplicates require judgment. A perceptual hash match means two images are visually similar but not pixel-identical — typically the same image with different compression, slight crop, minor brightness adjustment, or rotation. Whether to remove one depends on whether the variation is meaningful for your task. A slightly brighter version of the same image adds little; the same scene from a marginally different angle may add enough viewpoint diversity to keep. In evaluation datasets, near-duplicates between train and test are the primary concern — they degrade the validity of held-out metrics. Check the near-duplicate groups for cross-split contamination before deciding on removals.

Semantic clusters (from cluster-based detection) identify images that are different in appearance but close in embedding space — similar scene composition, same background type, or same target in very similar configurations. These represent redundancy at a higher level of abstraction. High semantic redundancy in a region of the feature space means training examples are concentrated there at the expense of underrepresented regions. This is where Prioritize becomes relevant: rather than deciding which images to delete, use prioritization to select a subset that preserves coverage while reducing redundancy.

For large datasets with known high redundancy — full motion video, overhead imagery with abundant background, repeated collection passes over the same area — systematic deduplication before any other analysis is worth doing first. It reduces the size of the problem that all subsequent evaluators have to process and removes the noise that redundant samples introduce into coverage and bias metrics. It also helps prevent leakage between dataset splits.

Outliers

Outliers flags images with unusual image statistics (brightness, blur, contrast, size) relative to the distribution of the dataset. The output includes per-metric z-scores and which threshold each flagged image exceeded.

The first question with any outlier is: is this a data quality problem or a legitimate but unusual sample? Low brightness outliers might be genuinely dark operational images (keep) or corrupted/miscalibrated sensor captures (remove). High blur might be motion blur in a class where motion blur matters for the task (keep) or a focus failure (remove).

Systematic outliers — where many images from a particular collection event, location, or sensor appear as outliers together — indicate a collection condition that was not present in the rest of the dataset. Depending on whether that condition represents a real operational scenario, the appropriate response is either to collect more data under similar conditions (to make the unusual condition normal) or to exclude the collection event (if it was a known calibration or setup error).

Isolated outliers — individual unusual images not associated with a pattern — are candidates for manual review. If the image is valid and represents a real scenario, it may be worth keeping but worth flagging as a hard case. If it is a processing artifact or error, remove it.

Outliers in evaluation data deserve extra scrutiny. An outlier in the test set that is unlike anything in the training set will produce a confident wrong prediction — not because the model is bad, but because it was never shown anything like it. This is evidence for expanding the training distribution, not evidence of model failure.

Image statistics

compute_stats() computes per-image and per-channel statistics (mean, standard deviation, entropy, sharpness, and others), with ImageStats flags controlling which statistics are calculated. These are diagnostic inputs to outlier detection and dataset characterization rather than pass/fail outputs on their own.

When the statistics reveal systematic distributional differences between conditions (day vs. night, sensor A vs. sensor B, collection campaign 1 vs. 2), the action is to make those differences visible in your metadata and analyze them explicitly with Balance and Diversity. A dataset where two-thirds of the bright images are from one class and two-thirds of the dark images are from another has a confound between image brightness and class label. That confound will not be visible in aggregate statistics — only in the correlation structure that Balance measures.


Label errors and label statistics

Label error detection (label_errors()) flags samples whose embedding geometry is inconsistent with their assigned label — specifically, samples that are closer to a different class than to their own. The error_rank output is a triage list: the highest-ranked samples are the strongest candidates for mislabeling and should be reviewed first.

The output includes suggested replacement labels derived from rank-weighted voting over the flagged sample’s out-of-class nearest neighbors. Three outcomes are possible for each flagged sample: a single confident suggestion (vote share ≥ 0.4 with a clear winner), an ambiguous pair (top two candidates within 0.2 of each other), or no suggestion (neighborhood too noisy to recommend). No suggestion does not mean the label is correct — it means the neighborhood is mixed enough that a single replacement cannot be recommended with confidence. Human review is required in all cases; the suggestion is a starting point, not a verdict.

The scores array provides the raw intra/extra class distance ratio for every sample, not just those above the threshold. Plotting the score distribution often reveals whether flagged samples cluster sharply above 1.0 (clean dataset with a few clear errors) or form a long tail (systematic mislabeling across a class boundary).

Label statistics (label_stats()) should be run before any other analysis as a basic audit. The empty_image_indices field is the most immediately actionable output: images with no annotations in an object detection dataset are almost always a pipeline error, and training on them tells the model to predict nothing. Verify whether empty images are intentional (background-only images explicitly included as negatives) or accidental (annotation tool failure or images dropped from the labeling queue). The label_counts_per_class and image_counts_per_class outputs provide the input to class imbalance analysis and are the prerequisite for ClassBalance.


Dataset Bias and Coverage findings

Balance and diversity

Balance measures mutual information (normalized to [0,1]) between metadata factors and class labels, and between metadata factors themselves. A high correlation between a metadata factor and class means the model can use that factor as a shortcut to predict class without learning the actual signal of interest.

The action depends on whether the correlation is removable. Some correlations are collection artifacts — if all images of class A were collected in summer and all images of class B in winter, the seasonal metadata factor is spuriously correlated with class. The fix is to collect more data that breaks the confound: class A images in winter, class B images in summer. Other correlations are intrinsic to the task — targets of a certain class genuinely tend to appear at certain sizes or in certain contexts. Those correlations cannot be removed; they can only be acknowledged and accounted for when interpreting model performance on operational data.

High inter-factor correlation (factor-to-factor rather than factor-to-class) indicates redundancy in the metadata itself — two factors are measuring the same underlying variation. This matters when designing metadata collection for future data: capturing highly correlated factors is wasted effort. It also matters for interpreting Parity results, where correlated factors can confound each other.

Diversity measures evenness of sampling across factor values. A low diversity score means sampling is concentrated in a few values — most images have the same lighting condition, the same background type, or the same target aspect. The action is data collection: identify which factor values are underrepresented and target collection toward them. When collection is not possible, the low diversity should be documented as a known limitation and matched against expected operational conditions. If the operational data will also have low diversity in that factor (it always will be daylight, the terrain is always desert), the training imbalance may not matter. If the operational distribution is more diverse than the training distribution, the imbalance is a real risk.

Parity

Parity (experimental) measures statistical dependence between metadata factors and labels using bias-corrected Cramér’s V with a G-test for significance. It is the most rigorous of the three bias metrics, requiring sufficient per-cell sample counts (minimum five per contingency cell) to produce valid results.

The insufficient_data flag in the output identifies factor-label combinations where the sample count requirement was not met. These combinations cannot be assessed for parity — the absence of a flag does not mean parity exists, it means the data was too sparse to test. This is itself actionable: it identifies collection targets where the current dataset has gaps too large to characterize.

Where Parity flags significant dependence, the interpretation is the same as for Balance: determine whether the dependence is a collection artifact (addressable by rebalancing data) or an intrinsic property of the task (documentable as a limitation).

Coverage and completeness

coverage() identifies regions of the embedding space with no reference samples nearby — gaps in the feature space where the model will be operating without having seen similar data during training. completeness() measures the effective dimensionality utilization of the embedding space: whether the dataset spans the available dimensions or is concentrated in a low-dimensional subspace.

Coverage gaps indicate specific underrepresented regions. The uncovered_indices output identifies existing samples nearest to the gaps — the samples at the frontier of current coverage. These are the samples most valuable to use as seeds for targeted collection: find more images that look like them, or find images that would be neighbors to them in embedding space.

Low completeness (effective dimensionality well below the total embedding dimensions) means the dataset is not exploring the full representational space. This can indicate that the dataset is too narrow in its variation — all images from similar conditions, similar viewpoints, similar backgrounds. Increasing diversity on the identified low-utilization dimensions requires understanding what those dimensions correspond to in the embedding model, which is an interpretation task requiring domain knowledge.

Both metrics are contingent on the quality of the embedding. Coverage gaps and completeness scores are only as meaningful as the embedding model is discriminative for your task. See the Embeddings concept page for guidance on embedding selection.


Prioritization: acting on redundancy and coverage together

Prioritize is the primary remediation tool in DataEval — it does not just diagnose, it returns a ranked list of sample indices you can act on directly. Where the other evaluators tell you what is wrong, Prioritize tells you which samples to include in a representative subset or which unlabeled samples to collect and annotate next.

What Prioritize ranks

Prioritize ranks samples by their position in embedding space, from prototypical (dense, central, low distance to neighbors) to atypical (sparse, peripheral, high distance to neighbors). Five ranking methods are available, all operating on embeddings:

  • knn: each sample’s score is its mean distance to its \(k\) nearest neighbors. Simple, fast, no hyperparameters beyond \(k\). Default \(k = \sqrt{n}\).

  • kmeans_distance: distance to assigned k-means cluster centroid. Prototypical samples are close to their centroid; challenging or boundary samples are far.

  • kmeans_complexity: uses the product of intra-cluster and inter-cluster distances to score clusters, then samples from each cluster proportionally. Designed to mitigate class imbalance without requiring labels.

  • hdbscan_distance: distance to assigned HDBSCAN cluster centroid. HDBSCAN discovers cluster structure without requiring a predetermined cluster count, making it more appropriate for datasets with irregular or variable-density clusters.

  • hdbscan_complexity: analogous to kmeans_complexity but using HDBSCAN clustering.

Selecting a policy

The PrioritizeOutput returned by evaluate() stores the raw ranking and computes the final index order lazily based on a policy. Changing policy is cheap — it reorders the stored result without re-running the embedding or ranking. Three policies are available:

difficulty returns samples in direct order of their difficulty score (easy_first or hard_first). Easy-first (prototypical samples first) is appropriate when the dataset is small and the model needs to learn basic concepts before seeing hard cases. Hard-first (atypical samples first) tends to produce better models than easy-first for moderate-sized datasets because hard samples sample the decision boundary rather than the cluster core. Neither policy consistently outperforms random decimation on clean, well-balanced image datasets, however.

stratified bins the score range and draws samples uniformly from each bin, producing a mixture of easy and hard samples rather than concentrating on one end. Across DataEval’s testing this was the most consistent policy for downstream model improvement — it avoids oversampling the dense, redundant core while not ignoring prototypical samples entirely.

class_balanced reorders to ensure equal representation across classes while maintaining priority order within each class. This requires class labels and is most valuable when the initial dataset has extreme class imbalance — where simple difficulty-based pruning would itself worsen the imbalance.

The library’s own recommendation: start with stratified and either knn or kmeans_distance. Neither scoring method consistently outperformed the other across datasets; both outperformed random decimation on several datasets.

Reference-based prioritization for new data

A distinct and important use case is evaluating new, unlabeled data against an existing labeled dataset. Pass the existing labeled dataset as the reference argument:

prioritizer = Prioritize.knn(extractor, k=10, reference=labeled_data)
result = prioritizer.evaluate(new_unlabeled_data)
most_novel = result.hard_first().indices

In hard-first order, the top-ranked samples are those most unlike the reference — the candidates that would add the most new information to the dataset. This is the labeling budget allocation problem: given limited annotation resources, which unlabeled samples are worth labeling? Prioritize with a reference answers that question directly.

What Prioritize does not do

Prioritize does not remove samples — it ranks them. The decision about where to draw the cutoff (how many samples to include in the prioritized subset) is yours, and it depends on your computational budget, target model performance, and downstream evaluation requirements. Prioritize gives you a principled ordering; the threshold is a program decision.

Prioritize also does not account for label quality, metadata correctness, or class distribution in its scoring (unless class_balanced policy is used). A prioritized subset of a dataset with label errors will still contain label errors. Run Duplicates and Outliers before prioritizing to clean the pool being prioritized.


Class balance remediation

ClassBalance is a selection tool, not a diagnostic. It takes a dataset with class frequency information and returns indices for a class-balanced subset, using one of two strategies:

global samples with probability proportional to the inverse square root of class frequency: \(w_c = \max(1, \sqrt{\alpha / f_c})\), where \(f_c\) is the frequency of class \(c\) and \(\alpha\) is a scaling factor. This upweights rare classes without completely equalizing them — it produces a gradient from common to rare rather than forcing uniform counts.

interclass samples equal numbers from each class, completely equalizing class frequency. Background class is excluded from the count and not drawn from.

Use global when you want to reduce imbalance while preserving some of the original frequency information. Use interclass when you need strict balance for evaluation purposes or when extreme imbalance is creating a known performance problem.

ClassBalance addresses sampling imbalance — it helps you select a balanced subset from what you have. If the imbalance exists because certain classes are simply underrepresented in the dataset (not enough images of that class exist), ClassBalance cannot fix that. The only remedy for collection imbalance is collection: acquiring more data for the underrepresented classes.


Performance Limits findings

BER upper and lower bounds

ber_mst() and ber_knn() returns upper and lower bounds on the Bayes Error Rate — the irreducible classification error given the current feature representation. The upper bound is the actionable number: if the BER upper bound exceeds your operational accuracy requirement, no classifier trained on this data in this feature space will meet the requirement.

The BER upper bound exceeds the requirement. This does not mean the task is impossible — it means it is impossible with the current data and feature representation. Three levers are available:

  1. Better embeddings: if the current feature representation conflates classes that should be separable, a better embedding model may achieve lower BER. BER is always contingent on the embedding; recomputing with different embeddings is a cheap way to check whether the representation is the bottleneck.

  2. More discriminative data: if the class overlap is genuine in the current feature space because the data does not contain the signal needed to separate the classes, additional data of the same type will not help. Different data — different sensors, different modalities, different collection conditions — may reduce the irreducible overlap.

  3. Task redefinition: if two classes are genuinely indistinguishable in the available feature space under real operational conditions, the task taxonomy may need revision. This is a program-level decision.

The BER upper and lower bounds are far apart. A wide gap means the estimate is uncertain — the available data is insufficient to tightly characterize the irreducible error. Collect more data, especially in regions of the feature space where class overlap appears high (near decision boundaries), and recompute.

BER applies to classification only. Object detection users: BER and uap() measure classification feasibility without localization error. A favorable BER result does not guarantee that a detection model will meet its requirements — localization error is an additional, separate constraint.

UAP

uap() computes the Upper-bound Average Precision over bounding box crops, treating the problem as classification and removing localization error. The result is a strict upper bound on what any detector can achieve in mAP: your operational detector must also solve localization, so actual mAP will be lower.

A UAP that already fails to meet the requirement means the classification problem itself is infeasible — even with perfect localization, the requirement cannot be met. A UAP that passes gives you headroom to lose to localization error. The gap between UAP and your requirement is the tolerance budget for localization and detection-specific factors.

Sufficiency learning curves

Sufficiency fits a power law \(f(n) = c \cdot n^{-m} + c_0\) to your model’s learning curve. The three parameters tell distinct stories:

\(c_0\) (asymptote) is the most important parameter. It is the performance ceiling: what the model converges to as \(n \to \infty\). If \(c_0\) is below your operational requirement, collecting more data will not close the gap — the model has hit its ceiling under the current data distribution and architecture. The appropriate response is a change in the data (different conditions, different sensors, different label taxonomy) or a change in the model.

\(m\) (exponent) governs how quickly performance improves with additional data. A steep curve (large \(m\)) means you are currently on the steep part of the learning curve and modest data additions will produce measurable gains. A shallow curve (small \(m\)) means you are in the diminishing returns regime.

inv_project(target_performance) returns the number of samples estimated to be required to reach a target performance level. When this returns -1, the target exceeds \(c_0\) and is unachievable under the current conditions regardless of data volume. When it returns a finite number, the result is an estimate with confidence intervals based on how well the power law fits the observed data. The estimate assumes the learning curve continues along the same trajectory — distribution shift, label errors, or changes in model architecture will all violate that assumption.


Distribution Shift findings

Drift detection results

All drift detectors return a binary drifted flag and a p-value (or AUROC for the domain classifier). A drift detection is a signal to investigate, not a signal to immediately retrain.

First response to drift detection: characterize the drift before acting. DriftUnivariate identifies which features drifted. DriftDomainClassifier provides feature importances indicating which features most distinguish the operational batch from the reference. Use this information to understand what changed: a seasonal shift in lighting, a sensor hardware change, a change in the operational scenario, or a change in the target population.

Drift that affects model performance (visible in uncertainty-based drift detection or in ground-truth performance metrics when labels are available) warrants retraining or fine-tuning on operational data. If operational data with ground truth is available, it should be incorporated into the training set. Prioritize the most novel operational samples for labeling using Prioritize with the current training set as reference — the hard-first ranking identifies the operational samples most unlike the training distribution.

Drift that does not affect model performance (feature distribution changed but uncertainty-based detection shows no confidence degradation) may not require retraining. Document the drift and increase monitoring frequency.

Persistent structural drift — where the operational distribution has systematically departed from the training distribution and is not expected to return — is a dataset problem, not a monitoring problem. The training reference needs to be updated to reflect the operational distribution. This is a program lifecycle decision involving data collection, labeling, and model qualification.

OOD detection results

OOD detectors return per-sample scores and a binary flag. The actionable question is not just whether a sample is flagged, but why it is distant from the training distribution.

Isolated OOD samples in a batch that drift detection did not flag (the batch as a whole is not shifted, but individual samples are anomalous) are candidates for human review. They may represent: legitimate novel scenarios the model has never seen (worth collecting more of), sensor or collection errors (remove), or genuinely rare events that are operationally significant (flag for model attention, consider oversampling in retraining).

High OOD rate in a batch that coincides with drift detection signals that the distribution has moved into a region the training data did not cover. The combination of drift detection and OOD scoring provides a more complete picture than either alone: drift detection says the population has shifted; high OOD rate says the new distribution is in a sparse region of the training manifold. This is the scenario most likely to produce confident wrong predictions and most likely to warrant immediate retraining.

OOD detection with reconstruction-based methods provides feature-level error maps that localize where in the image the anomaly is. When per-sample OOD flags are reviewed, use the error maps to understand whether the anomaly is a foreground change (new target type, new object configuration) or a background change (new environmental condition). The former is usually more consequential for model performance.


Diagnosing findings with metadata

When any DataEval evaluator flags samples — whether as outliers, OOD, drifted, or mislabeled — the immediate question is why. The flagged indices are a list of samples; the metadata attached to those samples is often the fastest path to understanding what they have in common.

DataEval provides two metadata diagnostic functions for this purpose: factor_deviation() and factor_predictors(). Both take flagged sample indices as input and operate on the metadata factors (image-level or object-level attributes) that were collected alongside the samples.

Identifying what is unusual about flagged samples

factor_deviation() answers the question: for each flagged sample, which metadata factors deviate most from the reference distribution?

It computes the reference median for each factor, then measures how far each flagged sample deviates from that median. Deviations are scaled asymmetrically: positive deviations are normalized by the median of the positive half of the reference distribution; negative deviations by the median of the negative half. This prevents factors with wide positive tails from dominating factors with wide negative tails.

The output is one dictionary per flagged sample, mapping factor names to their scaled deviation magnitudes, sorted in descending order. The top-ranked factor for a given sample is the metadata dimension that most strongly distinguishes that sample from the reference.

A practical use: after Outliers flags a set of images, run factor_deviation on the flagged indices against the reference metadata from the full dataset. If the top factor for most flagged samples is altitude, that suggests the outliers were collected at unusual altitudes — actionable information for both data collection planning and model limitation documentation.

At least three reference samples are required for a meaningful deviation calculation. The function returns empty dictionaries if fewer reference samples are available.

Identifying which factors predict flagged samples

factor_predictors() answers the question: which metadata factors are most associated with being flagged?

It treats the flagged/not-flagged binary as a target variable and computes the normalized mutual information between each metadata factor and that target, using scikit-learn’s mutual_info_classif. The result is a dictionary mapping each factor name to its normalized mutual information. A score of 0 means the factor carries no information about which samples were flagged; a score near 1 means the factor is predictive of flagging status.

Unlike factor_deviation, which characterizes individual flagged samples, factor_predictors characterizes the flagged set as a whole. It answers: “Is there a systematic metadata reason why these samples were flagged?” This is most useful when there are many flagged samples and you want to understand the population-level pattern rather than inspect each sample individually.

Example workflow after drift detection:

# drift_result.drifted is True; flagged_indices are samples in the drifted batch
flagged = [i for i, score in enumerate(ood_scores) if score > threshold]

# Which factors predict OOD membership?
predictors = factor_predictors(operational_metadata.factors, flagged)
# {'time_of_day': 0.84, 'sensor_id': 0.61, 'altitude': 0.02, ...}
# → time_of_day and sensor_id are the likely drivers

# For the top flagged samples, what specifically is different?
top_flagged = flagged[:20]
deviations = factor_deviation(reference_metadata.factors, operational_metadata.factors, top_flagged)
# deviations[0] → {'time_of_day': 4.2, 'sensor_id': 1.8, 'altitude': 0.3}
# → first flagged sample is 4.2 scaled deviations from reference in time_of_day

Mutual information is association, not causation. A high normalized MI (NMI) score means the factor correlates with being flagged. It does not mean the factor caused the problem, nor does it mean other factors with lower scores are irrelevant. Always interpret NMI scores alongside domain knowledge about what the factors represent and how data was collected.

Both functions require factors to be provided as dictionaries mapping factor names to arrays. If your metadata is stored in DataEval’s Metadata class, the .factors attribute provides this format directly.

Divergence scores

Where drift detectors return a binary signal — drifted or not — HP divergence (divergence()) returns a continuous score between 0 and 1. The two are complementary: drift detection tells you when to act; divergence tells you how much the situation has changed, which is the information needed to calibrate the urgency and scale of the response.

A divergence score near 0 between training and operational data confirms that the model is operating close to its training envelope. Formal drift detection results should be interpreted in this context: a statistically significant drift result with low divergence means a real but small shift — worth documenting and monitoring, but unlikely to require immediate retraining.

A divergence score approaching 1 means the two distributions are nearly separable — the operational data looks fundamentally different from anything in training. Any model performance measurements from training are likely to be poor predictors of operational performance. This is a strong signal for immediate data collection, retraining, or escalation depending on the operational stakes.

Tracking divergence over time is more informative than any single score. A slowly rising trend — each operational batch slightly more diverged from the reference than the last — identifies gradual drift that may never trigger a single-batch hypothesis test but is nonetheless eroding the relevance of the training distribution. If you are running drift monitoring in production, logging divergence scores alongside p-values gives you the trend visibility that p-values alone cannot provide.

Using divergence to localize drift. When a drift detector fires on a large batch, run divergence on subsets — by collection date, sensor, geographic region, or operational condition — to identify which subpopulation is driving the shift. The subset with the highest divergence against the reference is where to focus investigation and targeted data collection. This avoids the multiple-testing penalty of running separate drift tests on each subset.

Pre-deployment gap assessment. Before a model enters service, run divergence between the training set and a sample of expected operational data. A high divergence at this stage — before any monitoring has begun — is evidence that the training distribution was not representative of the deployment environment. This is the point at which the gap is cheapest to close: collect more data from the operational distribution, or document the limitation explicitly in the test report.

Keeping results in context

DataEval’s diagnostics are snapshots. They characterize the dataset at the time of analysis. Datasets change — new data is collected, labels are corrected, splits are revised — and monitoring data changes continuously. Any diagnostic result that drives a significant decision should be recomputed after the relevant changes have been made to verify that the action had the expected effect.

The most important integration is the connection between pre-deployment and operational findings. BER and UAP results from pre-deployment data evaluation establish expected performance bounds. Sufficiency curves establish expected trajectories. When operational monitoring detects drift or OOD conditions, those pre-deployment baselines provide the context for interpreting whether the operational change is within the expected envelope or represents a genuine threat to the established performance bounds.

  • Data Integrity — what the cleaning, label error, and label statistics diagnostics measure

  • Clustering — the algorithm underlying cluster-based Duplicates, Outliers, label_errors, and Prioritize

  • Dataset Bias and Coverage — what the bias diagnostics measure

  • Performance Limits — what BER, UAP, and Sufficiency tell you about achievability

  • Distribution Shift — what drift and OOD detection measure and how detectors work

  • Divergence — the quantitative distance metric covered in the divergence scores section above

  • Embeddings — the representation that most DataEval evaluators depend on

See this in practice

How-to guides

Tutorials