Data Integrity¶
The performance of any machine learning (ML) model is strictly bounded by the quality of its training data. This is the garbage in, garbage out principle: no amount of model sophistication can compensate for training data that is redundant, corrupted, or statistically anomalous. Data Integrity refers to the degree to which a dataset is free from these three categories of noise.
In test and evaluation (T&E) contexts, data integrity failures carry operational consequences that go beyond poor benchmark scores. A model trained on sensor-corrupted imagery may learn to associate artifacts with specific targets. A dataset inflated with near-duplicates from a single collection event may pass validation but fail badly when deployed against a different sensor platform or environmental condition. Catching these problems before training is orders of magnitude cheaper than discovering them after deployment (Polyzotis et al., 2018).
What is it¶
Data Integrity is a comprehensive condition of a dataset, not a single metric. A dataset has high integrity when every sample it contains is loadable, non-redundant, within the expected statistical range, and relevant to the problem domain. Conversely, a dataset has low integrity when it contains a significant proportion of samples that are duplicated, corrupted, or anomalous — even if those samples are technically valid files.
DataEval approaches data integrity through four complementary tools:
Duplicates for redundancy detection, Outliers for anomaly
identification, label_errors() for detecting mislabeled samples, and
label_stats() for auditing label distribution structure. Both
Duplicates and Outliers can operate on raw image statistics;
Outliers can additionally operate on embeddings
from a pre-trained feature extractor, and Duplicates can use
embedding-based clustering as a third detection mode alongside its hash methods.
Taxonomy of data noise¶
Redundancy and duplication¶
A duplicate is any sample that is identical or near-identical to another sample already in the dataset. Near-duplicates include images with minor lighting shifts, JPEG re-compression artifacts, slight crops of the same underlying scene, or rotated and flipped versions of the same image.
The training impact of redundancy is well established. When a sample appears multiple times, the model receives repeated gradient updates from information it has already processed. This creates two compounding problems. First, redundant samples inflate the effective size of the training set without adding new signal, making training less compute-efficient. Second — and more importantly for T&E — duplicate samples in a validation set cause its metrics to be non-representative. If 30% of your validation images are near-duplicates of training images, your reported accuracy is not a valid estimate of out-of-distribution generalization.
Birodkar et al. (2019) demonstrated that large benchmark datasets contain substantial rates of semantic redundancy, and that removing it does not degrade — and in some cases improves — model performance.
This is not an argument against data augmentation. Augmentation and deduplication address different problems and operate at different points in the pipeline. Deduplication is a pre-training step applied to the raw dataset: removing near-duplicates before splitting reduces the risk of accidental train/validation leakage and ensures that held-out metrics reflect genuine generalization. Runtime augmentation — applied to the clean, split dataset during training — introduces controlled variation that the model has not already memorized, which is precisely what improves generalization. The two practices are complementary, not competing.
Statistical outliers and semantic anomalies¶
Not all anomalous samples are the same, and the distinction matters for how you handle them.
Statistical outliers are samples that fall at the extreme edges of a measurable distribution — an extremely overexposed image, a frame with near-zero entropy, a dimension ratio far outside the norm for the dataset. These samples may or may not be semantically valid. A statistically extreme image might be a rare but operationally important edge case, or it might be a broken sensor frame. The statistical test alone cannot tell you which; it can only flag the sample for inspection.
Semantic anomalies are samples that are statistically unremarkable but belong to the wrong problem domain entirely — an image of a domestic animal in an industrial inspection dataset, or a clear daytime frame in a dataset intended for low-light detection. These pass all pixel-level quality checks but actively harm the model by introducing irrelevant class structure. Embedding- based outlier detection catches many semantic anomalies that statistical linting misses, because the samples arrange differently in the embedding space, even when their pixel statistics look normal.
Corruption and sensor artifacts¶
Physical sensors introduce systematic errors that statistical linting is specifically designed to catch. Common examples in computer vision include motion blur from high-speed collection platforms, lens flare from direct sun exposure, dead pixel rows from sensor damage, and blocking artifacts from aggressive JPEG compression in transmission pipelines.
The integrity concern is subtler than simple corruption. A model that trains on a dataset with a consistent rate of blur artifacts may train successfully — but against the wrong distribution. If blur is consistently associated with a particular collection scenario, the model can learn blur as a feature rather than noise, creating a form of sensor-conditioned bias that only manifests when the deployment sensor differs from the collection sensor. In situations, where collection hardware is often upgraded or substituted over multiple phases, this is a practical risk rather than an edge case.
Label noise and mislabeling¶
Image-level and pixel-level integrity checks catch problems with the data itself. A fourth category of integrity failure is quieter and harder to detect: label noise — samples that are correctly captured but incorrectly annotated.
Label noise is common in large-scale annotation pipelines. Annotators make mistakes, instructions are interpreted inconsistently, class boundaries are ambiguous at the margins, and adversarial or rushed labeling produces systematic errors in particular categories. In T&E datasets where labels may be generated from automated pipelines or inherited from third parties, label quality is often assumed rather than verified (Sculley et al., 2015).
The consequences are severe. A mislabeled sample is not just useless — it actively injects false gradient signal during training, pushing the model toward the wrong decision boundary. Clusters of mislabeled samples around class boundaries can shift the learned boundary significantly. In evaluation data, mislabeled samples produce incorrect assessments of model performance: a correct model prediction on a mislabeled sample counts as an error, and a wrong model prediction counts as correct.
label_errors() detects potential mislabeling by examining embedding
geometry. A correctly labeled sample should be closer to other samples of its
own class than to samples of any other class. The metric is the
intra/extra class distance ratio: the mean distance to the \(k\) nearest
neighbors within the same class, divided by the mean distance to the \(k\) nearest
neighbors in any other class. A ratio ≥ 1.0 means the sample is closer to a
different class than to its own — strong evidence of a labeling problem.
Label statistics (label_stats()) address a different but related
concern: structural problems in the label distribution that are not about
correctness but about completeness and consistency. The function counts
label_counts_per_class, image_counts_per_class, label_counts_per_image,
and — critically for object detection — empty_image_indices (images with no
annotations). Empty images are a common annotation error: an image that was
included in the dataset but never labeled, or one where annotators missed all
objects. An unlabeled image trains the model to predict nothing, which is
situation dependent and often wrong.
Theory¶
Information density and gradient efficiency¶
From an information theory perspective, the goal of data integrity work is to maximize the information density of the training set — the ratio of novel signal to total sample count.
Consider a training set of \(N\) samples, of which a fraction \(r\) are near-duplicates of existing samples. The effective number of unique training examples is \(N(1 - r)\). The duplicate samples still contribute gradients during training, but those gradients point in directions the optimizer has already processed. In the best case this wastes compute. In the worse case, on samples that are nearly but not exactly identical, it biases the gradient toward the over-represented region of the input space, effectively re-weighting the loss landscape without the practitioner realizing it.
Duplicate detection: hashing and clustering¶
Duplicates uses three complementary detection approaches, each
suited to a different kind of redundancy.
Exact duplicate detection uses xxHash, a fast non-cryptographic hash of the raw image bytes. Two images with identical xxHash values are guaranteed to be pixel-for-pixel identical. This is the most reliable signal: no threshold tuning is required and there are no false positives.
Near-duplicate detection uses perceptual hashing, where two images that are visually similar but not pixel-identical — due to compression, minor cropping, or brightness adjustment — produce similar hash values. DataEval supports two perceptual hash algorithms:
pHash (perceptual hash): The image is resized to a square \(N \times N\) grid, a discrete cosine transform (DCT) is applied, and the lowest-frequency components are encoded as a bit array relative to the median coefficient value. The result is a compact hex string that is robust to minor photometric perturbations. Zauner (2010) provides the foundational treatment.
dHash (difference hash): Horizontal adjacent-pixel differences are computed on a downsampled image and encoded as a bit string. This approach is particularly robust to brightness and contrast shifts.
Both algorithms are available in D4 variants (phash_d4, dhash_d4) that
apply the hash across all eight orientations of the dihedral group (four
rotations × two reflections) and take the minimum, producing a hash invariant
to 90°/180°/270° rotations and horizontal/vertical flips. This matters for
datasets where images may be re-oriented during ingestion.
Two images are considered near-duplicates when the Hamming distance between
their hash values falls below a configurable threshold. Groups detected by
multiple methods carry higher confidence. When both basic and D4 hashes are
computed, the orientation column in the duplicates DataFrame is automatically
set to "same" (detected by basic hashes) or "rotated" (detected only by D4
hashes), letting practitioners distinguish the two cases.
Cluster-based detection is an optional third mode that operates in
embedding space rather than pixel space. When a feature extractor
and a cluster_sensitivity are provided, images are projected into embedding
space, clustered, and pairs whose embeddings fall within the threshold distance
are treated as near-duplicates. Because embeddings are approximate
representations, cluster-based matches are always reported as near rather than
exact duplicates, even when their embedding distance is zero. This mode catches
semantic duplicates — distinct photographs of the same object or scene that
are not similar at the pixel level but occupy the same region of embedding
space.
Outlier detection: image statistics and embeddings¶
Outliers identifies anomalous samples through two independent paths
that can be used separately or together.
Image statistics-based detection computes pixel, visual, and dimension
statistics for each image using compute_stats() with
ImageStats flags, then applies a statistical threshold test to each
metric distribution to flag samples at the extremes. Three tests are available:
The z-score method measures how many standard deviations a sample’s value \(x_i\) lies from the distribution mean \(\mu\):
Samples exceeding the threshold (default: 3.0) are flagged. This method works well for roughly normal distributions but is sensitive to the influence of existing extreme values on the mean and standard deviation.
The modified z-score method substitutes the median \(\tilde{x}\) and median absolute deviation (MAD) for the mean and standard deviation, making it robust to that influence:
The constant 0.6745 is the 75th percentile of the standard normal distribution, chosen so that the modified z-score is on the same scale as the standard z-score for normally distributed data. The default threshold is 3.5.
The interquartile range (IQR) method flags samples whose distance from the nearest quartile boundary exceeds a multiple of the IQR:
The default threshold of 1.5 corresponds to the standard Tukey fence. This method is the most robust to extreme values and requires no distributional assumptions.
Each statistical test operates independently on each metric. A sample is flagged if it exceeds the threshold on any single metric. The output records both which metric triggered the flag and the measured value, so practitioners can distinguish a brightness outlier from a dimension outlier and prioritize accordingly.
Cluster-based detection operates in embedding space. When a feature
extractor is provided, images are projected into embedding space and clustered.
For each sample, the distance to its nearest cluster center is computed and
expressed as a number of standard deviations from that cluster’s mean
intra-cluster distance. Samples exceeding the cluster_threshold (default: 2.5)
are flagged with a cluster_distance metric value. This path catches semantic
anomalies — samples that look statistically normal at the pixel level but do
not belong to any established class or scene type in the dataset.
Both detection paths can run simultaneously and their results are merged into a single output DataFrame. A sample flagged by both paths warrants immediate inspection.
Image statistics as a linting vocabulary¶
compute_stats() is the underlying computation engine for image
statistics in DataEval. It accepts any iterable of images or a MAITE-compliant
Dataset and computes whichever statistics are requested via the
ImageStats flag parameter. ImageStats is a selector, not a
processor — it is a Flag enum whose values you combine with bitwise OR to
specify which statistics you want, and compute_stats handles the rest.
The four flag categories and their integrity signals:
Category |
Key statistics |
Integrity signal |
|---|---|---|
|
Mean, std, entropy, skewness, kurtosis |
Corrupted or near-empty frames; distribution anomalies |
|
Brightness, contrast, sharpness, darkness |
Exposure and focus failures; sensor artifacts |
|
Width, height, aspect ratio, channels, bit depth |
Incorrectly resized, cropped, or re-encoded samples |
|
xxHash, pHash, dHash (+ D4 variants) |
Exact and near-duplicate detection (used by |
Some flags have dependencies that are resolved automatically: requesting
PIXEL_ENTROPY also enables PIXEL_HISTOGRAM (which entropy requires);
requesting VISUAL_BRIGHTNESS, VISUAL_CONTRAST, or VISUAL_DARKNESS
also enables VISUAL_PERCENTILES.
For object detection datasets, compute_stats can compute statistics
separately for each bounding box (per_target=True, the default when boxes are
present) and for the full image (per_image=True). Setting per_channel=True
produces per-channel breakdowns rather than aggregating across channels. The
source_index field in the output identifies whether each row corresponds to
a full image, a specific bounding box within an image, or a specific channel.
The default configuration for Outliers uses DIMENSION | PIXEL | VISUAL, covering the full space of sensor-level integrity failures without
hash computation (which belongs to Duplicates).
Label error detection: embedding geometry¶
label_errors() operates on embeddings and class labels. For each sample,
it computes:
where \(\bar{d}_{\text{intra},i}\) is the mean distance from sample \(i\) to its
\(k\) nearest neighbors within the same class, and \(\bar{d}_{\text{extra},i}\)
is the mean distance to its \(k\) nearest neighbors in any other class (\(k\)
defaults to 50, capped at min_class_size - 1).
The output contains three fields:
errors: a dictionary mapping sample index to (original_label, [suggested_labels])
for all samples with score ≥ 1.0. Suggested labels are derived from
rank-weighted voting over the \(k\) nearest out-of-class neighbors — closer
neighbors receive higher weight. The suggestion logic applies two thresholds:
if the top candidate’s vote share is below min_confidence (default 0.4), no
suggestion is returned; if the margin between the top two candidates is smaller
than ambiguity_threshold (default 0.2), both are returned as a tie.
error_rank: all sample indices sorted by descending score — a triage
list for human review, starting with the samples most likely to be mislabeled.
scores: the raw distance ratio for every sample, useful for setting custom
thresholds or examining the distribution of scores.
A score well above 1.0 is stronger evidence of mislabeling than a score just
at the boundary. The error_rank allows reviewers to prioritize the highest-
confidence detections and stop reviewing when the confidence drops below a
practical threshold.
When to use it¶
Data Integrity assessment is appropriate at three points in the data lifecycle.
Before any training run. Running the full integrity pipeline on a new or merged dataset is standard practice before committing compute to training. Integrity failures caught here cost a data review. Integrity failures caught after a failed training run cost a training run plus a data review.
After merging datasets. Near-duplicate rates spike when datasets from different collection events or public sources are combined without deduplication. A dataset assembled from three sources with 10% internal redundancy each can easily reach 25–30% redundancy at the merged level.
When evaluation metrics look suspicious. Anomalously high validation accuracy — especially accuracy that degrades sharply in operational testing — is a common symptom of validation set contamination by training-set duplicates. If your held-out metrics seem too good, run deduplication across the train/validation split boundary.
To better understand what to do after running an assessment, review the Data Integrity section in the Acting on Results explanation page.
Limitations¶
Statistical linting and perceptual hashing are effective at catching the integrity failures they were designed to detect, but both have blind spots practitioners should understand.
Hash-based detection will not catch semantic duplicates — two different photographs of the same object taken from different angles or lighting conditions that are genuinely different images at the pixel level. Cluster-based duplicate detection in embedding space addresses this, but requires a feature extractor whose embedding space is meaningful for the target domain.
Image statistics-based outlier detection is only sensitive to pixel-level and dimensional anomalies. A semantically incorrect image with normal brightness, contrast, and dimensions will not be flagged by statistical tests. Cluster-based outlier detection addresses this case, again contingent on embedding quality.
Neither approach addresses label quality and fails to detect label errors —
images that are correctly exposed, non-duplicated, and statistically normal but
assigned the wrong class label. To understand label quality, the functions
label_stats() and {func} .label_errors have to be run independently.
The statistical outlier tests flag samples relative to the distribution of the dataset being evaluated. Results are therefore dataset-dependent: adding or removing samples changes what counts as an outlier. When comparing results across dataset versions, re-run the full analysis rather than assuming prior flags remain valid.
Related concept pages¶
Clustering — the underlying algorithm used by Duplicates (cluster mode), Outliers (cluster mode), and label_errors
Dataset Bias and Coverage — when your data is clean but still unrepresentative
Distribution Shift — when your data was clean at training time but the deployment distribution has changed
Acting on Results — how to prioritize remediation across integrity, bias, and performance findings
See this in practice¶
How-to guides¶
Tutorials¶
Data cleaning tutorial — end-to-end walkthrough of integrity assessment on a realistic dataset
References¶
Birodkar, V., Mobahi, H., & Bengio, S. (2019). Semantic redundancy in image classification datasets. arXiv preprint arXiv:1901.11409. paper
Polyzotis, N., Roy, S., Whang, S. E., & Zinkevich, M. (2018). Data management challenges in production machine learning. In Proceedings of the 2017 ACM SIGMOD International Conference on Management of Data (pp. 1723–1726). paper
Sculley, D., Holt, G., Golovin, D., Davydov, E., Phillips, T., Ebner, D., Chaudhuri, V., Young, M., Crespo, J.-F., & Dennison, D. (2015). Hidden technical debt in machine learning systems. In Advances in Neural Information Processing Systems (Vol. 28). paper
Zauner, C. (2010). Implementation and benchmarking of perceptual image hash functions. Bachelor’s thesis, Upper Austria University of Applied Sciences. thesis