dataeval.quality.Duplicates¶

class dataeval.quality.Duplicates(only_exact=False)¶

Finds duplicate images using non-cryptographic and perceptual hashing.

Detects both exact duplicates (identical pixel values) using xxhash non-cryptographic hashing and near duplicates (visually similar) using perceptual hashing with discrete cosine transform. Supports analysis of single datasets or cross-dataset duplicate detection.

Perceptual hashing identifies visually similar images that may differ in compression, resolution, or minor modifications while maintaining the same visual content structure.

Parameters:¶

only_exact : bool, default False¶: Whether to detect only exact pixel-level duplicates using xxhash. When True, skips near duplicate computation for faster processing and lower memory usage. When False, detects both exact and near duplicates. Default False provides comprehensive duplicate detection.

stats¶

Hash statistics computed during the last evaluate() call. Contains xxhash and pchash values for all processed images.

Type:¶: CalculationResult

only_exact¶

Configuration for duplicate detection scope.

Type:¶: bool

Examples

End-to-end detection: compute hashes + find duplicates

>>> detector = Duplicates()
>>> result = detector.evaluate(dataset)

Reuse pre-computed statistics for efficiency

>>> result = detector.from_stats(hashes1)

Fast exact-only detection for large datasets

>>> fast_detector = Duplicates(only_exact=True)
>>> result = fast_detector.evaluate(duplicate_images)

evaluate(data, *, per_image=True, per_target=True)¶

Find duplicates by computing hashes and analyzing for duplicate groups.

Performs end-to-end duplicate detection by computing hash statistics for the provided dataset and then identifying duplicate groups. Separates item-level duplicates (full images/videos) from target-level duplicates (bounding boxes/detections). Stores computed hash statistics in the stats attribute for reuse.

Parameters:¶

data : Dataset[ArrayLike] or Dataset[tuple[ArrayLike, Any, Any]]¶: Dataset of images in array format. Can be image-only dataset or dataset with additional tuple elements (labels, metadata). Images should be in standard array format (C, H, W).
per_image : bool, default True¶: Whether to compute hashes for full items (images/videos). When True, item-level duplicates will be detected.
per_target : bool, default True¶: Whether to compute hashes for individual targets/detections. When True and targets are present, target-level duplicates will be detected. Has no effect for datasets without targets.

Returns:¶

Duplicate detection results with separate item and target duplicate groups.

items: Contains item-level duplicates (indices are simple integers)
targets: Contains target-level duplicates (indices are SourceIndex objects), or None if per_target=False or no targets present

Return type:¶

DuplicatesOutput

Examples

Item-level duplicates only (default for non-OD datasets):

>>> detector = Duplicates()
>>> result = detector.evaluate(duplicate_images)
>>> print(result.items.exact)
[[3, 20], [16, 37]]
>>> print(result.targets.exact)
None

Target-level duplicates only:

>>> result = detector.evaluate(od_dataset, per_image=False, per_target=True)
>>> print(result.items.exact)
None
>>> print(result.targets.exact)
[[SourceIndex(0, 0), SourceIndex(0, 1)], [SourceIndex(1, 0), SourceIndex(1, 1), SourceIndex(1, 2)]]

Access computed hashes for reuse:

>>> saved_stats = detector.stats

from_clusters(cluster_result)¶

Find duplicates using cluster-based detection from minimum spanning tree.

Analyzes the minimum spanning tree and cluster assignments to identify exact and near duplicates based on distance relationships within clusters. This method is particularly effective for finding semantic or visual duplicates in image embeddings.

Parameters:¶

cluster_result : ClusterResult¶: Clustering results from the cluster() function, containing the minimum spanning tree (mst) and cluster assignments needed for duplicate detection.

Returns:¶

Duplicate detection results with item-level duplicate groups. Cluster-based detection operates on items only (no target separation).

Return type:¶

DuplicatesOutput

Notes

This method uses cluster distance standards to identify duplicates:

Exact duplicates: Points at zero distance in the MST
Near duplicates: Points within cluster-specific distance thresholds

Unlike hash-based duplicate detection (from_stats/evaluate), cluster-based detection identifies duplicates in embedding space, which can capture semantic or visual similarity rather than pixel-level equality.

The only_exact parameter set during initialization controls whether near duplicates are computed. Set only_exact=True for faster processing when only exact duplicates are needed.

Cluster-based detection returns item-level duplicates only. The targets field will always be None since clustering operates on the embedding level.

See also

dataeval.core.cluster: Function to compute clusters from embeddings
from_stats: Find duplicates from pre-computed hash statistics
evaluate: Find duplicates by computing hashes from images

from_stats(stats: dataeval.core._calculate.CalculationResult) → DuplicatesOutput¶

from_stats(stats: collections.abc.Sequence[dataeval.core._calculate.CalculationResult]) → DuplicatesOutput

Find duplicates from pre-computed hash statistics.

Analyzes previously computed hash values to identify duplicate groups without recomputing hashes. Separates item-level duplicates (full images/videos) from target-level duplicates (bounding boxes/detections). Supports both single dataset and cross-dataset duplicate detection.

Parameters:¶

stats : CalculationResult or Sequence[CalculationResult]¶

Hash statistics from calculate() with ImageStats.HASH. Must include source_index information to distinguish items from targets.

Single CalculationResult: within-dataset duplicate detection
Sequence of CalculationResults: cross-dataset duplicate detection

Returns:¶

Duplicate detection results with separate item and target duplicate groups.

items: Contains item-level duplicates (indices are simple integers for single dataset, DatasetItemTuple objects for cross-dataset)
targets: Contains target-level duplicates (indices are SourceIndex objects for single dataset, DatasetItemTuple objects for cross-dataset), or None if no targets were processed

For single dataset: indices are simple integers or SourceIndex objects For multiple datasets: indices are DatasetItemTuple objects with dataset_id and id fields

Return type:¶

DuplicatesOutput

Examples

Single dataset - item-level duplicates only:

>>> detector = Duplicates()
>>> stats = calculate(duplicate_images, None, ImageStats.HASH, per_image=True, per_target=False)
>>> result = detector.from_stats(stats)
>>> print(result.items.exact)
[[3, 20], [16, 37]]
>>> print(result.targets.exact)
None

Single dataset - both item and target duplicates:

>>> stats = calculate(od_dataset, None, ImageStats.HASH, per_image=True, per_target=True)
>>> result = detector.from_stats(stats)
>>> print(result.items.exact)
[[1, 2]]
>>> print(result.targets.exact)
[[SourceIndex(0, 0), SourceIndex(0, 1)], [SourceIndex(1, 0), SourceIndex(1, 1), SourceIndex(1, 2)]]

Cross-dataset duplicate detection:

>>> stats1 = calculate(duplicate_images, None, ImageStats.HASH)
>>> stats2 = calculate(od_dataset, None, ImageStats.HASH)
>>> result = detector.from_stats([stats1, stats2])
>>> print(result.items.exact)
[[(0, 3), (0, 20)], [(0, 5), (1, 1), (1, 2)], [(0, 16), (0, 37)]]