dataeval.detectors.linters.Duplicates¶

class dataeval.detectors.linters.Duplicates(only_exact=False)¶

Finds duplicate images using non-cryptographic and perceptual hashing.

Detects both exact duplicates (identical pixel values) using xxhash non-cryptographic hashing and near duplicates (visually similar) using perceptual hashing with discrete cosine transform. Supports analysis of single datasets or cross-dataset duplicate detection.

Perceptual hashing identifies visually similar images that may differ in compression, resolution, or minor modifications while maintaining the same visual content structure.

Parameters:¶

only_exact : bool, default False¶: Whether to detect only exact pixel-level duplicates using xxhash. When True, skips near duplicate computation for faster processing and lower memory usage. When False, detects both exact and near duplicates. Default False provides comprehensive duplicate detection.

stats¶

Hash statistics computed during the last evaluate() call. Contains xxhash and pchash values for all processed images.

Type:¶: HashStatsOutput

only_exact¶

Configuration for duplicate detection scope.

Type:¶: bool

Examples

End-to-end detection: compute hashes + find duplicates

>>> detector = Duplicates()
>>> result = detector.evaluate(dataset)

Reuse pre-computed hashes for efficiency

>>> result = detector.from_stats(hashes1)

Fast exact-only detection for large datasets

>>> fast_detector = Duplicates(only_exact=True)
>>> result = fast_detector.evaluate(duplicate_images)

evaluate(data)¶

Find duplicates by computing hashes and analyzing for duplicate groups.

Performs end-to-end duplicate detection by computing hash statistics for the provided dataset and then identifying duplicate groups. Stores computed hash statistics in the stats attribute for reuse.

Parameters:¶

data : Dataset[ArrayLike] or Dataset[tuple[ArrayLike, Any, Any]]¶: Dataset of images in array format. Can be image-only dataset or dataset with additional tuple elements (labels, metadata). Images should be in standard array format (C, H, W).

Returns:¶

Duplicate detection results with exact and near duplicate groups as lists of image indices within the dataset.

Return type:¶

DuplicatesOutput[DuplicateGroup]

Examples

Basic duplicate detection:

>>> detector = Duplicates()
>>> result = detector.evaluate(duplicate_images)

>>> print(f"Exact duplicates: {result.exact}")
Exact duplicates: [[3, 20], [16, 37]]

>>> print(f"Near duplicates: {result.near}")
Near duplicates: [[3, 20, 22], [12, 18], [13, 36], [14, 31], [17, 27], [19, 38, 47]]

Access computed hashes for reuse

>>> saved_stats = detector.stats

from_stats(hashes: dataeval.outputs.HashStatsOutput) → dataeval.outputs.DuplicatesOutput[dataeval.outputs._linters.DuplicateGroup]¶

from_stats(hashes: collections.abc.Sequence[dataeval.outputs.HashStatsOutput]) → dataeval.outputs.DuplicatesOutput[dataeval.outputs._linters.DatasetDuplicateGroupMap]

Find duplicates from pre-computed hash statistics.

Analyzes previously computed hash values to identify duplicate groups without recomputing hashes. Supports both single dataset analysis and cross-dataset duplicate detection across multiple hash outputs.

Parameters:¶

hashes : HashStatsOutput or Sequence[HashStatsOutput]¶: Hash statistics from hashstats function. Single HashStatsOutput for within-dataset duplicates, or sequence for cross-dataset analysis.

Returns:¶

DuplicatesOutput[DuplicateGroup] – When single HashStatsOutput provided. Contains exact and near duplicate groups as lists of image indices within the dataset.
DuplicatesOutput[DatasetDuplicateGroupMap] – When sequence provided. Groups map dataset indices to lists of image indices, enabling cross-dataset duplicate identification.

Raises:¶

TypeError – If hashes is not HashStatsOutput or Sequence[HashStatsOutput].

Examples

Single dataset duplicate detection:

>>> detector = Duplicates()
>>> result = detector.from_stats(hashes1)
>>> print(f"Exact duplicates: {result.exact}")
Exact duplicates: [[3, 20]]

>>> print(f"Near duplicates: {result.near}")
Near duplicates: [[3, 20, 22], [12, 18]]

Cross-dataset duplicate detection:

>>> result = detector.from_stats([hashes1, hashes2])
>>> print(f"Exact duplicates: {result.exact}")
Exact duplicates: [{0: [3, 20]}, {0: [16], 1: [12]}]