dataeval.detectors.linters.Duplicates

class dataeval.detectors.linters.Duplicates(only_exact=False)

Finds the duplicate images in a dataset using xxhash for exact duplicates and pchash for near duplicates.

stats

Output class of stats

Type:

StatsOutput

Parameters:
only_exact : bool, default False

Only inspect the dataset for exact image matches

evaluate(data)

Returns duplicate image indices for both exact matches and near matches

Parameters:
data : Iterable[ArrayLike], shape - (N, C, H, W) | StatsOutput | Sequence[StatsOutput]

A dataset of images in an ArrayLike format or the output(s) from a hashstats analysis

Returns:

List of groups of indices that are exact and near matches

Return type:

DuplicatesOutput

See also

hashstats

Example

>>> all_dupes = Duplicates()
>>> all_dupes.evaluate(duplicate_images)
DuplicatesOutput(exact=[[3, 20], [16, 37]], near=[[3, 20, 22], [12, 18], [13, 36], [14, 31], [17, 27], [19, 38, 47]])
from_stats(hashes: dataeval.metrics.stats._hashstats.HashStatsOutput) DuplicatesOutput[DuplicateGroup]
from_stats(hashes: collections.abc.Sequence[dataeval.metrics.stats._hashstats.HashStatsOutput]) DuplicatesOutput[DatasetDuplicateGroupMap]

Returns duplicate image indices for both exact matches and near matches

Parameters:
hashes : HashStatsOutput | Sequence[HashStatsOutput]

The output(s) from a hashstats analysis

Returns:

List of groups of indices that are exact and near matches

Return type:

DuplicatesOutput

See also

hashstats

Example

>>> exact_dupes = Duplicates(only_exact=True)
>>> exact_dupes.from_stats([hashes1, hashes2])
DuplicatesOutput(exact=[{0: [3, 20]}, {0: [16], 1: [12]}], near=[])