dataeval.quality.Duplicates

class dataeval.quality.Duplicates(flags=None, cluster_threshold=None, cluster_algorithm=None, n_clusters=None, merge_near_duplicates=None, config=None, feature_extractor=None)

Finds duplicate images using hashing and/or embedding-based clustering.

Supports multiple complementary detection methods:

  • Hash-based exact (xxhash): Detects exact duplicates (identical pixel values) using xxhash.

  • Hash-based near (phash): DCT-based perceptual hashing for compression/resize detection.

  • Hash-based near (dhash): Gradient hash for brightness-invariant detection.

  • Multidirectional hashing (phash_d4, dhash_d4): Rotation/flip-invariant variants that detect duplicates regardless of orientation.

  • Cluster-based: Uses neural network embeddings to find semantic duplicates.

The multiple perceptual hash methods (phash, dhash) are complementary and can catch different types of image modifications. Using all hashes provides more robust near-duplicate detection without requiring a trained model.

Three convenience flags are provided for common use cases:

  • ImageStats.HASH_DUPLICATES_BASIC: Standard duplicate detection (xxhash + phash + dhash)

  • ImageStats.HASH_DUPLICATES_D4: Rotation/flip-invariant detection (xxhash + phash_d4 + dhash_d4)

Parameters:
flags : ImageStats, default ImageStats.HASH_DUPLICATES_BASIC

Statistics to compute for hash-based duplicate detection:

  • ImageStats.HASH_DUPLICATES_BASIC (default): Standard detection with exact and perceptual hashes (xxhash + phash + dhash)

  • ImageStats.HASH_DUPLICATES_D4: Rotation/flip-invariant detection using D4 symmetry hashes (xxhash + phash_d4 + dhash_d4)

  • ImageStats.HASH: Compute all hash types (includes both basic and D4 variants)

  • ImageStats.HASH_XXHASH: Compute only exact duplicates (fastest)

  • ImageStats.HASH_PHASH: Compute only phash-based near duplicates

  • ImageStats.HASH_DHASH: Compute only dhash-based near duplicates

  • ImageStats.NONE: Skip hash computation, use only cluster-based detection

feature_extractor : FeatureExtractor, optional

Feature extractor for cluster-based duplicate detection. When provided, embeddings are extracted and clustered to find semantic duplicates.

cluster_threshold : float, optional

Threshold for cluster-based near duplicate detection. This does NOT affect exact duplicates (which are zero distance in the MST). When None, only exact cluster duplicates are detected. Lower values are stricter.

cluster_algorithm : {"kmeans", "hdbscan"}, default "hdbscan"

Clustering algorithm for cluster-based detection.

n_clusters : int, optional

Expected number of clusters. For HDBSCAN, this is a hint that adjusts min_cluster_size. For KMeans, this is the exact number of clusters.

merge_near_duplicates : bool, default True

If True, overlapping near duplicate groups from different detection methods are merged into unified groups. Each group tracks which methods detected it, providing confidence information. If False, groups from each method are kept separate.

config : Duplicates.Config or None, default None

Optional configuration object with default parameters. Parameters specified directly in __init__ will override config defaults.

stats

Hash statistics computed during the last evaluate() call.

Type:

CalculationResult

flags

Statistics to compute for duplicate detection.

Type:

ImageStats

feature_extractor

Feature extractor for cluster-based detection.

Type:

FeatureExtractor | None

cluster_threshold

Threshold for cluster-based near duplicate detection.

Type:

float | None

cluster_algorithm

Clustering algorithm to use.

Type:

Literal[“kmeans”, “hdbscan”]

n_clusters

Expected number of clusters.

Type:

int | None

merge_near_duplicates

Whether to merge overlapping near duplicate groups.

Type:

bool

Examples

Basic hash-based detection (default):

>>> detector = Duplicates()
>>> result = detector.evaluate(images)

Fast exact-only detection for large datasets:

>>> fast_detector = Duplicates(flags=ImageStats.HASH_XXHASH)
>>> result = fast_detector.evaluate(images)

Combined hash and cluster-based detection:

>>> from dataeval import Embeddings
>>> extractor = Embeddings(encoder=encoder)
>>> detector = Duplicates(feature_extractor=extractor, cluster_threshold=1.0)
>>> result = detector.evaluate(train_ds)

Using configuration:

>>> config = Duplicates.Config(cluster_algorithm="kmeans", merge_near_duplicates=False)
>>> detector = Duplicates(config=config)
evaluate(data, *, per_image=True, per_target=True)

Find duplicates by computing hashes and/or analyzing embeddings.

Performs duplicate detection using hash statistics and/or cluster-based analysis depending on configuration.

Parameters:
data : Dataset[ArrayLike] or Dataset[tuple[ArrayLike, Any, Any]]

Dataset of images in array format.

per_image : bool, default True

Whether to compute hashes for full items (images/videos).

per_target : bool, default True

Whether to compute hashes for individual targets/detections.

Returns:

Duplicate detection results with separate item and target groups.

  • items.exact: Exact duplicates (hash-based and/or cluster-based)

  • items.near: Near duplicate groups with detection method metadata. Each group has indices and methods (e.g., {“phash”, “rhash”}).

  • targets: Target-level duplicates (hash-based only)

Return type:

DuplicatesOutput

Raises:

ValueError – If flags is NONE and no feature_extractor is provided.

Examples

Hash-based duplicates with merged near duplicates (default):

>>> detector = Duplicates()
>>> result = detector.evaluate(images)
>>> print(result.items.exact)
[[3, 20], [7, 11, 18, 25], [16, 37]]
>>> for group in result.items.near:
...     print(f"Index count: {len(group.indices)}, Methods: {sorted(group.methods)}")
Index count: 50, Methods: ['dhash', 'phash']

Fast exact-only detection:

>>> detector = Duplicates(flags=ImageStats.HASH_XXHASH)
>>> result = detector.evaluate(images)

Combined hash and cluster-based detection:

>>> from dataeval import Embeddings
>>> extractor = Embeddings(encoder=encoder)
>>> detector = Duplicates(feature_extractor=extractor, cluster_threshold=1.0)
>>> result = detector.evaluate(train_ds)
from_clusters(cluster_result)

Find duplicates using cluster-based detection from minimum spanning tree.

Analyzes the minimum spanning tree and cluster assignments to identify exact and near duplicates based on distance relationships within clusters.

Parameters:
cluster_result : ClusterResult

Clustering results from the cluster() function.

Returns:

Duplicate detection results with item-level duplicate groups. Cluster-based detection operates on items only (no target separation).

Return type:

DuplicatesOutput

Notes

This method identifies duplicates in embedding space:

  • Exact duplicates: Points at zero distance in the MST

  • Near duplicates: Points within cluster-specific distance thresholds

See also

dataeval.core.cluster

Function to compute clusters from embeddings

from_stats

Find duplicates from pre-computed hash statistics

evaluate

Find duplicates by computing hashes from images

from_stats(stats: dataeval.core._calculate.CalculationResult) DuplicatesOutput
from_stats(stats: dataeval.core._calculate.CalculationResult, *other_stats: dataeval.core._calculate.CalculationResult) DuplicatesOutput
from_stats(stats: collections.abc.Sequence[dataeval.core._calculate.CalculationResult]) DuplicatesOutput

Find duplicates from pre-computed hash statistics.

Use this method when hash statistics have already been computed via calculate() to avoid redundant computation.

Parameters:
stats : CalculationResult | Sequence[CalculationResult]

Pre-computed statistics containing hash values. Must include at least one of: xxhash, phash, dhash, rhash. Can be a single result, a sequence of results, or multiple results passed as positional arguments.

*other_stats : CalculationResult

Additional statistics from other datasets for cross-dataset duplicate detection.

Returns:

Duplicate detection results with separate item and target groups. For cross-dataset detection, indices are DatasetItemTuple objects.

Return type:

DuplicatesOutput

See also

evaluate

Compute hashes and find duplicates in one call

from_clusters

Find duplicates using cluster-based detection

Classes

Config

Configuration for Duplicates detector.