dataeval.quality.Duplicates¶
-
class dataeval.quality.Duplicates(flags=
None, cluster_sensitivity=None, merge_near_duplicates=None, extractor=None, batch_size=None, cluster_algorithm=None, n_clusters=None, config=None)¶ Finds duplicate images using hashing and/or embedding-based clustering.
Supports multiple complementary detection methods:
Hash-based exact (xxhash): Detects exact duplicates (identical pixel values) using xxhash.
Hash-based near (phash): DCT-based perceptual hashing for compression/resize detection.
Hash-based near (dhash): Gradient hash for brightness-invariant detection.
Multidirectional hashing (phash_d4, dhash_d4): Rotation/flip-invariant variants that detect duplicates regardless of orientation.
Cluster-based: Uses neural network embeddings to find semantic duplicates.
The multiple perceptual hash methods (phash, dhash) are complementary and can catch different types of image modifications. Using all hashes provides more robust near-duplicate detection without requiring a trained model.
Three convenience flags are provided for common use cases:
ImageStats.HASH_DUPLICATES_BASIC: Standard duplicate detection (xxhash + phash + dhash)ImageStats.HASH_DUPLICATES_D4: Rotation/flip-invariant detection (xxhash + phash_d4 + dhash_d4)ImageStats.HASH: All hash statistics (enables rotation/flip awareness)
- Parameters:¶
- flags : ImageStats, default ImageStats.HASH_DUPLICATES_BASIC¶
Statistics to compute for hash-based duplicate detection. Set to
ImageStats.NONEto disable hash-based detection.- extractor : FeatureExtractor, optional¶
Feature extractor for cluster-based duplicate detection. Must be provided together with cluster_sensitivity to enable clustering. When provided alone without cluster_sensitivity, clustering is skipped.
- cluster_sensitivity : float, optional¶
Controls how aggressively points within a cluster are considered duplicates, by scaling the cluster’s standard deviation of MST edge distances. An edge is flagged as a duplicate link when its distance is less than
cluster_sensitivity * cluster_std. Lower values (e.g. 0.5) are stricter (fewer duplicates); higher values (e.g. 2.0) are more sensitive. Typical range: 0.1 – 3.0. Must be positive. Must be provided together withextractorto enable clustering. When None or when extractor is None, cluster-based detection is skipped entirely.- cluster_algorithm : {"kmeans", "hdbscan"}, default "hdbscan"¶
Clustering algorithm for cluster-based detection.
- n_clusters : int, optional¶
Expected number of clusters. For HDBSCAN, this is a hint that adjusts min_cluster_size. For KMeans, this is the exact number of clusters.
- merge_near_duplicates : bool, default True¶
If True, overlapping near duplicate groups from different detection methods are merged into unified groups. Each group tracks which methods detected it, providing confidence information. If False, groups from each method are kept separate.
- config : Duplicates.Config or None, default None¶
Optional configuration object with default parameters. Parameters specified directly in __init__ will override config defaults.
- extractor¶
Feature extractor for cluster-based detection.
- Type:¶
FeatureExtractor | None
- cluster_sensitivity¶
Sensitivity for cluster-based near duplicate detection. Values should be positive, with typical range 0.1 – 3.0. When None, cluster-based detection is disabled.
- Type:¶
float | None
Examples
Basic hash-based detection (default):
>>> detector = Duplicates() >>> result = detector.evaluate(images)Fast exact-only detection for large datasets:
>>> fast_detector = Duplicates(flags=ImageStats.HASH_XXHASH) >>> result = fast_detector.evaluate(images)Combined hash and cluster-based detection:
>>> from dataeval.extractors import FlattenExtractor>>> detector = Duplicates(extractor=FlattenExtractor(), cluster_sensitivity=1.0) >>> result = detector.evaluate(train_ds)Using configuration:
>>> config = Duplicates.Config( ... extractor=FlattenExtractor(), ... cluster_algorithm="kmeans", ... merge_near_duplicates=False, ... ) >>> detector = Duplicates(config=config)-
evaluate(data: _DatasetInput, *, per_image: bool =
True, per_target: False =...) SingleDuplicatesOutput¶ -
evaluate(data: _DatasetInput, *, per_image: bool =
True, per_target: True) SingleTargetDuplicatesOutput -
evaluate(data: _DatasetInput, *other: _DatasetInput, per_image: bool =
True, per_target: False =...) MultiDuplicatesOutput -
evaluate(data: _DatasetInput, *other: _DatasetInput, per_image: bool =
True, per_target: True) MultiTargetDuplicatesOutput Find duplicates by computing hashes and/or analyzing embeddings.
Performs duplicate detection using hash statistics and/or cluster-based analysis depending on configuration. Supports single or multiple datasets.
- Parameters:¶
- data : Dataset¶
A dataset of images.
- *other : Dataset¶
Additional datasets for cross-dataset duplicate detection.
- per_image : bool, default True¶
Whether to compute hashes for full items (images/videos).
- per_target : bool, default False¶
Whether to compute hashes for individual targets/detections. When True, accessor properties return
SourceIndexindices; when False, they return plainintitem indices.
- Returns:¶
Duplicate detection results as a DataFrame of duplicate groups. For multi-dataset input, includes a
dataset_indexcolumn.- Return type:¶
SingleDuplicatesOutput or MultiDuplicatesOutput
- Raises:¶
ValueError – If flags is NONE and no extractor is provided.
Examples
Hash-based duplicates with merged near duplicates (default):
>>> detector = Duplicates() >>> result = detector.evaluate(images) >>> result shape: (4, 5) ┌──────────┬───────┬──────────┬───────────────┬────────────────────┐ │ group_id ┆ level ┆ dup_type ┆ item_indices ┆ methods │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ i64 ┆ str ┆ str ┆ list[i64] ┆ list[str] │ ╞══════════╪═══════╪══════════╪═══════════════╪════════════════════╡ │ 0 ┆ item ┆ exact ┆ [3, 20] ┆ ["xxhash"] │ │ 1 ┆ item ┆ exact ┆ [7, 11, … 25] ┆ ["xxhash"] │ │ 2 ┆ item ┆ exact ┆ [16, 37] ┆ ["xxhash"] │ │ 3 ┆ item ┆ near ┆ [0, 1, … 49] ┆ ["dhash", "phash"] │ └──────────┴───────┴──────────┴───────────────┴────────────────────┘Cross-dataset detection:
>>> detector = Duplicates() >>> result = detector.evaluate(train_ds, test_ds)
- from_clusters(cluster_result)¶
Find duplicates using cluster-based detection from minimum spanning tree.
Analyzes the minimum spanning tree and cluster assignments to identify near duplicates based on distance relationships within clusters.
- Parameters:¶
- cluster_result : ClusterResult¶
Clustering results from the cluster() function.
- Returns:¶
Duplicate detection results with item-level duplicate groups. Cluster-based detection operates on items only (no target separation).
- Return type:¶
Notes
This method identifies duplicates in embedding space. All cluster-based duplicates are returned as near duplicates because embeddings are approximate representations - identical embeddings don’t guarantee pixel-identical images.
See also
dataeval.core.clusterFunction to compute clusters from embeddings
from_statsFind duplicates from pre-computed hash statistics
evaluateFind duplicates by computing hashes from images
-
from_stats(stats: dataeval.core.StatsResult, *, per_image: bool =
True, per_target: False =...) SingleDuplicatesOutput¶ -
from_stats(stats: dataeval.core.StatsResult, *, per_image: bool =
True, per_target: True) SingleTargetDuplicatesOutput -
from_stats(stats: collections.abc.Sequence[dataeval.core.StatsResult], *, per_image: bool =
True, per_target: False =...) MultiDuplicatesOutput -
from_stats(stats: collections.abc.Sequence[dataeval.core.StatsResult], *, per_image: bool =
True, per_target: True) MultiTargetDuplicatesOutput Find duplicates from pre-computed hash statistics.
Use this method when hash statistics have already been computed via
calculate()to avoid redundant computation.- Parameters:¶
- stats : StatsResult | Sequence[StatsResult]¶
Pre-computed statistics containing hash values. Must include at least one of: xxhash, phash, dhash, rhash. Can be a single result or a sequence of results.
- per_image : bool, default True¶
Whether to include item-level (image) duplicate groups.
- per_target : bool, default False¶
Whether to include target-level duplicate groups. When True, accessor properties return
SourceIndexindices; when False, they return plainintitem indices.
- Returns:¶
Duplicate detection results as a DataFrame of duplicate groups. For cross-dataset detection, includes a dataset_index column.
- Return type:¶
See also
evaluateCompute hashes and find duplicates in one call
from_clustersFind duplicates using cluster-based detection
Classes¶
Configuration for Duplicates detector. |