dataeval.quality.DuplicatesOutput¶
-
class dataeval.quality.DuplicatesOutput(data, *, calculation_results=
None, cluster_result=None, cluster_sensitivity=None, merge_near_duplicates=True, flags=ImageStats.NONE)¶ Output class for
Duplicatesdetector.Wraps a Polars DataFrame of duplicate groups with aggregation helpers and threshold-based redetection for cluster duplicates.
DataFrame of duplicate groups with columns:
group_id: int - Auto-incrementing ID for each duplicate group
level: str -
"item"or"target"dup_type: str -
"exact"or"near"item_indices: list[int] - Item indices of members in the group
target_indices: list[int] - Target indices within items (only when target-level groups exist, positionally aligned with item_indices)
methods: list[str] - Detection method names (e.g.,
["phash", "dhash"])orientation: str | None -
"same","rotated", or None (only present when both basic and D4 hashes were computed)dataset_index: list[int] - Dataset indices for cross-dataset results (only present for multi-dataset output, positionally aligned with item_indices)
- calculation_results¶
The original hash statistics. Used internally for redetection via
with_threshold().- Type:¶
StatsResult or Sequence[StatsResult] or None
- cluster_result¶
The clustering result (MST + cluster assignments). Used internally for redetection via
with_threshold().- Type:¶
ClusterResult or None
- cluster_sensitivity¶
Factor used for cluster-based near duplicate detection. Scales the cluster standard deviation to set the duplicate cutoff.
- Type:¶
float or None
- aggregate_by_group()¶
Return a DataFrame summarizing each duplicate group.
Adds a member_count column showing the size of each group.
- Returns:¶
DataFrame with columns:
group_id: int - Group identifier
level: str -
"item"or"target"dup_type: str -
"exact"or"near"member_count: int - Number of members in the group
methods: list[str] - Detection methods
orientation: str | None - Only present when both basic and D4 hashes were computed
- Return type:¶
pl.DataFrame
- aggregate_by_image()¶
Return a DataFrame listing each unique image involved in duplicates.
Explodes item_indices so each image appears once, with counts and metadata about which groups and methods flagged it.
- aggregate_by_method()¶
Return a DataFrame summarizing duplicate counts per detection method.
Explodes the methods list so each method is counted individually.
- data()¶
Return the output data as a polars DataFrame.
- with_sensitivity(cluster_sensitivity)¶
Re-detect cluster-based duplicates with a different distance factor.
Hash-based duplicates are deterministic and are not affected. Only cluster-based near duplicates are recomputed using the stored clustering result (MST + cluster assignments).
- Parameters:¶
- cluster_sensitivity : float¶
Controls how aggressively points are considered duplicates by scaling the cluster’s standard deviation. An edge is flagged as a duplicate link when its distance is below
cluster_sensitivity * cluster_std. Lower values are stricter (fewer near duplicates), higher values are more sensitive. Typical range: 0.1 – 3.0. Must be positive.
- Returns:¶
New output with re-detected duplicates using the new distance factor.
- Return type:¶
- Raises:¶
ValueError – If this output was not created from an evaluation with cluster results.
- property exact : TExactDuplicatesGroup¶
Exact duplicate groups as lists of indices.
For single-dataset item results:
list[list[int]]For single-dataset target results:
list[list[SourceIndex]]For cross-dataset item results:
dict[int, list[list[int]]]For cross-dataset target results:
dict[int, list[list[SourceIndex]]]
- property items : Self¶
Return a filtered DuplicatesOutput containing only item-level duplicate groups.
The returned object supports the same properties (
exact,near) and aggregation methods as the original output.
- property near : TNearDuplicatesGroup¶
Near-duplicate groups as
(indices, methods)tuples.Each group is a tuple of
(indices, methods)wheremethodsis thelist[str]of detection methods that flagged the group.For single-dataset item results:
list[tuple[list[int], list[str]]]For single-dataset target results:
list[tuple[list[SourceIndex], list[str]]]For cross-dataset item results:
dict[int, list[tuple[list[int], list[str]]]]For cross-dataset target results:
dict[int, list[tuple[list[SourceIndex], list[str]]]]
- property targets : Self¶
Return a filtered DuplicatesOutput containing only target-level duplicate groups.
The returned object supports the same properties (
exact,near) and aggregation methods as the original output.