dataeval.quality.DuplicatesOutput¶

class dataeval.quality.DuplicatesOutput(data, *, calculation_results=None, cluster_result=None, cluster_sensitivity=None, merge_near_duplicates=True, flags=ImageStats.NONE)¶

Output class for Duplicates detector.

Wraps a Polars DataFrame of duplicate groups with aggregation helpers and threshold-based redetection for cluster duplicates.

DataFrame of duplicate groups with columns:

group_id: int - Auto-incrementing ID for each duplicate group
level: str - "item" or "target"
dup_type: str - "exact" or "near"
item_indices: list[int] - Item indices of members in the group
target_indices: list[int] - Target indices within items (only when target-level groups exist, positionally aligned with item_indices)
methods: list[str] - Detection method names (e.g., ["phash", "dhash"])
orientation: str | None - "same", "rotated", or None (only present when both basic and D4 hashes were computed)
dataset_index: list[int] - Dataset indices for cross-dataset results (only present for multi-dataset output, positionally aligned with item_indices)

calculation_results¶

The original hash statistics. Used internally for redetection via with_threshold().

Type:¶: StatsResult or Sequence[StatsResult] or None

cluster_result¶

The clustering result (MST + cluster assignments). Used internally for redetection via with_threshold().

Type:¶: ClusterResult or None

cluster_sensitivity¶

Factor used for cluster-based near duplicate detection. Scales the cluster standard deviation to set the duplicate cutoff.

Type:¶: float or None

merge_near_duplicates¶

Whether overlapping near duplicate groups were merged.

Type:¶: bool

flags¶

The hash statistics flags used for detection.

Type:¶: ImageStats

aggregate_by_group()¶

Return a DataFrame summarizing each duplicate group.

Adds a member_count column showing the size of each group.

Returns:¶

DataFrame with columns:

group_id: int - Group identifier
level: str - "item" or "target"
dup_type: str - "exact" or "near"
member_count: int - Number of members in the group
methods: list[str] - Detection methods
orientation: str | None - Only present when both basic and D4 hashes were computed

Return type:¶

pl.DataFrame

aggregate_by_image()¶

Return a DataFrame listing each unique image involved in duplicates.

Explodes item_indices so each image appears once, with counts and metadata about which groups and methods flagged it.

Returns:¶

DataFrame with columns:

item_index: int - The image index
group_count: int - Number of duplicate groups this image appears in
dup_types: list[str] - Unique duplicate types for this image
methods: list[str] - All unique methods that detected this image

Return type:¶

pl.DataFrame

aggregate_by_method()¶

Return a DataFrame summarizing duplicate counts per detection method.

Explodes the methods list so each method is counted individually.

Returns:¶

DataFrame with columns:

method: str - Detection method name
group_count: int - Number of groups detected by this method
total_members: int - Total members across those groups

Return type:¶

pl.DataFrame

data()¶

Return the output data as a polars DataFrame.

meta()¶

Metadata about the execution of the function or method for the Output class.

Return type:¶: ExecutionMetadata

with_sensitivity(cluster_sensitivity)¶

Re-detect cluster-based duplicates with a different distance factor.

Hash-based duplicates are deterministic and are not affected. Only cluster-based near duplicates are recomputed using the stored clustering result (MST + cluster assignments).

Parameters:¶

cluster_sensitivity : float¶: Controls how aggressively points are considered duplicates by scaling the cluster’s standard deviation. An edge is flagged as a duplicate link when its distance is below cluster_sensitivity * cluster_std. Lower values are stricter (fewer near duplicates), higher values are more sensitive. Typical range: 0.1 – 3.0. Must be positive.

Returns:¶

New output with re-detected duplicates using the new distance factor.

Return type:¶

DuplicatesOutput

Raises:¶

ValueError – If this output was not created from an evaluation with cluster results.

property exact : TExactDuplicatesGroup¶

Exact duplicate groups as lists of indices.

For single-dataset item results: list[list[int]]
For single-dataset target results: list[list[SourceIndex]]
For cross-dataset item results: dict[int, list[list[int]]]
For cross-dataset target results: dict[int, list[list[SourceIndex]]]

property items : Self¶

Return a filtered DuplicatesOutput containing only item-level duplicate groups.

The returned object supports the same properties (exact, near) and aggregation methods as the original output.

property near : TNearDuplicatesGroup¶

Near-duplicate groups as (indices, methods) tuples.

Each group is a tuple of (indices, methods) where methods is the list[str] of detection methods that flagged the group.

For single-dataset item results: list[tuple[list[int], list[str]]]
For single-dataset target results: list[tuple[list[SourceIndex], list[str]]]
For cross-dataset item results: dict[int, list[tuple[list[int], list[str]]]]
For cross-dataset target results: dict[int, list[tuple[list[SourceIndex], list[str]]]]

property targets : Self¶

Return a filtered DuplicatesOutput containing only target-level duplicate groups.

The returned object supports the same properties (exact, near) and aggregation methods as the original output.