dataeval.quality.DuplicatesOutput

class dataeval.quality.DuplicatesOutput(data, *, calculation_results=None, cluster_result=None, cluster_sensitivity=None, merge_near_duplicates=True, flags=ImageStats.NONE)

Output class for Duplicates detector.

Wraps a Polars DataFrame of duplicate groups with aggregation helpers and threshold-based redetection for cluster duplicates.

DataFrame of duplicate groups with columns:

  • group_id: int - Auto-incrementing ID for each duplicate group

  • level: str - "item" or "target"

  • dup_type: str - "exact" or "near"

  • item_indices: list[int] - Item indices of members in the group

  • target_indices: list[int] - Target indices within items (only when target-level groups exist, positionally aligned with item_indices)

  • methods: list[str] - Detection method names (e.g., ["phash", "dhash"])

  • orientation: str | None - "same", "rotated", or None (only present when both basic and D4 hashes were computed)

  • dataset_index: list[int] - Dataset indices for cross-dataset results (only present for multi-dataset output, positionally aligned with item_indices)

calculation_results

The original hash statistics. Used internally for redetection via with_threshold().

Type:

StatsResult or Sequence[StatsResult] or None

cluster_result

The clustering result (MST + cluster assignments). Used internally for redetection via with_threshold().

Type:

ClusterResult or None

cluster_sensitivity

Factor used for cluster-based near duplicate detection. Scales the cluster standard deviation to set the duplicate cutoff.

Type:

float or None

merge_near_duplicates

Whether overlapping near duplicate groups were merged.

Type:

bool

flags

The hash statistics flags used for detection.

Type:

ImageStats

aggregate_by_group()

Return a DataFrame summarizing each duplicate group.

Adds a member_count column showing the size of each group.

Returns:

DataFrame with columns:

  • group_id: int - Group identifier

  • level: str - "item" or "target"

  • dup_type: str - "exact" or "near"

  • member_count: int - Number of members in the group

  • methods: list[str] - Detection methods

  • orientation: str | None - Only present when both basic and D4 hashes were computed

Return type:

pl.DataFrame

aggregate_by_image()

Return a DataFrame listing each unique image involved in duplicates.

Explodes item_indices so each image appears once, with counts and metadata about which groups and methods flagged it.

Returns:

DataFrame with columns:

  • item_index: int - The image index

  • group_count: int - Number of duplicate groups this image appears in

  • dup_types: list[str] - Unique duplicate types for this image

  • methods: list[str] - All unique methods that detected this image

Return type:

pl.DataFrame

aggregate_by_method()

Return a DataFrame summarizing duplicate counts per detection method.

Explodes the methods list so each method is counted individually.

Returns:

DataFrame with columns:

  • method: str - Detection method name

  • group_count: int - Number of groups detected by this method

  • total_members: int - Total members across those groups

Return type:

pl.DataFrame

data()

Return the output data as a polars DataFrame.

meta()

Metadata about the execution of the function or method for the Output class.

Return type:

ExecutionMetadata

with_sensitivity(cluster_sensitivity)

Re-detect cluster-based duplicates with a different distance factor.

Hash-based duplicates are deterministic and are not affected. Only cluster-based near duplicates are recomputed using the stored clustering result (MST + cluster assignments).

Parameters:
cluster_sensitivity : float

Controls how aggressively points are considered duplicates by scaling the cluster’s standard deviation. An edge is flagged as a duplicate link when its distance is below cluster_sensitivity * cluster_std. Lower values are stricter (fewer near duplicates), higher values are more sensitive. Typical range: 0.1 – 3.0. Must be positive.

Returns:

New output with re-detected duplicates using the new distance factor.

Return type:

DuplicatesOutput

Raises:

ValueError – If this output was not created from an evaluation with cluster results.

property exact : TExactDuplicatesGroup

Exact duplicate groups as lists of indices.

  • For single-dataset item results: list[list[int]]

  • For single-dataset target results: list[list[SourceIndex]]

  • For cross-dataset item results: dict[int, list[list[int]]]

  • For cross-dataset target results: dict[int, list[list[SourceIndex]]]

property items : Self

Return a filtered DuplicatesOutput containing only item-level duplicate groups.

The returned object supports the same properties (exact, near) and aggregation methods as the original output.

property near : TNearDuplicatesGroup

Near-duplicate groups as (indices, methods) tuples.

Each group is a tuple of (indices, methods) where methods is the list[str] of detection methods that flagged the group.

  • For single-dataset item results: list[tuple[list[int], list[str]]]

  • For single-dataset target results: list[tuple[list[SourceIndex], list[str]]]

  • For cross-dataset item results: dict[int, list[tuple[list[int], list[str]]]]

  • For cross-dataset target results: dict[int, list[tuple[list[SourceIndex], list[str]]]]

property targets : Self

Return a filtered DuplicatesOutput containing only target-level duplicate groups.

The returned object supports the same properties (exact, near) and aggregation methods as the original output.