dataeval.quality.Duplicates.Config

class dataeval.quality.Duplicates.Config

Configuration for Duplicates detector.

flags

Statistics to compute for hash-based duplicate detection.

Type:

ImageStats, default ImageStats.HASH

cluster_threshold

Threshold for cluster-based near duplicate detection. Must be provided together with extractor to enable clustering.

Type:

float or None, default None

merge_near_duplicates

Whether to merge overlapping near duplicate groups.

Type:

bool, default True

extractor

Feature extractor for cluster-based duplicate detection.

Type:

FeatureExtractor or None, default None

cluster_algorithm

Clustering algorithm for cluster-based detection.

Type:

{“kmeans”, “hdbscan”}, default “hdbscan”

n_clusters

Expected number of clusters.

Type:

int or None, default None