Clustering, Deduplication and Outlier Detection

As part of data exploration, we often want to know how the data groups. The Clusterer class uses hierarchal clustering to group the data and flags duplicate images as well as outlier images.

The Clusterer identifies both exact duplicate and near duplicate images based on their distance. Near duplicate images are defined as images whose distance is within the standard deviation of the cluster to which they belong. By being based on their respective cluster, near duplicates accounts for differences in the density of the cluster.

The Clusterer identifies outliers based on their distance. After defining where the splits are in the data for the different groups, outliers are defined as samples that lie outside of 2 standard deviations of the average intra-cluster distance.

Tutorials

Check out this tutorial to begin using the Clusterer class

Clusterer Tutorial

How To Guides

There are currently no how to’s for the Clusterer. If there are scenarios that you want us to explain, contact us!

DataEval API

class dataeval.detectors.Clusterer(dataset: ndarray)

Uses hierarchical clustering to flag dataset properties of interest like outliers and duplicates

Parameters:: dataset (np.ndarray) – An array of images or image embeddings to perform clustering

evaluate()

Finds and flags indices of the data for outliers and duplicates

Returns:: Dictionary containing list of outliers, potential outliers, duplicates, and near duplicates in keys “outliers”, “potential_outliers”, “duplicates”, “near_duplicates” respectively
Return type:: Dict[str, Union[List[int]], List[List[int]]]

find_duplicates(last_merge_levels: Dict[int, int]) → Tuple[List[List[int]], List[List[int]]]

Finds duplicate and near duplicate data based on the last good merge levels when building the cluster

Parameters:: last_merge_levels (Dict[int, int]) – A mapping of a cluster id to its last good merge level
Returns:: The exact duplicates and near duplicates as lists of related indices
Return type:: Tuple[List[List[int]], List[List[int]]]

find_outliers(last_merge_levels: Dict[int, int]) → Tuple[List[int], List[int]]

Retrieves outliers based on when the sample was added to the cluster and how far it was from the cluster when it was added

Parameters:: last_merge_levels (Dict[int, int]) – A mapping of a cluster id to its last good merge level
Returns:: The outliers and possible outliers as sorted lists of indices
Return type:: Tuple[List[int], List[int]]