Clustering, Deduplication and Outlier Detection
As part of data exploration, we often want to know how the data groups. The Clusterer class uses hierarchal clustering to group the data and flags duplicate images as well as outlier images.
The Clusterer identifies both exact duplicate and near duplicate images based on their distance. Near duplicate images are defined as images whose distance is within the standard deviation of the cluster to which they belong. By being based on their respective cluster, near duplicates accounts for differences in the density of the cluster.
The Clusterer identifies outliers based on their distance. After defining where the splits are in the data for the different groups, outliers are defined as samples that lie outside of 2 standard deviations of the average intra-cluster distance.
Tutorials
Check out this tutorial to begin using the Clusterer class
How To Guides
There are currently no how to’s for the Clusterer. If there are scenarios that you want us to explain, contact us!
DataEval API
- class dataeval.detectors.Clusterer(dataset: ndarray)
Uses hierarchical clustering to flag dataset properties of interest like outliers and duplicates
- Parameters:
dataset (np.ndarray) – An array of images or image embeddings to perform clustering
- evaluate()
Finds and flags indices of the data for outliers and duplicates
- Returns:
Dictionary containing list of outliers, potential outliers, duplicates, and near duplicates in keys “outliers”, “potential_outliers”, “duplicates”, “near_duplicates” respectively
- Return type:
Dict[str, Union[List[int]], List[List[int]]]
- find_duplicates(last_merge_levels: Dict[int, int]) Tuple[List[List[int]], List[List[int]]]
Finds duplicate and near duplicate data based on the last good merge levels when building the cluster
- Parameters:
last_merge_levels (Dict[int, int]) – A mapping of a cluster id to its last good merge level
- Returns:
The exact duplicates and near duplicates as lists of related indices
- Return type:
Tuple[List[List[int]], List[List[int]]]
- find_outliers(last_merge_levels: Dict[int, int]) Tuple[List[int], List[int]]
Retrieves outliers based on when the sample was added to the cluster and how far it was from the cluster when it was added
- Parameters:
last_merge_levels (Dict[int, int]) – A mapping of a cluster id to its last good merge level
- Returns:
The outliers and possible outliers as sorted lists of indices
- Return type:
Tuple[List[int], List[int]]