Clusterer
- class dataeval.detectors.Clusterer(dataset: _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes])
Uses hierarchical clustering to flag dataset properties of interest like outliers and duplicates
- Parameters:
dataset (ArrayLike, shape - (N, P)) – A dataset in an ArrayLike format. Function expects the data to have 2 dimensions, N number of observations in a P-dimensional space.
Warning
The Clusterer class is heavily dependent on computational resources, and may fail due to insufficient memory.
Note
The Clusterer works best when the length of the feature dimension, P, is less than 500. If flattening a CxHxW image results in a dimension larger than 500, then it is recommended to reduce the dimensions.
Example
Initialize the Clusterer class:
>>> cluster = Clusterer(dataset)
- evaluate() ClustererOutput
Finds and flags indices of the data for outliers and duplicates
- Returns:
The outliers and duplicate indices found in the data
- Return type:
ClustererOutput
Example
>>> cluster.evaluate() ClustererOutput(outliers=[18, 21, 34, 35, 45], potential_outliers=[13, 15, 42], duplicates=[[9, 24], [23, 48]], potential_duplicates=[[1, 11]])
- find_duplicates(last_merge_levels: Dict[int, int]) Tuple[List[List[int]], List[List[int]]]
Finds duplicate and near duplicate data based on the last good merge levels when building the cluster
- Parameters:
last_merge_levels (Dict[int, int]) – A mapping of a cluster id to its last good merge level
- Returns:
The exact duplicates and near duplicates as lists of related indices
- Return type:
Tuple[List[List[int]], List[List[int]]]
- find_outliers(last_merge_levels: Dict[int, int]) Tuple[List[int], List[int]]
Retrieves outliers based on when the sample was added to the cluster and how far it was from the cluster when it was added
- Parameters:
last_merge_levels (Dict[int, int]) – A mapping of a cluster id to its last good merge level
- Returns:
The outliers and possible outliers as sorted lists of indices
- Return type:
Tuple[List[int], List[int]]