Clusterer#

class dataeval.detectors.linters.Clusterer(dataset: ArrayLike)#

Uses hierarchical clustering to flag dataset properties of interest like outliers and duplicates

Parameters:

dataset (ArrayLike, shape - (N, P)) – A dataset in an ArrayLike format. Function expects the data to have 2 dimensions, N number of observations in a P-dimensional space.

Warning

The Clusterer class is heavily dependent on computational resources, and may fail due to insufficient memory.

Note

The Clusterer works best when the length of the feature dimension, P, is less than 500. If flattening a CxHxW image results in a dimension larger than 500, then it is recommended to reduce the dimensions.

Example

Initialize the Clusterer class:

>>> cluster = Clusterer(dataset)
evaluate() ClustererOutput#

Finds and flags indices of the data for outliers and duplicates

Returns:

The outliers and duplicate indices found in the data

Return type:

ClustererOutput

Example

>>> cluster.evaluate()
ClustererOutput(outliers=[18, 21, 34, 35, 45], potential_outliers=[13, 15, 42], duplicates=[[9, 24], [23, 48]], potential_duplicates=[[1, 11]])
find_duplicates(last_merge_levels: dict[int, int]) tuple[list[list[int]], list[list[int]]]#

Finds duplicate and near duplicate data based on the last good merge levels when building the cluster

Parameters:

last_merge_levels (Dict[int, int]) – A mapping of a cluster id to its last good merge level

Returns:

The exact duplicates and near duplicates as lists of related indices

Return type:

Tuple[List[List[int]], List[List[int]]]

find_outliers(last_merge_levels: dict[int, int]) tuple[list[int], list[int]]#

Retrieves outliers based on when the sample was added to the cluster and how far it was from the cluster when it was added

Parameters:

last_merge_levels (Dict[int, int]) – A mapping of a cluster id to its last good merge level

Returns:

The outliers and possible outliers as sorted lists of indices

Return type:

Tuple[List[int], List[int]]