dataeval.detectors.linters.Clusterer ==================================== .. py:class:: dataeval.detectors.linters.Clusterer(dataset) Uses hierarchical clustering to flag dataset properties of interest like Outliers and :term:`duplicates` :param dataset: A dataset in an ArrayLike format. Function expects the data to have 2 dimensions, N number of observations in a P-dimensional space. :type dataset: ArrayLike, shape - (N, P) .. warning:: The Clusterer class is heavily dependent on computational resources, and may fail due to insufficient memory. .. note:: The Clusterer works best when the length of the feature dimension, P, is less than 500. If flattening a CxHxW image results in a dimension larger than 500, then it is recommended to reduce the dimensions. .. py:method:: evaluate() Finds and flags indices of the data for Outliers and :term:`duplicates` :returns: The Outliers and duplicate indices found in the data :rtype: ClustererOutput .. rubric:: Example >>> cluster = Clusterer(clusterer_images) >>> cluster.evaluate() ClustererOutput(outliers=[18, 21, 34, 35, 45], potential_outliers=[13, 15, 42], duplicates=[[9, 24], [23, 48]], potential_duplicates=[[1, 11]]) .. py:method:: find_duplicates(last_merge_levels) Finds duplicate and near duplicate data based on the last good merge levels when building the cluster :param last_merge_levels: A mapping of a cluster id to its last good merge level :type last_merge_levels: Dict[int, int] :returns: The exact :term:`duplicates` and near duplicates as lists of related indices :rtype: Tuple[List[List[int]], List[List[int]]] .. py:method:: find_outliers(last_merge_levels) Retrieves Outliers based on when the sample was added to the cluster and how far it was from the cluster when it was added :param last_merge_levels: A mapping of a cluster id to its last good merge level :type last_merge_levels: Dict[int, int] :returns: The outliers and possible outliers as sorted lists of indices :rtype: Tuple[List[int], List[int]]