dataeval.detectors.linters.Clusterer
====================================

.. py:class:: dataeval.detectors.linters.Clusterer(dataset)

   Uses hierarchical clustering to flag dataset properties of interest like Outliers and :term:`duplicates<Duplicates>`

   :param dataset: A dataset in an ArrayLike format.
                   Function expects the data to have 2 dimensions, N number of observations in a P-dimensional space.
   :type dataset: ArrayLike, shape - (N, P)

   .. warning:: The Clusterer class is heavily dependent on computational resources, and may fail due to insufficient memory.

   .. note::

      The Clusterer works best when the length of the feature dimension, P, is less than 500.
      If flattening a CxHxW image results in a dimension larger than 500, then it is recommended to reduce the dimensions.


   .. py:method:: evaluate()

      Finds and flags indices of the data for Outliers and :term:`duplicates<Duplicates>`

      :returns: The Outliers and duplicate indices found in the data
      :rtype: ClustererOutput

      .. rubric:: Example

      >>> cluster = Clusterer(clusterer_images)
      >>> cluster.evaluate()
      ClustererOutput(outliers=[18, 21, 34, 35, 45], potential_outliers=[13, 15, 42], duplicates=[[9, 24], [23, 48]], potential_duplicates=[[1, 11]])


   .. py:method:: find_duplicates(last_merge_levels)

      Finds duplicate and near duplicate data based on the last good merge levels when building the cluster

      :param last_merge_levels: A mapping of a cluster id to its last good merge level
      :type last_merge_levels: Dict[int, int]

      :returns: The exact :term:`duplicates<Duplicates>` and near duplicates as lists of related indices
      :rtype: Tuple[List[List[int]], List[List[int]]]


   .. py:method:: find_outliers(last_merge_levels)

      Retrieves Outliers based on when the sample was added to the cluster
      and how far it was from the cluster when it was added

      :param last_merge_levels: A mapping of a cluster id to its last good merge level
      :type last_merge_levels: Dict[int, int]

      :returns: The outliers and possible outliers as sorted lists of indices
      :rtype: Tuple[List[int], List[int]]