dataeval.core.cluster

dataeval.core.cluster(embeddings, n_expected_clusters=None, max_cluster_size=None)

Uses hierarchical clustering on the flattened data and returns clustering information.

Parameters:
embeddings : ArrayND, shape - (N, ...)

A dataset that can be a list, or array-like object. Function expects the data to have 2 or more dimensions which will flatten to (N, P) where N is the number of observations in a P-dimensional space.

n_expected_clusters : int, optional

Hint for the expected number of clusters (e.g., number of classes in dataset). If provided, adaptively adjusts min_cluster_size to encourage finding approximately this many clusters. Useful when you have domain knowledge about the data structure.

max_cluster_size : int, optional

Option to limit the size of the identified clusters. Useful when you have domain knowledge about the data structure.

Returns:

Mapping with keys: - clusters : NDArray[np.int64] - Assigned clusters - mst : NDArray[np.float32] - The minimum spanning tree of the data - linkage_tree : NDArray[np.float32] - The linkage array of the data - condensed_tree : CondensedTree(Mapping) - Derived from fast_hdbscan.cluster_trees.CondensedTree - membership_strengths : NDArray[np.float32] - The strength of the data point belonging to the assigned cluster - k_neighbors : NDArray[np.int64] - Indices of the nearest points in the population matrix - k_distances : NDArray[np.float32] - Array representing the lengths to points

Return type:

ClusterResult

Notes

The cluster function works best when the length of the feature dimension, P, is less than 500. If flattening a CxHxW image results in a dimension larger than 500, then it is recommended to reduce the dimensions.

Examples

Return dataset clusters

>>> import sklearn.datasets as dsets
>>> from dataeval.core import cluster
>>> clusterer_images = dsets.make_blobs(
...     n_samples=50, centers=np.array([(-1, -1), (1, 1)]), cluster_std=0.5, random_state=33
... )[0]

Two distinct clusters

>>> output = cluster(clusterer_images)
>>> output["clusters"]
array([0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0,
       0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0,
       1, 0, 0, 1, 1, 1])