dataeval.core.cluster¶
-
dataeval.core.cluster(embeddings, n_expected_clusters=
None, max_cluster_size=None)¶ Uses hierarchical clustering on the flattened data and returns clustering information.
- Parameters:¶
- embeddings : ArrayND, shape - (N, ...)¶
A dataset that can be a list, or array-like object. Function expects the data to have 2 or more dimensions which will flatten to (N, P) where N is the number of observations in a P-dimensional space.
- n_expected_clusters : int, optional¶
Hint for the expected number of clusters (e.g., number of classes in dataset). If provided, adaptively adjusts min_cluster_size to encourage finding approximately this many clusters. Useful when you have domain knowledge about the data structure.
- max_cluster_size : int, optional¶
Option to limit the size of the identified clusters. Useful when you have domain knowledge about the data structure.
- Returns:¶
Mapping with keys: - clusters : NDArray[np.int64] - Assigned clusters - mst : NDArray[np.float32] - The minimum spanning tree of the data - linkage_tree : NDArray[np.float32] - The linkage array of the data - condensed_tree : CondensedTree(Mapping) - Derived from fast_hdbscan.cluster_trees.CondensedTree - membership_strengths : NDArray[np.float32] - The strength of the data point belonging to the assigned cluster - k_neighbors : NDArray[np.int64] - Indices of the nearest points in the population matrix - k_distances : NDArray[np.float32] - Array representing the lengths to points
- Return type:¶
ClusterResult
Notes
The cluster function works best when the length of the feature dimension, P, is less than 500. If flattening a CxHxW image results in a dimension larger than 500, then it is recommended to reduce the dimensions.
Examples
Return dataset clusters
>>> import sklearn.datasets as dsets >>> from dataeval.core import cluster >>> clusterer_images = dsets.make_blobs( ... n_samples=50, centers=np.array([(-1, -1), (1, 1)]), cluster_std=0.5, random_state=33 ... )[0]Two distinct clusters
>>> output = cluster(clusterer_images) >>> output["clusters"] array([0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1])