dataeval.core.cluster¶
-
dataeval.core.cluster(embeddings, algorithm=
'hdbscan', n_clusters=None, max_cluster_size=None, n_init='auto')¶ Use hierarchical clustering on the flattened data and return clustering information.
- Parameters:¶
- embeddings : ArrayND, shape - (N, ...)¶
A dataset that can be a list, or array-like object. Function expects the data to have 2 or more dimensions which will flatten to (N, P) where N is the number of observations in a P-dimensional space.
- algorithm : "kmeans" | "hdbscan", default "hdbscan"¶
The clustering algorithm to use.
- n_clusters : int, optional¶
The expected number of clusters (e.g., number of classes in dataset). For KMeans, this is the exact number of clusters to find. For HDBSCAN, adaptively adjusts min_cluster_size to encourage finding approximately this many clusters.
- max_cluster_size : int, optional¶
Option to limit the size of the identified clusters. Useful when you have domain knowledge about the data structure. (HDBSCAN only)
- n_init : int | "auto", default "auto"¶
Number of K-means initializations (KMeans only).
- Returns:¶
Mapping with keys: - clusters : NDArray[np.int64] - Assigned clusters - mst : NDArray[np.float32] - The minimum spanning tree of the data - linkage_tree : NDArray[np.float32] - The linkage array of the data - membership_strengths : NDArray[np.float32] - The strength of the data point belonging to the assigned cluster - k_neighbors : NDArray[np.int64] - Indices of the nearest points in the population matrix - k_distances : NDArray[np.float32] - Array representing the lengths to points
- Return type:¶
Notes
The cluster function works best when the length of the feature dimension, P, is less than 500. If flattening a CxHxW image results in a dimension larger than 500, then it is recommended to reduce the dimensions.
Examples
Two distinct clusters
>>> import numpy as np >>> import sklearn.datasets as dsets >>> from dataeval.core import cluster >>> clusterer_images = dsets.make_blobs( ... n_samples=50, centers=np.array([(-1, -1), (1, 1)]), cluster_std=0.5, random_state=33 ... )[0]Clustering via HDBSCAN
>>> output = cluster(clusterer_images, algorithm="hdbscan") >>> output["clusters"] array([0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1])Clustering via KMeans
>>> output = cluster(clusterer_images, algorithm="kmeans", n_clusters=2) >>> output["clusters"] array([0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1])