dataeval.core.cluster

dataeval.core.cluster(embeddings, algorithm='hdbscan', n_clusters=None, max_cluster_size=None, n_init='auto')

Use hierarchical clustering on the flattened data and return clustering information.

Parameters:
embeddings : ArrayND, shape - (N, ...)

A dataset that can be a list, or array-like object. Function expects the data to have 2 or more dimensions which will flatten to (N, P) where N is the number of observations in a P-dimensional space.

algorithm : "kmeans" | "hdbscan", default "hdbscan"

The clustering algorithm to use.

n_clusters : int, optional

The expected number of clusters (e.g., number of classes in dataset). For KMeans, this is the exact number of clusters to find. For HDBSCAN, adaptively adjusts min_cluster_size to encourage finding approximately this many clusters.

max_cluster_size : int, optional

Option to limit the size of the identified clusters. Useful when you have domain knowledge about the data structure. (HDBSCAN only)

n_init : int | "auto", default "auto"

Number of K-means initializations (KMeans only).

Returns:

Mapping with keys: - clusters : NDArray[np.int64] - Assigned clusters - mst : NDArray[np.float32] - The minimum spanning tree of the data - linkage_tree : NDArray[np.float32] - The linkage array of the data - membership_strengths : NDArray[np.float32] - The strength of the data point belonging to the assigned cluster - k_neighbors : NDArray[np.int64] - Indices of the nearest points in the population matrix - k_distances : NDArray[np.float32] - Array representing the lengths to points

Return type:

ClusterResult

Notes

The cluster function works best when the length of the feature dimension, P, is less than 500. If flattening a CxHxW image results in a dimension larger than 500, then it is recommended to reduce the dimensions.

Examples

Two distinct clusters

>>> import numpy as np
>>> import sklearn.datasets as dsets
>>> from dataeval.core import cluster
>>> clusterer_images = dsets.make_blobs(
...     n_samples=50, centers=np.array([(-1, -1), (1, 1)]), cluster_std=0.5, random_state=33
... )[0]

Clustering via HDBSCAN

>>> output = cluster(clusterer_images, algorithm="hdbscan")
>>> output["clusters"]
array([0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0,
       0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0,
       1, 0, 0, 1, 1, 1])

Clustering via KMeans

>>> output = cluster(clusterer_images, algorithm="kmeans", n_clusters=2)
>>> output["clusters"]
array([0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0,
       0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0,
       1, 0, 0, 1, 1, 1])