{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Clustering Tutorial\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## _Problem Statement_\n", "\n", "Data does not typically come labeled and labeling/verifying labels is a time and resource intensive process.\n", "Exploratory data analysis (EDA) can often be enhanced by splitting data into similar groups.\n", "\n", "Clustering is a method which groups data in the format of (samples, features). This can be used with images or image embeddings as long as the arrays are flattened to only contain 2 dimensions.\n", "\n", "The `Clusterer` class utilizes a clustering algorithm based on the HDBSCAN algorithm and outputs outliers and duplicates.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### _When to use_\n", "\n", "The Clusterer can be used during the EDA process to perform the following:\n", "\n", "- group a dataset into clusters\n", "- verify labeling as a quality control\n", "- identify outliers in your dataset\n", "- identify duplicates in your dataset\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### _What you will need_\n", "\n", "1. A 2 dimensional dataset (samples, features)\n", "2. A python environment with the following packages installed:\n", " - `matplotlib`\n", "\n", "This could be a set of flattened images or image embeddings. We recommend using image embeddings (with the feature dimension being <=1000).\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## _Getting Started_\n", "\n", "Let's import the required libraries needed to set up a minimal working example.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [ "remove_cell" ] }, "outputs": [], "source": [ "try:\n", " import google.colab # noqa: F401\n", "\n", " # specify the version of DataEval (==X.XX.X) for versions other than the latest\n", " %pip install -q dataeval\n", "except Exception:\n", " pass" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "import numpy as np\n", "import sklearn.datasets as dsets\n", "\n", "from dataeval.detectors.linters import Clusterer" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Loading in data\n", "\n", "For the purposes of this demonstration, we are just going to create a generic set of blobs for clustering.\n", "\n", "This is to help show all of the functionalities of the clusterer in one tutorial.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Creating 5 clusters\n", "test_data, labels = dsets.make_blobs(\n", " n_samples=100,\n", " centers=[(-1.5, 1.8), (-1, 3), (0.8, 2.1), (2.8, 1.5), (2.5, 3.5)],\n", " cluster_std=0.3,\n", " random_state=33,\n", ") # type: ignore" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Because the clusterer can also detect duplicate data, we are going to modify the dataset to contain a few duplicate datapoints.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "test_data[79] = test_data[24]\n", "test_data[63] = test_data[58] + 1e-5\n", "labels[79] = labels[24]\n", "labels[63] = labels[58]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Visualizing the clusters\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Mapping from labels to colors\n", "label_to_color = np.array([\"b\", \"r\", \"g\", \"y\", \"m\"])\n", "\n", "# Translate labels to colors using vectorized operation\n", "color_array = label_to_color[labels]\n", "\n", "# Additional parameters for plotting\n", "plot_kwds = {\"alpha\": 0.5, \"s\": 50, \"linewidths\": 0}\n", "\n", "# Create scatter plot\n", "plt.scatter(test_data.T[0], test_data.T[1], c=color_array, **plot_kwds)\n", "\n", "# Annotate each point in the scatter plot\n", "for i, (x, y) in enumerate(test_data):\n", " plt.annotate(str(i), (x, y), textcoords=\"offset points\", xytext=(0, 1), ha=\"center\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Verify the number of datapoints and that the shape is 2 dimensional\n", "print(\"Number of samples: \", len(test_data))\n", "print(\"Array shape:\", test_data.ndim)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Running the Clusterer\n", "\n", "We are now ready to run the data through the clusterer and inspect the results.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Initialize the clusterer\n", "clusterer = Clusterer(test_data)\n", "\n", "# Evaluate the data\n", "results = clusterer.evaluate()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Results\n", "\n", "We can list out each category followed by the number of items in the category and then display those items on the line below.\n", "\n", "For the outlier and potential outlier results, the clusterer provides a list of all points that it found to be an outlier.\n", "\n", "For the duplicates and near duplicate results, the clusterer provides a list of sets of points which it identified as duplicates.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Show results\n", "for category, finding in results.dict().items():\n", " print(f\"\\t{category} - {len(finding)}\")\n", " print(finding)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can see that there was 6 outliers and 5 potential outliers.\n", "There was also 2 sets of duplicates and 10 sets of near duplicates.\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.9" } }, "nbformat": 4, "nbformat_minor": 2 }