{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Dataset Sufficiency Analysis for Classification Tutorial\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## _Problem Statement_\n", "\n", "For machine learning tasks, often we would like to evaluate the performance of a model on a small, preliminary dataset. In situations where data collection is expensive, we would like to extrapolate hypothetical performance out to a larger dataset.\n", "\n", "DataEval has introduced a method projecting performance via _sufficiency curves_.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### _When to use_\n", "\n", "The `Sufficiency` class should be used when you would like to extrapolate hypothetical performance. For example, if you have a small dataset, and would like to know if it is worthwhile to collect more data.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### _What you will need_\n", "\n", "1. A particular model architecture.\n", "2. Metric(s) that we would like to evaluate.\n", "3. A dataset of interest.\n", "4. A python environment with the following packages installed:\n", " - `dataeval` or `dataeval[all]`\n", " - `tabulate`\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### _Setting up_\n", "\n", "Let's import the required libraries needed to set up a minimal working example\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [ "remove_cell" ] }, "outputs": [], "source": [ "try:\n", " import google.colab # noqa: F401\n", "\n", " # specify the version of DataEval (==X.XX.X) for versions other than the latest\n", " %pip install -q dataeval\n", " !export LC_ALL=\"en_US.UTF-8\"\n", " !export LD_LIBRARY_PATH=\"/usr/lib64-nvidia\"\n", " !export LIBRARY_PATH=\"/usr/local/cuda/lib64/stubs\"\n", " !ldconfig /usr/lib64-nvidia\n", "except Exception:\n", " pass\n", "\n", "%pip install -q tabulate" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import os\n", "import random\n", "from typing import Dict, Sequence, cast\n", "\n", "import matplotlib.pyplot as plt\n", "import numpy as np\n", "import torch\n", "import torch.nn as nn\n", "import torch.nn.functional as F\n", "import torch.optim as optim\n", "import torchmetrics\n", "from tabulate import tabulate\n", "from torch.utils.data import DataLoader, Dataset, Subset\n", "\n", "from dataeval.utils.dataset.datasets import MNIST\n", "from dataeval.workflows import Sufficiency\n", "\n", "np.random.seed(0)\n", "np.set_printoptions(formatter={\"float\": lambda x: f\"{x:0.4f}\"})\n", "torch.manual_seed(0)\n", "torch.set_float32_matmul_precision(\"high\")\n", "device = \"cuda\" if torch.cuda.is_available() else \"cpu\"\n", "torch._dynamo.config.suppress_errors = True\n", "\n", "random.seed(0)\n", "torch.use_deterministic_algorithms(True)\n", "os.environ[\"CUBLAS_WORKSPACE_CONFIG\"] = \":4096:8\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Load data and define functions\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Load the MNIST data and create the training and test datasets.\n", "For the purposes of this example, we will use subsets of the training (2000) and test (500) data.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Download the mnist dataset and preview the images\n", "train_ds = MNIST(\n", " root=\"./data\", train=True, download=True, size=2000, unit_interval=True, dtype=np.float32, channels=\"channels_first\"\n", ")\n", "test_ds = MNIST(\n", " root=\"./data\", train=False, download=True, size=500, unit_interval=True, dtype=np.float32, channels=\"channels_first\"\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fig = plt.figure(figsize=(8, 3))\n", "\n", "for lbl in range(10):\n", " i = (train_ds.targets == lbl).nonzero()[0][0]\n", " img = train_ds.data[i, 0, :, :]\n", " ax = fig.add_subplot(2, 5, lbl + 1)\n", " ax.xaxis.set_visible(False)\n", " ax.yaxis.set_visible(False)\n", " ax.imshow(img, cmap=\"gray_r\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, we define the network architecture we will be using.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Define our network architecture\n", "class Net(nn.Module):\n", " def __init__(self):\n", " super().__init__()\n", " self.conv1 = nn.Conv2d(1, 6, 5)\n", " self.conv2 = nn.Conv2d(6, 16, 5)\n", " self.fc1 = nn.Linear(6400, 120)\n", " self.fc2 = nn.Linear(120, 84)\n", " self.fc3 = nn.Linear(84, 10)\n", "\n", " def forward(self, x):\n", " x = F.relu(self.conv1(x))\n", " x = F.relu(self.conv2(x))\n", " x = torch.flatten(x, 1) # flatten all dimensions except batch\n", " x = F.relu(self.fc1(x))\n", " x = F.relu(self.fc2(x))\n", " x = self.fc3(x)\n", " return x\n", "\n", "\n", "# Compile the model\n", "model = torch.compile(Net().to(device))\n", "\n", "# Type cast the model back to Net as torch.compile returns a Unknown\n", "# Nothing internally changes from the cast; we are simply signaling the type\n", "model = cast(Net, model)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, we define our custom training and evaluation functions. Sufficiency requires that the evaluation function returns a dictionary of the results.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def custom_train(model: nn.Module, dataset: Dataset, indices: Sequence[int]):\n", " # Defined only for this testing scenario\n", " criterion = torch.nn.CrossEntropyLoss().to(device)\n", " optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9)\n", " epochs = 10\n", "\n", " # Define the dataloader for training\n", " dataloader = DataLoader(Subset(dataset, indices), batch_size=16)\n", "\n", " for epoch in range(epochs):\n", " for batch in dataloader:\n", " # Load data/images to device\n", " X = torch.Tensor(batch[0]).to(device)\n", " # Load targets/labels to device\n", " y = torch.Tensor(batch[1]).to(device)\n", " # Zero out gradients\n", " optimizer.zero_grad()\n", " # Forward propagation\n", " outputs = model(X)\n", " # Compute loss\n", " loss = criterion(outputs, y)\n", " # Back prop\n", " loss.backward()\n", " # Update weights/parameters\n", " optimizer.step()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def custom_eval(model: nn.Module, dataset: Dataset) -> Dict[str, float]:\n", " metric = torchmetrics.Accuracy(task=\"multiclass\", num_classes=10).to(device)\n", " result = 0\n", "\n", " # Set model layers into evaluation mode\n", " model.eval()\n", " dataloader = DataLoader(dataset, batch_size=16)\n", " # Tell PyTorch to not track gradients, greatly speeds up processing\n", " with torch.no_grad():\n", " for batch in dataloader:\n", " # Load data/images to device\n", " X = torch.Tensor(batch[0]).to(device)\n", " # Load targets/labels to device\n", " y = torch.Tensor(batch[1]).to(device)\n", " preds = model(X)\n", " metric.update(preds, y)\n", " result = metric.compute().cpu()\n", " return {\"Accuracy\": result}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Initialize sufficiency metric\n", "\n", "Attach the custom training and evaluation functions to the Sufficiency metric and define the number of models to train in parallel (stability), as well as the number of steps along the learning curve to evaluate.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Instantiate sufficiency metric\n", "suff = Sufficiency(\n", " model=model,\n", " train_ds=train_ds,\n", " test_ds=test_ds,\n", " train_fn=custom_train,\n", " eval_fn=custom_eval,\n", " runs=5,\n", " substeps=10,\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Evaluate Sufficiency\n", "\n", "Now we can evaluate the metric to train the models and produce the learning curve.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Train & test model\n", "output = suff.evaluate()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Print out sufficiency output in a table format\n", "formatted = {\"Steps\": output.steps, **output.measures}\n", "print(tabulate(formatted, headers=list(formatted), tablefmt=\"pretty\"))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Print out projected output values\n", "projection = output.project([1000, 2000, 4000])\n", "projected = {\"Steps\": projection.steps, **projection.measures}\n", "print(tabulate(projected, list(projected), tablefmt=\"pretty\"))" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [ "remove_cell" ] }, "outputs": [], "source": [ "### TEST ASSERTION CELL ###\n", "assert -0.015 < output.measures[\"Accuracy\"][-1] - projection.measures[\"Accuracy\"][-2] < 0.015" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Plot the output using the convenience function\n", "_ = output.plot()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Results\n", "\n", "Using this learning curve, we can project performance under much larger datasets (with the same model).\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Predicting sample requirements\n", "\n", "We can also predict the amount of training samples required to achieve a desired accuracy.\n", "\n", "Let's say we wanted to see how many samples are needed to hit 90%, 95% and 99% accuracy given the learning curve.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Initialize the array of desired accuracies\n", "desired_accuracies = np.array([0.90, 0.95, 0.99])\n", "\n", "# Evaluate the learning curve to infer the needed amount of training data\n", "samples_needed = output.inv_project({\"Accuracy\": desired_accuracies})" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Print the amount of needed data needed to achieve the accuracies of interest\n", "for i, accuracy in enumerate(desired_accuracies):\n", " print(f\"To achieve {int(accuracy*100)}% accuracy, {int(samples_needed['Accuracy'][i])} samples are needed.\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The projection shows that given the current model, hitting an accuracy of 99% is improbable.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [ "remove_cell" ] }, "outputs": [], "source": [ "# Future BER tie in?\n", "\n", "images, labels = [], []\n", "for data in train_ds:\n", " images.append(np.array(data[0]))\n", " labels.append(data[1])\n", "\n", "images = np.array(images)\n", "labels = np.array(labels)\n", "\n", "from dataeval.metrics.estimators import ber\n", "\n", "ber_output = ber(images, labels)\n", "np.round(1 - ber_output.ber_lower, 3) * 100" ] } ], "metadata": { "kernelspec": { "display_name": ".venv-3.11", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.9" } }, "nbformat": 4, "nbformat_minor": 2 }