{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Dataset Linting Tutorial\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Problem Statement\n", "\n", "Exploratory data analysis (EDA) can be overwhelming. There are so many things to check.\n", "Duplicates in your dataset, bad/corrupted images in the set, blurred or bright/dark images, the list goes on.\n", "\n", "DataEval created a Linting class to assist you with your EDA so you can start training your models on high quality data.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### _When to use_\n", "\n", "The Linting class should be used during the initial EDA process or if you are trying to verify that you have the right data in your dataset.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### _What you will need_\n", "\n", "1. A dataset to analyze\n", "2. A python environment with the following packages installed:\n", " - `dataeval[torch]` or `dataeval[all]`\n", " - `torchvision`\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## _Getting Started_\n", "\n", "Let's import the required libraries needed to set up a minimal working example\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [ "remove_cell" ] }, "outputs": [], "source": [ "try:\n", " import google.colab # noqa: F401\n", "\n", " # specify the version of DataEval (==X.XX.X) for versions other than the latest\n", " %pip install -q dataeval[torch]\n", "except Exception:\n", " pass" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import torch\n", "import torchvision.datasets as datasets\n", "import torchvision.transforms.v2 as v2\n", "\n", "from dataeval.detectors.linters import Outliers" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Loading in the data\n", "\n", "We are going to start by loading in torchvision's CIFAR-10 dataset.\n", "\n", "The CIFAR-10 dataset contains 60,000 images - 50,000 in the train set and 10,000 in the test set.\n", "For the purposes of this demonstration, we are just going to use the test set.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Load in the cifar-10 dataset from torchvision\n", "to_tensor = v2.Compose([v2.ToImage(), v2.ToDtype(torch.float32, scale=True)])\n", "testing_dataset = datasets.CIFAR10(\"./data\", train=False, download=True, transform=to_tensor)\n", "test_data = np.array(testing_dataset.data, dtype=float)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Linting the Dataset\n", "\n", "Now we can begin finding those images which are significantly different from the rest of the data.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Initialize the Duplicates class\n", "outliers = Outliers()\n", "\n", "# Evaluate the data\n", "results = outliers.evaluate(test_data)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The results are a dictionary with the keys being the image that has an issue in one of the listed properties below:\n", "\n", "- Brightness\n", "- Blurriness\n", "- Missing\n", "- Zero\n", "- Width\n", "- Height\n", "- Size\n", "- Aspect Ratio\n", "- Channels\n", "- Depth\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(f\"Total number of images with an issue: {len(results.issues)}\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Show each image that has at least one issue\n", "for image, issue in results.issues.items():\n", " print(f\"{image} - {issue}\")" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.9" } }, "nbformat": 4, "nbformat_minor": 2 }