{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Dataset Linting Tutorial\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Problem Statement\n",
    "\n",
    "Exploratory data analysis (EDA) can be overwhelming. There are so many things to check.\n",
    "Duplicates in your dataset, bad/corrupted images in the set, blurred or bright/dark images, the list goes on.\n",
    "\n",
    "DataEval created a Linting class to assist you with your EDA so you can start training your models on high quality data.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### _When to use_\n",
    "\n",
    "The Linting class should be used during the initial EDA process or if you are trying to verify that you have the right data in your dataset.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### _What you will need_\n",
    "\n",
    "1. A dataset to analyze\n",
    "2. A python environment with the following packages installed:\n",
    "   - `dataeval[torch]` or `dataeval[all]`\n",
    "   - `torchvision`\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## _Getting Started_\n",
    "\n",
    "Let's import the required libraries needed to set up a minimal working example\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "remove_cell"
    ]
   },
   "outputs": [],
   "source": [
    "try:\n",
    "    import google.colab  # noqa: F401\n",
    "\n",
    "    # specify the version of DataEval (==X.XX.X) for versions other than the latest\n",
    "    %pip install -q dataeval[torch]\n",
    "except Exception:\n",
    "    pass"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import torch\n",
    "import torchvision.datasets as datasets\n",
    "import torchvision.transforms.v2 as v2\n",
    "\n",
    "from dataeval.detectors.linters import Outliers"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Loading in the data\n",
    "\n",
    "We are going to start by loading in torchvision's CIFAR-10 dataset.\n",
    "\n",
    "The CIFAR-10 dataset contains 60,000 images - 50,000 in the train set and 10,000 in the test set.\n",
    "For the purposes of this demonstration, we are just going to use the test set.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Load in the cifar-10 dataset from torchvision\n",
    "to_tensor = v2.Compose([v2.ToImage(), v2.ToDtype(torch.float32, scale=True)])\n",
    "testing_dataset = datasets.CIFAR10(\"./data\", train=False, download=True, transform=to_tensor)\n",
    "test_data = np.array(testing_dataset.data, dtype=float)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Linting the Dataset\n",
    "\n",
    "Now we can begin finding those images which are significantly different from the rest of the data.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Initialize the Duplicates class\n",
    "outliers = Outliers()\n",
    "\n",
    "# Evaluate the data\n",
    "results = outliers.evaluate(test_data)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The results are a dictionary with the keys being the image that has an issue in one of the listed properties below:\n",
    "\n",
    "- Brightness\n",
    "- Blurriness\n",
    "- Missing\n",
    "- Zero\n",
    "- Width\n",
    "- Height\n",
    "- Size\n",
    "- Aspect Ratio\n",
    "- Channels\n",
    "- Depth\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(f\"Total number of images with an issue: {len(results.issues)}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Show each image that has at least one issue\n",
    "for image, issue in results.issues.items():\n",
    "    print(f\"{image} - {issue}\")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}