{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Data Cleaning Guide\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Introduction\n", "\n", "Exploratory Data Analysis (EDA) is an approach to analyzing data sets to summarize the main characteristics and identify incongruencies in the data.\n", "Before diving into machine learning or statistical modeling, it is crucial to understand the data you are working with.\n", "EDA helps in understanding the patterns, detecting anomalies, checking assumptions, and determining relationships in the data.\n", "\n", "One of the most important aspects of EDA is data cleaning.\n", "A portion of DataEval is dedicated to being able to identify duplicates and outliers as well as data points that have missing or too many extreme values.\n", "These techniques help ensure that you only include high quality data for your projects and avoid things like leakage between training and testing sets.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Step-by-Step Guide\n", "\n", "This guide will walk through how to use DataEval to perform basic data cleaning.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Environment Requirements\n", "\n", "You will need a python environment with the following packages installed:\n", "\n", "- `dataeval[torch]` or `dataeval[all]`\n", "- `torchvision`\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We'll begin by installing the necessary libraries to walk through this guide.\n" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "tags": [ "remove_cell" ] }, "outputs": [], "source": [ "try:\n", " import google.colab # noqa: F401\n", "\n", " # specify the version of DataEval (==X.XX.X) for versions other than the latest\n", " %pip install -q dataeval[torch]\n", "except Exception:\n", " pass" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "# You will need matplotlib for visualing our dataset and numpy to be able to handle the data.\n", "import matplotlib.pyplot as plt\n", "import numpy as np\n", "import torchvision.transforms.v2 as v2\n", "from numpy.typing import NDArray\n", "\n", "# You are only using torchvision to load in the dataset.\n", "# If you already have the data stored on your computer in a numpy friendly manner,\n", "# then feel free to load it directly into numpy arrays.\n", "from torchvision import datasets\n", "\n", "# Load the classes from DataEval that are helpful for EDA\n", "from dataeval.detectors.linters import Duplicates, Outliers\n", "from dataeval.metrics.stats import channelstats, datasetstats, hashstats, labelstats\n", "\n", "# Set the random value\n", "rng = np.random.default_rng(213)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 1: Understand the Data\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load the Data\n", "\n", "You are going to work with the PASCAL VOC 2011 dataset.\n", "This dataset is a small curated dataset that was used for a computer vision competition.\n", "The images were used for classification, object detection, and segmentation.\n", "This dataset was chosen because it has multiple classes and images with a variety of sizes and objects.\n", "\n", "If this data is already on your computer you can change the file location from `\"./data\"` to wherever the data is stored.\n", "Just remember to also change the download value from `True` to `False`.\n", "\n", "For the sake of ensuring that this tutorial runs quickly on most computers, you are going to analyze only the training set of the data, which is a little under 6000 images.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Download the data and then load it as a torch Tensor\n", "to_tensor = v2.ToImage()\n", "ds = datasets.VOCDetection(\n", " \"./data\",\n", " year=\"2011\",\n", " image_set=\"train\",\n", " download=True,\n", " transform=to_tensor,\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Verify the size of the loaded dataset\n", "len(ds)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Before moving on, verify that the above code cell printed out 5717 for the size of the [dataset](http://host.robots.ox.ac.uk/pascal/VOC/voc2011/dbstats.html).\n", "\n", "This ensures that everything is working as needed for the tutorial." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Inspect the Data\n", "\n", "As this data was used for a computer vision competition, it will most likely have very few issues, but it is always worth it to check.\n", "Many of the large webscraped datasets available for use do contain image issues.\n", "Verifying in the beginning that you have a high quality dataset is always easier than finding out later that you trained a model on a dataset with erroneous images or a set of splits with leakage." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In addition to the images, you'll also need to load the labels.\n", "However, there is no standard for metadata associated with images.\n", "Thus, you will load the metadata associated with the first image in order to explore the metadata structure and determine exactly what is contained where in the metadata.\n", "This way you can extract just the labels for each image.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Check the label structure\n", "ds[0][1]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Having viewed the metadata for the first image, you now know that the metadata comes through as a nested dictionary.\n", "What you need is the _\"object\"_ key of the dictionary which contains a list of objects in the image.\n", "Inside the list are additional dictionaries, one for each object found in the image.\n", "Inside these dictionaries, the label can be found via the _\"name\"_ key.\n", "\n", "Let's run through all of the metadata and create a list of lists which just contains the name of each object in each image.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "labels = [[o[\"name\"] for o in d[1][\"annotation\"][\"object\"]] for d in ds]\n", "\n", "labels[0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Double check that the values output from the code above matches the object names from the original metadata you viewed above.\n", "\n", "Now that the labels for each image are contained in a friendly list, you can run some label statistics to explore the different objects found in the images.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Calculate basic label statistics from the dataset\n", "lstats = labelstats(labels)\n", "\n", "# Display basic counts\n", "print(f\"Class Count: {lstats.class_count}\")\n", "print(f\"Label Count: {lstats.label_count}\")\n", "print(\"--------------------------------------\")\n", "\n", "# Display counts per class\n", "print(\" Object: Total Count - Image Count\")\n", "for cls in lstats.label_counts_per_class:\n", " print(f\"{cls:>11}: {lstats.label_counts_per_class[cls]:>4}\\\n", " - {lstats.image_counts_per_label[cls]:>4}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The above table shows that this dataset has a total of 20 classes. \n", "\n", "Of the classes, `person` is the class with the highest total object count followed by `chair` and `car`, while `person`, `chair` and `dog` are the classes with the highest number of images. \n", "\n", "`cow`, `sheep`, and `bus` are the classes with least number of objects, while `bus`, `train` and `cow` are the classes with the least number of images.\n", "\n", "This table helps point out the wide variation in\n", "\n", "- the number of classes per image,\n", "- the number of objects per image,\n", "- and the number of objects of each class per image.\n", "\n", "This highlights an important concept - class balance. \n", "A dataset that is imbalanced can result in a model that chooses the more prominent class more often just because it's more prominent. \n", "To explore this concept further, see the bias tutorial in the [What's Next](#whats-next) section at the end of this tutorial.\n", "\n", "Now that the label set has been explored, it's important to visually inspect random images across the different classes to get an idea of the quality of the data.\n", "When inspecting the random images, you want to get an idea of the variety of backgrounds, the range of colors, the locations of objects in images,\n", "and how often an image is seen with a single object versus multiple objects.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Helper function to get image and permute to channels last for matplotlib\n", "def get_image(index: int) -> NDArray:\n", " return np.moveaxis(ds[index][0].numpy(), 0, -1)\n", "\n", "\n", "# Plot random images from each category\n", "_, axs = plt.subplots(5, 4, figsize=(8, 10))\n", "\n", "for ax, (category, indices) in zip(axs.flat, lstats.image_indices_per_label.items()):\n", " # Randomly select an index from the list of indices\n", " selected_index = rng.choice(indices)\n", "\n", " ax.imshow(get_image(selected_index))\n", " ax.set_title(category)\n", " ax.axis(\"off\")\n", "\n", "plt.tight_layout()\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Plotting the images displays the variety in the images, including image sizes, image brightness, object sizes, backgrounds, number of objects in the image, and even the lack of color in a few images which are black and white.\n", "\n", "This is where DataEval comes in, it's designed to help you make sense of the many different aspects that affect building repsentative datasets and robust models.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Summarize the Data\n", "\n", "To begin, you are going to utilize multiple statistical analysis functions.\n", "\n", "The `datasetstats` function produces statistical information covering various categories of image metrics. The results obtained are equivalent to the indivudal outputs from `dimensionstats`, `pixelstats`, and `visualstats` functions run on the dataset." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# This cell takes about 1-3 minutes to run depending on your hardware\n", "\n", "# Calculate the datasetstats for the images\n", "# Note: the stat function expect the images as an iterable and in the (C,H,W) format\n", "\n", "stats = datasetstats(d[0] for d in ds)\n", "\n", "# Aggregate all of the results in a single dictionary\n", "dataset_stats = stats.dict()\n", "\n", "# View the list of metrics used to analyze the images\n", "list(dataset_stats)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that the stats are computed, you should visualize the results.\n", "To help you adequately see all of the trends in the plots, plot them once normally and once on a log scale. \n", "Sometimes there are only a few extreme values in a category and they can be easily overlooked if a log scale is not used.\n" ] }, { "cell_type": "code", "execution_count": 58, "metadata": {}, "outputs": [], "source": [ "# Helper function to plot histograms by metric with options for per_channel\n", "def plot_metric_histogram(stats, log: bool, per_channel: bool):\n", " ch_mask = None\n", " metrics_perchannel = {\n", " \"size\": False,\n", " \"aspect_ratio\": False,\n", " \"channels\": False,\n", " \"mean\": True,\n", " \"std\": True,\n", " \"var\": True,\n", " \"skew\": True,\n", " \"kurtosis\": False,\n", " \"zeros\": True,\n", " \"brightness\": True,\n", " \"contrast\": True,\n", " \"darkness\": True,\n", " \"sharpness\": False,\n", " \"entropy\": True,\n", " \"missing\": False,\n", " }\n", " metrics = {k: v for k, v in metrics_perchannel.items() if not per_channel or v}\n", " label_kwargs = {\"label\": [\"Channel 0\", \"Channel 1\", \"Channel 2\"]}\n", "\n", " rows = int(len(metrics) / 3)\n", " _, axs = plt.subplots(rows, 3, figsize=(10, rows * 2.5))\n", " for ax, (metric, channelwise) in zip(\n", " axs.flat,\n", " metrics.items(),\n", " ):\n", " # Plot the histogram for the chosen metric\n", " if per_channel:\n", " if not channelwise:\n", " continue\n", " ch_mask = ch_mask or stats.pixelstats.get_channel_mask(None, 3)\n", " ax.hist(\n", " stats.dict()[metric][ch_mask].reshape(-1, 3),\n", " bins=20,\n", " density=True,\n", " color=[\"red\", \"green\", \"blue\"],\n", " log=log,\n", " **label_kwargs,\n", " )\n", " # Only plot the labels once for channels\n", " if label_kwargs:\n", " ax.legend()\n", " label_kwargs = {}\n", " else:\n", " ax.hist(dataset_stats[metric], bins=20, log=log)\n", "\n", " ax.set_title(f\"Channel {metric}\")\n", " ax.set_title(metric)\n", "\n", " plt.tight_layout()\n", " plt.show()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plot_metric_histogram(stats, log=False, per_channel=False)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plot_metric_histogram(stats, log=True, per_channel=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Plotting the distribution of values for each metric allows one to quickly inspect the metrics for unusual distributions.\n", "Without knowing anything about the images, assume that each metric should follow one of two types of distributions: normal or uniform.\n", "\n", "With a [uniform distribution](https://en.wikipedia.org/wiki/Discrete_uniform_distribution), you want to notice if any of the plots have areas that are a lot shorter or a lot taller than the rest of the values.\n", "\n", "With a [normal distribution](https://en.wikipedia.org/wiki/Normal_distribution), you are looking at the edges of the bell curve to see if the values near the edges of the plot raise up or if there are gaps between the edge values and the next value in.\n", "\n", "Looking at the plots, there are a few key things to point out:\n", "\n", "1. The channel metric has only one value, 3. This is interesting since some of the images from our random plot above are greyscale, which usually only has 1 channel.\n", "2. The entropy, zeros, kurtosis, and contrast metrics are single-tailed and all of them have a long tail which indicates that the images whose values are in the edges of the tail are potentially problematic.\n", "3. Size, aspect ratio, variance, skew, brightness and darkness have skewed or off-center distributions which is another sign of problematic images.\n", "4. Mean, standard deviation and sharpness appear to have a normal distribution and none have an extended tail, which is a good sign.\n", "\n", "While this does not tell you which images are the problematic ones, it provides some insight into the metrics the `Outliers` class should flag. \n", "From these plots, you should expect the Outliers class to flag images with issues in the following metrics:\n", "\n", "- size\n", "- aspect ratio\n", "- variance\n", "- skew\n", "- kurtosis\n", "- zeros\n", "- brightness\n", "- contrast\n", "- darkness\n", "- entropy\n", "\n", "Before moving on to the Outliers class, you should analyze the channel stats using the `channelstats` function to see if there are any additional metrics that might be problematic. The `channelstats` function performs a subset of the `datasetstats` function, covering per-channel `pixelstats` and `visualstats` metrics.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# This cell takes about 1-3 minutes to run depending on your hardware\n", "\n", "# Calculate the per-channel pixelstats and visualstats for the images\n", "# Note: the stat function expect the images as an iterable and in the (C,H,W) format\n", "ch_stats = channelstats(d[0] for d in ds)\n", "\n", "# Aggregate all of the results in a single dictionary\n", "ds_channel_stats = ch_stats.dict()\n", "\n", "# View the list of metrics to analyze the image channels\n", "list(ds_channel_stats)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plot_metric_histogram(ch_stats, log=False, per_channel=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With the understanding from above about uniform and normal distributions, you want to analyze the channel-based metrics with the same principle.\n", "\n", "Here the plots show that the overall shape for each channel metrics matches the shape of their image metric counterparts.\n", "When analyzing the channel metrics, you should not be interested in the overall shape of these plots but in the comparison of the shape across each of the individual channels.\n", "You want to see if the same shape holds across each channel or if there are large differences between the channels.\n", "This is important because discrepancies across channels can help detect image processing errors and channel bias.\n", "\n", "However, their is very little difference across the channels for each metric. \n", "There is a slight shift in the blue channel for the mean, skew, brightness, and darkness metrics, but it is not enough of a difference to warrant suspicion.\n", "Thus, no additional metric is added to the list of metrics expected to be flagged by the Outliers class.\n", "\n", "Now, you can move on to identifying which images have a statistical difference from the rest of the images.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 2: Identify any Outlying Data Points\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Extreme/Missing Values\n", "\n", "Here you will detect and identify the images associated with the extreme values from the plotted metrics above.\n", "To detect these extreme values, you will use the `Outliers` class.\n", "The `Outliers` class has multiple methods to determine the extreme values, which are discussed in the [Data Cleaning explanation](../concepts/DataCleaning.md).\n", "For this guide, you will use the \"zscore\" as the Z score defines outliers in a normal distribution.\n", "\n", "The output of the `Outliers` class contains a dictionary where the image number is the key and the value is a dictionary containing the flagged metrics and their value.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Initialize the Outliers class\n", "outliers = Outliers(outlier_method=\"zscore\")\n", "\n", "# Find the extreme images\n", "outlier_imgs = outliers.from_stats(stats)\n", "\n", "# View the number of extreme images\n", "print(f\"Number of images with extreme values: {len(outlier_imgs)}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This class can flag a lot of images, depending on how varied the dataset is and which method you use to define extreme values. \n", "Using the zscore, it flagged 498 images across 15 metrics out of the 5717 images in the dataset.\n", "However, switching the method can give different results.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# List the metrics with an extreme value\n", "metrics = {}\n", "for img, group in outlier_imgs.issues.items():\n", " for extreme in group:\n", " if extreme in metrics:\n", " metrics[extreme].append(img)\n", " else:\n", " metrics[extreme] = [img]\n", "print(f\"Number of metrics with extremes: {len(metrics)}\")\n", "\n", "# Show the total number of extreme values for each metric\n", "for group, imgs in sorted(metrics.items(), key=lambda item: len(item[1]), reverse=True):\n", " print(f\" {group} - {len(imgs)}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Digging into the flagged images and organizing them by category shows that the metric with the most extreme values is \"size\" while \"sharpness\" has the least number of extreme values.\n", "It is also worth noting that the result from `Outliers` found issues with more metrics than was anticipated.\n", "Going back to the expected list, there was\n", "\n", "- size\n", "- aspect ratio\n", "- variance\n", "- skew\n", "- kurtosis\n", "- zeros\n", "- brightness\n", "- contrast\n", "- darkness\n", "- entropy\n", "\n", "However, the result from `Outliers` added `mean`, `std`, and `sharpness` (because `size` was an expected issue, `width` and `height` are counted as expected).\n", "The result from `Outliers` is not perfect but it is designed to flag any image that might be problematic.\n", "It is then up to you, the user, to shift through the information provided by the result from `Outliers`.\n", "\n", "Part of exploring the results includes displaying how the flagged images are spread across the 20 classes.\n" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [], "source": [ "# Show each metric by class\n", "# Determine which classes are present in each image\n", "class_wise = {obj: {} for obj in sorted(lstats.image_indices_per_label)}\n", "for group, imgs in metrics.items():\n", " for img in imgs:\n", " unique_items = set(labels[img])\n", " for cat in unique_items:\n", " if group not in class_wise[cat]:\n", " class_wise[cat][group] = 0\n", " class_wise[cat][group] += 1\n", "\n", "# Create the table for displaying\n", "table_header = [\" Class\"]\n", "for group in sorted(metrics.keys()):\n", " table_header.append(f\"{group:^10}\")\n", "table_header.append(\" Total\")\n", "table = [table_header]\n", "for class_cat, results in class_wise.items():\n", " table_rows = [f\"{class_cat:>11}\"]\n", " total = 0\n", " for group in sorted(metrics.keys()):\n", " if group == \"aspect_ratio\":\n", " if group in results:\n", " table_rows.append(f\"{results[group]:^12}\")\n", " total += results[group]\n", " else:\n", " table_rows.append(f\"{0:^12}\")\n", " else:\n", " if group in results:\n", " table_rows.append(f\"{results[group]:^10}\")\n", " total += results[group]\n", " else:\n", " table_rows.append(f\"{0:^10}\")\n", " table_rows.append(f\" {total:^5}\")\n", " table.append(table_rows)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "(linting-issues-by-metric-table)=\n", "\n", "#### Linting Issues by Metric Table\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Display the table\n", "for row in table:\n", " print(\" | \".join(row))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Some of the things to note from splitting up the issues by class and metric:\n", "\n", "- An image with an unusual aspect ratio is most likely to contain a boat or aeroplane\n", "- An image with an issue in brightness is most likely to contain a person\n", "- An image with an issue in darkness is most likely to be an aeroplane\n", "- Images with high contrast are likely to fall within 1 of 4 classes: bottle, cat, chair, person\n", "- Images with low entropy (think image with constant pixels) are likely to fall within 1 of 4 classes: aeroplane, bird, bottle, person\n", "- Unusual skew and kurtosis images follow a similar trend as entropy\n", "- Every class has images with size issues\n", "\n", "There appear to be other trends as well. \n", "Something to remember is that there are different number of images for each class.\n", "For example, 36 low entropy images out of the 2000 for person might be outliers while 28 low entropy images out of 300 for aeroplane might not be;\n", "low entropy might be an inherent characteristic of class aeroplane.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In order to understand the above table, you will plot sample images from a few of the metrics, specifically:\n", "\n", "- entropy\n", "- size\n", "- zeros\n", "- sharpness\n", "\n", "Entropy, variance, standard deviation, kurtosis, and skew all measure (in different ways) how much change there is across the pixels in the image, and entropy will be the easiest to understand.\n", "\n", "Size, width, height and aspect ratio are all interrelated and size has the most extreme images from those.\n", "\n", "Zeros is a category unto itself but it is closely related to brightness, contrast, darkness, and mean. Zeros measures the percentage of pixels with a zero value compared to the average image.\n", "\n", "Sharpness is also in it's own category and it measures the perceived edges in an image.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "(questions)=\n", "\n", "#### Questions\n", "\n", "When looking at these images, you want to think about the following questions:\n", "\n", "- Does this image represent something that would be expected in operation?\n", "- Is there commonality to the objects in the images? Such as all the objects are found on the leftside of the images.\n", "- Is there commonality to the backgrounds of the images? Such as similar colors, darkness/brightness, places, things (like water or snow).\n", "- Is there commonality to the class of objects in the images? Such as a specific pose for person or specific pot color for pottedplant.\n" ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [], "source": [ "# Helper method to plot images of interest\n", "def plot_sample_images(metric: str, layout: tuple[int, int]) -> None:\n", " _, axs = plt.subplots(*layout, figsize=(10, layout[0] * 4))\n", " selected_index = rng.choice(metrics[metric], min(np.prod(layout), len(metrics[metric])), replace=False)\n", "\n", " for i, ax in enumerate(axs.flat):\n", " ax.imshow(get_image(selected_index[i]))\n", " ax.set_title(f\"{metric}={np.round(dataset_stats[metric][selected_index[i]], 2)}\")\n", " ax.axis(\"off\")\n", "\n", " print(f\"metric={metric}\")\n", " print(f\"quantiles={np.round(np.quantile(dataset_stats[metric], [0, 0.25, 0.5, 0.75, 1]), 2)}\")\n", " plt.tight_layout()\n", " plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Entropy\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Plot images flagged for \"entropy\"\n", "plot_sample_images(\"entropy\", (2, 4))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Looking at the flagged images for entropy, what do you see?\n", "\n", "Many of the flagged images here have an almost constant background.\n", "Thinking back to our questions - how many of these backgrounds will you see in operation? Are you likely to find water in your images or an object in the sky? \n", "It is also worth pointing out the number of images that have a relatively dark background. How likely are you to encounter night time or dark images in your operation? \n", "If water or objects in the sky or dark backgrounds are expected, then you may just need to collect more images with these kinds of backgrounds. If not, then they are outliers that can be discarded. \n", "To learn more about data collection, go [here](https://viso.ai/computer-vision/data-collection/).\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Size" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Plot images flagged for \"size\"\n", "plot_sample_images(\"size\", (2, 4))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If this was a real workflow where you were going to train a model on these images,\n", "you would need to decide whether your model workflow will preprocess images to be the exact same size or if you only wanted to only include images of a specific size. \n", "If preprocessing the images, you will want to make sure that your method does not cause distortions to the image (such as resizing)\n", "and that you still have the desired information in the image (such as when cropping).\n", "If you are expecting an image of a specific size, then you can easily just discard the incorrectly sized images.\n", "However, you would then need to make sure that this does not introduce any bias into your dataset.\n", "\n", "Now that you've thought about your workflow, let's look at the flagged images for size.\n", "\n", "The first thing of note is that there are a lot of images here with animals.\n", "Here you want to think about how images of animals are taken and is this discrepancy in size a natural artifact of animal pictures or just a by product of the data collection methods?\n", "Recalling from the [table](#linting-issues-by-metric-table) above, issues with size are pretty spread out across all classes, so dropping all of them might be okay, but you will definitely want to check for bias after dropping them.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Zeros\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Plot images flagged for \"zeros\"\n", "plot_sample_images(\"zeros\", (2, 4))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Looking at the flagged images for zeros, what do you see?\n", "\n", "Similarly to entropy, which was addressed above, some of these images have a dark background. Is this expected for this dataset? \n", "Also, of note is the grayscale images. Here, you want to think about how often will the model come across greyscale images in operation, and can a malfunction in the pipeline (either hardware or software) produce greyscale images and if so how likely will that kind of malfunction occur?\n", "\n", "For both of those cases, dark backgrounds and greyscale images, do they occur proportionately throughout all of the classes or do they exist in only 1 or 2 classes?\n", "If they occur in only 1 or 2 classes, then you might just want to throw them out so that your model doesn't just learn to associate dark backgrounds or greyscale with those classes.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Blurriness\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Plot images flagged for \"sharpness\"\n", "plot_sample_images(\"sharpness\", (1, 2))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Looking at the flagged images for sharpness, what do you see?\n", "\n", "Notice the grass and the leaves in the background of the images. Are those common backgrounds or are these the only images with a close up with leaves and grasses in the background? Is this operationally relevant? If not, then these two images should just be removed. If yes, then additional images are needed with these two backgrounds.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Linting Summary\n", "\n", "The `Outliers` class can not tell you what is operationally relevant, but it does inform about which images stand out from the rest in one way or another.\n", "\n", "After viewing these images that stand out, there are two key takeaways to keep in mind:\n", "\n", "1. Many of the flagged images will be flagged by more than one metric.\n", "2. Plotting the flagged metrics allows you to get an idea of what the `Outliers` class calls an outlier.\n", " Not all of these images are outliers, some of them could represent areas in our dataset that are underrepresented.\n", "\n", "DataEval is used to identify images which _may be_ problematic in your dataset, but it cannot specify whether an image is actually an outlier or not. \n", "Applying the four [questions](#questions) above to each image that stands out, will help you in determining whether the image should be removed or not from the dataset.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Step 3: Identify duplicate data\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Duplicates\n", "\n", "Now that you know how to identify poor quality images in your dataset, another important aspect of data cleaning is detecting and removing any duplicates.\n", "\n", "The `Duplicates` class identifies both exact duplicates and potential (near) duplicates.\n", "Potential duplicates can occur in a variety of ways:\n", "\n", "- Intentional permutations\n", " - Images with varying brightness\n", " - Translating the image\n", " - Padding the image\n", " - Cropping the image\n", "- Unintentional changes\n", " - Copying the image from one format to another (png->jpeg)\n", " - Including a permuted image and the original\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Initialize the Duplicates class\n", "dups = Duplicates()\n", "\n", "# Find the duplicates\n", "dups.evaluate(d[0] for d in ds)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As expected there are no duplicates in this dataset, since it was curated for a specific competition.\n", "\n", "However, to highlight the abilities of the `Duplicates` class, you will add some duplicates to the dataset and then rerun the `Duplicates` class.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Create exact and duplicate images\n", "\n", "# Copy images 23 and 46 to create exact duplicates\n", "# Copy and crop images 5 and 4376 to create near duplicates\n", "dupes = [\n", " ds[23][0],\n", " ds[46][0],\n", " ds[5][0][:, 5:-5, 5:-5],\n", " ds[4376][0][:, :-5, 5:],\n", "]\n", "\n", "dupes_stats = hashstats(dupes)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Find the duplicates appended to the dataset\n", "duplicates = dups.from_stats([dups.stats, dupes_stats])\n", "print(f\"exact: {duplicates.exact}\")\n", "print(f\"near: {duplicates.near}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As shown above, the `Duplicates` class identified all images from the second dataset as exact or near duplicates.\n", "\n", "Images 0 and 1 from dataset 1 are identified as exact duplicates of images 23 and 46, respectively from the original dataset (dataset 0). Images 2 and 3 from dataset 1 are identified as near duplicates of images 5 and 4376, respectively, which were cropped from the original dataset (dataset 0).\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Conclusion\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now comes the fun part, determining what data points are supposed to be in the data set, what points need to be removed, and whether or not you need to collect more data points for a given class or style of image.\n", "\n", "You will need to inspect the flagged images.\n", "Viewing the flagged images in relation to the other images with the same class and the rest of the dataset, will help you determine what to do with the image.\n", "Examples of issues include mislabeled images, classes with under-represented samples, and discrepancies in image characteristics (e.g. brightness) between classes.\n", "\n", "As you can see, the DataEval methods are here to help you gain a deep understanding of your dataset and all of it's strengths and limitations.\n", "It is designed to help you create representative and reliable datasets.\n", "\n", "Good luck with your data!\n", "\n", "---\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## What's Next\n", "\n", "In addition to exploring a dataset in its feature space, DataEval also offers the following tutorials:\n", "\n", "- Explore images through clustering with the [Assessing the Data Space Guide](EDA_Part2.ipynb)\n", "- Identify bias or other factors in a dataset which may influence model performance with the [Identifying Bias and Correlations Guide](EDA_Part3.ipynb)\n", "- Monitor data for shifts during operation with the [Data Monitoring Guide](Data_Monitoring.ipynb)\n", "\n", "To learn more about specific functions or classes, see the [Concept pages](../concepts/index.md).\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## On your own\n", "\n", "Now that you've gone through a tutorial on exploring a dataset, try going through the tutorial again with the test set, full dataset, or even your own dataset.\n", "One thing to look for when checking other sets of data is to observe how the stats of each grouping of data changes or doesn't change.\n", "\n", "You can also play around with the different statistical methods that the `Outlier` class employs to see how the method affects the number and type of issues detected." ] } ], "metadata": { "kernelspec": { "display_name": ".venv-3.11", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.9" } }, "nbformat": 4, "nbformat_minor": 2 }