How to visualize cleaning issues¶
Problem statement¶
Exploratory data analysis (EDA) can be overwhelming. There are so many things to check. Duplicates in your dataset, bad/corrupted images in the set, blurred or bright/dark images, the list goes on.
DataEval created a data cleaning class to assist you with your EDA so you can start training your models on high quality data.
When to use¶
The cleaning class should be used during the initial EDA process or if you are trying to verify that you have the right data in your dataset.
What you will need¶
A dataset to analyze
A Python environment with the following packages installed:
dataevalmaite-datasets
Getting started¶
Let’s import the required libraries needed to set up a minimal working example
import polars as pl
from maite_datasets.image_classification import CIFAR10
from dataeval import Metadata
from dataeval.config import set_max_processes
from dataeval.quality import Outliers
set_max_processes(4)
_ = pl.Config.set_tbl_rows(-1)
Loading in the data¶
We are going to start by loading in the CIFAR-10 dataset.
The CIFAR-10 dataset contains 60,000 images - 50,000 in the train set and 10,000 in the test set. For the purposes of this demonstration, we are just going to use the test set.
# Load in the CIFAR10 dataset
testing_dataset = CIFAR10("./data", image_set="test", download=True)
# Create the metadata for the dataset
metadata = Metadata(testing_dataset)
Cleaning the dataset¶
Now we can begin finding those images which are significantly different from the rest of the data.
# Initialize the Outliers class
outliers = Outliers()
# Evaluate the data
results = outliers.evaluate(testing_dataset)
# Also evaluate the data classwise
results_classwise = results.classwise(metadata)
The results are a dictionary with the keys being the image that has an issue in one of the listed properties below:
Brightness
Blurriness
Missing
Zero
Width
Height
Size
Aspect Ratio
Channels
Depth
print(f"Total number of images with an issue: {len(results.aggregate_by_item())}")
Total number of images with an issue: 319
print(f"Total number of images with an issue (classwise): {len(results_classwise.aggregate_by_item())}")
Total number of images with an issue (classwise): 322
# View issues by metric
results.aggregate_by_metric()
| metric_name | Total |
|---|---|
| cat | u32 |
| "zeros" | 197 |
| "kurtosis" | 94 |
| "entropy" | 60 |
| "skew" | 29 |
| "contrast" | 15 |
| "brightness" | 8 |
| "var" | 1 |
# View issues by metric (classwise)
results_classwise.aggregate_by_metric()
| metric_name | Total |
|---|---|
| cat | u32 |
| "zeros" | 199 |
| "kurtosis" | 83 |
| "entropy" | 61 |
| "contrast" | 15 |
| "skew" | 13 |
| "brightness" | 5 |
| "var" | 4 |
| "darkness" | 3 |
# View issues by class
results.aggregate_by_class(metadata)
| class_name | brightness | contrast | entropy | kurtosis | skew | var | zeros | Total |
|---|---|---|---|---|---|---|---|---|
| cat | u32 | u32 | u32 | u32 | u32 | u32 | u32 | u32 |
| "airplane" | 7 | 1 | 35 | 44 | 19 | 0 | 21 | 127 |
| "bird" | 1 | 1 | 6 | 20 | 4 | 1 | 24 | 57 |
| "cat" | 0 | 1 | 3 | 8 | 3 | 0 | 27 | 42 |
| "frog" | 0 | 1 | 4 | 3 | 0 | 0 | 31 | 39 |
| "automobile" | 0 | 1 | 3 | 1 | 0 | 0 | 31 | 36 |
| "deer" | 0 | 6 | 1 | 9 | 2 | 0 | 15 | 33 |
| "horse" | 0 | 1 | 1 | 1 | 0 | 0 | 19 | 22 |
| "ship" | 0 | 0 | 3 | 7 | 0 | 0 | 10 | 20 |
| "dog" | 0 | 1 | 2 | 0 | 0 | 0 | 12 | 15 |
| "truck" | 0 | 2 | 2 | 1 | 1 | 0 | 7 | 13 |
| "Total" | 8 | 15 | 60 | 94 | 29 | 1 | 197 | 404 |
# View issues by class (classwise)
results_classwise.aggregate_by_class(metadata)
| class_name | brightness | contrast | darkness | entropy | kurtosis | skew | var | zeros | Total |
|---|---|---|---|---|---|---|---|---|---|
| cat | u32 | u32 | u32 | u32 | u32 | u32 | u32 | u32 | u32 |
| "airplane" | 0 | 2 | 0 | 2 | 18 | 2 | 1 | 21 | 46 |
| "cat" | 0 | 1 | 0 | 7 | 10 | 3 | 0 | 24 | 45 |
| "truck" | 3 | 2 | 2 | 10 | 10 | 2 | 0 | 16 | 45 |
| "frog" | 0 | 1 | 0 | 6 | 3 | 0 | 0 | 31 | 41 |
| "automobile" | 0 | 0 | 0 | 11 | 3 | 0 | 0 | 25 | 39 |
| "bird" | 0 | 0 | 0 | 3 | 10 | 2 | 2 | 20 | 37 |
| "deer" | 1 | 6 | 0 | 6 | 6 | 3 | 0 | 15 | 37 |
| "horse" | 0 | 2 | 1 | 7 | 5 | 0 | 0 | 19 | 34 |
| "ship" | 1 | 0 | 0 | 3 | 14 | 1 | 1 | 13 | 33 |
| "dog" | 0 | 1 | 0 | 6 | 4 | 0 | 0 | 15 | 26 |
| "Total" | 5 | 15 | 3 | 61 | 83 | 13 | 4 | 199 | 383 |