Introduction to Data Cleaning

Part 1 of our introduction to exploratory data analysis guide

What you’ll do

  • You will use DataEval’s linters to assess the 2012 VOC dataset.

  • You will analyze the results through various plots and tables.

What you’ll learn

  • You’ll learn how to assess a dataset for extreme and/or redundant data points.

  • You’ll learn helpful questions to determine when to remove or collect additional data.

What you’ll need

  • Environment Requirements

    • dataeval or dataeval[all]


Introduction

Exploratory Data Analysis (EDA) is an approach to analyzing data sets to summarize the main characteristics and identify incongruencies in the data. Before diving into machine learning or statistical modeling, it is crucial to understand the data you are working with. EDA helps in understanding the patterns, detecting anomalies, checking assumptions, and determining relationships in the data.

One of the most important aspects of EDA is data cleaning. A portion of DataEval is dedicated to being able to identify duplicates and outliers as well as data points that have missing or too many extreme values. These techniques help ensure that you only include high quality data for your projects.

Step-by-Step Guide

This guide will walk through how to use DataEval to perform basic data cleaning.

Setup

You’ll begin by importing the necessary libraries to walk through this guide.

# You will need matplotlib for visualing our dataset and
# numpy to be able to handle the data.
import matplotlib.pyplot as plt
import numpy as np
from maite_datasets.object_detection import VOCDetection

# Load the classes from DataEval that are helpful for EDA
from dataeval.detectors.linters import Duplicates, Outliers
from dataeval.metrics.stats import hashstats, labelstats

# Set the random value
rng = np.random.default_rng(213)
# Helper method to plot images of interest
def plot_sample_images(
    dataset, outlier_class, outlier_result, metric: str, metric_dict: dict[str, list[int]], layout: tuple[int, int]
) -> None:
    _, axs = plt.subplots(*layout, figsize=(10, layout[0] * 4))
    selected_index = rng.choice(metric_dict[metric], min(int(np.prod(layout)), len(metric_dict[metric])), replace=False)

    for i, ax in enumerate(axs.flat):
        ax.imshow(dataset[selected_index[i]][0].transpose(1, 2, 0))
        ax.set_title(f"{metric}={np.round(outlier_result.issues[selected_index[i]][metric], 2)}")
        ax.axis("off")

    print(f"metric={metric}")
    print(f"quantiles={np.round(np.quantile(outlier_class.stats[metric], [0, 0.25, 0.5, 0.75, 1]), 2)}")
    plt.tight_layout()
    plt.show()

Step 1: Understand the Data

Load the Data

You are going to work with the PASCAL VOC 2012 dataset. This dataset is a small curated dataset that was used for a computer vision competition. The images were used for classification, object detection, and segmentation. This dataset was chosen because it has multiple classes and images with a variety of sizes and objects.

If this data is already on your computer you can change the file location from "./data" to wherever the data is stored. Just remember to also change the download value from True to False.

For the sake of ensuring that this tutorial runs quickly on most computers, you are going to analyze only the training set of the data, which is a little under 6000 images.

# Download the data and then load it as a torch Tensor
ds = VOCDetection("./data", image_set="train", year="2012", download=True)
print(ds)
VOCDetection Dataset
--------------------
    Year: 2012
    Transforms: []
    Image_set: train
    Metadata: {'id': 'VOCDetection_train', 'index2label': {0: 'aeroplane', 1: 'bicycle', 2: 'bird', 3: 'boat', 4: 'bottle', 5: 'bus', 6: 'car', 7: 'cat', 8: 'chair', 9: 'cow', 10: 'diningtable', 11: 'dog', 12: 'horse', 13: 'motorbike', 14: 'person', 15: 'pottedplant', 16: 'sheep', 17: 'sofa', 18: 'train', 19: 'tvmonitor'}, 'split': 'train'}
    Path: /dataeval/docs/source/notebooks/data/vocdataset/VOCdevkit/VOC2012
    Size: 5717

Inspect the Data

As this data was used for a computer vision competition, it will most likely have very few issues, but it is always worth it to check. Many of the large webscraped datasets available for use do contain image issues. Verifying in the beginning that you have a high quality dataset is always easier than finding out later that you trained a model on a dataset with erroneous images or a set of splits with leakage.

# Calculate basic label statistics from the dataset
lstats = labelstats(ds)

# Display label stats
print(lstats.to_table())
Class Count: 20
Label Count: 15774
Average # Labels per Image: 2.76
--------------------------------------
      Label: Total Count - Image Count
  aeroplane:     470     -     328
    bicycle:     410     -     281
       bird:     592     -     399
       boat:     508     -     264
     bottle:     749     -     399
        bus:     317     -     219
        car:    1191     -     621
        cat:     609     -     540
      chair:    1457     -     656
        cow:     355     -     155
diningtable:     373     -     318
        dog:     768     -     636
      horse:     377     -     238
  motorbike:     375     -     274
     person:    5019     -    2142
pottedplant:     557     -     289
      sheep:     509     -     171
       sofa:     399     -     359
      train:     327     -     275
  tvmonitor:     412     -     299

The above table shows that this dataset has a total of 20 classes.

Of the classes, person is the class with the highest total object count followed by chair and car, while person, chair and dog are the classes with the highest number of images.

cow, sheep, and bus are the classes with least number of objects, while bus, train and cow are the classes with the least number of images.

This table helps point out the wide variation in

  • the number of classes per image,

  • the number of objects per image,

  • and the number of objects of each class per image.

This highlights an important concept - class balance. A dataset that is imbalanced can result in a model that chooses the more prominent class more often just because there are more samples in that class. To explore this concept further, see the bias tutorial in the What’s Next section at the end of this tutorial.

Now that the metadata has been examined, it’s important to inspect random images to get an idea of the variety of backgrounds, the range of colors, the locations of objects in images, and how often an image is seen with a single object versus multiple objects.

# Plot random images from each category
_, axs = plt.subplots(5, 4, figsize=(8, 10))

for ax, (category, indices) in zip(axs.flat, lstats.image_indices_per_class.items()):
    # Randomly select an index from the list of indices
    ax.imshow(ds[rng.choice(indices)][0].transpose(1, 2, 0))
    ax.set_title(lstats.class_names[category])
    ax.axis("off")

plt.tight_layout()
plt.show()
../_images/dda63286f6ab2e7ba86e802d49560535b672c3b63bafeb2a27ed2b2282b0bbaf.png

Plotting the images displays the variety in the images, including image sizes, image brightness, object sizes, backgrounds, number of objects in the image, and even the lack of color in a few images which are black and white.

This is where DataEval comes in. It’s designed to help you make sense of the many different aspects that affect building representative datasets and robust models.

In addition to making sure that you understand the structure of the labels and have visualized some of the images from the dataset, you can also visualize the data distribution across different statistics such as the image size or the pixel mean. In order to view these distributions, you have to use DataEval’s stat functions and plot the results. For more information, see :module:dataeval.metrics.stats.

Now, you can move on to identifying which images have a statistical difference from the rest of the images.

Step 2: Identify any Outlying Data Points

Extreme/Missing Values

Here you will detect and identify the images associated with the extreme values from DataEval’s stat functions. To detect these extreme values, you will use the :class:.Outliers class. The Outliers class has multiple methods to determine the extreme values, which are discussed in the Data Cleaning explanation. For this guide, you will use the “zscore” as the Z score defines outliers in a normal distribution.

The output of the Outliers class contains a dictionary where the image number is the key and the value is a dictionary containing the flagged metrics and their value.

# This cell takes about 1-5 minutes to run depending on your hardware

# Initialize the Outliers class
outliers = Outliers(outlier_method="zscore")

# Find the extreme images
outlier_imgs = outliers.evaluate(ds)

# View the number of extreme images
print(f"Number of images with extreme values: {len(outlier_imgs)}")
Number of images with extreme values: 480

This class can flag a lot of images, depending on how varied the dataset is and which method you use to define extreme values. Using the zscore, it flagged 480 images across 15 metrics out of the 5717 images in the dataset. However, switching the method can give different results.

# List the metrics with an extreme value
metrics = {}
for img, group in outlier_imgs.issues.items():
    for extreme in group:
        if extreme in metrics:
            metrics[extreme].append(img)
        else:
            metrics[extreme] = [img]
print(f"Number of metrics with extremes: {len(metrics)}")

# Show the total number of extreme values for each metric
for group, imgs in sorted(metrics.items(), key=lambda item: len(item[1]), reverse=True):
    print(f"  {group} - {len(imgs)}")
Number of metrics with extremes: 15
  size - 173
  entropy - 124
  contrast - 92
  skew - 89
  zeros - 77
  kurtosis - 73
  brightness - 52
  width - 43
  var - 35
  aspect_ratio - 33
  mean - 29
  std - 22
  height - 22
  darkness - 20
  sharpness - 2

Digging into the flagged images and organizing them by category shows that the metric with the most extreme values is “size” while “sharpness” has the least number of extreme values.

Outliers is designed to flag any images on the edge of each metric’s data distribution. Some images will get flagged as an outlier by multiple metrics, while others will get flagged by only a single metric. It is then up to you, the user, to shift through the information provided by the result from Outliers.

Part of exploring the results includes displaying how the flagged images are spread across the 20 classes.

# Display the table
print(outlier_imgs.to_table(lstats))
      Class | aspect_ratio | brightness | contrast | darkness | entropy | height | kurtosis | mean  | sharpness | size  | skew  |  std  |  var  | width | zeros | Total
  aeroplane |      7       |     22     |    4     |    2     |   28    |   5    |    23    |   7   |     0     |  12   |  30   |   6   |   1   |   1   |   1   |  149 
    bicycle |      0       |     1      |    3     |    0     |    3    |   0    |    2     |   1   |     0     |  10   |   3   |   0   |   0   |   5   |   2   |  30  
       bird |      3       |     8      |    3     |    1     |   16    |   1    |    12    |   1   |     0     |  10   |  14   |   5   |   0   |   1   |   7   |  82  
       boat |      7       |     2      |    2     |    1     |    6    |   4    |    2     |   1   |     0     |   8   |   3   |   2   |   0   |   0   |   1   |  39  
     bottle |      1       |     2      |    15    |    5     |   10    |   0    |    8     |   3   |     0     |  10   |   8   |   1   |   3   |   6   |   9   |  81  
        bus |      2       |     0      |    1     |    0     |    2    |   0    |    2     |   1   |     0     |   5   |   2   |   1   |   1   |   1   |   0   |  18  
        car |      4       |     0      |    7     |    2     |    6    |   3    |    2     |   1   |     0     |  18   |   1   |   0   |   3   |   3   |   7   |  57  
        cat |      1       |     2      |    10    |    1     |    5    |   3    |    3     |   1   |     0     |  24   |   2   |   3   |   9   |   4   |  11   |  79  
      chair |      1       |     4      |    8     |    2     |    9    |   1    |    5     |   3   |     0     |  13   |   5   |   0   |   1   |   3   |  10   |  65  
        cow |      0       |     0      |    1     |    0     |    1    |   1    |    1     |   0   |     0     |   3   |   1   |   0   |   1   |   2   |   1   |  12  
diningtable |      0       |     1      |    3     |    0     |    2    |   0    |    1     |   0   |     0     |   6   |   2   |   0   |   2   |   0   |   2   |  19  
        dog |      2       |     3      |    4     |    0     |    8    |   2    |    1     |   2   |     1     |  23   |   2   |   3   |   2   |   7   |   5   |  65  
      horse |      1       |     0      |    2     |    0     |    2    |   0    |    0     |   0   |     0     |   5   |   0   |   0   |   0   |   0   |   0   |  10  
  motorbike |      1       |     1      |    5     |    2     |    3    |   0    |    2     |   1   |     0     |   8   |   2   |   0   |   2   |   1   |   3   |  31  
     person |      4       |     7      |    51    |    10    |   36    |   2    |    24    |   9   |     0     |  61   |  27   |   1   |   7   |  18   |  30   |  287 
pottedplant |      0       |     1      |    4     |    0     |    1    |   0    |    0     |   1   |     0     |   4   |   2   |   0   |   4   |   0   |   1   |  18  
      sheep |      1       |     0      |    1     |    0     |    0    |   0    |    1     |   0   |     1     |   3   |   1   |   0   |   1   |   0   |   1   |  10  
       sofa |      1       |     0      |    4     |    0     |    3    |   1    |    1     |   0   |     0     |   8   |   1   |   0   |   3   |   3   |   3   |  28  
      train |      3       |     3      |    0     |    0     |    4    |   2    |    0     |   2   |     0     |   9   |   0   |   1   |   4   |   2   |   2   |  32  
  tvmonitor |      1       |     0      |    5     |    0     |    2    |   2    |    3     |   0   |     0     |   7   |   2   |   0   |   0   |   1   |   7   |  30  

Some of the trends to note from the table above which splits the issues by class and metric:

  • An image with an unusual aspect ratio is most likely to contain a boat or aeroplane.

  • An image with an issue in brightness is most likely to contain an aeroplane.

  • An image with an issue in darkness is most likely to be a person.

  • Images with high contrast are likely to fall within 1 of 4 classes: bottle, cat, chair, person.

  • Images with low entropy (think image with constant pixels) are likely to fall within 1 of 4 classes: aeroplane, bird, bottle, person.

  • Unusual skew and kurtosis images follow a similar trend as entropy.

  • Every class has images with size issues.

Something to remember is that there are different number of images for each class and that effective use of this tool requires understanding the dataset in question. For example, 36 low entropy images out of the 2000 for person might be outliers while 28 low entropy images out of 300 for aeroplane might not be; low entropy might be an inherent characteristic of the aeroplane class.

In order to understand the above table, you will plot sample images from a few of the metrics, specifically:

  • entropy

  • size

  • zeros

  • sharpness

Entropy, variance, standard deviation, kurtosis, and skew all measure (in different ways) how much change there is across the pixels in the image, and entropy will be the easiest to understand.

Size, width, height and aspect ratio are all interrelated and size has the most extreme images from those.

Zeros is a category unto itself but it is closely related to brightness, contrast, darkness, and mean. Zeros measures the percentage of pixels with a zero value compared to the average image.

Sharpness is also in it’s own category and it measures the perceived edges in an image.

Questions

When looking at these images, you want to think about the following questions:

  • Does this image represent something that would be expected in operation?

  • Is there commonality to the objects in the images?

  • Is there commonality to the backgrounds of the images?

  • Is there commonality to the class of objects in the images?

Asking these questions will help you notice things like all objects being located on the leftside of the image or all the images of a specific class have a specific background. Training a model with data that has commonalities can cause your model to develop biases or limit your model’s ability to generalize to non-training data.

Entropy

# Plot images flagged for "entropy"
plot_sample_images(ds, outliers, outlier_imgs, "entropy", metrics, (2, 4))
metric=entropy
quantiles=[0.28 5.05 5.23 5.35 5.53]
../_images/398446cb3803b5cae975aceaaec11c33bfeb776facbdf096853145e03e775b5b.png

When you examine the flagged images for entropy, look for patterns in the content of the images. Many of these images may feature backgrounds with very little variation, such as water or sky. Others might have darker backgrounds than usual.

For example, in an operational setting, water or sky backgrounds may or may not appear frequently, depending on the expected use case. Similarly, darker images may indicate low-light conditions, which could suggest either operational relevance (e.g., night operations) or anomalies that need to be addressed.

To refine your dataset, decide whether these flagged images represent scenarios that align with your goals. If they do, consider collecting more data with similar characteristics to balance your dataset. If not, these images may be excluded as outliers.

Size

# Plot images flagged for "size"
plot_sample_images(ds, outliers, outlier_imgs, "size", metrics, (2, 4))
metric=size
quantiles=[ 41250. 166500. 187500. 187500. 250000.]
../_images/f2e913ca80812cbf05b03ff728e3e4789f69a98a527d86f15bf121cc2b6cab27.png

Flagged images for size often include examples where the objects in the image are unusually large or small relative to the rest of the dataset. For instance, animal images might have a wide range of sizes depending on how the photographs were taken.

If your workflow involves preprocessing images to a uniform size, verify that resizing does not distort important details. For example, cropping could remove key parts of the image, while resizing could stretch or compress objects. Alternatively, if you plan to filter images based on size, ensure this doesn’t introduce bias—for example, by disproportionately excluding images of certain classes or contexts.

After evaluating the flagged images, you may notice that size discrepancies are common across multiple classes, as shown in the earlier table. This observation suggests that these issues are a general feature of the dataset, and dropping all size outliers might be an appropriate step. However, be cautious and verify whether this action creates any imbalances.

Zeros

# Plot images flagged for "zeros"
plot_sample_images(ds, outliers, outlier_imgs, "zeros", metrics, (2, 4))
metric=zeros
quantiles=[0.   0.   0.   0.   0.71]
../_images/00b3f70110fe21e12eab7b554acb8516db81490fb10d763a363b79a267e8ab24.png

Images flagged for zeros typically feature large regions of completely black or gray pixels. Some of these may also appear in grayscale. These characteristics could indicate issues like underexposed photos, scanning errors, or specific use cases.

Grayscale images, in particular, might stand out if the rest of your dataset is primarily in color. Check whether grayscale images are relevant to your operational scenario or whether they are artifacts of the data collection process.

For instance, if grayscale images are operationally irrelevant, consider removing them. However, if grayscale scenarios are possible, ensure that you have sufficient representation of these types of images to train a robust model. Similarly, dark images with many zero-value pixels may indicate rare but valid scenarios (e.g., nighttime operations) or irrelevant anomalies.

Sharpness

# Plot images flagged for "sharpness"
plot_sample_images(ds, outliers, outlier_imgs, "sharpness", metrics, (1, 2))
metric=sharpness
quantiles=[  6.11  39.67  52.42  64.96 107.38]
../_images/783a477d6003958db3f3925e8f9f65f95134d8a1b384c52d411e75421c4830c1.png

Sharpness measures the clarity of edges in an image. Flagged images often include those with unusually crisp or blurry details. For instance, you might notice a close-up shot of leaves or grass, where the texture stands out significantly compared to other images in the dataset.

Evaluate whether these highly detailed images are typical of your use case. If they are uncommon in your operational scenario, they might skew your model’s ability to generalize. In such cases, consider excluding these images. Conversely, if they are operationally relevant, ensure that similar images are sufficiently represented in your dataset to prevent biases.

Linting Summary

The Outliers class identifies images that deviate significantly from the dataset’s overall distribution. While it cannot determine operational relevance, it highlights patterns that may require further investigation.

For example, flagged images might reflect real-world scenarios underrepresented in your dataset, such as night operations or objects photographed from unusual angles. Alternatively, they may reveal anomalies, such as artifacts from the data collection process.

By reviewing flagged images for multiple metrics and plotting examples, you can better understand how the Outliers class identifies extremes. This hands-on exploration helps you decide whether to include or exclude specific images based on your dataset’s intended use.

Step 3: Identify duplicate data

Duplicates

Now that you know how to identify poor quality images in your dataset, another important aspect of data cleaning is detecting and removing any duplicates.

The Duplicates class identifies both exact duplicates and potential (near) duplicates. Potential duplicates can occur in a variety of ways:

  • Intentional perturbations

    • Images with varying brightness

    • Translating the image

    • Padding the image

    • Cropping the image

  • Unintentional changes

    • Copying the image from one format to another (png->jpeg)

    • Using the same image with two different filenames

    • Duplicate frames from video extraction

    • Oversight in the data collection process

# Initialize the Duplicates class
dups = Duplicates()

# Find the duplicates
dups.evaluate(ds)
DuplicatesOutput(exact=[], near=[])

As expected there are no duplicates in this dataset, since it was curated for a specific competition.

However, to highlight the abilities of the Duplicates class, you will add some duplicates to the dataset and then rerun the Duplicates class.

# Create exact and duplicate images

# Copy images 23 and 46 to create exact duplicates
# Copy and crop images 5 and 4376 to create near duplicates
dupes = [
    ds[23][0],
    ds[46][0],
    ds[5][0][:, 5:-5, 5:-5],
    ds[4376][0][:, :-5, 5:],
]

dupes_stats = hashstats(dupes)
# Find the duplicates appended to the dataset
duplicates = dups.from_stats([dups.stats, dupes_stats])
print(f"exact: {duplicates.exact}")
print(f"near: {duplicates.near}")
exact: [{0: [23], 1: [0]}, {0: [46], 1: [1]}]
near: [{0: [5], 1: [2]}, {0: [4376], 1: [3]}]

As shown above, the Duplicates class identified all images from the second dataset as exact or near duplicates.

Images 0 and 1 from dataset 1 are identified as exact duplicates of images 23 and 46, respectively from the original dataset (dataset 0). Images 2 and 3 from dataset 1 are identified as near duplicates of images 5 and 4376, respectively, which were cropped from the original dataset (dataset 0).

Conclusion

Through this process, you’ve learned how to use DataEval’s Outliers class to identify and analyze images that deviate from the overall distribution of your dataset and DataEval’s Duplicates class to identify exact and near duplicates. By examining the images flagged by the different metrics, you gained a deeper understanding of potential issues within your dataset. In this tutorial, the following were covered:

  • Underrepresented classes that may require additional data collection.

  • Inconsistencies in image characteristics, such as brightness, sharpness, or size, which could affect model performance.

  • Duplicate data that can affect model performance.

This work has provided a clearer picture of your dataset’s strengths and limitations. You are now equipped to make informed decisions about which data points to keep, remove, or augment. For example, you may decide to exclude irrelevant outliers, collect more data for underrepresented scenarios, or address biases that could impact your model’s generalizability.

By using DataEval, you are not just refining your dataset—you are laying the groundwork for creating a more representative, balanced, and reliable dataset. These insights ultimately enable the development of models that perform robustly in real-world operational settings.

DataEval’s tools empower you to move from raw data to actionable insights, ensuring your dataset is not only comprehensive but also aligned with your specific goals and requirements.

Good luck with your data!


What’s Next

Learn how to do the following:

To learn more about specific functions or classes, see the API Reference section. To learn more about data cleaning, see the Data Cleaning explanation page.

On your own

Now that you’ve gone through a tutorial on exploring a dataset, try going through the tutorial again with the test set, full dataset, or even your own dataset. One thing to look for when checking other sets of data is to observe how the stats of each grouping of data changes or doesn’t change.

You can also play around with the different statistical methods that the Outlier class employs to see how the method affects the number and type of issues detected.