Dataset Deduplication Tutorial

Problem Statement

Exploratory data analysis (EDA) can be overwhelming. There are so many things to check. Duplicates in your dataset, bad/corrupted images in the set, blurred or bright/dark images, the list goes on.

DataEval created a Duplicates class to assist you with your EDA so you can start training your models on high quality data.

When to use

The Duplicates class should be used if you need to check for duplicates in your dataset.

What you will need

A dataset to analyze

Getting Started

Let’s import the required libraries needed to set up a minimal working example

try:
    import google.colab  # noqa: F401

    %pip install -q dataeval
except Exception:
    pass

import os

os.environ["TF_CPP_MIN_LOG_LEVEL"] = "3"

import numpy as np
import tensorflow_datasets as tfds

from dataeval.detectors import Duplicates

Loading in the data

Let’s start by loading in TensorFlow’s MNIST dataset, then we will examine it

The MNIST dataset contains 70,000 images - 60,000 in the train set and 10,000 in the test set. For the purposes of this demonstration, we are just going to use the test set.

# Load in the mnist dataset from tensorflow datasets
dataset, ds_info = tfds.load(
    "mnist",
    split="test",
    shuffle_files=True,
    with_info=True,
)  # type: ignore
tfds.visualization.show_examples(dataset, ds_info)  # type: ignore

# Translate the dataset from tensorflow to numpy for use with the duplicates class
labels = np.array([i["label"] for i in dataset])  # type: ignore
test_data = [i["image"] for i in dataset]  # type: ignore
test_data = np.squeeze(np.array(test_data, dtype=np.float32).transpose(0, 3, 1, 2))

../../_images/8f3384a2fa1dac2c08dfea23ff9a3fd38885d23312fc8682828b35c16aced0c3.png

Because the MNIST dataset does not contain any exact duplicates we are going to adjust the dataset to include some.

# Creating some duplicates
print("Exact duplicates")
duplicates = {}
for i in [1, 2, 5, 9]:
    matching_indices = np.where(labels == i)[0]
    test_data[matching_indices[78]] = test_data[matching_indices[23]]
    print(f"\t{i} - ({matching_indices[23]}, {matching_indices[78]})")
    duplicates[i] = (matching_indices[23], matching_indices[78], matching_indices[2])

Exact duplicates
- (274, 706)
- (210, 648)
- (285, 887)
- (193, 825)

print("Number of samples: ", len(test_data))

Number of samples:  10000

Finding the Duplicates

Now we are asking our Duplicates class to find the needle in the haystack. There are only 4 exact duplicates and then there are 3 shifted exact duplicates.

# Initialize the Duplicates class
duplicator = Duplicates()

# Evaluate the data
results = duplicator.evaluate(test_data)

The results are a dictionary with exact and near as the keys. So we will extract those to view the results.

for category, images in results.items():
    print(f"{category} - {len(images)}")
    print(f"\t{images}")

exact - 4
	[[193, 825], [210, 648], [274, 706], [285, 887]]
near - 97
	[[92, 6629], [108, 2709], [245, 1063, 3633, 7862], [278, 8502, 8701], [281, 5004], [287, 608, 2535, 2794, 5240, 5653, 6127, 6183, 7664, 7667, 8540], [301, 9350], [366, 1432, 1939, 2282, 3413, 5326, 5433, 7157], [424, 1538], [470, 7409], [499, 4108, 9488], [539, 601, 3605, 4100, 6842, 8560], [559, 5566], [606, 1773, 2821], [756, 944], [834, 4975], [853, 6144], [889, 897], [892, 1413, 1668, 1756, 3916, 7295, 7519, 8126], [902, 3336, 6153], [919, 3323, 9007], [940, 8019], [1001, 8358], [1125, 4576], [1133, 2075], [1172, 2032], [1186, 3914], [1214, 1454], [1279, 3989], [1309, 4097, 6349], [1417, 9500], [1422, 5472], [1426, 3639, 4682, 8042, 8185], [1496, 1853], [1597, 5109], [1650, 4870], [1676, 1875, 3241, 5519, 5967], [1693, 2495], [1719, 6977], [1751, 7869], [1767, 4838], [1865, 7605], [1998, 5517, 6645, 6684, 8083], [2105, 4839], [2113, 6393], [2234, 7422], [2260, 4832, 8622, 9484, 9878], [2350, 2807], [2428, 5502], [2442, 2858], [2714, 5222, 9976], [2724, 3425], [2736, 9278], [2752, 8605, 9362], [2754, 2864], [2787, 2791, 5154, 5422, 8105, 8760], [2909, 7911], [2990, 9078], [3014, 5620], [3069, 6378], [3091, 7342], [3093, 3534, 5429, 9530], [3179, 8947], [3298, 3863], [3480, 7852, 9492], [3692, 4233], [3757, 4451], [3918, 9416], [3977, 6165], [4022, 4698, 8464, 8861], [4075, 9745], [4165, 6695], [4303, 9621], [4357, 7881], [4408, 6757], [4435, 9526], [4455, 8695], [4629, 8096], [4892, 8643], [4919, 8025], [4960, 7892], [5149, 8697], [5199, 6039], [5229, 8808], [5323, 8284], [5483, 6609], [5552, 6940], [5942, 7064], [6036, 8173], [6552, 9588], [7373, 9962], [7418, 8868], [7861, 8573], [8250, 9018], [8294, 8722], [8568, 8743], [9046, 9848]]

If we recall from above, our exact duplicates were:

(274, 706), (210, 648), (285, 887), (193, 825)

Which exactly matches what the Duplicates class was able to find.

It also found several sets of images that are very closely related to each other, and since we are using hand written digits we would expect it to find some images that were nearly identical.