How to identify duplicates¶

Problem Statement¶

One of the first steps in Exploratory Data Analysis (EDA) is to check for duplicates. Duplicates add no new information and can distort model training by over-emphasizing features that in appear in the duplicates.

DataEval provides a Duplicates class to assist you in removing duplicates so you can start training your models on high quality data.

When to use¶

The Duplicates class should be used if you need to find duplicate images in your dataset.

What you will need¶

A python envornment with following packages installed:
- dataeval or dataeval[all]
A dataset to analyze

Getting Started¶

Let’s import the required libraries needed to set up a minimal working example

import numpy as np
from maite_datasets.image_classification import MNIST
from torch.utils.data import Subset

from dataeval.data import Metadata
from dataeval.detectors.linters import Duplicates

Loading in the data¶

Load the MNIST data and create the dataset.

The MNIST dataset contains 70,000 images - 60,000 in the train set and 10,000 in the test set. For the purposes of this demonstration, we are just going to use the test set.

# Load in the mnist dataset
testing_dataset = MNIST(root="./data/", image_set="test", download=True)

# Get the labels
labels = Metadata(testing_dataset).class_labels

Because the MNIST dataset does not contain any exact duplicates we are going to adjust the dataset to include some.

# Creating some indices to duplicate
print("Exact duplicates")
duplicates = {}
for i in [1, 2, 5, 9]:
    matching_indices = np.where(labels == i)[0]
    print(f"\t{i} - ({matching_indices[23]}, {matching_indices[78]})")
    duplicates[int(matching_indices[78])] = int(matching_indices[23])

Exact duplicates
- (180, 663)
- (249, 728)
- (219, 866)
- (212, 773)

# Create a subset with the identified duplicate indices swapped
indices_with_duplicates = [duplicates.get(i, i) for i in range(len(testing_dataset))]
duplicates_ds = Subset(testing_dataset, indices_with_duplicates)

Finding the Duplicates¶

Now we are asking our Duplicates class to find the needle in the haystack. There are only 4 exact duplicates.

# Initialize the Duplicates class to begin to identify duplicate images.
identifyDuplicates = Duplicates()

# Evaluate the data
results = identifyDuplicates.evaluate(duplicates_ds)

The results can be returned as a dictionary with exact and near as the keys. So we will extract those to view the results.

for category, images in results.data().items():
    print(f"{category} - {len(images)}")
    print(f"\t{images}")

exact - 4
	[[180, 663], [212, 773], [219, 866], [249, 728]]
near - 96
	[[57, 4039], [176, 5872], [178, 1867, 1876, 6141, 6901, 6917, 7280, 8068], [203, 272, 2822, 3003, 3430, 5637, 5699, 5896, 9368, 9395, 9415], [204, 8418], [223, 6171], [255, 3969], [330, 1238, 2541, 4191, 9282], [348, 1193], [377, 7340], [416, 9324], [430, 1657, 4651, 7717, 9499], [476, 1040, 7270, 9348], [652, 1760, 3152], [675, 831], [745, 9836], [772, 7636], [783, 2827, 3070, 5651, 7686, 8459], [809, 6913], [889, 6073], [920, 3480, 4524, 5211, 9464], [929, 1213, 4273, 4774, 6125, 6799], [941, 6830], [964, 8672], [993, 2937], [1011, 2867, 4179], [1025, 6634], [1075, 4050], [1083, 5880], [1137, 2164, 2379], [1280, 2626], [1424, 5917], [1448, 4858], [1484, 2984, 7303], [1674, 3777], [1729, 3452], [1836, 1844, 2239, 4006, 8633, 8666, 9143, 9434], [1861, 2674], [1939, 5614], [2357, 4147], [2418, 8352], [2434, 4643], [2442, 7950], [2688, 3741], [2736, 4936], [2786, 2997, 4104, 6159, 8488], [2874, 9586], [3054, 7362], [3196, 7344], [3249, 8757], [3281, 8090], [3421, 5453], [3546, 4871], [3590, 5632], [3639, 4977], [3894, 6338], [3895, 3959], [4014, 6891, 8658, 9120], [4021, 9081], [4074, 6408], [4138, 8113], [4144, 9335], [4169, 7323], [4198, 8899], [4304, 4676], [4428, 6083], [4580, 6308], [4606, 5072], [4624, 9101], [4646, 5566], [4754, 7019], [4773, 9990], [4975, 8781], [5042, 8006], [5154, 8911], [5323, 7399, 8252], [5399, 6484, 6976], [5466, 8401], [5534, 5544, 6397, 6848], [5616, 8903], [5848, 6232], [5903, 7300], [6115, 6224, 6231], [6306, 7375, 7390], [6670, 7582, 9878], [6928, 7353], [7059, 8364], [7149, 7756], [7226, 8809], [7255, 9951], [7626, 9301], [7784, 8905], [8390, 9802], [8758, 9705], [8850, 9570], [9772, 9796]]

The Duplicates class was able to find all 4 exact duplicates out of the 10,000 samples.

It also found several sets of images that are very closely related to each other, and since we are using hand written digits we would expect it to find some images that were nearly identical.