Detecting duplicates#
What is it#
The Duplicates class identifies exact and near duplicate data.
When to use it#
The Duplicates class should be used if you need to check for duplicates in your dataset.
Theory behind it#
With the Duplicates class, exact matches are found using a byte hash of the data information, while near matches (such as a crop of another image or a distoration of another image) use a perception based hash.
The byte hash is achieved through the use of the python-xxHash Python module, which is based on Yann Collet’s xxHash C library.
The perceptual hash is achieved on an image by resizing to a square NxN image using the Lanczos algorithm where N is 32x32 or the largest multiple of 8 that is smaller than the input image dimensions. The resampled image is compressed using a discrete cosine transform and the lowest frequency component is encoded as a bit array of greater or less than median value and returned as a hex string.