Detecting duplicates#

What is it#

The Duplicates class identifies exact and near duplicate data.

When to use it#

The Duplicates class should be used if you need to check for duplicates in your dataset.

Theory behind it#

With the Duplicates class, exact matches are found using a byte hash of the data information, while near matches (such as a crop of another image or a distoration of another image) use a perception based hash.

The byte hash is achieved through the use of the python-xxHash Python module, which is based on Yann Collet’s xxHash C library.

The perceptual hash is achieved on an image by resizing to a square NxN image using the Lanczos algorithm where N is 32x32 or the largest multiple of 8 that is smaller than the input image dimensions. The resampled image is compressed using a discrete cosine transform and the lowest frequency component is encoded as a bit array of greater or less than median value and returned as a hex string.