Detecting common augmentations as duplicates

This tutorial demonstrates how DataEval’s duplicate detection methods handle common torchvision augmentations.

Estimated time to complete: 10 minutes

Relevant ML stages: Data Engineering

Relevant personas: Data Engineer, ML Engineer

What you’ll do

  • Create synthetic test images and apply 30+ torchvision transformations

  • Run both D4 hash-based and BoVW embedding-based duplicate detection

  • Compare which transformations each method catches or misses

  • Tune detection sensitivity with the cluster_sensitivity parameter

What you’ll learn

  • Which augmentation types are detectable as near-duplicates (and which aren’t)

  • When to use D4 hashes vs BoVW embeddings for duplicate detection

  • How D4 and BoVW have complementary strengths that improve coverage when combined

Quick reference: detection methods

Method

Best For

Speed

Rotation Invariant

D4 Hashes (phash_d4, dhash_d4)

Detecting rotated/flipped copies

Fast

Only 90° increments

BoVWExtractor

Semantic similarity, different viewpoints

Slower

Any angle

Basic Hashes (phash, dhash)

Same-orientation near-duplicates

Fastest

No

Key insight: D4 hashes only handle the 8 symmetries of a square (0°, 90°, 180°, 270° + flips). BoVW using SIFT features is invariant to any rotation angle, making it better for detecting arbitrarily rotated duplicates.

What you’ll need

  • A Python environment with the following packages installed:

    • dataeval

    • opencv-python or opencv-python-headless

    • torchvision

    • matplotlib

Introduction

Data augmentation is a common technique in deep learning, but augmented images can inadvertently appear in both training and test sets, or be saved as “new” images when they’re really transformations of existing ones. Understanding which augmentations are detectable as near-duplicates helps you:

  1. Identify data leakage - Find augmented copies that leaked between train/test splits

  2. Clean datasets - Remove redundant transformed images

  3. Validate augmentation pipelines - Ensure augmentations create sufficiently distinct images

Getting started

Let’s import the required libraries.

from numbers import Number
from typing import cast

import matplotlib.pyplot as plt
import numpy as np
import torch
import torchvision.transforms.v2 as T
from PIL import Image

from dataeval import config
from dataeval.extractors import BoVWExtractor
from dataeval.flags import ImageStats
from dataeval.quality import Duplicates, DuplicatesOutput

config.set_batch_size(64)
config.set_max_processes(4)
config.set_seed(42)

Creating test data

We’ll create a synthetic image with rich texture patterns that SIFT can detect. Then we’ll apply various torchvision transformations to test detection capabilities.

def create_textured_image(seed: int, size: int) -> np.ndarray:
    """Create an image with texture patterns that SIFT can detect.

    Returns image in CHW format (3, H, W) with uint8 values.
    """
    rng = np.random.default_rng(seed)

    # Use the seed to generate random frequencies and phases
    # so each seed produces a genuinely different pattern
    freqs = rng.uniform(1.0, 5.0, size=6)
    phases = rng.uniform(0, 2 * np.pi, size=6)
    channel_offsets = rng.integers(5, 30, size=4)

    x = np.linspace(0, 6 * np.pi, size)
    y = np.linspace(0, 6 * np.pi, size)
    xx, yy = np.meshgrid(x, y)

    # Create pattern with seed-dependent frequency components
    pattern = (
        np.sin(xx * freqs[0] + phases[0]) * np.cos(yy * freqs[1] + phases[1])
        + np.sin(xx * freqs[2] + phases[2]) * np.cos(yy * freqs[3] + phases[3]) * 0.5
        + np.sin(xx * freqs[4] + yy * freqs[5] + phases[4]) * 0.3
        + rng.random((size, size)) * 0.2
    )

    # Normalize to 0-255
    pattern = ((pattern - pattern.min()) / (pattern.max() - pattern.min()) * 255).astype(np.uint8)

    # Create RGB image with seed-dependent channel variations
    img = np.stack(
        [
            pattern,
            np.roll(pattern, int(channel_offsets[0]), axis=0),
            np.roll(pattern, int(channel_offsets[1]), axis=1),
        ],
        axis=0,
    )  # Shape: (3, H, W)

    return img.astype(np.uint8)


def numpy_to_pil(img: np.ndarray) -> Image.Image:
    """Convert CHW numpy array to PIL Image."""
    return Image.fromarray(np.transpose(img, (1, 2, 0)))


def pil_to_numpy(img: Image.Image) -> np.ndarray:
    """Convert PIL Image to CHW numpy array."""
    return np.transpose(np.array(img), (2, 0, 1))


def tensor_to_numpy(tensor: torch.Tensor) -> np.ndarray:
    """Convert torch tensor (CHW, float 0-1 or uint8) to CHW numpy uint8."""
    if tensor.dtype == torch.float32:
        tensor = (tensor * 255).to(torch.uint8)
    return tensor.numpy()
IMG_SIZE = 224

# Create base images
base_img1 = create_textured_image(seed=67, size=IMG_SIZE)
base_img2 = create_textured_image(seed=123, size=IMG_SIZE)
base_img3 = create_textured_image(seed=789, size=IMG_SIZE)

# Display base images
fig, axes = plt.subplots(1, 3, figsize=(12, 4))
for i, (img, title) in enumerate(
    [
        (base_img1, "Base Image 1 (seed=67)"),
        (base_img2, "Base Image 2 (seed=123)"),
        (base_img3, "Base Image 3 (seed=789)"),
    ]
):
    axes[i].imshow(np.transpose(img, (1, 2, 0)))
    axes[i].set_title(title)
    axes[i].axis("off")
plt.tight_layout()
plt.show()
../_images/47487cb847b2057bfdbbce26f522774ee96c90ff53ec90fb11c346c87ddead92.png

Defining Torchvision transformations

We’ll test a comprehensive set of common torchvision transformations, organized by category:

Category

Transformations

Expected Detection

Geometric

Rotation, Flip, Affine, Perspective

High (SIFT is geometry-invariant)

Color

ColorJitter, Grayscale, Invert

Medium (depends on intensity)

Blur/Noise

GaussianBlur, Noise

Medium to Low

Crop/Resize

RandomCrop, Resize, CenterCrop

Medium (depends on overlap)

Severe

RandomErasing, Heavy distortion

Low (features destroyed)

Important setup notes:

  • We use expand=True with a resize-back step for rotation transforms so that the full rotated content is preserved (no black corners or clipped content).

  • We use fill=128 (gray) instead of the default fill=0 (black) where fill is unavoidable. Black fill creates strong artificial edges that SIFT detects, corrupting the BoVW histogram.

FILL = 128  # Gray fill avoids artificial SIFT edges that black (0) would create


def _n(degrees: int) -> Number:
    """Helper to cast degrees to Number for Pylance."""
    return cast(Number, degrees)


# Helper: rotate with expand=True to preserve full content, then resize back
def _rotate_and_resize(degrees):
    return T.Compose([T.RandomRotation(degrees=(degrees, degrees), expand=True, fill=FILL), T.Resize(IMG_SIZE)])


# Define transformation categories
transformations = {
    # Geometric transformations - SIFT should handle these well
    "Rotation 15°": _rotate_and_resize(15),
    "Rotation 45°": _rotate_and_resize(45),
    "Rotation 90°": _rotate_and_resize(90),
    "Rotation 180°": _rotate_and_resize(180),
    "Horizontal Flip": T.RandomHorizontalFlip(p=1.0),
    "Vertical Flip": T.RandomVerticalFlip(p=1.0),
    "Affine (rotate+translate)": T.RandomAffine(degrees=_n(30), translate=(0.1, 0.1), fill=FILL),
    "Affine (rotate+scale)": T.RandomAffine(degrees=_n(15), scale=(0.8, 1.2), fill=FILL),
    "Perspective (mild)": T.RandomPerspective(distortion_scale=0.2, p=1.0, fill=FILL),
    "Perspective (strong)": T.RandomPerspective(distortion_scale=0.5, p=1.0, fill=FILL),
    # Color transformations - may or may not be detected
    "Brightness +30%": T.ColorJitter(brightness=(1.3, 1.3)),
    "Brightness -30%": T.ColorJitter(brightness=(0.7, 0.7)),
    "Contrast +50%": T.ColorJitter(contrast=(1.5, 1.5)),
    "Saturation +50%": T.ColorJitter(saturation=(1.5, 1.5)),
    "Hue Shift": T.ColorJitter(hue=(0.3, 0.3)),
    "Full ColorJitter": T.ColorJitter(brightness=0.3, contrast=0.3, saturation=0.3, hue=0.1),
    "Grayscale": T.Grayscale(num_output_channels=3),
    "Color Invert": T.RandomInvert(p=1.0),
    # Blur and noise
    "Gaussian Blur (mild)": T.GaussianBlur(kernel_size=5, sigma=(1.0, 1.0)),
    "Gaussian Blur (strong)": T.GaussianBlur(kernel_size=11, sigma=(3.0, 3.0)),
    # Crop and resize
    "Center Crop (80%)": T.Compose([T.CenterCrop(180), T.Resize(IMG_SIZE)]),
    "Center Crop (50%)": T.Compose([T.CenterCrop(112), T.Resize(IMG_SIZE)]),
    "Random Crop (80%)": T.Compose([T.RandomCrop(180), T.Resize(IMG_SIZE)]),
    "Resize Down+Up": T.Compose([T.Resize(112), T.Resize(IMG_SIZE)]),
    "Resize Down+Up (severe)": T.Compose([T.Resize(56), T.Resize(IMG_SIZE)]),
    # Severe transformations - likely to break detection
    "Random Erasing (10%)": T.RandomErasing(p=1.0, scale=(0.02, 0.1)),
    "Random Erasing (33%)": T.RandomErasing(p=1.0, scale=(0.2, 0.33)),
    # Combinations (common augmentation pipelines)
    "Augment: Flip+Rotate": T.Compose(
        [
            T.RandomHorizontalFlip(p=1.0),
            T.RandomRotation(degrees=_n(15), expand=True, fill=FILL),
            T.Resize(IMG_SIZE),
        ]
    ),
    "Augment: Flip+Color": T.Compose(
        [
            T.RandomHorizontalFlip(p=1.0),
            T.ColorJitter(brightness=0.2, contrast=0.2),
        ]
    ),
    "Augment: Full Pipeline": T.Compose(
        [
            T.RandomHorizontalFlip(p=0.5),
            T.RandomRotation(degrees=_n(10), expand=True, fill=FILL),
            T.Resize(IMG_SIZE),
            T.ColorJitter(brightness=0.1, contrast=0.1),
            T.GaussianBlur(kernel_size=3, sigma=(0.5, 0.5)),
        ]
    ),
}
# Apply all transformations to base image 1
images = []
labels = []

# Add original images first
images.append(base_img1)
labels.append("Original (Base 1)")

# Apply each transformation to base image 1
base_pil = numpy_to_pil(base_img1)

for name, transform in transformations.items():
    torch.manual_seed(42)  # For reproducibility
    transformed = transform(base_pil)
    images.append(pil_to_numpy(transformed))
    labels.append(name)

# Add other base images as "unique" images (should NOT be detected as duplicates)
images.append(base_img2)
labels.append("Unique: Base 2")
images.append(base_img3)
labels.append("Unique: Base 3")

print(f"Created {len(images)} test images:")
print(f"  - {1} original")
print(f"  - {len(transformations)} transformations")
print(f"  - {2} unique (different base images)")
Created 33 test images:
  - 1 original
  - 30 transformations
  - 2 unique (different base images)
# Visualize a sample of transformations
sample_indices = [0, 1, 2, 5, 6, 10, 15, 17, 20, 25, 28, 30]
sample_indices = [i for i in sample_indices if i < len(images)]

n_cols = 4
n_rows = (len(sample_indices) + n_cols - 1) // n_cols
fig, axes = plt.subplots(n_rows, n_cols, figsize=(14, 3.5 * n_rows))
axes = axes.flatten()

for ax_idx, img_idx in enumerate(sample_indices):
    img = images[img_idx]
    axes[ax_idx].imshow(np.transpose(img, (1, 2, 0)))
    axes[ax_idx].set_title(f"[{img_idx}] {labels[img_idx]}", fontsize=9)
    axes[ax_idx].axis("off")

for i in range(len(sample_indices), len(axes)):
    axes[i].axis("off")

plt.tight_layout()
plt.suptitle("Sample of Torchvision Transformations Applied to Base Image", y=1.02, fontsize=12)
plt.show()
../_images/1e4cc92b088d0a5d75877e020656314272edb5158cdea4c29ef640dc1b106a04.png

Running near-duplicate detection

We’ll use both hash-based detection (D4 hashes) and BoVWExtractor to compare their effectiveness on different transformations.

# Method 1: D4 Hash-based detection (rotation/flip invariant at 90° increments)
d4_detector = Duplicates(flags=ImageStats.HASH_DUPLICATES_D4)
d4_results = d4_detector.evaluate(images)

print("=== D4 Hash Results ===")
print("\nNear duplicates detected:")
if d4_results.near:
    for i, (indices, methods) in enumerate(d4_results.near):
        print(f"  Group: {i} - {methods}")
        for idx in indices:
            print(f"    {idx:<2} - {labels[idx]}")
        print()
else:
    print("  None found")
=== D4 Hash Results ===

Near duplicates detected:
  Group: 0 - ['dhash_d4', 'phash_d4']
    0  - Original (Base 1)
    3  - Rotation 90°
    4  - Rotation 180°
    5  - Horizontal Flip
    6  - Vertical Flip
    12 - Brightness -30%
    14 - Saturation +50%
    18 - Color Invert
    19 - Gaussian Blur (mild)
    20 - Gaussian Blur (strong)

  Group: 1 - ['dhash_d4']
    17 - Grayscale
    25 - Resize Down+Up (severe)
# Method 2: BoVW-based detection (rotation invariant at any angle)
# Use a smaller vocab_size for this small dataset (~32 images).
# Large vocabularies create sparse histograms that cluster poorly.
bovw_extractor = BoVWExtractor(vocab_size=32)
cluster_sensitivity = 1.75

bovw_detector = Duplicates(
    flags=ImageStats.NONE,  # Skip hash computation, use only clustering
    extractor=bovw_extractor,
    batch_size=64,
    cluster_sensitivity=cluster_sensitivity,
)
bovw_results = bovw_detector.evaluate(images)

print("=== BoVW Results ===")
print("\nNear duplicates detected:")
if bovw_results.near:
    for i, (indices, methods) in enumerate(bovw_results.near):
        print(f"  Group: {i} - {methods}")
        for idx in indices:
            print(f"    {idx:<2} - {labels[idx]}")
        print()
else:
    print("  None found")
=== BoVW Results ===

Near duplicates detected:
  Group: 0 - ['cluster']
    0  - Original (Base 1)
    1  - Rotation 15°
    2  - Rotation 45°
    3  - Rotation 90°
    4  - Rotation 180°
    7  - Affine (rotate+translate)
    9  - Perspective (mild)
    10 - Perspective (strong)
    11 - Brightness +30%
    13 - Contrast +50%
    14 - Saturation +50%
    17 - Grayscale
    18 - Color Invert
    19 - Gaussian Blur (mild)
    24 - Resize Down+Up
    30 - Augment: Full Pipeline

  Group: 1 - ['cluster']
    5  - Horizontal Flip
    6  - Vertical Flip
    28 - Augment: Flip+Rotate
    29 - Augment: Flip+Color

  Group: 2 - ['cluster']
    20 - Gaussian Blur (strong)
    25 - Resize Down+Up (severe)
# Method 3: Combined detection (both hashes and BoVW)
combined_detector = Duplicates(
    flags=ImageStats.HASH_DUPLICATES_D4,
    extractor=bovw_extractor,
    cluster_sensitivity=cluster_sensitivity,
)
combined_results = combined_detector.evaluate(images)

print("=== Combined (D4 Hash + BoVW) Results ===")
print("\nNear duplicates detected:")
if combined_results.near:
    for i, (indices, methods) in enumerate(combined_results.near):
        print(f"  Group: {i} - {methods}")
        for idx in indices:
            print(f"    {idx:<2} - {labels[idx]}")
        print()
else:
    print("  None found")
=== Combined (D4 Hash + BoVW) Results ===

Near duplicates detected:
  Group: 0 - ['cluster', 'dhash_d4', 'phash_d4']
    0  - Original (Base 1)
    1  - Rotation 15°
    2  - Rotation 45°
    3  - Rotation 90°
    4  - Rotation 180°
    5  - Horizontal Flip
    6  - Vertical Flip
    7  - Affine (rotate+translate)
    9  - Perspective (mild)
    10 - Perspective (strong)
    11 - Brightness +30%
    12 - Brightness -30%
    13 - Contrast +50%
    14 - Saturation +50%
    17 - Grayscale
    18 - Color Invert
    19 - Gaussian Blur (mild)
    20 - Gaussian Blur (strong)
    24 - Resize Down+Up
    25 - Resize Down+Up (severe)
    28 - Augment: Flip+Rotate
    29 - Augment: Flip+Color
    30 - Augment: Full Pipeline

Analyzing detection results by transformation type

Let’s analyze which transformations were detected as near-duplicates.

def get_detected_indices(results: DuplicatesOutput):
    """Extract all indices detected as duplicates of index 0 (original)."""
    detected = set()
    if results.near:
        for indices, _ in results.near:
            if 0 in indices:  # Group contains the original
                detected.update(indices)
    detected.discard(0)  # Remove the original itself
    return detected


d4_detected = get_detected_indices(d4_results)
bovw_detected = get_detected_indices(bovw_results)
combined_detected = get_detected_indices(combined_results)

print("Detection Summary:")
print(f"  D4 Hashes detected: {len(d4_detected)} transformations")
print(f"  BoVW detected: {len(bovw_detected)} transformations")
print(f"  Combined detected: {len(combined_detected)} transformations")
Detection Summary:
  D4 Hashes detected: 9 transformations
  BoVW detected: 15 transformations
  Combined detected: 22 transformations
# Create a detailed comparison table
print("\nDetailed Detection Results:")
print("=" * 70)
print(f"{'Transformation':<35} {'D4 Hash':<10} {'BoVW':<10} {'Combined':<10}")
print("=" * 70)

# Skip index 0 (original) and last 2 (unique images)
for i in range(1, len(images) - 2):
    d4_status = "Yes" if i in d4_detected else "No"
    bovw_status = "Yes" if i in bovw_detected else "No"
    combined_status = "Yes" if i in combined_detected else "No"
    print(f"{labels[i]:<35} {d4_status:<10} {bovw_status:<10} {combined_status:<10}")

print("=" * 70)

# Check unique images (should NOT be detected)
print("\nUnique Image Verification (should NOT be detected):")
for i in range(len(images) - 2, len(images)):
    d4_status = "DETECTED" if i in d4_detected else "OK"
    bovw_status = "DETECTED" if i in bovw_detected else "OK"
    combined_status = "DETECTED" if i in combined_detected else "OK"
    print(f"  {labels[i]}: D4={d4_status}, BoVW={bovw_status}, Combined={combined_status}")
Detailed Detection Results:
======================================================================
Transformation                      D4 Hash    BoVW       Combined  
======================================================================
Rotation 15°                        No         Yes        Yes       
Rotation 45°                        No         Yes        Yes       
Rotation 90°                        Yes        Yes        Yes       
Rotation 180°                       Yes        Yes        Yes       
Horizontal Flip                     Yes        No         Yes       
Vertical Flip                       Yes        No         Yes       
Affine (rotate+translate)           No         Yes        Yes       
Affine (rotate+scale)               No         No         No        
Perspective (mild)                  No         Yes        Yes       
Perspective (strong)                No         Yes        Yes       
Brightness +30%                     No         Yes        Yes       
Brightness -30%                     Yes        No         Yes       
Contrast +50%                       No         Yes        Yes       
Saturation +50%                     Yes        Yes        Yes       
Hue Shift                           No         No         No        
Full ColorJitter                    No         No         No        
Grayscale                           No         Yes        Yes       
Color Invert                        Yes        Yes        Yes       
Gaussian Blur (mild)                Yes        Yes        Yes       
Gaussian Blur (strong)              Yes        No         Yes       
Center Crop (80%)                   No         No         No        
Center Crop (50%)                   No         No         No        
Random Crop (80%)                   No         No         No        
Resize Down+Up                      No         Yes        Yes       
Resize Down+Up (severe)             No         No         Yes       
Random Erasing (10%)                No         No         No        
Random Erasing (33%)                No         No         No        
Augment: Flip+Rotate                No         No         Yes       
Augment: Flip+Color                 No         No         Yes       
Augment: Full Pipeline              No         Yes        Yes       
======================================================================

Unique Image Verification (should NOT be detected):
  Unique: Base 2: D4=OK, BoVW=OK, Combined=OK
  Unique: Base 3: D4=OK, BoVW=OK, Combined=OK

Visualizing detected vs missed transformations

# Categorize results
detected_by_both = bovw_detected & d4_detected
detected_by_bovw_only = bovw_detected - d4_detected
detected_by_d4_only = d4_detected - bovw_detected
missed_by_both = set(range(1, len(images) - 2)) - bovw_detected - d4_detected

print("Categorized Results:")
print(f"\nDetected by BOTH D4 and BoVW ({len(detected_by_both)}):")
for i in sorted(detected_by_both):
    print(f"  [{i}] {labels[i]}")

print(f"\nDetected by BoVW ONLY ({len(detected_by_bovw_only)}):")
for i in sorted(detected_by_bovw_only):
    print(f"  [{i}] {labels[i]}")

print(f"\nDetected by D4 ONLY ({len(detected_by_d4_only)}):")
for i in sorted(detected_by_d4_only):
    print(f"  [{i}] {labels[i]}")

print(f"\nMissed by BOTH ({len(missed_by_both)}):")
for i in sorted(missed_by_both):
    print(f"  [{i}] {labels[i]}")
Categorized Results:

Detected by BOTH D4 and BoVW (5):
  [3] Rotation 90°
  [4] Rotation 180°
  [14] Saturation +50%
  [18] Color Invert
  [19] Gaussian Blur (mild)

Detected by BoVW ONLY (10):
  [1] Rotation 15°
  [2] Rotation 45°
  [7] Affine (rotate+translate)
  [9] Perspective (mild)
  [10] Perspective (strong)
  [11] Brightness +30%
  [13] Contrast +50%
  [17] Grayscale
  [24] Resize Down+Up
  [30] Augment: Full Pipeline

Detected by D4 ONLY (4):
  [5] Horizontal Flip
  [6] Vertical Flip
  [12] Brightness -30%
  [20] Gaussian Blur (strong)

Missed by BOTH (11):
  [8] Affine (rotate+scale)
  [15] Hue Shift
  [16] Full ColorJitter
  [21] Center Crop (80%)
  [22] Center Crop (50%)
  [23] Random Crop (80%)
  [25] Resize Down+Up (severe)
  [26] Random Erasing (10%)
  [27] Random Erasing (33%)
  [28] Augment: Flip+Rotate
  [29] Augment: Flip+Color
# Visualize some of the detected and missed transformations
def visualize_category(indices, title, max_display=6):
    """Visualize images in a category."""
    if not indices:
        print(f"{title}: No images")
        return

    indices = sorted(indices)[:max_display]
    n_cols = min(len(indices), 3)
    n_rows = (len(indices) + n_cols - 1) // n_cols

    fig, axes = plt.subplots(n_rows, n_cols, figsize=(4 * n_cols, 4 * n_rows))
    axes = [axes] if n_rows * n_cols == 1 else axes.flatten()

    for ax_idx, img_idx in enumerate(indices):
        axes[ax_idx].imshow(np.transpose(images[img_idx], (1, 2, 0)))
        axes[ax_idx].set_title(f"[{img_idx}] {labels[img_idx]}", fontsize=9)
        axes[ax_idx].axis("off")

    for i in range(len(indices), len(axes)):
        axes[i].axis("off")

    plt.suptitle(title, fontsize=12)
    plt.tight_layout()
    plt.show()


# Show original for reference
fig, ax = plt.subplots(1, 1, figsize=(4, 4))
ax.imshow(np.transpose(images[0], (1, 2, 0)))
ax.set_title("Original Image (reference)", fontsize=12)
ax.axis("off")
plt.show()

# Show each category
visualize_category(detected_by_both, "Detected by BOTH D4 Hash and BoVW")
visualize_category(detected_by_bovw_only, "Detected by BoVW ONLY (D4 missed these)")
visualize_category(missed_by_both, "MISSED by Both Methods")
../_images/7ea24af4a95a702a01a49d2237819bae245fdcc3346b9e472d421992d307a9a7.png ../_images/01fb626af36a340547d0e4d7ff79e269fd807535da1d227e1535eb603452cc85.png ../_images/5ac145a59757cf93df5c56bedce3d18f1e8fcd01c9433260a0c86601be8f6ab4.png ../_images/f512a7bff5ca00fec7b72361eac1129cb3d22a7557ff18ed7dadb540f8c54783.png

Adjusting detection sensitivity

The cluster_sensitivity parameter controls how strict the near-duplicate detection is. Let’s see how different thresholds affect detection.

# Test different cluster thresholds
thresholds = [0.75, 1.0, 1.5, 2.0, 2.5]
threshold_results = {}

for threshold in thresholds:
    detector = Duplicates(
        flags=ImageStats.NONE,
        extractor=bovw_extractor,
        cluster_sensitivity=threshold,
    )
    results = detector.evaluate(images)
    detected = get_detected_indices(results)
    threshold_results[threshold] = detected
    print(f"Threshold {threshold}: {len(detected)} transformations detected")
Threshold 0.75: 0 transformations detected
Threshold 1.0: 3 transformations detected
Threshold 1.5: 6 transformations detected
Threshold 2.0: 17 transformations detected
Threshold 2.5: 22 transformations detected
# Show how detection changes with threshold
print("\nTransformations detected at each threshold:")
print("=" * 90)
header = f"{'Transformation':<35}"
for t in thresholds:
    header += f" {t:<8}"
print(header)
print("=" * 90)

for i in range(1, len(images) - 2):
    row = f"{labels[i]:<35}"
    for t in thresholds:
        status = "Yes" if i in threshold_results[t] else "-"
        row += f" {status:<8}"
    print(row)

print("=" * 90)
Transformations detected at each threshold:
==========================================================================================
Transformation                      0.75     1.0      1.5      2.0      2.5     
==========================================================================================
Rotation 15°                        -        -        -        Yes      Yes     
Rotation 45°                        -        -        -        Yes      Yes     
Rotation 90°                        -        -        Yes      Yes      Yes     
Rotation 180°                       -        -        Yes      Yes      Yes     
Horizontal Flip                     -        -        -        -        -       
Vertical Flip                       -        -        -        -        -       
Affine (rotate+translate)           -        -        -        Yes      Yes     
Affine (rotate+scale)               -        -        -        Yes      Yes     
Perspective (mild)                  -        -        -        Yes      Yes     
Perspective (strong)                -        -        -        Yes      Yes     
Brightness +30%                     -        -        -        Yes      Yes     
Brightness -30%                     -        -        -        Yes      Yes     
Contrast +50%                       -        -        -        Yes      Yes     
Saturation +50%                     -        -        Yes      Yes      Yes     
Hue Shift                           -        -        -        -        -       
Full ColorJitter                    -        -        -        -        Yes     
Grayscale                           -        Yes      Yes      Yes      Yes     
Color Invert                        -        -        -        Yes      Yes     
Gaussian Blur (mild)                -        Yes      Yes      Yes      Yes     
Gaussian Blur (strong)              -        -        -        -        Yes     
Center Crop (80%)                   -        -        -        -        Yes     
Center Crop (50%)                   -        -        -        -        -       
Random Crop (80%)                   -        -        -        -        Yes     
Resize Down+Up                      -        Yes      Yes      Yes      Yes     
Resize Down+Up (severe)             -        -        -        -        Yes     
Random Erasing (10%)                -        -        -        -        -       
Random Erasing (33%)                -        -        -        -        -       
Augment: Flip+Rotate                -        -        -        -        -       
Augment: Flip+Color                 -        -        -        -        -       
Augment: Full Pipeline              -        -        -        Yes      Yes     
==========================================================================================

Key findings and recommendations

Transformations detected as near-duplicates

Transformation Type

D4 Hash

BoVW

Notes

Rotation (90° increments)

Yes

Yes

Both methods detect 90° and 180° reliably

Rotation (arbitrary angles)

No

Yes

BoVW’s SIFT features are rotation-invariant at any angle

Horizontal/Vertical Flip

Yes

No

BoVW clusters flips separately from the original; D4 is designed for this

Perspective

No

Yes

BoVW detects both mild and strong perspective distortion

Affine (rotate+translate)

No

Yes

BoVW handles combined rotation and translation

Brightness / Contrast / Saturation

Partial

Partial

Both detect some color shifts; depends on which channel is affected

Grayscale

No

Yes

SIFT operates on luminance, so grayscale conversion preserves features

Color Inversion

Yes

Yes

Both methods detect inversion

Gaussian Blur (mild)

Yes

Yes

Both methods tolerate mild blur

Gaussian Blur (strong)

Yes

No

D4 hashes are more resilient to strong blur than SIFT

Resize Down+Up

No

Yes

BoVW detects mild resolution loss; both miss severe downsampling

Transformations missed by both methods

Transformation Type

Why Missed

Hue shift / Full ColorJitter

Changes pixel values enough to alter both hashes and SIFT descriptors

All crops (center, random)

Removes too much content; remaining features don’t match the full-image histogram

Severe downsampling

Destroys fine-grained SIFT keypoints and alters hash signatures

Random erasing

Destroys local features in erased regions

Affine (rotate+scale)

Combined scaling with rotation changes SIFT descriptor distributions

Complementary strengths

A key finding is that D4 hashes and BoVW have complementary detection strengths:

  • D4 detects but BoVW misses: Flips, brightness reduction, strong blur

  • BoVW detects but D4 misses: Arbitrary rotations, perspective, affine, grayscale, mild resize, contrast shifts

The combined method detected 22 out of 30 transformations (73%) by merging groups across both methods.

Recommendations

  1. Use both methods together for best coverage — they complement each other well

  2. For detecting rotated copies: D4 hashes handle 90° increments and flips; add BoVW for arbitrary angles

  3. For data augmentation validation: Use BoVW with a higher cluster_sensitivity (1.5–2.0) to catch subtle duplicates

  4. For large datasets: Start with fast D4 hashes, then run BoVW on remaining candidates

  5. Adjust cluster_sensitivity: Lower (1.0–1.25) for strict matching, higher (1.5–2.0) for permissive — note that no transformations are detected at 0.75

What’s next

In addition to exploring the duplicates in a dataset, DataEval offers additional tutorials on exploratory data analysis:

Explore deeper explanations on topics such as duplicates, outliers, and coverage in the Concept pages.

To learn more about setting a global seed in DataEval, see the hardware configuration how-to.

On your own

Once you are familiar with DataEval and data analysis, run this analysis on your own dataset. When you do, make sure that you analyze all of your data and not just the training set.