Detecting common augmentations as duplicates¶

This tutorial demonstrates how DataEval’s duplicate detection methods handle common torchvision augmentations.

Estimated time to complete: 10 minutes

Relevant ML stages: Data Engineering

Relevant personas: Data Engineer, ML Engineer

What you’ll do¶

Create synthetic test images and apply 30+ torchvision transformations
Run both D4 hash-based and BoVW embedding-based duplicate detection
Compare which transformations each method catches or misses
Tune detection sensitivity with the cluster_sensitivity parameter

What you’ll learn¶

Which augmentation types are detectable as near-duplicates (and which aren’t)
When to use D4 hashes vs BoVW embeddings for duplicate detection
How D4 and BoVW have complementary strengths that improve coverage when combined

Quick reference: detection methods¶

Method	Best For	Speed	Rotation Invariant
D4 Hashes (phash_d4, dhash_d4)	Detecting rotated/flipped copies	Fast	Only 90° increments
BoVWExtractor	Semantic similarity, different viewpoints	Slower	Any angle
Basic Hashes (phash, dhash)	Same-orientation near-duplicates	Fastest	No

Key insight: D4 hashes only handle the 8 symmetries of a square (0°, 90°, 180°, 270° + flips). BoVW using SIFT features is invariant to any rotation angle, making it better for detecting arbitrarily rotated duplicates.

What you’ll need¶

A Python environment with the following packages installed:
- dataeval
- opencv-python or opencv-python-headless
- torchvision
- matplotlib

Introduction¶

Data augmentation is a common technique in deep learning, but augmented images can inadvertently appear in both training and test sets, or be saved as “new” images when they’re really transformations of existing ones. Understanding which augmentations are detectable as near-duplicates helps you:

Identify data leakage - Find augmented copies that leaked between train/test splits
Clean datasets - Remove redundant transformed images
Validate augmentation pipelines - Ensure augmentations create sufficiently distinct images

Getting started¶

Let’s import the required libraries.

from numbers import Number
from typing import cast

import matplotlib.pyplot as plt
import numpy as np
import torch
import torchvision.transforms.v2 as T
from PIL import Image

from dataeval import config
from dataeval.extractors import BoVWExtractor
from dataeval.flags import ImageStats
from dataeval.quality import Duplicates, DuplicatesOutput

config.set_batch_size(64)
config.set_max_processes(4)
config.set_seed(42)

Creating test data¶

We’ll create a synthetic image with rich texture patterns that SIFT can detect. Then we’ll apply various torchvision transformations to test detection capabilities.

def create_textured_image(seed: int, size: int) -> np.ndarray:
    """Create an image with texture patterns that SIFT can detect.

    Returns image in CHW format (3, H, W) with uint8 values.
    """
    rng = np.random.default_rng(seed)

    # Use the seed to generate random frequencies and phases
    # so each seed produces a genuinely different pattern
    freqs = rng.uniform(1.0, 5.0, size=6)
    phases = rng.uniform(0, 2 * np.pi, size=6)
    channel_offsets = rng.integers(5, 30, size=4)

    x = np.linspace(0, 6 * np.pi, size)
    y = np.linspace(0, 6 * np.pi, size)
    xx, yy = np.meshgrid(x, y)

    # Create pattern with seed-dependent frequency components
    pattern = (
        np.sin(xx * freqs[0] + phases[0]) * np.cos(yy * freqs[1] + phases[1])
        + np.sin(xx * freqs[2] + phases[2]) * np.cos(yy * freqs[3] + phases[3]) * 0.5
        + np.sin(xx * freqs[4] + yy * freqs[5] + phases[4]) * 0.3
        + rng.random((size, size)) * 0.2
    )

    # Normalize to 0-255
    pattern = ((pattern - pattern.min()) / (pattern.max() - pattern.min()) * 255).astype(np.uint8)

    # Create RGB image with seed-dependent channel variations
    img = np.stack(
        [
            pattern,
            np.roll(pattern, int(channel_offsets[0]), axis=0),
            np.roll(pattern, int(channel_offsets[1]), axis=1),
        ],
        axis=0,
    )  # Shape: (3, H, W)

    return img.astype(np.uint8)


def numpy_to_pil(img: np.ndarray) -> Image.Image:
    """Convert CHW numpy array to PIL Image."""
    return Image.fromarray(np.transpose(img, (1, 2, 0)))


def pil_to_numpy(img: Image.Image) -> np.ndarray:
    """Convert PIL Image to CHW numpy array."""
    return np.transpose(np.array(img), (2, 0, 1))


def tensor_to_numpy(tensor: torch.Tensor) -> np.ndarray:
    """Convert torch tensor (CHW, float 0-1 or uint8) to CHW numpy uint8."""
    if tensor.dtype == torch.float32:
        tensor = (tensor * 255).to(torch.uint8)
    return tensor.numpy()

IMG_SIZE = 224

# Create base images
base_img1 = create_textured_image(seed=67, size=IMG_SIZE)
base_img2 = create_textured_image(seed=123, size=IMG_SIZE)
base_img3 = create_textured_image(seed=789, size=IMG_SIZE)

# Display base images
fig, axes = plt.subplots(1, 3, figsize=(12, 4))
for i, (img, title) in enumerate(
    [
        (base_img1, "Base Image 1 (seed=67)"),
        (base_img2, "Base Image 2 (seed=123)"),
        (base_img3, "Base Image 3 (seed=789)"),
    ]
):
    axes[i].imshow(np.transpose(img, (1, 2, 0)))
    axes[i].set_title(title)
    axes[i].axis("off")
plt.tight_layout()
plt.show()

../_images/47487cb847b2057bfdbbce26f522774ee96c90ff53ec90fb11c346c87ddead92.png

Defining Torchvision transformations¶

We’ll test a comprehensive set of common torchvision transformations, organized by category:

Category	Transformations	Expected Detection
Geometric	Rotation, Flip, Affine, Perspective	High (SIFT is geometry-invariant)
Color	ColorJitter, Grayscale, Invert	Medium (depends on intensity)
Blur/Noise	GaussianBlur, Noise	Medium to Low
Crop/Resize	RandomCrop, Resize, CenterCrop	Medium (depends on overlap)
Severe	RandomErasing, Heavy distortion	Low (features destroyed)

Important setup notes:

We use expand=True with a resize-back step for rotation transforms so that the full rotated content is preserved (no black corners or clipped content).
We use fill=128 (gray) instead of the default fill=0 (black) where fill is unavoidable. Black fill creates strong artificial edges that SIFT detects, corrupting the BoVW histogram.

FILL = 128  # Gray fill avoids artificial SIFT edges that black (0) would create


def _n(degrees: int) -> Number:
    """Helper to cast degrees to Number for Pylance."""
    return cast(Number, degrees)


# Helper: rotate with expand=True to preserve full content, then resize back
def _rotate_and_resize(degrees):
    return T.Compose([T.RandomRotation(degrees=(degrees, degrees), expand=True, fill=FILL), T.Resize(IMG_SIZE)])


# Define transformation categories
transformations = {
    # Geometric transformations - SIFT should handle these well
    "Rotation 15°": _rotate_and_resize(15),
    "Rotation 45°": _rotate_and_resize(45),
    "Rotation 90°": _rotate_and_resize(90),
    "Rotation 180°": _rotate_and_resize(180),
    "Horizontal Flip": T.RandomHorizontalFlip(p=1.0),
    "Vertical Flip": T.RandomVerticalFlip(p=1.0),
    "Affine (rotate+translate)": T.RandomAffine(degrees=_n(30), translate=(0.1, 0.1), fill=FILL),
    "Affine (rotate+scale)": T.RandomAffine(degrees=_n(15), scale=(0.8, 1.2), fill=FILL),
    "Perspective (mild)": T.RandomPerspective(distortion_scale=0.2, p=1.0, fill=FILL),
    "Perspective (strong)": T.RandomPerspective(distortion_scale=0.5, p=1.0, fill=FILL),
    # Color transformations - may or may not be detected
    "Brightness +30%": T.ColorJitter(brightness=(1.3, 1.3)),
    "Brightness -30%": T.ColorJitter(brightness=(0.7, 0.7)),
    "Contrast +50%": T.ColorJitter(contrast=(1.5, 1.5)),
    "Saturation +50%": T.ColorJitter(saturation=(1.5, 1.5)),
    "Hue Shift": T.ColorJitter(hue=(0.3, 0.3)),
    "Full ColorJitter": T.ColorJitter(brightness=0.3, contrast=0.3, saturation=0.3, hue=0.1),
    "Grayscale": T.Grayscale(num_output_channels=3),
    "Color Invert": T.RandomInvert(p=1.0),
    # Blur and noise
    "Gaussian Blur (mild)": T.GaussianBlur(kernel_size=5, sigma=(1.0, 1.0)),
    "Gaussian Blur (strong)": T.GaussianBlur(kernel_size=11, sigma=(3.0, 3.0)),
    # Crop and resize
    "Center Crop (80%)": T.Compose([T.CenterCrop(180), T.Resize(IMG_SIZE)]),
    "Center Crop (50%)": T.Compose([T.CenterCrop(112), T.Resize(IMG_SIZE)]),
    "Random Crop (80%)": T.Compose([T.RandomCrop(180), T.Resize(IMG_SIZE)]),
    "Resize Down+Up": T.Compose([T.Resize(112), T.Resize(IMG_SIZE)]),
    "Resize Down+Up (severe)": T.Compose([T.Resize(56), T.Resize(IMG_SIZE)]),
    # Severe transformations - likely to break detection
    "Random Erasing (10%)": T.RandomErasing(p=1.0, scale=(0.02, 0.1)),
    "Random Erasing (33%)": T.RandomErasing(p=1.0, scale=(0.2, 0.33)),
    # Combinations (common augmentation pipelines)
    "Augment: Flip+Rotate": T.Compose(
        [
            T.RandomHorizontalFlip(p=1.0),
            T.RandomRotation(degrees=_n(15), expand=True, fill=FILL),
            T.Resize(IMG_SIZE),
        ]
    ),
    "Augment: Flip+Color": T.Compose(
        [
            T.RandomHorizontalFlip(p=1.0),
            T.ColorJitter(brightness=0.2, contrast=0.2),
        ]
    ),
    "Augment: Full Pipeline": T.Compose(
        [
            T.RandomHorizontalFlip(p=0.5),
            T.RandomRotation(degrees=_n(10), expand=True, fill=FILL),
            T.Resize(IMG_SIZE),
            T.ColorJitter(brightness=0.1, contrast=0.1),
            T.GaussianBlur(kernel_size=3, sigma=(0.5, 0.5)),
        ]
    ),
}

# Apply all transformations to base image 1
images = []
labels = []

# Add original images first
images.append(base_img1)
labels.append("Original (Base 1)")

# Apply each transformation to base image 1
base_pil = numpy_to_pil(base_img1)

for name, transform in transformations.items():
    torch.manual_seed(42)  # For reproducibility
    transformed = transform(base_pil)
    images.append(pil_to_numpy(transformed))
    labels.append(name)

# Add other base images as "unique" images (should NOT be detected as duplicates)
images.append(base_img2)
labels.append("Unique: Base 2")
images.append(base_img3)
labels.append("Unique: Base 3")

print(f"Created {len(images)} test images:")
print(f"  - {1} original")
print(f"  - {len(transformations)} transformations")
print(f"  - {2} unique (different base images)")

Created 33 test images:
  - 1 original
  - 30 transformations
  - 2 unique (different base images)

# Visualize a sample of transformations
sample_indices = [0, 1, 2, 5, 6, 10, 15, 17, 20, 25, 28, 30]
sample_indices = [i for i in sample_indices if i < len(images)]

n_cols = 4
n_rows = (len(sample_indices) + n_cols - 1) // n_cols
fig, axes = plt.subplots(n_rows, n_cols, figsize=(14, 3.5 * n_rows))
axes = axes.flatten()

for ax_idx, img_idx in enumerate(sample_indices):
    img = images[img_idx]
    axes[ax_idx].imshow(np.transpose(img, (1, 2, 0)))
    axes[ax_idx].set_title(f"[{img_idx}] {labels[img_idx]}", fontsize=9)
    axes[ax_idx].axis("off")

for i in range(len(sample_indices), len(axes)):
    axes[i].axis("off")

plt.tight_layout()
plt.suptitle("Sample of Torchvision Transformations Applied to Base Image", y=1.02, fontsize=12)
plt.show()

../_images/1e4cc92b088d0a5d75877e020656314272edb5158cdea4c29ef640dc1b106a04.png

Running near-duplicate detection¶

We’ll use both hash-based detection (D4 hashes) and BoVWExtractor to compare their effectiveness on different transformations.

# Method 1: D4 Hash-based detection (rotation/flip invariant at 90° increments)
d4_detector = Duplicates(flags=ImageStats.HASH_DUPLICATES_D4)
d4_results = d4_detector.evaluate(images)

print("=== D4 Hash Results ===")
print("\nNear duplicates detected:")
if d4_results.near:
    for i, (indices, methods) in enumerate(d4_results.near):
        print(f"  Group: {i} - {methods}")
        for idx in indices:
            print(f"    {idx:<2} - {labels[idx]}")
        print()
else:
    print("  None found")

=== D4 Hash Results ===

Near duplicates detected:
  Group: 0 - ['dhash_d4', 'phash_d4']
- Original (Base 1)
- Rotation 90°
- Rotation 180°
- Horizontal Flip
- Vertical Flip
- Brightness -30%
- Saturation +50%
- Color Invert
- Gaussian Blur (mild)
- Gaussian Blur (strong)

  Group: 1 - ['dhash_d4']
- Grayscale
- Resize Down+Up (severe)

# Method 2: BoVW-based detection (rotation invariant at any angle)
# Use a smaller vocab_size for this small dataset (~32 images).
# Large vocabularies create sparse histograms that cluster poorly.
bovw_extractor = BoVWExtractor(vocab_size=32)
cluster_sensitivity = 1.75

bovw_detector = Duplicates(
    flags=ImageStats.NONE,  # Skip hash computation, use only clustering
    extractor=bovw_extractor,
    batch_size=64,
    cluster_sensitivity=cluster_sensitivity,
)
bovw_results = bovw_detector.evaluate(images)

print("=== BoVW Results ===")
print("\nNear duplicates detected:")
if bovw_results.near:
    for i, (indices, methods) in enumerate(bovw_results.near):
        print(f"  Group: {i} - {methods}")
        for idx in indices:
            print(f"    {idx:<2} - {labels[idx]}")
        print()
else:
    print("  None found")

=== BoVW Results ===

Near duplicates detected:
  Group: 0 - ['cluster']
- Original (Base 1)
- Rotation 15°
- Rotation 45°
- Rotation 90°
- Rotation 180°
- Affine (rotate+translate)
- Perspective (mild)
- Perspective (strong)
- Brightness +30%
- Contrast +50%
- Saturation +50%
- Grayscale
- Color Invert
- Gaussian Blur (mild)
- Resize Down+Up
- Augment: Full Pipeline

  Group: 1 - ['cluster']
- Horizontal Flip
- Vertical Flip
- Augment: Flip+Rotate
- Augment: Flip+Color

  Group: 2 - ['cluster']
- Gaussian Blur (strong)
- Resize Down+Up (severe)

# Method 3: Combined detection (both hashes and BoVW)
combined_detector = Duplicates(
    flags=ImageStats.HASH_DUPLICATES_D4,
    extractor=bovw_extractor,
    cluster_sensitivity=cluster_sensitivity,
)
combined_results = combined_detector.evaluate(images)

print("=== Combined (D4 Hash + BoVW) Results ===")
print("\nNear duplicates detected:")
if combined_results.near:
    for i, (indices, methods) in enumerate(combined_results.near):
        print(f"  Group: {i} - {methods}")
        for idx in indices:
            print(f"    {idx:<2} - {labels[idx]}")
        print()
else:
    print("  None found")

=== Combined (D4 Hash + BoVW) Results ===

Near duplicates detected:
  Group: 0 - ['cluster', 'dhash_d4', 'phash_d4']
- Original (Base 1)
- Rotation 15°
- Rotation 45°
- Rotation 90°
- Rotation 180°
- Horizontal Flip
- Vertical Flip
- Affine (rotate+translate)
- Perspective (mild)
- Perspective (strong)
- Brightness +30%
- Brightness -30%
- Contrast +50%
- Saturation +50%
- Grayscale
- Color Invert
- Gaussian Blur (mild)
- Gaussian Blur (strong)
- Resize Down+Up
- Resize Down+Up (severe)
- Augment: Flip+Rotate
- Augment: Flip+Color
- Augment: Full Pipeline

Analyzing detection results by transformation type¶

Let’s analyze which transformations were detected as near-duplicates.

def get_detected_indices(results: DuplicatesOutput):
    """Extract all indices detected as duplicates of index 0 (original)."""
    detected = set()
    if results.near:
        for indices, _ in results.near:
            if 0 in indices:  # Group contains the original
                detected.update(indices)
    detected.discard(0)  # Remove the original itself
    return detected


d4_detected = get_detected_indices(d4_results)
bovw_detected = get_detected_indices(bovw_results)
combined_detected = get_detected_indices(combined_results)

print("Detection Summary:")
print(f"  D4 Hashes detected: {len(d4_detected)} transformations")
print(f"  BoVW detected: {len(bovw_detected)} transformations")
print(f"  Combined detected: {len(combined_detected)} transformations")

Detection Summary:
  D4 Hashes detected: 9 transformations
  BoVW detected: 15 transformations
  Combined detected: 22 transformations

# Create a detailed comparison table
print("\nDetailed Detection Results:")
print("=" * 70)
print(f"{'Transformation':<35} {'D4 Hash':<10} {'BoVW':<10} {'Combined':<10}")
print("=" * 70)

# Skip index 0 (original) and last 2 (unique images)
for i in range(1, len(images) - 2):
    d4_status = "Yes" if i in d4_detected else "No"
    bovw_status = "Yes" if i in bovw_detected else "No"
    combined_status = "Yes" if i in combined_detected else "No"
    print(f"{labels[i]:<35} {d4_status:<10} {bovw_status:<10} {combined_status:<10}")

print("=" * 70)

# Check unique images (should NOT be detected)
print("\nUnique Image Verification (should NOT be detected):")
for i in range(len(images) - 2, len(images)):
    d4_status = "DETECTED" if i in d4_detected else "OK"
    bovw_status = "DETECTED" if i in bovw_detected else "OK"
    combined_status = "DETECTED" if i in combined_detected else "OK"
    print(f"  {labels[i]}: D4={d4_status}, BoVW={bovw_status}, Combined={combined_status}")

Detailed Detection Results:
======================================================================
Transformation                      D4 Hash    BoVW       Combined  
======================================================================
Rotation 15°                        No         Yes        Yes       
Rotation 45°                        No         Yes        Yes       
Rotation 90°                        Yes        Yes        Yes       
Rotation 180°                       Yes        Yes        Yes       
Horizontal Flip                     Yes        No         Yes       
Vertical Flip                       Yes        No         Yes       
Affine (rotate+translate)           No         Yes        Yes       
Affine (rotate+scale)               No         No         No        
Perspective (mild)                  No         Yes        Yes       
Perspective (strong)                No         Yes        Yes       
Brightness +30%                     No         Yes        Yes       
Brightness -30%                     Yes        No         Yes       
Contrast +50%                       No         Yes        Yes       
Saturation +50%                     Yes        Yes        Yes       
Hue Shift                           No         No         No        
Full ColorJitter                    No         No         No        
Grayscale                           No         Yes        Yes       
Color Invert                        Yes        Yes        Yes       
Gaussian Blur (mild)                Yes        Yes        Yes       
Gaussian Blur (strong)              Yes        No         Yes       
Center Crop (80%)                   No         No         No        
Center Crop (50%)                   No         No         No        
Random Crop (80%)                   No         No         No        
Resize Down+Up                      No         Yes        Yes       
Resize Down+Up (severe)             No         No         Yes       
Random Erasing (10%)                No         No         No        
Random Erasing (33%)                No         No         No        
Augment: Flip+Rotate                No         No         Yes       
Augment: Flip+Color                 No         No         Yes       
Augment: Full Pipeline              No         Yes        Yes       
======================================================================

Unique Image Verification (should NOT be detected):
  Unique: Base 2: D4=OK, BoVW=OK, Combined=OK
  Unique: Base 3: D4=OK, BoVW=OK, Combined=OK

Visualizing detected vs missed transformations¶

# Categorize results
detected_by_both = bovw_detected & d4_detected
detected_by_bovw_only = bovw_detected - d4_detected
detected_by_d4_only = d4_detected - bovw_detected
missed_by_both = set(range(1, len(images) - 2)) - bovw_detected - d4_detected

print("Categorized Results:")
print(f"\nDetected by BOTH D4 and BoVW ({len(detected_by_both)}):")
for i in sorted(detected_by_both):
    print(f"  [{i}] {labels[i]}")

print(f"\nDetected by BoVW ONLY ({len(detected_by_bovw_only)}):")
for i in sorted(detected_by_bovw_only):
    print(f"  [{i}] {labels[i]}")

print(f"\nDetected by D4 ONLY ({len(detected_by_d4_only)}):")
for i in sorted(detected_by_d4_only):
    print(f"  [{i}] {labels[i]}")

print(f"\nMissed by BOTH ({len(missed_by_both)}):")
for i in sorted(missed_by_both):
    print(f"  [{i}] {labels[i]}")

Categorized Results:

Detected by BOTH D4 and BoVW (5):
  [3] Rotation 90°
  [4] Rotation 180°
  [14] Saturation +50%
  [18] Color Invert
  [19] Gaussian Blur (mild)

Detected by BoVW ONLY (10):
  [1] Rotation 15°
  [2] Rotation 45°
  [7] Affine (rotate+translate)
  [9] Perspective (mild)
  [10] Perspective (strong)
  [11] Brightness +30%
  [13] Contrast +50%
  [17] Grayscale
  [24] Resize Down+Up
  [30] Augment: Full Pipeline

Detected by D4 ONLY (4):
  [5] Horizontal Flip
  [6] Vertical Flip
  [12] Brightness -30%
  [20] Gaussian Blur (strong)

Missed by BOTH (11):
  [8] Affine (rotate+scale)
  [15] Hue Shift
  [16] Full ColorJitter
  [21] Center Crop (80%)
  [22] Center Crop (50%)
  [23] Random Crop (80%)
  [25] Resize Down+Up (severe)
  [26] Random Erasing (10%)
  [27] Random Erasing (33%)
  [28] Augment: Flip+Rotate
  [29] Augment: Flip+Color

# Visualize some of the detected and missed transformations
def visualize_category(indices, title, max_display=6):
    """Visualize images in a category."""
    if not indices:
        print(f"{title}: No images")
        return

    indices = sorted(indices)[:max_display]
    n_cols = min(len(indices), 3)
    n_rows = (len(indices) + n_cols - 1) // n_cols

    fig, axes = plt.subplots(n_rows, n_cols, figsize=(4 * n_cols, 4 * n_rows))
    axes = [axes] if n_rows * n_cols == 1 else axes.flatten()

    for ax_idx, img_idx in enumerate(indices):
        axes[ax_idx].imshow(np.transpose(images[img_idx], (1, 2, 0)))
        axes[ax_idx].set_title(f"[{img_idx}] {labels[img_idx]}", fontsize=9)
        axes[ax_idx].axis("off")

    for i in range(len(indices), len(axes)):
        axes[i].axis("off")

    plt.suptitle(title, fontsize=12)
    plt.tight_layout()
    plt.show()


# Show original for reference
fig, ax = plt.subplots(1, 1, figsize=(4, 4))
ax.imshow(np.transpose(images[0], (1, 2, 0)))
ax.set_title("Original Image (reference)", fontsize=12)
ax.axis("off")
plt.show()

# Show each category
visualize_category(detected_by_both, "Detected by BOTH D4 Hash and BoVW")
visualize_category(detected_by_bovw_only, "Detected by BoVW ONLY (D4 missed these)")
visualize_category(missed_by_both, "MISSED by Both Methods")

../_images/7ea24af4a95a702a01a49d2237819bae245fdcc3346b9e472d421992d307a9a7.png

../_images/01fb626af36a340547d0e4d7ff79e269fd807535da1d227e1535eb603452cc85.png

../_images/5ac145a59757cf93df5c56bedce3d18f1e8fcd01c9433260a0c86601be8f6ab4.png

../_images/f512a7bff5ca00fec7b72361eac1129cb3d22a7557ff18ed7dadb540f8c54783.png

Adjusting detection sensitivity¶

The cluster_sensitivity parameter controls how strict the near-duplicate detection is. Let’s see how different thresholds affect detection.

# Test different cluster thresholds
thresholds = [0.75, 1.0, 1.5, 2.0, 2.5]
threshold_results = {}

for threshold in thresholds:
    detector = Duplicates(
        flags=ImageStats.NONE,
        extractor=bovw_extractor,
        cluster_sensitivity=threshold,
    )
    results = detector.evaluate(images)
    detected = get_detected_indices(results)
    threshold_results[threshold] = detected
    print(f"Threshold {threshold}: {len(detected)} transformations detected")

Threshold 0.75: 0 transformations detected

Threshold 1.0: 3 transformations detected

Threshold 1.5: 6 transformations detected

Threshold 2.0: 17 transformations detected

Threshold 2.5: 22 transformations detected

# Show how detection changes with threshold
print("\nTransformations detected at each threshold:")
print("=" * 90)
header = f"{'Transformation':<35}"
for t in thresholds:
    header += f" {t:<8}"
print(header)
print("=" * 90)

for i in range(1, len(images) - 2):
    row = f"{labels[i]:<35}"
    for t in thresholds:
        status = "Yes" if i in threshold_results[t] else "-"
        row += f" {status:<8}"
    print(row)

print("=" * 90)

Transformations detected at each threshold:
==========================================================================================
Transformation                      0.75     1.0      1.5      2.0      2.5     
==========================================================================================
Rotation 15°                        -        -        -        Yes      Yes     
Rotation 45°                        -        -        -        Yes      Yes     
Rotation 90°                        -        -        Yes      Yes      Yes     
Rotation 180°                       -        -        Yes      Yes      Yes     
Horizontal Flip                     -        -        -        -        -       
Vertical Flip                       -        -        -        -        -       
Affine (rotate+translate)           -        -        -        Yes      Yes     
Affine (rotate+scale)               -        -        -        Yes      Yes     
Perspective (mild)                  -        -        -        Yes      Yes     
Perspective (strong)                -        -        -        Yes      Yes     
Brightness +30%                     -        -        -        Yes      Yes     
Brightness -30%                     -        -        -        Yes      Yes     
Contrast +50%                       -        -        -        Yes      Yes     
Saturation +50%                     -        -        Yes      Yes      Yes     
Hue Shift                           -        -        -        -        -       
Full ColorJitter                    -        -        -        -        Yes     
Grayscale                           -        Yes      Yes      Yes      Yes     
Color Invert                        -        -        -        Yes      Yes     
Gaussian Blur (mild)                -        Yes      Yes      Yes      Yes     
Gaussian Blur (strong)              -        -        -        -        Yes     
Center Crop (80%)                   -        -        -        -        Yes     
Center Crop (50%)                   -        -        -        -        -       
Random Crop (80%)                   -        -        -        -        Yes     
Resize Down+Up                      -        Yes      Yes      Yes      Yes     
Resize Down+Up (severe)             -        -        -        -        Yes     
Random Erasing (10%)                -        -        -        -        -       
Random Erasing (33%)                -        -        -        -        -       
Augment: Flip+Rotate                -        -        -        -        -       
Augment: Flip+Color                 -        -        -        -        -       
Augment: Full Pipeline              -        -        -        Yes      Yes     
==========================================================================================

Key findings and recommendations¶

Transformations detected as near-duplicates¶

Transformation Type	D4 Hash	BoVW	Notes
Rotation (90° increments)	Yes	Yes	Both methods detect 90° and 180° reliably
Rotation (arbitrary angles)	No	Yes	BoVW’s SIFT features are rotation-invariant at any angle
Horizontal/Vertical Flip	Yes	No	BoVW clusters flips separately from the original; D4 is designed for this
Perspective	No	Yes	BoVW detects both mild and strong perspective distortion
Affine (rotate+translate)	No	Yes	BoVW handles combined rotation and translation
Brightness / Contrast / Saturation	Partial	Partial	Both detect some color shifts; depends on which channel is affected
Grayscale	No	Yes	SIFT operates on luminance, so grayscale conversion preserves features
Color Inversion	Yes	Yes	Both methods detect inversion
Gaussian Blur (mild)	Yes	Yes	Both methods tolerate mild blur
Gaussian Blur (strong)	Yes	No	D4 hashes are more resilient to strong blur than SIFT
Resize Down+Up	No	Yes	BoVW detects mild resolution loss; both miss severe downsampling

Transformations missed by both methods¶

Transformation Type	Why Missed
Hue shift / Full ColorJitter	Changes pixel values enough to alter both hashes and SIFT descriptors
All crops (center, random)	Removes too much content; remaining features don’t match the full-image histogram
Severe downsampling	Destroys fine-grained SIFT keypoints and alters hash signatures
Random erasing	Destroys local features in erased regions
Affine (rotate+scale)	Combined scaling with rotation changes SIFT descriptor distributions

Complementary strengths¶

A key finding is that D4 hashes and BoVW have complementary detection strengths:

D4 detects but BoVW misses: Flips, brightness reduction, strong blur
BoVW detects but D4 misses: Arbitrary rotations, perspective, affine, grayscale, mild resize, contrast shifts

The combined method detected 22 out of 30 transformations (73%) by merging groups across both methods.

Recommendations¶

Use both methods together for best coverage — they complement each other well
For detecting rotated copies: D4 hashes handle 90° increments and flips; add BoVW for arbitrary angles
For data augmentation validation: Use BoVW with a higher cluster_sensitivity (1.5–2.0) to catch subtle duplicates
For large datasets: Start with fast D4 hashes, then run BoVW on remaining candidates
Adjust cluster_sensitivity: Lower (1.0–1.25) for strict matching, higher (1.5–2.0) for permissive — note that no transformations are detected at 0.75

What’s next¶

In addition to exploring the duplicates in a dataset, DataEval offers additional tutorials on exploratory data analysis:

Clean a dataset with the labels in the Data Cleaning Guide
Identify Bias and Correlations in your metadata
Determine how the data groups by assessing the data space

Explore deeper explanations on topics such as duplicates, outliers, and coverage in the Concept pages.

To learn more about setting a global seed in DataEval, see the hardware configuration how-to.

On your own¶

Once you are familiar with DataEval and data analysis, run this analysis on your own dataset. When you do, make sure that you analyze all of your data and not just the training set.