How to choose a duplicate detection method¶
Problem Statement¶
DataEval offers multiple approaches for detecting duplicate images:
Hash-based methods (phash_d4, dhash_d4) - Fast perceptual hashing with rotation/flip invariance
Embedding-based methods (BoVWExtractor) - SIFT-based Bag of Visual Words for semantic similarity
This notebook compares these approaches to help you choose the right method for your use case.
When to use each method¶
Method |
Best For |
Speed |
Rotation Invariant |
|---|---|---|---|
D4 Hashes (phash_d4, dhash_d4) |
Detecting rotated/flipped copies |
Fast |
Only 90° increments |
BoVWExtractor |
Semantic similarity, different viewpoints |
Slower |
Any angle |
Basic Hashes (phash, dhash) |
Same-orientation near-duplicates |
Fastest |
No |
Key insight: D4 hashes only handle the 8 symmetries of a square (0°, 90°, 180°, 270° + flips). BoVW using SIFT features is invariant to any rotation angle, making it better for detecting arbitrarily rotated duplicates.
What you will need¶
A Python environment with the following packages installed:
dataevalopencv-pythonoropencv-python-headlessmatplotlib
Sample images to analyze
Getting Started¶
Let’s import the required libraries.
import time
import cv2
import matplotlib.pyplot as plt
import numpy as np
from dataeval import config
from dataeval.extractors._bovw import BoVWExtractor
from dataeval.flags import ImageStats
from dataeval.quality import Duplicates
config.set_seed(67) # six seven
Creating Test Data¶
We’ll create a synthetic dataset with different types of “duplicates” to test each method’s capabilities:
Original images - Base images with texture (for SIFT detection)
90° rotations - 90°, 180°, 270° rotations (D4 hashes can detect these)
Diagonal rotations - 45°, 135° rotations (only BoVW can detect these!)
Flipped copies - Horizontal and vertical flips
Unique images - Should NOT be detected as duplicates
def create_textured_image(seed: int, size: int = 128) -> np.ndarray:
"""Create an image with texture patterns that SIFT can detect."""
rng = np.random.default_rng(seed)
# Create base with gradient and noise
x = np.linspace(0, 4 * np.pi, size)
y = np.linspace(0, 4 * np.pi, size)
xx, yy = np.meshgrid(x, y)
# Create pattern with multiple frequency components
pattern = (
np.sin(xx * (1 + seed % 3)) * np.cos(yy * (2 + seed % 2))
+ np.sin(xx * 3 + seed) * 0.5
+ rng.random((size, size)) * 0.3
)
# Normalize to 0-255
pattern = ((pattern - pattern.min()) / (pattern.max() - pattern.min()) * 255).astype(np.uint8)
# Create RGB image (same pattern in all channels with slight variation)
img = np.stack(
[pattern, np.roll(pattern, seed % 10, axis=0), np.roll(pattern, seed % 7, axis=1)], axis=0
) # Shape: (3, H, W) - channels first
return img.astype(np.uint8)
def rotate_image(img: np.ndarray, angle: float) -> np.ndarray:
"""Rotates image by any angle"""
angle = angle % 360
if angle == 0:
return img
# Transpose to HWC for OpenCV
img_hwc = np.transpose(img, (1, 2, 0))
# Orthogonal rotations (90, 180, 270)
if angle % 90 == 0:
rotate_code = {1: cv2.ROTATE_90_COUNTERCLOCKWISE, 2: cv2.ROTATE_180, 3: cv2.ROTATE_90_CLOCKWISE}
rotated = cv2.rotate(img_hwc, rotate_code[int((angle // 90) % 4)])
# Affine rotation (Diagonal)
else:
h, w = img_hwc.shape[:2]
center = (w // 2, h // 2)
matrix = cv2.getRotationMatrix2D(center, angle, 1.0)
cos, sin = np.abs(matrix[0, 0]), np.abs(matrix[0, 1])
new_w, new_h = int(h * sin + w * cos), int(h * cos + w * sin)
matrix[0, 2] += (new_w - w) / 2
matrix[1, 2] += (new_h - h) / 2
rotated = cv2.warpAffine(img_hwc, matrix, (new_w, new_h), borderValue=(128, 128, 128))
# Transpose back to CHW
return np.transpose(rotated, (2, 0, 1))
def flip_image(img: np.ndarray, direction: str) -> np.ndarray:
"""Flip image horizontally or vertically"""
img_hwc = np.transpose(img, (1, 2, 0))
flipped = cv2.flip(img_hwc, 1) if direction == "horizontal" else cv2.flip(img_hwc, 0)
return np.transpose(flipped, (2, 0, 1))
images, labels, group_info = [], [], []
experiments = [
(
42,
"Group 1: Orthogonal (D4 Detectable)",
[(rotate_image, 90, "Rot 90°"), (rotate_image, 180, "Rot 180°"), (flip_image, "horizontal", "Flip H")],
),
(
63,
"Group 2: Diagonal (BoVW Only)",
[(rotate_image, 45, "Rot 45°"), (rotate_image, 135, "Rot 135°"), (rotate_image, 30, "Rot 30°")],
),
(
89,
"Group 3: Mixed Rotations",
[(rotate_image, 90, "Rot 90°"), (rotate_image, 60, "Rot 60°"), (flip_image, "vertical", "Flip V")],
),
]
for seed, desc, transforms in experiments:
start_idx = len(images)
base_img = create_textured_image(seed=seed)
images.append(base_img)
labels.append(f"Original (Seed {seed})")
for func, arg, suffix in transforms:
images.append(func(base_img, arg))
labels.append(f"Seed {seed} - {suffix}")
group_info.append((desc, start_idx, len(images) - 1))
start_idx = len(images)
unique_seeds = [777, 888, 999]
for i, seed in enumerate(unique_seeds):
images.append(create_textured_image(seed=seed))
labels.append(f"Unique {i + 1}")
group_info.append(("Group 4: Unique Images", start_idx, len(images) - 1))
print(f"Created {len(images)} test images\n" + "=" * 60)
for desc, start, end in group_info:
print(f"{desc:<40} (indices {start}-{end})")
print("=" * 60)
for i, label in enumerate(labels):
print(f" [{i:2d}] {label}")
Created 15 test images
============================================================
Group 1: Orthogonal (D4 Detectable) (indices 0-3)
Group 2: Diagonal (BoVW Only) (indices 4-7)
Group 3: Mixed Rotations (indices 8-11)
Group 4: Unique Images (indices 12-14)
============================================================
[ 0] Original (Seed 42)
[ 1] Seed 42 - Rot 90°
[ 2] Seed 42 - Rot 180°
[ 3] Seed 42 - Flip H
[ 4] Original (Seed 63)
[ 5] Seed 63 - Rot 45°
[ 6] Seed 63 - Rot 135°
[ 7] Seed 63 - Rot 30°
[ 8] Original (Seed 89)
[ 9] Seed 89 - Rot 90°
[10] Seed 89 - Rot 60°
[11] Seed 89 - Flip V
[12] Unique 1
[13] Unique 2
[14] Unique 3
# Visualize the test images
fig, axes = plt.subplots(3, 5, figsize=(12, 6))
axes = axes.flatten()
for i, (img, label) in enumerate(zip(images, labels)):
if i < len(axes):
# Convert CHW to HWC for display
img_display = np.transpose(img, (1, 2, 0))
axes[i].imshow(img_display)
axes[i].set_title(f"[{i}] {label}", fontsize=8)
axes[i].axis("off")
# Hide empty subplots
for i in range(len(images), len(axes)):
axes[i].axis("off")
plt.tight_layout()
plt.suptitle("Test Images: 90° rotations (Group 1) vs Diagonal rotations (Group 2)", y=1.02, fontsize=14)
plt.show()
Method 1: D4 Hash-based Detection¶
D4 hashes (phash_d4, dhash_d4) compute perceptual hashes that are invariant to the 8 symmetries of a square (rotations by 0°, 90°, 180°, 270° and their horizontal flips).
Strengths:
Very fast computation
Detects rotated and flipped versions reliably (90° increments only)
No training required
Weaknesses:
Cannot detect diagonal rotations (45°, 30°, 60°, etc.)
Only detects near-exact copies (with transformations)
Cannot detect semantic similarity
Expected result: Should detect Group 1 (90° rotations) but miss Group 2 (diagonal rotations).
# Run D4 hash-based detection
start_time = time.time()
d4_detector = Duplicates(flags=ImageStats.HASH_DUPLICATES_D4)
d4_results = d4_detector.evaluate(images)
d4_time = time.time() - start_time
print(f"D4 Hash Detection completed in {d4_time:.3f} seconds")
D4 Hash Detection completed in 0.081 seconds
print("\n=== D4 Hash Results ===")
print("\nNear duplicates (perceptual similarity):")
if d4_results.items.near:
for group in d4_results.items.near:
indices = list(group.indices)
methods = sorted(group.methods)
print(f" Indices: {indices}")
print(f" Methods: {methods}")
print(f" Labels: {[labels[int(i)] for i in indices]}")
print()
=== D4 Hash Results ===
Near duplicates (perceptual similarity):
Indices: [0, 1, 2, 3]
Methods: ['dhash_d4', 'phash_d4']
Labels: ['Original (Seed 42)', 'Seed 42 - Rot 90°', 'Seed 42 - Rot 180°', 'Seed 42 - Flip H']
Indices: [8, 9, 11]
Methods: ['dhash_d4', 'phash_d4']
Labels: ['Original (Seed 89)', 'Seed 89 - Rot 90°', 'Seed 89 - Flip V']
Method 2: BoVW Embedding-based Detection¶
BoVWExtractor uses SIFT features to create rotation-invariant image representations. It clusters local features into a “visual vocabulary” and represents each image as a histogram of visual words.
Strengths:
Rotation invariant at ANY angle (SIFT features are inherently rotation invariant)
Can detect semantic similarity (similar objects, different viewpoints)
Works well for natural images with texture
Weaknesses:
Slower than hash-based methods
Requires images with detectable features (may fail on uniform/simple images)
Results depend on vocabulary size parameter
Expected result: Should detect BOTH Group 1 (90° rotations) AND Group 2 (diagonal rotations).
# Run BoVW-based detection
bovw_extractor = BoVWExtractor(vocab_size=128) # Smaller vocab for small dataset
start_time = time.time()
bovw_detector = Duplicates(
flags=ImageStats.NONE, # Skip hash computation
extractor=bovw_extractor,
cluster_threshold=1.25,
)
bovw_results = bovw_detector.evaluate(images)
bovw_time = time.time() - start_time
print(f"BoVW Detection completed in {bovw_time:.3f} seconds")
BoVW Detection completed in 22.252 seconds
print("\n=== BoVW Results ===")
print("\nNear duplicates (embedding similarity):")
if bovw_results.items.near:
for group in bovw_results.items.near:
indices = list(group.indices)
methods = sorted(group.methods)
print(f" Indices: {indices}")
print(f" Methods: {methods}")
print(f" Labels: {[labels[int(i)] for i in indices]}")
print()
else:
print(" No near duplicates found")
=== BoVW Results ===
Near duplicates (embedding similarity):
Indices: [0, 1, 2, 3]
Methods: ['cluster']
Labels: ['Original (Seed 42)', 'Seed 42 - Rot 90°', 'Seed 42 - Rot 180°', 'Seed 42 - Flip H']
Indices: [4, 5, 6, 7]
Methods: ['cluster']
Labels: ['Original (Seed 63)', 'Seed 63 - Rot 45°', 'Seed 63 - Rot 135°', 'Seed 63 - Rot 30°']
Indices: [8, 9, 10]
Methods: ['cluster']
Labels: ['Original (Seed 89)', 'Seed 89 - Rot 90°', 'Seed 89 - Rot 60°']
Performance Comparison¶
print("\n=== Performance Summary ===")
print("\nExecution Time:")
print(f" D4 Hashes: {d4_time:.3f}s")
print(f" BoVW: {bovw_time:.3f}s ({bovw_time / d4_time:.1f}x slower)")
print("\nDetection Results:")
print(f" {'Method':<15} {'Exact':<10} {'Near Groups':<15}")
print(f" {'-' * 40}")
d4_exact = len(d4_results.items.exact) if d4_results.items.exact else 0
d4_near = len(d4_results.items.near) if d4_results.items.near else 0
print(f" {'D4 Hashes':<15} {d4_exact:<10} {d4_near:<15}")
bovw_exact = len(bovw_results.items.exact) if bovw_results.items.exact else 0
bovw_near = len(bovw_results.items.near) if bovw_results.items.near else 0
print(f" {'BoVW':<15} {bovw_exact:<10} {bovw_near:<15}")
=== Performance Summary ===
Execution Time:
D4 Hashes: 0.081s
BoVW: 22.252s (274.8x slower)
Detection Results:
Method Exact Near Groups
----------------------------------------
D4 Hashes 0 2
BoVW 0 3
Recommendations¶
Use D4 Hashes (HASH_DUPLICATES_D4) when:¶
You need fast processing of large datasets
You’re looking for rotated/flipped copies at 90° increments only
Images are near-exact duplicates (same content, possibly transformed)
Use BoVWExtractor when:¶
You need to detect arbitrarily rotated duplicates (45°, 30°, etc.)
You need semantic similarity detection
Images may have different viewpoints of same objects
Processing time is not critical
Images have rich texture (not uniform/simple patterns)
Key Differences¶
Aspect |
D4 Hashes |
BoVWExtractor |
|---|---|---|
Exact duplicates |
Yes (via xxhash) |
No (embeddings are approximate) |
90° rotation detection |
Yes (D4 symmetry) |
Yes |
Diagonal rotation detection |
No |
Yes (any angle) |
Semantic similarity |
No |
Yes |
Speed |
Fast |
Slower |
Training required |
No |
Yes (builds vocabulary) |
Works on uniform images |
Yes |
No (needs texture for SIFT) |
Summary¶
The key difference demonstrated in this notebook is diagonal rotation handling:
D4 hashes can only detect rotations at 0°, 90°, 180°, 270° (plus flips)
BoVW/SIFT can detect rotations at any angle because SIFT features are inherently rotation invariant
If your dataset may contain images rotated at arbitrary angles, BoVWExtractor is the better choice despite being slower.