Image statistical analysis¶
The image statistics features assist with understanding the dataset.
These can be used to get a big picture view of the dataset and its underlying
distribution. The calculate() function with ImageStats flags creates
the data distribution that the Outliers class uses to identify outliers.
What are the statistical analysis categories¶
DataEval provides four main categories of statistics for analyzing image datasets,
controlled via ImageStats flags:
PIXEL- Pixel-level statistics (mean, std, variance, skewness, kurtosis, entropy, etc.)VISUAL- Visual quality statistics (brightness, contrast, darkness, sharpness, percentiles)DIMENSION- Dimension-based statistics (width, height, channels, size, aspect ratio, etc.)HASH- Hash-based statistics for duplicate detection (xxhash, pchash)
The information below includes what each category provides and the statistical metrics that are available in each.
PIXEL Statistics¶
The PIXEL flag group calculates pixel-level statistics for each image.
These statistics analyze the raw pixel value distribution and can be computed
per-channel using the per_channel=True parameter with calculate().
Available individual pixel statistics:
Flag |
Description |
|---|---|
PIXEL_MEAN |
Average pixel value across entire image |
PIXEL_STD |
Standard deviation of pixel values across entire image |
PIXEL_VAR |
Variance of pixel values across entire image |
PIXEL_SKEW |
Skewness - measure of how normally distributed the data is |
PIXEL_KURTOSIS |
Kurtosis - measure of how normally distributed the data is |
PIXEL_ENTROPY |
Shannon entropy based on the histogram, \(-\sum p \log p\) |
PIXEL_MISSING |
Total number of pixels missing a value as a percentage of total pixels |
PIXEL_ZEROS |
Total number of pixels with a zero value as a percentage of total pixels |
PIXEL_HISTOGRAM |
Scales pixel values between 0-1 and bins into 256 bins |
Convenience sub-groups:
PIXEL_BASIC- Mean, std, varPIXEL_DISTRIBUTION- Skew, kurtosis, entropy, histogram
These statistics can be used in conjunction with the Outliers class to determine
if there are any issues with any of the images in the dataset.
VISUAL Statistics¶
The VISUAL flag group calculates visual quality statistics for each individual image:
Flag |
Description |
|---|---|
VISUAL_BRIGHTNESS |
Brightness measure (25th percentile) |
VISUAL_SHARPNESS |
Sharpness measure using 3x3 edge filter |
VISUAL_CONTRAST |
Contrast measure (max value - min value) / mean value |
VISUAL_DARKNESS |
Darkness measure (75th percentile) |
VISUAL_PERCENTILES |
The 0, 25, 50, 75, and 100 percentile values of the pixel distribution |
Convenience sub-group:
VISUAL_BASIC- Brightness, contrast, sharpness
These statistics can be used in conjunction with the Outliers class to determine
if there are any issues with any of the images in the dataset.
DIMENSION Statistics¶
The DIMENSION flag group calculates dimension-based statistics for each individual image or bounding box:
Flag |
Description |
|---|---|
DIMENSION_CHANNELS |
Number of color channels in the image |
DIMENSION_HEIGHT |
Height of the image or bounding box in pixels |
DIMENSION_WIDTH |
Width of the image or bounding box in pixels |
DIMENSION_SIZE |
Area of the image or bounding box in pixels |
DIMENSION_ASPECT_RATIO |
Width divided by height |
DIMENSION_DEPTH |
Automatic calculation of the bit depth based on max and min values |
DIMENSION_OFFSET_X |
The x value (in pixels) of the top left corner of the bounding box |
DIMENSION_OFFSET_Y |
The y value (in pixels) of the top left corner of the bounding box |
DIMENSION_CENTER |
The x and y value (in pixels) of the center of the image or bounding box |
DIMENSION_DISTANCE_CENTER |
Distance between the center of the image and the center of the bounding box |
DIMENSION_DISTANCE_EDGE |
Distance from the bounding box to the nearest image edge |
DIMENSION_INVALID_BOX |
Whether the box is out of bounds or has no area |
Convenience sub-groups:
DIMENSION_BASIC- Width, height, channelsDIMENSION_OFFSET- Offset X and YDIMENSION_POSITION- Center, distance to center, distance to edge
Images are expected in CxHxW format, which is used to populate the width, height, and channels metrics.
These statistics can be used in conjunction with the Outliers class to determine
if there are any issues with any of the images in the dataset.
HASH Statistics¶
The HASH flag group calculates hash values for duplicate detection:
Flag |
Description |
|---|---|
HASH_XXHASH |
xxHash for exact image matching |
HASH_PCHASH |
Perceptual hash for near-duplicate detection |
These hashes can be used in conjunction with the Duplicates class to identify duplicate images.
When to use calculate with ImageStats¶
The calculate() function is automatically called when using Outliers.evaluate on data.
Therefore, you don’t usually need to call calculate() directly.
However, there are a few scenarios where using calculate() independently is beneficial:
When multiple sets of data as well as the combined set are to be analyzed, it can be easier to run
calculate()on each individual set of data and then pass the outputs to theOutliersclass in each of the desired data combinations for analysis.When comparing the resulting data distribution between two or more datasets to determine how similar the datasets are.
When you need specific statistics for custom analysis or visualization.
When using the
calculate_ratios()function to compute ratios between bounding box statistics and image statistics.
Example usage¶
Example code for calculating all statistics for images:
# Import the calculate function and ImageStats flags
from dataeval.core import calculate
from dataeval.flags import ImageStats
from torchvision.datasets import VOCDetection
from torchvision.transforms import v2
# Loading in the PASCAL VOC 2011 dataset for this example
to_tensor = v2.ToImage()
ds = VOCDetection(
"./data",
year="2011",
image_set="train",
download=True,
transform=to_tensor,
)
# Calculate all statistics for the images
# Note: Images should be in (C,H,W) format
result = calculate(ds, stats=ImageStats.ALL)
# Access the computed statistics
print(f"Processed {result['image_count']} images")
print(f"Available statistics: {list(result['stats'].keys())}")
Example code for calculating specific statistics:
from dataeval.core import calculate
from dataeval.flags import ImageStats
# Calculate only pixel and visual statistics
result = calculate(
ds,
stats=ImageStats.PIXEL | ImageStats.VISUAL
)
# Calculate only basic pixel statistics with per-channel breakdown
result = calculate(
ds,
stats=ImageStats.PIXEL_BASIC,
per_channel=True
)
# Calculate dimension statistics for both full images and bounding boxes
result = calculate(
ds,
stats=ImageStats.DIMENSION,
per_image=True,
per_box=True
)
Analyzing the results¶
The calculate function returns a CalculationResult dictionary containing:
source_index: Sequence ofSourceIndexobjects tracking which image, box, and channel each statistic corresponds toobject_count: Number of objects (bounding boxes) per imageinvalid_box_count: Number of invalid boxes per imageimage_count: Total number of images processedstats: Dictionary mapping statistic names to NumPy arrays of computed values
You can analyze the distribution of statistics to identify potential issues:
import numpy as np
# Get mean pixel values across all images
mean_values = result['stats']['mean']
# Identify outliers (values beyond 3 standard deviations)
mean_std = np.std(mean_values)
mean_avg = np.mean(mean_values)
outliers = np.where(np.abs(mean_values - mean_avg) > 3 * mean_std)[0]
print(f"Found {len(outliers)} outlier images based on mean pixel value")
When analyzing distributions, look for:
Uniform distribution: Check if any areas are significantly shorter or taller than the rest
Normal distribution: Look at the edges of the bell curve for raised values or gaps
Per-channel analysis: Compare shapes across channels to detect processing errors or channel bias
Using with Outliers and Duplicates¶
The statistics from calculate() are used internally by the Outliers and Duplicates classes:
from dataeval import Outliers, Duplicates
# Outliers automatically calls calculate with appropriate stats
outliers = Outliers()
outlier_results = outliers.evaluate(ds)
# Duplicates uses hash statistics
duplicates = Duplicates()
duplicate_results = duplicates.evaluate(ds)
Performance Overview¶
The following performance data was collected using both small images (CIFAR-10, 3x32x32) and medium images (VOCDetection2012, ~3x375x500) across different computational configurations.
Statistics Categories Benchmarked¶
DIMENSION: Image dimension analysis
HASH: Hash-based similarity detection
VISUAL: Visual properties analysis (brightness, contrast, etc.)
PIXEL: Pixel-level statistical analysis (mean, std, histograms)
ALL: Combined pixel and visual and dimension statistics
Per-channel mode: Per-channel analysis with additional overhead
Small Images Performance (CIFAR-10)¶
The following chart shows execution times for processing CIFAR-10 images with 16 processes across different dataset sizes (10K, 30K, 50K images).
Key observations:
Excellent linear scaling with image count for most statistics
DIMENSION and HASH show the best performance and efficiency
VISUAL provides good performance for comprehensive visual analysis
PIXEL has moderate computational cost for detailed pixel analysis
Per-channel mode shows expected overhead for per-channel breakdowns
Medium Images Performance (VOC Detection 2012)¶
Performance characteristics change significantly with larger images, as shown below for VOCDetection (2012) dataset processing (1K, 3K, 5K images).
Notable differences:
HASH and DIMENSION remain relatively efficient regardless of image size
VISUAL maintains good performance characteristics across image sizes
PIXEL shows higher computational cost with larger images due to increased pixel data
Per-channel mode demonstrates significant overhead scaling with image complexity
Process Scaling Analysis¶
Small Images Process Scaling¶
Medium Images Process Scaling¶
Performance Recommendations¶
Based on the benchmark results:
For fast dataset profiling: Use
DIMENSION | HASHfor rapid analysisFor visual quality assessment:
VISUALprovides good performance-to-insight ratioFor detailed analysis:
PIXELoffers comprehensive metrics with moderate overheadFor complete analysis:
ALLcombines all statistics efficientlyMemory-constrained environments: Avoid
per_channel=Truefor large datasetsProcess scaling: Multi-processing (configured via
dataeval.config) provides optimal performance for most workloads
Key Performance Insights¶
Diminishing returns: Increasing process count offers diminishing returns
Statistic selection: Choose the minimal set of statistics needed for your analysis using specific
ImageStatsflagsPer-channel overhead: Only use
per_channel=Truewhen channel-specific insights are requiredLinear scaling: All statistics scale linearly with image count, size, and process count
Technical Notes¶
Benchmarks conducted using multiprocessing with shared memory optimization
Times measured include I/O overhead and result aggregation
Process scaling shows diminishing returns beyond optimal core count
Memory usage scales proportionally with per-channel analysis depth
Tests performed on an Intel Core i9-14900HX w/ 64GB DDR5 on Windows 11/Ubuntu 22.04 (WSL2) with dataset loaded on local storage
Performance applies to the
calculate()function with variousImageStatsflag combinations