Image Statistical Analysis

The image statistics functions assist with understanding the dataset. These can be used to get a big picture view of the dataset and it’s underlying distribution. These functions create the data distribution that the Outliers class uses to identify outliers.

What are the statistical analysis functions

There are seven different DataEval stat functions for analyzing a dataset:

  • boxratiostats

  • dimensionstats

  • hashstats

  • imagestats

  • labelstats

  • pixelstats

  • visualstats

The information below includes what each function does and the statistical analysis metrics that are included in each function.

boxratiostats

The boxratiostats() function calculates the ratio of the bounding box outputs to the image outputs. This function can be used with the output from:

  • dimensionstats()

  • pixelstats()

  • visualstats()

This function requires both a bounding box output and an image output from the above mentioned stat functions.

This function can be used in conjunction with the Outliers class to determine if there are issues with any of the images in the dataset.

imagestats

The imagestats() function provides an easy way to run the pixelstats and visualstats on the images of a dataset or over each channel of the images of a dataset.

dimensionstats

The dimensionstats() function is an aggregate metric that calculates various dimension based statistics for each individual image:

Metric

Description

channels

Number of color channels in the image

height

Height of the image in pixels

width

Width of the image in pixels

size

Area of the image in pixels

aspect_ratio

Width of the image divided by the height

depth

Automatic calculation of the bit depth of the image based on max and min image values

left

The x value (in pixels) of the top left corner of the image or bounding box

top

The y value (in pixels) of the top left corner of the image or bounding box

center

The x and y value (in pixels) of the center of the image or bounding box

distance

The distance between the center of the image and the center of the bounding box

invalid_box

Whether the box is out of bounds or has no area

This function expects the image in a CxHxW format and uses this standard format to populate the width, height and channels metrics.

This function can be used in conjunction with the Outliers class to determine if there are any issues with any of the images in the dataset.

hashstats

The hashstats() function is an aggregate metric that calculates various hash values for each individual image:

Metric

Description

xxhash

exact image matching

pchash

perceptual hash based near image matching

This function can be used in conjunction with the Duplicates class in order to identify duplicate images.

labelstats

The labelstats() function provides summary statistics across classes and labels:

Metric

Description

label_counts_per_class

Total number of labels for each class

label_counts_per_image

Total number of labels for each image

image_counts_per_label

Total number of images for each label

image_indices_per_label

Dictionary tracking image number for each label

image_count

Total number of images

label_count

Total number of labels

class_count

Total number of class

pixelstats

The pixelstats() function is an aggregate metric that calculates normal statistics about pixel values for each individual image:

Metric

Description

mean

average pixel value across entire image

std

standard deviation of pixel values across entire image

var

variance of pixel values across entire image

skew

measure of how normally distributed the data is

kurtosis

measure of how normally distributed the data is

entropy

Shannon entropy based on the histogram, \(-\sum p \log p\)

missing

total number of pixels missing a value as a percentage of total pixels

zeros

total number of pixels with a zero value as a percentage of total pixels

histogram

scales pixel values between 0-1 and binned into 256 bins

This function can be used in conjunction with the Outliers class to determine if there are any issues with any of the images in the dataset.

visualstats

The visualstats() function is an aggregate metric that calculates visual quality statistics for each individual image:

Metric

Description

brightness

The value of the 25th percentile

sharpness

The standard deviation of a 3x3 edge filter

contrast

(max value - min value) / mean value

darkness

The value of the 75th percentile

percentiles

The 0, 25, 50, 75, and 100 percentile values of the pixel distribution

This function can be used in conjunction with the Outliers class to determine if there are any issues with any of the images in the dataset.

When to use the statistical analysis functions

The functions are automatically called when using Outliers.evaluate on data. Therefore, the functions themselves don’t usually need to be called on the data. However, there are a few scenarios that lend themselves to using the functions independently:

  • When multiple sets of data as well as the combined set are to be analyzed, it can be easier to run the stat functions on each individual set of data and then pass in the outputs to the Outlier class in each of the desired data combinations for analysis.

  • When comparing the resulting data distribution between two or more datasets to determine how similar the datasets are.

  • When visualizing the resulting data distribution after using one or more of the stat functions.

Example visualizing the resulting data distribution

The output for each function contains a plot function which plots a histogram of the data results, except for the hashstats function whose results are grouped lists and has no visualization function, and the labelstats function whose results can be visualized instead with the to_table function.

Example code and result for the imagestats function for images:

# Load the statistic metric from DataEval
from dataeval.metrics.stats import imagestats

# Loading in the PASCAL VOC 2011 dataset for this example
to_tensor = v2.ToImage()
ds = VOCDetection(
    "./data",
    year="2011",
    image_set="train",
    download=True,
    transform=to_tensor,
)

# This stat function takes about 1-3 minutes to run depending on your hardware

# Calculate the imagestats for the images
# Note: the stat function expects the images as a dataset
#       with images in the (C,H,W) format
stats = imagestats(ds)

# Visualize the results
stats.plot(log=True)

image

Example code and result for the imagestats function for channels per image:

# Load the statistic metric from DataEval
from dataeval.metrics.stats import imagestats

# Loading in the PASCAL VOC 2011 dataset for this example
to_tensor = v2.ToImage()
ds = VOCDetection(
    "./data",
    year="2011",
    image_set="train",
    download=True,
    transform=to_tensor,
)

# This stat function takes about 1-3 minutes to run depending on your hardware

# Calculate the imagestats per channel for the images
# Note: the stat function expects the images as a dataset
#       with images in the (C,H,W) format
ch_stats = imagestats(ds, per_channel=True)

# Visualize the results
ch_stats.plot(log=False, channel_limit=3)

image

Using the visualization for a quick analysis

Visualizing the distribution of values for each metric allows one to quickly inspect the metrics for unusual distributions. In general, each metric should follow either a normal distribution or a uniform distribution.

With a uniform distribution, you want to notice if any of the plots have areas that are a lot shorter or a lot taller than the rest of the values.

With a normal distribution, you are looking at the edges of the bell curve to see if the values near the edges of the plot raise up or if there are gaps between the edge values and the next value in.

When analyzing the visualizations by channel, you should not be interested in the overall shape of these plots but in the comparison of the shape across each of the individual channels. You want to see if the same shape holds across each channel or if there are large differences between the channels. This is important because discrepancies across channels can help detect image processing errors and channel bias.

For example, below is a quick analysis of the above example plots.

In regards to the imagestats plot, there are a few key insights:

  1. The channel metric has only one value, 3, which is interesting since some of the images in the dataset are greyscale, and greyscale images usually only have 1 channel.

  2. The entropy, zeros, kurtosis, and contrast metrics are single-tailed and all of them have a long tail which indicates that the images whose values are in the edges of the tail are potentially problematic.

  3. Size, aspect ratio, variance, skew, brightness and darkness have skewed or off-center distributions which is another sign of problematic images.

  4. Mean, standard deviation and sharpness appear to have a normal distribution and none have an extended tail, which is a good sign.

While these insights don’t identify the exact images that may be problematic, they highlight where to focus on with further analysis.

In regards to the per-channel plot, the only insight is that there is very little difference across the channels for each metric. Therefore, there are no additional concerns beyond those from the imagestats plot.