Image Statistical Analysis¶

The image statistics functions assist with understanding the dataset. These can be used to get a big picture view of the dataset and it’s underlying distribution. These functions create the data distribution that the Outliers class uses to identify outliers.

What are the statistical analysis functions¶

There are seven different DataEval stat functions for analyzing a dataset:

boxratiostats
dimensionstats
hashstats
imagestats
labelstats
pixelstats
visualstats

The information below includes what each function does and the statistical analysis metrics that are included in each function.

boxratiostats¶

The boxratiostats() function calculates the ratio of the bounding box outputs to the image outputs. This function can be used with the output from:

This function requires both a bounding box output and an image output from the above mentioned stat functions.

This function can be used in conjunction with the Outliers class to determine if there are issues with any of the images in the dataset.

imagestats¶

The imagestats() function provides an easy way to run the pixelstats and visualstats on the images of a dataset or over each channel of the images of a dataset.

dimensionstats¶

The dimensionstats() function is an aggregate metric that calculates various dimension based statistics for each individual image:

Metric	Description
channels	Number of color channels in the image
height	Height of the image in pixels
width	Width of the image in pixels
size	Area of the image in pixels
aspect_ratio	Width of the image divided by the height
depth	Automatic calculation of the bit depth of the image based on max and min image values
left	The x value (in pixels) of the top left corner of the image or bounding box
top	The y value (in pixels) of the top left corner of the image or bounding box
center	The x and y value (in pixels) of the center of the image or bounding box
distance	The distance between the center of the image and the center of the bounding box
invalid_box	Whether the box is out of bounds or has no area

This function expects the image in a CxHxW format and uses this standard format to populate the width, height and channels metrics.

This function can be used in conjunction with the Outliers class to determine if there are any issues with any of the images in the dataset.

hashstats¶

The hashstats() function is an aggregate metric that calculates various hash values for each individual image:

Metric	Description
xxhash	exact image matching
pchash	perceptual hash based near image matching

This function can be used in conjunction with the Duplicates class in order to identify duplicate images.

labelstats¶

The labelstats() function provides summary statistics across classes and labels:

Metric	Description
label_counts_per_class	Total number of labels for each class
label_counts_per_image	Total number of labels for each image
image_counts_per_label	Total number of images for each label
image_indices_per_label	Dictionary tracking image number for each label
image_count	Total number of images
label_count	Total number of labels
class_count	Total number of class

pixelstats¶

The pixelstats() function is an aggregate metric that calculates normal statistics about pixel values for each individual image:

Metric	Description
mean	average pixel value across entire image
std	standard deviation of pixel values across entire image
var	variance of pixel values across entire image
skew	measure of how normally distributed the data is
kurtosis	measure of how normally distributed the data is
entropy	Shannon entropy based on the histogram, \(-\sum p \log p\)
missing	total number of pixels missing a value as a percentage of total pixels
zeros	total number of pixels with a zero value as a percentage of total pixels
histogram	scales pixel values between 0-1 and binned into 256 bins

This function can be used in conjunction with the Outliers class to determine if there are any issues with any of the images in the dataset.

visualstats¶

The visualstats() function is an aggregate metric that calculates visual quality statistics for each individual image:

Metric	Description
brightness	The value of the 25th percentile
sharpness	The standard deviation of a 3x3 edge filter
contrast	(max value - min value) / mean value
darkness	The value of the 75th percentile
percentiles	The 0, 25, 50, 75, and 100 percentile values of the pixel distribution

This function can be used in conjunction with the Outliers class to determine if there are any issues with any of the images in the dataset.

When to use the statistical analysis functions¶

The functions are automatically called when using Outliers.evaluate on data. Therefore, the functions themselves don’t usually need to be called on the data. However, there are a few scenarios that lend themselves to using the functions independently:

When multiple sets of data as well as the combined set are to be analyzed, it can be easier to run the stat functions on each individual set of data and then pass in the outputs to the Outlier class in each of the desired data combinations for analysis.
When comparing the resulting data distribution between two or more datasets to determine how similar the datasets are.
When visualizing the resulting data distribution after using one or more of the stat functions.

Example visualizing the resulting data distribution¶

The output for each function contains a plot function which plots a histogram of the data results, except for the hashstats function whose results are grouped lists and has no visualization function, and the labelstats function whose results can be visualized instead with the to_table function.

Example code and result for the imagestats function for images:

# Load the statistic metric from DataEval
from dataeval.metrics.stats import imagestats

# Loading in the PASCAL VOC 2011 dataset for this example
to_tensor = v2.ToImage()
ds = VOCDetection(
    "./data",
    year="2011",
    image_set="train",
    download=True,
    transform=to_tensor,
)

# This stat function takes about 1-3 minutes to run depending on your hardware

# Calculate the imagestats for the images
# Note: the stat function expects the images as a dataset
#       with images in the (C,H,W) format
stats = imagestats(ds)

# Visualize the results
stats.plot(log=True)

Example code and result for the imagestats function for channels per image:

# Load the statistic metric from DataEval
from dataeval.metrics.stats import imagestats

# Loading in the PASCAL VOC 2011 dataset for this example
to_tensor = v2.ToImage()
ds = VOCDetection(
    "./data",
    year="2011",
    image_set="train",
    download=True,
    transform=to_tensor,
)

# This stat function takes about 1-3 minutes to run depending on your hardware

# Calculate the imagestats per channel for the images
# Note: the stat function expects the images as a dataset
#       with images in the (C,H,W) format
ch_stats = imagestats(ds, per_channel=True)

# Visualize the results
ch_stats.plot(log=False, channel_limit=3)

Using the visualization for a quick analysis¶

Visualizing the distribution of values for each metric allows one to quickly inspect the metrics for unusual distributions. In general, each metric should follow either a normal distribution or a uniform distribution.

With a uniform distribution, you want to notice if any of the plots have areas that are a lot shorter or a lot taller than the rest of the values.

With a normal distribution, you are looking at the edges of the bell curve to see if the values near the edges of the plot raise up or if there are gaps between the edge values and the next value in.

When analyzing the visualizations by channel, you should not be interested in the overall shape of these plots but in the comparison of the shape across each of the individual channels. You want to see if the same shape holds across each channel or if there are large differences between the channels. This is important because discrepancies across channels can help detect image processing errors and channel bias.

For example, below is a quick analysis of the above example plots.

In regards to the imagestats plot, there are a few key insights:

The channel metric has only one value, 3, which is interesting since some of the images in the dataset are greyscale, and greyscale images usually only have 1 channel.
The entropy, zeros, kurtosis, and contrast metrics are single-tailed and all of them have a long tail which indicates that the images whose values are in the edges of the tail are potentially problematic.
Size, aspect ratio, variance, skew, brightness and darkness have skewed or off-center distributions which is another sign of problematic images.
Mean, standard deviation and sharpness appear to have a normal distribution and none have an extended tail, which is a good sign.

While these insights don’t identify the exact images that may be problematic, they highlight where to focus on with further analysis.

In regards to the per-channel plot, the only insight is that there is very little difference across the channels for each metric. Therefore, there are no additional concerns beyond those from the imagestats plot.