Outliers¶
What is it¶
The Outliers class is a data cleaning class used to identify images that are reducing the quality or integrity of the dataset. It identifies the images by calculating the statistical outliers of a dataset using various statistical tests applied to each metric from the imagestats function.
Because the statistical tests rely on a data distribution, the results are based on the data evaluated. This means that changing the data either through adding to it or removing some of it, will result in different results.
When to use it¶
The Outliers class should be used anytime someone is working with a new dataset. Whether you are tasked with trying to find the right dataset to train a model or with verifying someone else’s stuff, the Outliers class helps you gain an understanding of the data.
The Data Cleaning Guide shows how the Outliers class can be used in conjunction with other DataEval classes to explore and clean a dataset.
Theory behind it¶
The Outliers class relies upon the stat functions to create a data distribution for specific metrics. The class then analyzes the distribution through a statistical test.
There are 3 different statistical tests that the Outliers class can use for detecting abnormal images:
zscore
modzscore
iqr
The z score method is based on
determining if the distance between a data point and the dataset mean is above
a specified threshold. The default threshold value for zscore is 3. The
equation for zscore is:
The modified z score method is
based on determining if the distance between a data point and the dataset
median is above a specified threshold. The default threshold value for
modzscore is 3.5. The equation for modzscore is:
where \(MAD\) is the median absolute deviation.
The interquartile range
method is based on determining if the distance between a data point and the
75th quartile or a data point and the 25th quartile is greater than the
difference between the 75th and 25th quartile multiplied by a specified
threshold. The default threshold value for iqr is 1.5. The equation for iqr
is:
where distance is the greater of \(25th\ quartile - value\) or \(value - 75th\ quartile\)
Class Options¶
The Outliers API will give more information on how to use the
functionality. The user has the option to specify:
the statistical method used by the class
the threshold the Outliers should compare with for identifying outliers
For more information on the algorithms/methods used to create the data distribution, see the Image Stats concept page.
Class Output¶
The Outliers class outputs a results dictionary with the keys being the images which were extreme in at least one metric as measured by the chosen statistical test. The value for each image in the results dictionary is a dictionary with the metric and the image’s value in the given metric as the key-value pair.
The Outliers class does not identify the reason that the image is flagged. The functionality of mapping issues to specific aspects of the image is outside the scope of the Outliers class.