Outlier Detection Tutorial
Problem Statement
For most computer vision tasks like image classification and object detection, outliers can provide insight into operational drift, or training problems. A way to identify these is through autoencoding reconstruction error.
To help with this, DAML has an outlier detector that allows a user to identify potential outliers.
When to use
The AEOutlier class and similar should be used when you would like to find individual images in a dataset which are the most different from the others in the provided set.
What you will need
A training image dataset with the approximate percentage of known outliers.
A test image dataset to evaluate for outliers.
Setting up
Let’s import the required libraries needed to set up a minimal working example
import numpy as np
import tensorflow as tf
import tensorflow_datasets as tfds
from daml.metrics.outlier import AEOutlier, VAEGMMOutlier
from daml.models.tensorflow import AE, VAEGMM, create_model
tf.random.set_seed(108)
tf.keras.utils.set_random_seed(408)
Load the data
We will use the tensorflow mnist dataset for this tutorial on outlier detection
# Load in the mnist dataset from tensorflow datasets
(images, ds_info) = tfds.load(
"mnist",
split="train[:2000]",
with_info=True,
) # type: ignore
images = images.shuffle(images.cardinality())
tfds.visualization.show_examples(images, ds_info)
images = np.array([i["image"] for i in images], dtype=np.float32) / 255.0
input_shape = images[0].shape
Initialize the model
Now, lets look at how to use DAML’s outlier detection methods.
We will focus on a simple autoencoder network from our Alibi Detect provider
detectors = [
AEOutlier(create_model(AE, input_shape)),
VAEGMMOutlier(create_model(VAEGMM, input_shape)),
]
Train the model
Next we will train a model on the dataset. For better results, the epochs can be increased. We set the outlier threshold to detect the most extreme 1% of training data as outliers.
for detector in detectors:
print(f"Training {detector.__class__.__name__}...")
detector.fit(images, threshold_perc=99, epochs=20, verbose=False)
Training AEOutlier...
Training VAEGMMOutlier...
Test for outliers
We have trained our detector on a dataset of digits.
What happens when we give it corrupted images of digits (which we expect to be “outliers”)?
corr_images, ds_info = tfds.load(
"mnist_corrupted/translate",
split="train[:2000]",
with_info=True,
) # type: ignore
corr_images = corr_images.shuffle(corr_images.cardinality())
tfds.visualization.show_examples(corr_images, ds_info)
corr_images = np.array([i["image"] for i in corr_images], dtype=np.float32) / 255.0
# corr_images = corr_images.ravel().reshape((corr_images.shape[0], -1))
print(corr_images.shape)
(2000, 28, 28, 1)
Now we evaluate the two datasets using the trained model.
[(type(detector).__name__, np.mean(detector.predict(images)["is_outlier"])) for detector in detectors]
[('AEOutlier', 0.01), ('VAEGMMOutlier', 0.0115)]
[(type(detector).__name__, np.mean(detector.predict(corr_images)["is_outlier"])) for detector in detectors]
[('AEOutlier', 0.995), ('VAEGMMOutlier', 0.007)]
Results
We can see that the Autoencoder based outlier detector was able to identify most of the translated images as outliers, while the AEGMM was resilient to the perturbation.
Depending on your needs, certain outlier detectors will work better under specific conditions.