dataeval.quality.Outliers

class dataeval.quality.Outliers(flags=ImageStats.DIMENSION | ImageStats.PIXEL | ImageStats.VISUAL, outlier_method='modzscore', outlier_threshold=None)

Calculates statistical outliers of a dataset using various statistical tests applied to each image.

Parameters:
flags : ImageStats, default ImageStats.DIMENSION | ImageStats.PIXEL | ImageStats.VISUAL

Statistics to compute for outlier detection

outlier_method : ["modzscore" | "zscore" | "iqr"], optional - default "modzscore"

Statistical method used to identify outliers

outlier_threshold : float, optional - default None

Threshold value for the given outlier_method, above which data is considered an outlier - uses method specific default if None

stats

Statistics computed during the last evaluate() call. Contains dimension, pixel, and/or visual statistics based on the flags.

Type:

CalculationResult

flags

Statistics to compute for outlier detection

Type:

ImageStats

outlier_method

Statistical method used to identify outliers

Type:

Literal[“zscore”, “modzscore”, “iqr”]

outlier_threshold

Threshold value for the outlier method

Type:

float | None

See also

Duplicates

Notes

There are 3 different statistical methods:

  • zscore

  • modzscore

  • iqr

The z score method is based on the difference between the data point and the mean of the data. The default threshold value for zscore is 3.
Z score = \(|x_i - \mu| / \sigma\)
The modified z score method is based on the difference between the data point and the median of the data. The default threshold value for modzscore is 3.5.
Modified z score = \(0.6745 * |x_i - x̃| / MAD\), where \(MAD\) is the median absolute deviation
The interquartile range method is based on the difference between the data point and the difference between the 75th and 25th qartile. The default threshold value for iqr is 1.5.
Interquartile range = \(threshold * (Q_3 - Q_1)\)

Examples

Initialize the Outliers class:

>>> outliers = Outliers()

Specifying an outlier method:

>>> outliers = Outliers(outlier_method="iqr")

Specifying an outlier method and threshold:

>>> outliers = Outliers(outlier_method="zscore", outlier_threshold=3.5)
evaluate(data, *, per_image=True, per_target=True)

Returns indices of Outliers with the issues identified for each.

Computes statistical outliers by calculating dimension, pixel, and/or visual statistics for the dataset, then applying the configured outlier detection method. Stores computed statistics in the stats attribute.

Parameters:
data : Dataset[ArrayLike] or Dataset[tuple[ArrayLike, Any, Any]]

Dataset of images in array format. Can be image-only dataset or dataset with additional tuple elements (labels, metadata). Images should be in standard array format (C, H, W).

per_image : bool, default True

Whether to compute statistics for full items (images/videos). When True, item-level outliers will be detected.

per_target : bool, default True

Whether to compute statistics for individual targets/detections. When True and targets are present, target-level outliers will be detected. Has no effect for datasets without targets.

Returns:

Output class containing the indices of outliers and a dictionary showing the issues and calculated values for the given index.

Return type:

OutliersOutput

Examples

Basic outlier detection:

>>> outliers = Outliers(outlier_method="zscore", outlier_threshold=3.5)
>>> results = outliers.evaluate(outlier_images)
>>> results.issues
shape: (9, 3)
┌─────────┬─────────────┬──────────────┐
│ item_id ┆ metric_name ┆ metric_value │
│ ---     ┆ ---         ┆ ---          │
│ i64     ┆ cat         ┆ f64          │
╞═════════╪═════════════╪══════════════╡
│ 10      ┆ contrast    ┆ 1.25         │
│ 10      ┆ entropy     ┆ 0.212769     │
│ 10      ┆ zeros       ┆ 0.054932     │
│ 12      ┆ contrast    ┆ 1.25         │
│ 12      ┆ entropy     ┆ 0.212769     │
│ 12      ┆ sharpness   ┆ 1.509766     │
│ 12      ┆ std         ┆ 0.00536      │
│ 12      ┆ var         ┆ 0.000029     │
│ 12      ┆ zeros       ┆ 0.054932     │
└─────────┴─────────────┴──────────────┘

Access computed statistics for reuse:

>>> saved_stats = outliers.stats
from_clusters(embeddings, cluster_result, threshold=None)

Find outliers using cluster-based adaptive distance detection.

Identifies outliers based on their distance from cluster centers in embedding space. Points that are unusually far from their nearest cluster center are flagged as outliers. This method is particularly effective for finding semantic or visual outliers in image embeddings.

Parameters:
embeddings : ArrayND[float]

The embedding vectors used for clustering, shape (n_samples, n_features). Should be the same embeddings passed to the cluster() function.

cluster_result : ClusterResult

Clustering results from the cluster() function, containing cluster assignments and related metadata.

threshold : float, default=2.5

Number of standard deviations beyond cluster mean to use for outlier threshold. Higher values are more permissive (fewer outliers), lower values are stricter (more outliers). Typical range: 1.5-3.5.

Returns:

Output containing outlier indices and their issue details. Each outlier includes: - ‘cluster_distance’: the distance from the cluster mean - ‘std_devs’: the number of standard deviations from the mean

Return type:

OutliersOutput[IndexIssueMap]

See also

dataeval.core.cluster

Function to compute clusters from embeddings

dataeval.core.compute_cluster_stats

Computes statistics for adaptive detection

from_stats

Find outliers from pre-computed image statistics

evaluate

Find outliers by computing statistics from images

Notes

This method uses adaptive distance-based outlier detection that accounts for varying cluster densities. It significantly reduces false outliers compared to using HDBSCAN’s binary -1 labels, especially for image embeddings with varying density distributions.

The threshold parameter allows experimentation with different sensitivity levels without recomputing clusters. Recommended values: - 1.5-2.0: Very strict (many outliers) - 2.5: Balanced (default) - 3.0-3.5: Permissive (fewer outliers)

from_stats(stats: dataeval.core._calculate.CalculationResult) OutliersOutput[polars.DataFrame]
from_stats(stats: collections.abc.Sequence[dataeval.core._calculate.CalculationResult]) OutliersOutput[list[polars.DataFrame]]

Returns indices of Outliers with the issues identified for each.

Parameters:
stats : CalculationResult | Sequence[CalculationResult]

The output(s) from calculate() with ImageStats.DIMENSION, PIXEL, or VISUAL flags

Returns:

Output class containing a DataFrame of outlier issues with columns: - item_id: int - Index of the outlier image - target_id: int | None - Index of the target within the image (None for image-level outliers) - metric_name: str - Name of the metric that flagged this image/target - metric_value: float - Value of the metric for this image/target

Return type:

OutliersOutput

Example

Evaluate the dataset:

>>> outliers = Outliers(outlier_method="zscore", outlier_threshold=3.5)
>>> results = outliers.from_stats([stats1, stats2])
>>> len(results)
2
>>> results.issues[0]
shape: (6, 3)
┌─────────┬─────────────┬──────────────┐
│ item_id ┆ metric_name ┆ metric_value │
│ ---     ┆ ---         ┆ ---          │
│ i64     ┆ cat         ┆ f64          │
╞═════════╪═════════════╪══════════════╡
│ 10      ┆ entropy     ┆ 0.212769     │
│ 10      ┆ zeros       ┆ 0.054932     │
│ 12      ┆ entropy     ┆ 0.212769     │
│ 12      ┆ std         ┆ 0.00536      │
│ 12      ┆ var         ┆ 0.000029     │
│ 12      ┆ zeros       ┆ 0.054932     │
└─────────┴─────────────┴──────────────┘