dataeval.quality.Outliers¶

class dataeval.quality.Outliers(flags=None, outlier_method=None, outlier_threshold=None, cluster_threshold=None, cluster_algorithm=None, n_clusters=None, config=None, feature_extractor=None)¶

Calculates statistical outliers of a dataset using various statistical tests applied to each image.

Supports two complementary detection methods:

Image statistics-based: Computes pixel-level statistics (brightness, contrast, etc.) and flags images with unusual values using statistical methods (zscore, modzscore, iqr).
Cluster-based: Uses embeddings from a neural network to cluster images and identifies outliers based on distance from cluster centers in embedding space.

Both methods can be used together or independently based on the flags parameter.

Parameters:¶

flags : ImageStats, default ImageStats.DIMENSION | ImageStats.PIXEL | ImageStats.VISUAL¶: Statistics to compute for image statistics-based outlier detection. Set to ImageStats.NONE to skip image statistics and use only cluster-based detection (requires feature_extractor).
outlier_method : ["modzscore" | "zscore" | "iqr"], default "modzscore"¶: Statistical method used to identify outliers from image statistics.
outlier_threshold : float, optional¶: Threshold value for the given outlier_method, above which data is considered an outlier. Uses method-specific default if None.
feature_extractor : FeatureExtractor, optional¶: Feature extractor for cluster-based outlier detection. When provided, embeddings are extracted and clustered to find semantic/visual outliers in embedding space. Common extractors include Embeddings.
cluster_threshold : float, default 2.5¶: Number of standard deviations from cluster center beyond which a point is considered an outlier. Only used when feature_extractor is provided. Higher values are more permissive (fewer outliers).
cluster_algorithm : {"kmeans", "hdbscan"}, default "hdbscan"¶: Clustering algorithm for cluster-based detection.
n_clusters : int, optional¶: Expected number of clusters. For HDBSCAN, this is a hint that adjusts min_cluster_size. For KMeans, this is the exact number of clusters.
config : Outliers.Config or None, default None¶: Optional configuration object with default parameters. Parameters specified directly in __init__ will override config defaults.

stats¶

Statistics computed during the last evaluate() call. Contains dimension, pixel, and/or visual statistics based on the flags.

Type:¶: CalculationResult

flags¶

Statistics to compute for outlier detection.

Type:¶: ImageStats

outlier_method¶

Statistical method used to identify outliers.

Type:¶: Literal[“zscore”, “modzscore”, “iqr”]

outlier_threshold¶

Threshold value for the outlier method.

Type:¶: float | None

feature_extractor¶

Feature extractor for cluster-based detection.

Type:¶: FeatureExtractor | None

cluster_threshold¶

Threshold for cluster-based outlier detection.

Type:¶: float

cluster_algorithm¶

Clustering algorithm to use.

Type:¶: Literal[“kmeans”, “hdbscan”]

n_clusters¶

Expected number of clusters.

Type:¶: int | None

See also

Duplicates

Notes

Image Statistics Methods:

zscore: Based on difference from mean. Default threshold: 3. Z score = \(|x_i - \mu| / \sigma\)
modzscore: Based on difference from median (robust to outliers). Default threshold: 3.5. Modified z score = \(0.6745 * |x_i - x̃| / MAD\)
iqr: Based on interquartile range. Default threshold: 1.5. Outliers are outside \([Q_1 - 1.5 \cdot IQR, Q_3 + 1.5 \cdot IQR]\)

Cluster-based Detection:

Uses adaptive distance-based detection that accounts for varying cluster densities. Points are flagged as outliers if their distance from the nearest cluster center exceeds cluster_threshold standard deviations from the cluster’s mean distance.

Examples

Basic image statistics-based outlier detection:

>>> outliers = Outliers()
>>> result = outliers.evaluate(dataset)

Specifying an outlier method:

>>> outliers = Outliers(outlier_method="iqr")

Cluster-based detection with embeddings:

>>> from dataeval import Embeddings
>>> extractor = Embeddings(encoder=encoder)
>>> outliers = Outliers(flags=ImageStats.NONE, feature_extractor=extractor)
>>> result = outliers.evaluate(train_ds)  # Only cluster_distance metric

Using configuration:

>>> config = Outliers.Config(outlier_method="zscore", outlier_threshold=2.5)
>>> outliers = Outliers(config=config)

evaluate(data, *, per_image=True, per_target=True)¶

Returns indices of Outliers with the issues identified for each.

Computes outliers using image statistics and/or cluster-based detection, depending on configuration. When both methods are enabled, results are combined into a single DataFrame.

Parameters:¶

data : Dataset[ArrayLike] or Dataset[tuple[ArrayLike, Any, Any]]¶: Dataset of images in array format. Can be image-only dataset or dataset with additional tuple elements (labels, metadata). Images should be in standard array format (C, H, W).
per_image : bool, default True¶: Whether to compute statistics for full items (images/videos). When True, item-level outliers will be detected.
per_target : bool, default True¶: Whether to compute statistics for individual targets/detections. When True and targets are present, target-level outliers will be detected. Has no effect for datasets without targets or for cluster-based detection.

Returns:¶

Output class containing a DataFrame of outlier issues with columns:

item_id: int - Index of the outlier image
target_id: int | None - Index of the target within the image (None for image-level outliers, omitted if all are image-level)
metric_name: str - Name of the metric that flagged this image/target. Includes “cluster_distance” when feature_extractor is provided.
metric_value: float - Value of the metric for this image/target. For cluster_distance, this is the number of std devs from cluster mean.

Return type:¶

OutliersOutput

Raises:¶

ValueError – If flags is ImageStats.NONE and no feature_extractor is provided. If both per_image and per_target are False.

Examples

Basic outlier detection:

>>> outliers = Outliers(outlier_method="zscore", outlier_threshold=2.5)
>>> results = outliers.evaluate(images)
>>> results.issues.head(10)
shape: (10, 3)
┌─────────┬─────────────┬──────────────┐
│ item_id ┆ metric_name ┆ metric_value │
│ ---     ┆ ---         ┆ ---          │
│ i64     ┆ cat         ┆ f64          │
╞═════════╪═════════════╪══════════════╡
│ 7       ┆ brightness  ┆ 0.97998      │
│ 7       ┆ contrast    ┆ 0.0          │
│ 7       ┆ darkness    ┆ 0.97998      │
│ 7       ┆ entropy     ┆ 0.0          │
│ 7       ┆ mean        ┆ 0.97998      │
│ 7       ┆ sharpness   ┆ 0.0          │
│ 7       ┆ std         ┆ 0.0          │
│ 7       ┆ var         ┆ 0.0          │
│ 8       ┆ skew        ┆ 0.062317     │
│ 11      ┆ brightness  ┆ 0.97998      │
└─────────┴─────────────┴──────────────┘

Cluster-based detection with embeddings:

>>> from dataeval import Embeddings
>>> extractor = Embeddings(encoder=encoder)
>>> outliers = Outliers(flags=ImageStats.NONE, feature_extractor=extractor)
>>> results = outliers.evaluate(train_ds)

from_clusters(embeddings, cluster_result, threshold=None)¶

Find outliers using cluster-based adaptive distance detection.

Identifies outliers based on their distance from cluster centers in embedding space. Points that are unusually far from their nearest cluster center are flagged as outliers. This method is particularly effective for finding semantic or visual outliers in image embeddings.

Parameters:¶

embeddings : ArrayND[float]¶: The embedding vectors used for clustering, shape (n_samples, n_features). Should be the same embeddings passed to the cluster() function.
cluster_result : ClusterResult¶: Clustering results from the cluster() function, containing cluster assignments and related metadata.
threshold : float, default=2.5¶: Number of standard deviations beyond cluster mean to use for outlier threshold. Higher values are more permissive (fewer outliers), lower values are stricter (more outliers). Typical range: 1.5-3.5.

Returns:¶

Output containing outlier indices and their issue details. Each outlier includes: - ‘cluster_distance’: the distance from the cluster mean - ‘std_devs’: the number of standard deviations from the mean

Return type:¶

OutliersOutput[IndexIssueMap]

See also

dataeval.core.cluster: Function to compute clusters from embeddings
dataeval.core.compute_cluster_stats: Computes statistics for adaptive detection
from_stats: Find outliers from pre-computed image statistics
evaluate: Find outliers by computing statistics from images

Notes

This method uses adaptive distance-based outlier detection that accounts for varying cluster densities. It significantly reduces false outliers compared to using HDBSCAN’s binary -1 labels, especially for image embeddings with varying density distributions.

The threshold parameter allows experimentation with different sensitivity levels without recomputing clusters. Recommended values: - 1.5-2.0: Very strict (many outliers) - 2.5: Balanced (default) - 3.0-3.5: Permissive (fewer outliers)

from_stats(stats: dataeval.core._calculate.CalculationResult) → OutliersOutput[polars.DataFrame]¶

from_stats(stats: collections.abc.Sequence[dataeval.core._calculate.CalculationResult]) → OutliersOutput[list[polars.DataFrame]]

Returns indices of Outliers with the issues identified for each.

Parameters:¶

stats : CalculationResult | Sequence[CalculationResult]¶: The output(s) from calculate() with ImageStats.DIMENSION, PIXEL, or VISUAL flags

Returns:¶

Output class containing a DataFrame of outlier issues with columns: - item_id: int - Index of the outlier image - target_id: int | None - Index of the target within the image (None for image-level outliers) - metric_name: str - Name of the metric that flagged this image/target - metric_value: float - Value of the metric for this image/target

Return type:¶

OutliersOutput

Example

Evaluate the dataset using pre-computed stats:

>>> from dataeval.core import calculate
>>> from dataeval.flags import ImageStats
>>> stats = calculate(images, stats=ImageStats.PIXEL)
>>> outliers = Outliers(outlier_method="zscore", outlier_threshold=2.5)
>>> results = outliers.from_stats(stats)
>>> results.issues.head(10)
shape: (10, 3)
┌─────────┬─────────────┬──────────────┐
│ item_id ┆ metric_name ┆ metric_value │
│ ---     ┆ ---         ┆ ---          │
│ i64     ┆ cat         ┆ f64          │
╞═════════╪═════════════╪══════════════╡
│ 7       ┆ entropy     ┆ 0.0          │
│ 7       ┆ mean        ┆ 0.97998      │
│ 7       ┆ std         ┆ 0.0          │
│ 7       ┆ var         ┆ 0.0          │
│ 8       ┆ skew        ┆ 0.062317     │
│ 11      ┆ entropy     ┆ 0.0          │
│ 11      ┆ mean        ┆ 0.97998      │
│ 11      ┆ std         ┆ 0.0          │
│ 11      ┆ var         ┆ 0.0          │
│ 18      ┆ entropy     ┆ 0.0          │
└─────────┴─────────────┴──────────────┘

Classes¶

`Config`	Configuration for Outliers detector.