dataeval.quality.Outliers¶
-
class dataeval.quality.Outliers(flags=
None, outlier_threshold=None, extractor=None, batch_size=None, cluster_algorithm=None, cluster_threshold=None, n_clusters=None, config=None)¶ Computes statistical outliers of a dataset using various statistical tests applied to each image.
Supports two complementary detection methods:
Image statistics-based: Computes pixel-level statistics (brightness, contrast, etc.) and flags images with unusual values using configurable
Thresholdobjects.Cluster-based: Uses embeddings from a neural network to cluster images and identifies outliers based on distance from cluster centers in embedding space.
Both methods can be used together or independently based on the
flagsparameter.- Parameters:¶
- flags : ImageStats, default ImageStats.DIMENSION | ImageStats.PIXEL | ImageStats.VISUAL¶
Statistics to compute for image statistics-based outlier detection. Set to
ImageStats.NONEto skip image statistics and use only cluster-based detection (requiresextractor).- outlier_threshold : ThresholdLike, dict, or None, default None¶
Threshold configuration for image statistics-based outlier detection.
None: usesAdaptiveThreshold()with default multiplier (3.5), which computes both z-score and modified z-score bounds and takes the wider (more lenient) bound on each side.float: symmetric multiplier for the default method (modified z-score viaresolve_threshold)str: named threshold type (e.g.,"zscore","iqr","adaptive") with defaultstuple[float | None, float | None]: asymmetric(lower, upper)multiplierstuple[str, ThresholdBounds]: named threshold with bounds, e.g.("zscore", 2.5)or("iqr", (1.0, 3.0))Threshold: a fully configured threshold (e.g.,ZScoreThreshold,IQRThreshold,ConstantThreshold,AdaptiveThreshold)Mapping[str, ThresholdLike]: per-metric thresholds keyed by metric name. Metrics not in the dict use the default (AdaptiveThreshold()).
- extractor : FeatureExtractor, optional¶
Feature extractor for cluster-based outlier detection. When provided, embeddings are extracted and clustered to find semantic/visual outliers in embedding space.
- cluster_threshold : ThresholdLike or None, default None¶
Threshold configuration for cluster-based outlier detection. When None, defaults to
ZScoreThreshold(upper_multiplier=2.5). Accepts the same formats asoutlier_threshold. Only used whenextractoris provided.- cluster_algorithm : {"kmeans", "hdbscan"}, default "hdbscan"¶
Clustering algorithm for cluster-based detection.
- n_clusters : int, optional¶
Expected number of clusters. For HDBSCAN, this is a hint that adjusts min_cluster_size. For KMeans, this is the exact number of clusters.
- config : Outliers.Config or None, default None¶
Optional configuration object with default parameters. Parameters specified directly in __init__ will override config defaults.
- stats¶
Statistics computed during the last evaluate() call. Contains dimension, pixel, and/or visual statistics based on the flags.
- Type:¶
- outlier_threshold¶
Threshold configuration for outlier detection.
- Type:¶
ThresholdLike | Mapping[str, ThresholdLike] | None
- extractor¶
Feature extractor for cluster-based detection.
- Type:¶
FeatureExtractor | None
See also
Notes
Threshold Methods:
AdaptiveThreshold(default): Uses tail-weighted Double-MAD (separate MAD for data below and above the median) with automatic multiplier scaling for heavy tails to produce asymmetric bounds. Default multiplier: 3.0.ModifiedZScoreThreshold: Based on median absolute deviation. Default multiplier: 3.5. Modified z score = \(0.6745 * |x_i - x̃| / MAD\)ZScoreThreshold: Based on standard deviation from mean. Default multiplier: 3. Z score = \(|x_i - \mu| / \sigma\)IQRThreshold: Based on interquartile range. Default multiplier: 1.5. Outliers are outside \([Q_1 - 1.5 \cdot IQR, Q_3 + 1.5 \cdot IQR]\)ConstantThreshold: Hard lower/upper bounds (data-independent).
All threshold types support asymmetric lower/upper multipliers via
lower_multiplierandupper_multiplierparameters.Cluster-based Detection:
Uses adaptive distance-based detection that accounts for varying cluster densities. A
Thresholdis applied per-cluster to the distance distribution (default:ZScoreThreshold(upper_multiplier=2.5)), and points whose distance exceeds the upper bound are flagged as outliers.Examples
Basic image statistics-based outlier detection (default: modified z-score):
>>> outliers = Outliers() >>> result = outliers.evaluate(dataset)Using a specific threshold method:
>>> from dataeval.utils.thresholds import ZScoreThreshold >>> outliers = Outliers(outlier_threshold=ZScoreThreshold(2.5))Asymmetric thresholds (stricter on lower, lenient on upper):
>>> from dataeval.utils.thresholds import IQRThreshold >>> outliers = Outliers(outlier_threshold=IQRThreshold(lower_multiplier=1.0, upper_multiplier=3.0))Hard bounds:
>>> from dataeval.utils.thresholds import ConstantThreshold >>> outliers = Outliers(outlier_threshold=ConstantThreshold(lower=0.1, upper=0.9))Named threshold type with bounds (no need to import threshold classes):
>>> outliers = Outliers(outlier_threshold="iqr") >>> outliers = Outliers(outlier_threshold=("zscore", 2.5)) >>> outliers = Outliers(outlier_threshold=("modzscore", (1.0, 3.0)))Named threshold type with bounds and limits (no need to import threshold classes):
>>> outliers = Outliers(outlier_threshold=("zscore", 4.0, (0.0, 1.0)))Per-metric thresholds:
>>> outliers = Outliers(outlier_threshold={"mean": 2.0, "brightness": ("zscore", 2.0)})Cluster-based detection with embeddings:
>>> from dataeval.extractors import FlattenExtractor>>> outliers = Outliers(flags=ImageStats.NONE, extractor=FlattenExtractor(), cluster_threshold=2.0) >>> result = outliers.evaluate(train_ds) # Only cluster_distance metricUsing configuration:
>>> config = Outliers.Config(outlier_threshold=2.5) >>> outliers = Outliers(config=config)-
evaluate(data: _DatasetInput, *, per_image: bool =
True, per_target: False =..., per_class: bool =False, metadata: dataeval.protocols.MetadataLike | None =None) SingleOutliersOutput¶ -
evaluate(data: _DatasetInput, *, per_image: bool =
True, per_target: True, per_class: bool =False, metadata: dataeval.protocols.MetadataLike | None =None) SingleTargetOutliersOutput -
evaluate(data: _DatasetInput, *other: _DatasetInput, per_image: bool =
True, per_target: False =..., per_class: bool =False, metadata: dataeval.protocols.MetadataLike | None =None) MultiOutliersOutput -
evaluate(data: _DatasetInput, *other: _DatasetInput, per_image: bool =
True, per_target: True, per_class: bool =False, metadata: dataeval.protocols.MetadataLike | None =None) MultiTargetOutliersOutput Return indices of Outliers with the issues identified for each.
Computes outliers using image statistics and/or cluster-based detection, depending on configuration. When both methods are enabled, results are combined into a single DataFrame. Supports single or multiple datasets.
- Parameters:¶
- data : Dataset¶
A dataset of images.
- *other : Dataset¶
Additional datasets for cross-dataset outlier detection.
- per_image : bool, default True¶
Whether to compute statistics for full items (images/videos). When True, item-level outliers will be detected.
- per_target : bool, default False¶
Whether to compute statistics for individual targets/detections. When True, the
.outliersaccessor usesSourceIndexkeys; when False, it uses plainintitem indices. Has no effect for datasets without targets or for cluster-based detection.- per_class : bool, default False¶
Whether to compute outlier thresholds within each class separately, rather than globally across the entire dataset. When True,
metadatamust be provided. Only applies to image statistics-based detection, not cluster-based detection.- metadata : MetadataLike or None, default None¶
Metadata object containing class labels. Required when
per_class=True.
- Returns:¶
Output class containing a DataFrame of outlier issues with columns:
item_index: int - Index of the outlier item
target_index: int | None - Index of the target within the item (None for item-level outliers, omitted if all are item-level)
metric_name: str - Name of the metric that flagged this item/target. Includes “cluster_distance” when extractor is provided.
metric_value: float - Value of the metric for this item/target. For cluster_distance, this is the number of std devs from cluster mean.
For multi-dataset input, includes a
dataset_indexcolumn.- Return type:¶
SingleOutliersOutput or MultiOutliersOutput
- Raises:¶
ValueError – If
flagsisImageStats.NONEand noextractoris provided. If bothper_imageandper_targetare False. Ifper_classis True andmetadatais None.
Examples
Basic outlier detection:
>>> outliers = Outliers(outlier_threshold=2.5) >>> results = outliers.evaluate(images) >>> results.head(6) shape: (6, 3) ┌────────────┬─────────────┬──────────────┐ │ item_index ┆ metric_name ┆ metric_value │ │ --- ┆ --- ┆ --- │ │ i64 ┆ cat ┆ f64 │ ╞════════════╪═════════════╪══════════════╡ │ 0 ┆ zeros ┆ 0.000081 │ │ 2 ┆ zeros ┆ 0.000081 │ │ 7 ┆ brightness ┆ 0.98 │ │ 7 ┆ contrast ┆ 0.0 │ │ 7 ┆ darkness ┆ 0.98 │ │ 7 ┆ entropy ┆ 0.0 │ └────────────┴─────────────┴──────────────┘Evaluate two or more datasets (cross-dataset detection):
>>> outliers = Outliers() >>> results = outliers.evaluate(train_ds, test_ds) >>> results = outliers.evaluate(train_ds_area1, train_ds_area2, train_ds_area3, test_ds) # or more
-
from_clusters(embeddings, cluster_result, cluster_threshold=
None)¶ Find outliers using cluster-based adaptive distance detection.
Identifies outliers based on their distance from cluster centers in embedding space. Points that are unusually far from their nearest cluster center are flagged as outliers. This method is particularly effective for finding semantic or visual outliers in image embeddings.
- Parameters:¶
- embeddings : ArrayND[float]¶
The embedding vectors used for clustering, shape (n_samples, n_features). Should be the same embeddings passed to the cluster() function.
- cluster_result : ClusterResult¶
Clustering results from the cluster() function, containing cluster assignments and related metadata.
- cluster_threshold : ThresholdLike or None, default None¶
Threshold configuration for cluster-based outlier detection. Accepts the same formats as
outlier_threshold. When None, uses the detector’s configuredcluster_threshold.
- Returns:¶
Output containing outlier indices and their issue details. Each outlier includes: - ‘cluster_distance’: the distance from the cluster mean - ‘std_devs’: the number of standard deviations from the mean
- Return type:¶
OutliersOutput[IndexIssueMap]
See also
dataeval.core.clusterFunction to compute clusters from embeddings
dataeval.core.compute_cluster_statsComputes statistics for adaptive detection
from_statsFind outliers from pre-computed image statistics
evaluateFind outliers by computing statistics from images
Notes
This method uses adaptive distance-based outlier detection that accounts for varying cluster densities. It significantly reduces false outliers compared to using HDBSCAN’s binary -1 labels, especially for image embeddings with varying density distributions.
The threshold parameter allows experimentation with different sensitivity levels and methods without recomputing clusters.
-
from_stats(stats: dataeval.core.StatsResult, *, per_image: bool =
True, per_target: False =...) SingleOutliersOutput¶ -
from_stats(stats: dataeval.core.StatsResult, *, per_image: bool =
True, per_target: True) SingleTargetOutliersOutput -
from_stats(stats: collections.abc.Sequence[dataeval.core.StatsResult], *, per_image: bool =
True, per_target: False =...) MultiOutliersOutput -
from_stats(stats: collections.abc.Sequence[dataeval.core.StatsResult], *, per_image: bool =
True, per_target: True) MultiTargetOutliersOutput Return indices of Outliers with the issues identified for each.
- Parameters:¶
- stats : StatsResult | Sequence[StatsResult]¶
The output(s) from compute_stats() with ImageStats.DIMENSION, PIXEL, or VISUAL flags
- per_image : bool, default True¶
Whether to include item-level (image) outliers in the results.
- per_target : bool, default False¶
Whether to include target-level outliers in the results. When True, the
.outliersaccessor usesSourceIndexkeys; when False, it uses plainintitem indices.
- Returns:¶
Output class containing a DataFrame of outlier issues with columns: - item_index: int - Index of the outlier item - target_index: int | None - Index of the target within the item (None for item-level outliers) - metric_name: str - Name of the metric that flagged this item/target - metric_value: float - Value of the metric for this item/target
For multiple datasets, a
dataset_indexcolumn identifies the originating dataset anditem_indexvalues are local to each dataset.- Return type:¶
Example
Evaluate the dataset using pre-computed stats:
>>> from dataeval.core import compute_stats >>> from dataeval.flags import ImageStats >>> from dataeval.utils.thresholds import ZScoreThreshold>>> stats = compute_stats(images, stats=ImageStats.PIXEL) >>> outliers = Outliers(outlier_threshold=ZScoreThreshold(2.5)) >>> results = outliers.from_stats(stats) >>> results.head(10) shape: (10, 3) ┌────────────┬─────────────┬──────────────┐ │ item_index ┆ metric_name ┆ metric_value │ │ --- ┆ --- ┆ --- │ │ i64 ┆ cat ┆ f64 │ ╞════════════╪═════════════╪══════════════╡ │ 7 ┆ entropy ┆ 0.0 │ │ 7 ┆ mean ┆ 0.98 │ │ 7 ┆ std ┆ 0.0 │ │ 7 ┆ var ┆ 0.0 │ │ 8 ┆ skew ┆ 0.062311 │ │ 11 ┆ entropy ┆ 0.0 │ │ 11 ┆ mean ┆ 0.98 │ │ 11 ┆ std ┆ 0.0 │ │ 11 ┆ var ┆ 0.0 │ │ 18 ┆ entropy ┆ 0.0 │ └────────────┴─────────────┴──────────────┘
Classes¶
Configuration for Outliers detector. |