dataeval.quality.Outliers¶
-
class dataeval.quality.Outliers(flags=
ImageStats.DIMENSION | ImageStats.PIXEL | ImageStats.VISUAL, outlier_method='modzscore', outlier_threshold=None)¶ Calculates statistical outliers of a dataset using various statistical tests applied to each image.
- Parameters:¶
- flags : ImageStats, default ImageStats.DIMENSION | ImageStats.PIXEL | ImageStats.VISUAL¶
Statistics to compute for outlier detection
- outlier_method : ["modzscore" | "zscore" | "iqr"], optional - default "modzscore"¶
Statistical method used to identify outliers
- outlier_threshold : float, optional - default None¶
Threshold value for the given
outlier_method, above which data is considered an outlier - uses method specific default if None
- stats¶
Statistics computed during the last evaluate() call. Contains dimension, pixel, and/or visual statistics based on the flags.
- Type:¶
CalculationResult
- outlier_method¶
Statistical method used to identify outliers
- Type:¶
Literal[“zscore”, “modzscore”, “iqr”]
See also
Notes
There are 3 different statistical methods:
zscore
modzscore
iqr
The z score method is based on the difference between the data point and the mean of the data. The default threshold value for zscore is 3.Z score = \(|x_i - \mu| / \sigma\)The modified z score method is based on the difference between the data point and the median of the data. The default threshold value for modzscore is 3.5.Modified z score = \(0.6745 * |x_i - x̃| / MAD\), where \(MAD\) is the median absolute deviationThe interquartile range method is based on the difference between the data point and the difference between the 75th and 25th qartile. The default threshold value for iqr is 1.5.Interquartile range = \(threshold * (Q_3 - Q_1)\)Examples
Initialize the Outliers class:
>>> outliers = Outliers()Specifying an outlier method:
>>> outliers = Outliers(outlier_method="iqr")Specifying an outlier method and threshold:
>>> outliers = Outliers(outlier_method="zscore", outlier_threshold=3.5)-
evaluate(data, *, per_image=
True, per_target=True)¶ Returns indices of Outliers with the issues identified for each.
Computes statistical outliers by calculating dimension, pixel, and/or visual statistics for the dataset, then applying the configured outlier detection method. Stores computed statistics in the stats attribute.
- Parameters:¶
- data : Dataset[ArrayLike] or Dataset[tuple[ArrayLike, Any, Any]]¶
Dataset of images in array format. Can be image-only dataset or dataset with additional tuple elements (labels, metadata). Images should be in standard array format (C, H, W).
- per_image : bool, default True¶
Whether to compute statistics for full items (images/videos). When True, item-level outliers will be detected.
- per_target : bool, default True¶
Whether to compute statistics for individual targets/detections. When True and targets are present, target-level outliers will be detected. Has no effect for datasets without targets.
- Returns:¶
Output class containing the indices of outliers and a dictionary showing the issues and calculated values for the given index.
- Return type:¶
Examples
Basic outlier detection:
>>> outliers = Outliers(outlier_method="zscore", outlier_threshold=3.5) >>> results = outliers.evaluate(outlier_images) >>> results.issues shape: (9, 3) ┌─────────┬─────────────┬──────────────┐ │ item_id ┆ metric_name ┆ metric_value │ │ --- ┆ --- ┆ --- │ │ i64 ┆ cat ┆ f64 │ ╞═════════╪═════════════╪══════════════╡ │ 10 ┆ contrast ┆ 1.25 │ │ 10 ┆ entropy ┆ 0.212769 │ │ 10 ┆ zeros ┆ 0.054932 │ │ 12 ┆ contrast ┆ 1.25 │ │ 12 ┆ entropy ┆ 0.212769 │ │ 12 ┆ sharpness ┆ 1.509766 │ │ 12 ┆ std ┆ 0.00536 │ │ 12 ┆ var ┆ 0.000029 │ │ 12 ┆ zeros ┆ 0.054932 │ └─────────┴─────────────┴──────────────┘Access computed statistics for reuse:
>>> saved_stats = outliers.stats
-
from_clusters(embeddings, cluster_result, threshold=
None)¶ Find outliers using cluster-based adaptive distance detection.
Identifies outliers based on their distance from cluster centers in embedding space. Points that are unusually far from their nearest cluster center are flagged as outliers. This method is particularly effective for finding semantic or visual outliers in image embeddings.
- Parameters:¶
- embeddings : ArrayND[float]¶
The embedding vectors used for clustering, shape (n_samples, n_features). Should be the same embeddings passed to the cluster() function.
- cluster_result : ClusterResult¶
Clustering results from the cluster() function, containing cluster assignments and related metadata.
- threshold : float, default=2.5¶
Number of standard deviations beyond cluster mean to use for outlier threshold. Higher values are more permissive (fewer outliers), lower values are stricter (more outliers). Typical range: 1.5-3.5.
- Returns:¶
Output containing outlier indices and their issue details. Each outlier includes: - ‘cluster_distance’: the distance from the cluster mean - ‘std_devs’: the number of standard deviations from the mean
- Return type:¶
OutliersOutput[IndexIssueMap]
See also
dataeval.core.clusterFunction to compute clusters from embeddings
dataeval.core.compute_cluster_statsComputes statistics for adaptive detection
from_statsFind outliers from pre-computed image statistics
evaluateFind outliers by computing statistics from images
Notes
This method uses adaptive distance-based outlier detection that accounts for varying cluster densities. It significantly reduces false outliers compared to using HDBSCAN’s binary -1 labels, especially for image embeddings with varying density distributions.
The threshold parameter allows experimentation with different sensitivity levels without recomputing clusters. Recommended values: - 1.5-2.0: Very strict (many outliers) - 2.5: Balanced (default) - 3.0-3.5: Permissive (fewer outliers)
- from_stats(stats: dataeval.core._calculate.CalculationResult) OutliersOutput[polars.DataFrame]¶
- from_stats(stats: collections.abc.Sequence[dataeval.core._calculate.CalculationResult]) OutliersOutput[list[polars.DataFrame]]
Returns indices of Outliers with the issues identified for each.
- Parameters:¶
- stats : CalculationResult | Sequence[CalculationResult]¶
The output(s) from calculate() with ImageStats.DIMENSION, PIXEL, or VISUAL flags
- Returns:¶
Output class containing a DataFrame of outlier issues with columns: - item_id: int - Index of the outlier image - target_id: int | None - Index of the target within the image (None for image-level outliers) - metric_name: str - Name of the metric that flagged this image/target - metric_value: float - Value of the metric for this image/target
- Return type:¶
Example
Evaluate the dataset:
>>> outliers = Outliers(outlier_method="zscore", outlier_threshold=3.5) >>> results = outliers.from_stats([stats1, stats2]) >>> len(results) 2 >>> results.issues[0] shape: (6, 3) ┌─────────┬─────────────┬──────────────┐ │ item_id ┆ metric_name ┆ metric_value │ │ --- ┆ --- ┆ --- │ │ i64 ┆ cat ┆ f64 │ ╞═════════╪═════════════╪══════════════╡ │ 10 ┆ entropy ┆ 0.212769 │ │ 10 ┆ zeros ┆ 0.054932 │ │ 12 ┆ entropy ┆ 0.212769 │ │ 12 ┆ std ┆ 0.00536 │ │ 12 ┆ var ┆ 0.000029 │ │ 12 ┆ zeros ┆ 0.054932 │ └─────────┴─────────────┴──────────────┘