dataeval.shift.OODKNeighbors¶

class dataeval.shift.OODKNeighbors(k=None, distance_metric=None, threshold_perc=None, extractor=None, config=None)¶

K-Nearest Neighbors Out-of-Distribution detector.

Uses average distance to k nearest neighbors in embedding space to detect OOD samples. Samples with larger average distances to their k nearest neighbors in the reference (in-distribution) set are considered more likely to be OOD.

Based on the methodology from: “Back to the Basics: Revisiting Out-of-Distribution Detection Baselines” (Kuan & Mueller, 2022)

As referenced in: “Safe AI for coral reefs: Benchmarking out-of-distribution detection algorithms for coral reef image surveys”

Parameters:¶

k : int, default 10¶: Number of nearest neighbors to consider
distance_metric : "cosine" | "euclidean", default "cosine"¶: Distance metric to use
threshold_perc : float or None, default None¶: Percentage of reference data considered normal (0-100). Higher values result in more permissive thresholds. If None, uses config.threshold_perc (default 95.0).
extractor : FeatureExtractor or None, default None¶: Feature extractor for transforming input data before scoring. When provided, raw data is passed through the extractor in both fit() and score()/predict(). When None, data is used as-is (must be array-like embeddings).
config : OODKNeighbors.Config or None, default None¶: Optional configuration object with default parameters. Parameters specified directly in __init__ will override config defaults.

Examples

>>> from dataeval.shift import OODKNeighbors
>>> import numpy as np
>>>
>>> # Create reference embeddings (in-distribution)
>>> ref_embeddings = np.random.randn(100, 128).astype(np.float32)
>>>
>>> # Fit the detector
>>> detector = OODKNeighbors(k=10, distance_metric="cosine", threshold_perc=95.0)
>>> detector.fit(ref_embeddings)
OODKNeighbors(k=10, distance_metric='cosine', threshold_perc=95.0, extractor=None, fitted=True)
>>>
>>> # Score new samples
>>> test_embeddings = np.random.randn(20, 128).astype(np.float32)
>>> scores = detector.score(test_embeddings)
>>> predictions = detector.predict(test_embeddings)

Using configuration:

>>> config = OODKNeighbors.Config(k=15, distance_metric="euclidean", threshold_perc=99.0)
>>> detector = OODKNeighbors(config=config)
>>> detector.fit(ref_embeddings)
OODKNeighbors(k=15, distance_metric='euclidean', threshold_perc=99.0, extractor=None, fitted=True)

fit(reference_data)¶

Fit the detector using reference (in-distribution) data.

Builds a k-NN index for efficient nearest neighbor search and computes reference scores for automatic thresholding.

Parameters:¶

reference_data : Any¶: Reference (in-distribution) data. When an extractor is configured, this can be any data type accepted by the extractor. Otherwise, must be array-like embeddings.

Returns:¶

The fitted detector (for method chaining).

Return type:¶

Self

predict(data, batch_size=None, ood_type='instance')¶

Predict whether instances are out of distribution.

Parameters:¶

data : ArrayLike¶: Input data for OOD prediction.
batch_size : int or None, default None¶: Number of instances to process per batch (only used by some detectors). When None, uses the global batch size from get_batch_size().
ood_type : "feature" | "instance", default "instance"¶: Predict OOD at the "feature" or "instance" level.

Returns:¶

Predictions including is_ood boolean array and OOD scores.

Return type:¶

OODOutput

score(data, batch_size=None)¶

Compute out of distribution scores for a given dataset.

Parameters:¶

data : ArrayLike¶: Input data to score.
batch_size : int or None, default None¶: Number of instances to process per batch (only used by some detectors). When None, uses the global batch size from get_batch_size().

Returns:¶

Instance-level (and optionally feature-level) OOD scores. Higher scores indicate samples more likely to be OOD.

Return type:¶

OODScoreOutput

property reference_embeddings : numpy.typing.NDArray[numpy.float32]¶

Reference embeddings stored by the scorer.

Classes¶

`Config`	Configuration for OODKNeighbors detector.