dataeval.shift.OODKNeighbors¶

class dataeval.shift.OODKNeighbors(k=None, distance_metric=None, config=None)¶

K-Nearest Neighbors Out-of-Distribution detector.

Uses average distance to k nearest neighbors in embedding space to detect OOD samples. Samples with larger average distances to their k nearest neighbors in the reference (in-distribution) set are considered more likely to be OOD.

Based on the methodology from: “Back to the Basics: Revisiting Out-of-Distribution Detection Baselines” (Kuan & Mueller, 2022)

As referenced in: “Safe AI for coral reefs: Benchmarking out-of-distribution detection algorithms for coral reef image surveys”

Parameters:¶

k : int, default 10¶: Number of nearest neighbors to consider
distance_metric : "cosine" | "euclidean", default "cosine"¶: Distance metric to use
config : OODKNeighbors.Config or None, default None¶: Optional configuration object with default parameters. Parameters specified directly in __init__ will override config defaults.

Examples

>>> from dataeval.shift import OODKNeighbors
>>> import numpy as np
>>>
>>> # Create reference embeddings (in-distribution)
>>> ref_embeddings = np.random.randn(100, 128).astype(np.float32)
>>>
>>> # Fit the detector
>>> detector = OODKNeighbors(k=10, distance_metric="cosine")
>>> detector.fit(ref_embeddings, threshold_perc=95.0)
>>>
>>> # Score new samples
>>> test_embeddings = np.random.randn(20, 128).astype(np.float32)
>>> scores = detector.score(test_embeddings)
>>> predictions = detector.predict(test_embeddings)

Using configuration:

>>> config = OODKNeighbors.Config(k=15, distance_metric="euclidean", threshold_perc=99.0)
>>> detector = OODKNeighbors(config=config)
>>> detector.fit(ref_embeddings)  # Uses config.threshold_perc

fit(embeddings, threshold_perc=None)¶

Fit the detector using reference (in-distribution) embeddings.

Builds a k-NN index for efficient nearest neighbor search and computes reference scores for automatic thresholding.

Parameters:¶

embeddings : Array¶: Reference (in-distribution) embeddings
threshold_perc : float or None, default None¶: Percentage of reference data considered normal (0-100). Higher values result in more permissive thresholds. If None, uses config.threshold_perc (default 95.0).

predict(x, batch_size=int(10000000000.0), ood_type='instance')¶

Predict whether instances are out of distribution.

Parameters:¶

x : ArrayLike¶: Input data for OOD prediction.
batch_size : int, default 1e10¶: Number of instances to process per batch (only used by some detectors).
ood_type : "feature" | "instance", default "instance"¶: Predict OOD at the "feature" or "instance" level.

Returns:¶

Predictions including is_ood boolean array and OOD scores.

Return type:¶

OODOutput

score(x, batch_size=int(10000000000.0))¶

Compute out of distribution scores for a given dataset.

Parameters:¶

x : ArrayLike¶: Input data to score.
batch_size : int, default 1e10¶: Number of instances to process per batch (only used by some detectors).

Returns:¶

Instance-level (and optionally feature-level) OOD scores. Higher scores indicate samples more likely to be OOD.

Return type:¶

OODScoreOutput

property reference_embeddings : numpy.typing.NDArray[numpy.float32]¶

Reference embeddings stored by the scorer.

Classes¶

`Config`	Configuration for OODKNeighbors detector.