dataeval.shift.OODKNeighbors

class dataeval.shift.OODKNeighbors(k=None, distance_metric=None, config=None)

K-Nearest Neighbors Out-of-Distribution detector.

Uses average distance to k nearest neighbors in embedding space to detect OOD samples. Samples with larger average distances to their k nearest neighbors in the reference (in-distribution) set are considered more likely to be OOD.

Based on the methodology from: “Back to the Basics: Revisiting Out-of-Distribution Detection Baselines” (Kuan & Mueller, 2022)

As referenced in: “Safe AI for coral reefs: Benchmarking out-of-distribution detection algorithms for coral reef image surveys”

Parameters:
k : int, default 10

Number of nearest neighbors to consider

distance_metric : "cosine" | "euclidean", default "cosine"

Distance metric to use

config : OODKNeighbors.Config or None, default None

Optional configuration object with default parameters. Parameters specified directly in __init__ will override config defaults.

Examples

>>> from dataeval.shift import OODKNeighbors
>>> import numpy as np
>>>
>>> # Create reference embeddings (in-distribution)
>>> ref_embeddings = np.random.randn(100, 128).astype(np.float32)
>>>
>>> # Fit the detector
>>> detector = OODKNeighbors(k=10, distance_metric="cosine")
>>> detector.fit(ref_embeddings, threshold_perc=95.0)
>>>
>>> # Score new samples
>>> test_embeddings = np.random.randn(20, 128).astype(np.float32)
>>> scores = detector.score(test_embeddings)
>>> predictions = detector.predict(test_embeddings)

Using configuration:

>>> config = OODKNeighbors.Config(k=15, distance_metric="euclidean", threshold_perc=99.0)
>>> detector = OODKNeighbors(config=config)
>>> detector.fit(ref_embeddings)  # Uses config.threshold_perc
fit(embeddings, threshold_perc=None)

Fit the detector using reference (in-distribution) embeddings.

Builds a k-NN index for efficient nearest neighbor search and computes reference scores for automatic thresholding.

Parameters:
embeddings : Array

Reference (in-distribution) embeddings

threshold_perc : float or None, default None

Percentage of reference data considered normal (0-100). Higher values result in more permissive thresholds. If None, uses config.threshold_perc (default 95.0).

predict(x, batch_size=int(10000000000.0), ood_type='instance')

Predict whether instances are out of distribution.

Parameters:
x : ArrayLike

Input data for OOD prediction.

batch_size : int, default 1e10

Number of instances to process per batch (only used by some detectors).

ood_type : "feature" | "instance", default "instance"

Predict OOD at the "feature" or "instance" level.

Returns:

Predictions including is_ood boolean array and OOD scores.

Return type:

OODOutput

score(x, batch_size=int(10000000000.0))

Compute out of distribution scores for a given dataset.

Parameters:
x : ArrayLike

Input data to score.

batch_size : int, default 1e10

Number of instances to process per batch (only used by some detectors).

Returns:

Instance-level (and optionally feature-level) OOD scores. Higher scores indicate samples more likely to be OOD.

Return type:

OODScoreOutput

property reference_embeddings : numpy.typing.NDArray[numpy.float32]

Reference embeddings stored by the scorer.

Classes

Config

Configuration for OODKNeighbors detector.