dataeval.shift.OODKNeighbors

class dataeval.shift.OODKNeighbors(k=None, distance_metric=None, threshold_perc=None, extractor=None, config=None)

K-Nearest Neighbors Out-of-Distribution detector.

Uses average distance to k nearest neighbors in embedding space to detect OOD samples. Samples with larger average distances to their k nearest neighbors in the reference (in-distribution) set are considered more likely to be OOD.

Based on the methodology from: “Back to the Basics: Revisiting Out-of-Distribution Detection Baselines” (Kuan & Mueller, 2022)

As referenced in: “Safe AI for coral reefs: Benchmarking out-of-distribution detection algorithms for coral reef image surveys”

Parameters:
k : int, default 10

Number of nearest neighbors to consider

distance_metric : "cosine" | "euclidean", default "cosine"

Distance metric to use

threshold_perc : float or None, default None

Percentage of reference data considered normal (0-100). Higher values result in more permissive thresholds. If None, uses config.threshold_perc (default 95.0).

extractor : FeatureExtractor or None, default None

Feature extractor for transforming input data before scoring. When provided, raw data is passed through the extractor in both fit() and score()/predict(). When None, data is used as-is (must be array-like embeddings).

config : OODKNeighbors.Config or None, default None

Optional configuration object with default parameters. Parameters specified directly in __init__ will override config defaults.

Examples

>>> from dataeval.shift import OODKNeighbors
>>> import numpy as np
>>>
>>> # Create reference embeddings (in-distribution)
>>> ref_embeddings = np.random.randn(100, 128).astype(np.float32)
>>>
>>> # Fit the detector
>>> detector = OODKNeighbors(k=10, distance_metric="cosine", threshold_perc=95.0)
>>> detector.fit(ref_embeddings)
OODKNeighbors(k=10, distance_metric='cosine', threshold_perc=95.0, extractor=None, fitted=True)
>>>
>>> # Score new samples
>>> test_embeddings = np.random.randn(20, 128).astype(np.float32)
>>> scores = detector.score(test_embeddings)
>>> predictions = detector.predict(test_embeddings)

Using configuration:

>>> config = OODKNeighbors.Config(k=15, distance_metric="euclidean", threshold_perc=99.0)
>>> detector = OODKNeighbors(config=config)
>>> detector.fit(ref_embeddings)
OODKNeighbors(k=15, distance_metric='euclidean', threshold_perc=99.0, extractor=None, fitted=True)
fit(reference_data)

Fit the detector using reference (in-distribution) data.

Builds a k-NN index for efficient nearest neighbor search and computes reference scores for automatic thresholding.

Parameters:
reference_data : Any

Reference (in-distribution) data. When an extractor is configured, this can be any data type accepted by the extractor. Otherwise, must be array-like embeddings.

Returns:

The fitted detector (for method chaining).

Return type:

Self

predict(data, batch_size=None, ood_type='instance')

Predict whether instances are out of distribution.

Parameters:
data : ArrayLike

Input data for OOD prediction.

batch_size : int or None, default None

Number of instances to process per batch (only used by some detectors). When None, uses the global batch size from get_batch_size().

ood_type : "feature" | "instance", default "instance"

Predict OOD at the "feature" or "instance" level.

Returns:

Predictions including is_ood boolean array and OOD scores.

Return type:

OODOutput

score(data, batch_size=None)

Compute out of distribution scores for a given dataset.

Parameters:
data : ArrayLike

Input data to score.

batch_size : int or None, default None

Number of instances to process per batch (only used by some detectors). When None, uses the global batch size from get_batch_size().

Returns:

Instance-level (and optionally feature-level) OOD scores. Higher scores indicate samples more likely to be OOD.

Return type:

OODScoreOutput

property reference_embeddings : numpy.typing.NDArray[numpy.float32]

Reference embeddings stored by the scorer.

Classes

Config

Configuration for OODKNeighbors detector.