dataeval.detectors.ood.OOD_KNN

class dataeval.detectors.ood.OOD_KNN(k=10, distance_metric='cosine')

K-Nearest Neighbors Out-of-Distribution detector.

Uses average cosine distance to k nearest neighbors in embedding space to detect OOD samples. Samples with larger average distances to their k nearest neighbors in the reference (in-distribution) set are considered more likely to be OOD.

Based on the methodology from: “Back to the Basics: Revisiting Out-of-Distribution Detection Baselines” (Kuan & Mueller, 2022)

As referenced in: “Safe AI for coral reefs: Benchmarking out-of-distribution detection algorithms for coral reef image surveys”

Parameters:
k : int

distance_metric : Literal['cosine', 'euclidean']

fit_embeddings(embeddings, threshold_perc=95.0)

Fit the detector using reference (in-distribution) embeddings.

Builds a k-NN index for efficient nearest neighbor search and computes reference scores for automatic thresholding.

Parameters:
embeddings : dataeval.data.Embeddings

Reference embeddings from in-distribution data

threshold_perc : float

Percentage of reference data considered normal

Return type:

None

predict(X, batch_size=int(10000000000.0), ood_type='instance')

Predict whether instances are out of distribution or not.

Parameters:
X : ArrayLike

Input data for out-of-distribution prediction.

batch_size : int, default 1e10

Number of instances to process in each batch.

ood_type : "feature" | "instance", default "instance"

Predict out-of-distribution at the ‘feature’ or ‘instance’ level.

Raises:

ValueError – X input data must be unit interval [0-1].

Returns:

  • Dictionary containing the outlier predictions for the selected level,

  • and the OOD scores for the data including both ‘instance’ and ‘feature’ (if present) level scores.

Return type:

dataeval.outputs.OODOutput

score(X, batch_size=int(10000000000.0))

Compute the out of distribution scores for a given dataset.

Parameters:
X : ArrayLike

Input data to score.

batch_size : int, default 1e10

Number of instances to process in each batch. Use a smaller batch size if your dataset is large or if you encounter memory issues.

Raises:

ValueError – X input data must be unit interval [0-1].

Returns:

An object containing the instance-level and feature-level OOD scores.

Return type:

OODScoreOutput