dataeval.shift.OODDomainClassifier

class dataeval.shift.OODDomainClassifier(n_folds=None, n_repeats=None, n_std=None, hyperparameters=None, config=None)

Domain Classifier based Out-of-Distribution detector.

Uses a LightGBM classifier’s ability to distinguish test samples from reference samples as an OOD signal. Samples that a classifier can easily identify as “not reference” are likely OOD.

During fit(), establishes a null distribution of per-point class-1 prediction rates by running repeated k-fold CV on internal splits of the reference data. The threshold is set as mean + n_std * std of this null distribution.

During predict()/score(), treats test data as class 1 and reference as class 0, runs repeated k-fold CV, and returns per-point class-1 rates. Points with rates exceeding the threshold are flagged OOD.

Parameters:
n_folds : int, default 5

Number of cross-validation folds per repeat.

n_repeats : int, default 5

Number of times to repeat the k-fold split.

n_std : float, default 2.0

Number of standard deviations above the null mean for threshold.

hyperparameters : dict or None, default None

LightGBM hyperparameters.

config : OODDomainClassifier.Config or None, default None

Optional configuration object.

Examples

>>> ref = np.random.randn(200, 8).astype(np.float32)
>>> test = np.random.randn(50, 8).astype(np.float32) + 3
>>> detector = OODDomainClassifier(n_folds=3, n_repeats=3)
>>> detector.fit(ref)
>>> predictions = detector.predict(test)
fit(x_ref, threshold_perc=None)

Fit the detector using reference (in-distribution) data.

Computes a null distribution of class-1 prediction rates by splitting the reference data internally (half as pseudo-class-0, half as pseudo-class-1) and running repeated k-fold CV. The OOD threshold is derived from this null distribution.

Parameters:
x_ref : ArrayLike

Reference (in-distribution) data.

threshold_perc : float or None, default None

Percentage of reference data considered normal (0-100). If None, uses config.threshold_perc.

predict(x, batch_size=int(10000000000.0), ood_type='instance')

Predict whether instances are out of distribution.

Parameters:
x : ArrayLike

Input data for OOD prediction.

batch_size : int, default 1e10

Number of instances to process per batch (only used by some detectors).

ood_type : "feature" | "instance", default "instance"

Predict OOD at the "feature" or "instance" level.

Returns:

Predictions including is_ood boolean array and OOD scores.

Return type:

OODOutput

score(x, batch_size=int(10000000000.0))

Compute out of distribution scores for a given dataset.

Parameters:
x : ArrayLike

Input data to score.

batch_size : int, default 1e10

Number of instances to process per batch (only used by some detectors).

Returns:

Instance-level (and optionally feature-level) OOD scores. Higher scores indicate samples more likely to be OOD.

Return type:

OODScoreOutput

Classes

Config

Configuration for OODDomainClassifier.