dataeval.shift.OODDomainClassifier

class dataeval.shift.OODDomainClassifier(n_folds=None, n_repeats=None, n_std=None, hyperparameters=None, threshold_perc=None, extractor=None, config=None)

Domain Classifier based Out-of-Distribution detector.

Uses a LightGBM classifier’s ability to distinguish test samples from reference samples as an OOD signal. Samples that a classifier can easily identify as “not reference” are likely OOD.

During fit(), establishes a null distribution of per-point class-1 prediction rates by running repeated k-fold CV on internal splits of the reference data. The threshold is set as mean + n_std * std of this null distribution.

During predict()/score(), treats test data as class 1 and reference as class 0, runs repeated k-fold CV, and returns per-point class-1 rates. Points with rates exceeding the threshold are flagged OOD.

Parameters:
n_folds : int, default 5

Number of cross-validation folds per repeat.

n_repeats : int, default 5

Number of times to repeat the k-fold split.

n_std : float, default 2.0

Number of standard deviations above the null mean for threshold.

hyperparameters : dict or None, default None

LightGBM hyperparameters.

threshold_perc : float or None, default None

Percentage of reference data considered normal (0-100). If None, uses config.threshold_perc (default 95.0).

extractor : FeatureExtractor or None, default None

Feature extractor for transforming input data before scoring. When provided, raw data is passed through the extractor in both fit() and score()/predict(). When None, data is used as-is (must be array-like embeddings).

config : OODDomainClassifier.Config or None, default None

Optional configuration object.

Examples

>>> ref = np.random.randn(200, 8).astype(np.float32)
>>> test = np.random.randn(50, 8).astype(np.float32) + 3
>>> detector = OODDomainClassifier(n_folds=3, n_repeats=3)
>>> detector.fit(ref)
OODDomainClassifier(n_folds=3, n_repeats=3, n_std=2.0, threshold_perc=95.0, hyperparameters=None, extractor=None, fitted=True)
>>> predictions = detector.predict(test)
fit(reference_data)

Fit the detector using reference (in-distribution) data.

Computes a null distribution of class-1 prediction rates by splitting the reference data internally (half as pseudo-class-0, half as pseudo-class-1) and running repeated k-fold CV. The OOD threshold is derived from this null distribution.

Parameters:
reference_data : Any

Reference (in-distribution) data. When an extractor is configured, this can be any data type accepted by the extractor. Otherwise, must be array-like embeddings.

Returns:

The fitted detector (for method chaining).

Return type:

Self

predict(data, batch_size=None, ood_type='instance')

Predict whether instances are out of distribution.

Parameters:
data : ArrayLike

Input data for OOD prediction.

batch_size : int or None, default None

Number of instances to process per batch (only used by some detectors). When None, uses the global batch size from get_batch_size().

ood_type : "feature" | "instance", default "instance"

Predict OOD at the "feature" or "instance" level.

Returns:

Predictions including is_ood boolean array and OOD scores.

Return type:

OODOutput

score(data, batch_size=None)

Compute out of distribution scores for a given dataset.

Parameters:
data : ArrayLike

Input data to score.

batch_size : int or None, default None

Number of instances to process per batch (only used by some detectors). When None, uses the global batch size from get_batch_size().

Returns:

Instance-level (and optionally feature-level) OOD scores. Higher scores indicate samples more likely to be OOD.

Return type:

OODScoreOutput

Classes

Config

Configuration for OODDomainClassifier.