dataeval.shift.OODDomainClassifier¶
-
class dataeval.shift.OODDomainClassifier(n_folds=
None, n_repeats=None, n_std=None, hyperparameters=None, threshold_perc=None, extractor=None, config=None)¶ Domain Classifier based Out-of-Distribution detector.
Uses a LightGBM classifier’s ability to distinguish test samples from reference samples as an OOD signal. Samples that a classifier can easily identify as “not reference” are likely OOD.
During
fit(), establishes a null distribution of per-point class-1 prediction rates by running repeated k-fold CV on internal splits of the reference data. The threshold is set asmean + n_std * stdof this null distribution.During
predict()/score(), treats test data as class 1 and reference as class 0, runs repeated k-fold CV, and returns per-point class-1 rates. Points with rates exceeding the threshold are flagged OOD.Note: By default, this detector uses the
n_stdbased threshold for predictions. If a value forthreshold_percis provided (either directly or via config), it will use percentile-based thresholding from reference scores instead.- Parameters:¶
- n_folds : int, default 5¶
Number of cross-validation folds per repeat.
- n_repeats : int, default 5¶
Number of times to repeat the k-fold split.
- n_std : float, default 2.0¶
Number of standard deviations above the null mean for threshold. Used when threshold_perc is not explicitly set.
- hyperparameters : dict or None, default None¶
LightGBM hyperparameters.
- threshold_perc : float or None, default None¶
Percentage of reference data considered normal (0-100). If provided, overrides
n_stdfor percentile-based thresholding.- extractor : FeatureExtractor or None, default None¶
Feature extractor for transforming input data before scoring. When provided, raw data is passed through the extractor in both
fit()andscore()/predict(). When None, data is used as-is (must be array-like embeddings).- config : OODDomainClassifier.Config or None, default None¶
Optional configuration object.
Examples
>>> ref = np.random.randn(200, 8).astype(np.float32) >>> test = np.random.randn(50, 8).astype(np.float32) + 3 >>> detector = OODDomainClassifier(n_folds=3, n_repeats=3) >>> detector.fit(ref) OODDomainClassifier(n_folds=3, n_repeats=3, n_std=2.0, threshold_perc=None, hyperparameters=None, extractor=None, fitted=True) >>> predictions = detector.predict(test)- fit(reference_data)¶
Fit the detector using reference (in-distribution) data.
Computes a null distribution of class-1 prediction rates by splitting the reference data internally (half as pseudo-class-0, half as pseudo-class-1) and running repeated k-fold CV. The OOD threshold is derived from this null distribution.
-
predict(data, batch_size=
None, ood_type='instance')¶ Predict whether instances are out of distribution.
- Parameters:¶
- data : ArrayLike¶
Input data for OOD prediction.
- batch_size : int or None, default None¶
Number of instances to process per batch (only used by some detectors). When None, uses the global batch size from
get_batch_size().- ood_type : "feature" | "instance", default "instance"¶
Predict OOD at the
"feature"or"instance"level.
- Returns:¶
Predictions including
is_oodboolean array and OOD scores.- Return type:¶
-
score(data, batch_size=
None)¶ Compute out of distribution scores for a given dataset.
- Parameters:¶
- data : ArrayLike¶
Input data to score.
- batch_size : int or None, default None¶
Number of instances to process per batch (only used by some detectors). When None, uses the global batch size from
get_batch_size().
- Returns:¶
Instance-level (and optionally feature-level) OOD scores. Higher scores indicate samples more likely to be OOD.
- Return type:¶
Classes¶
Configuration for OODDomainClassifier. |