dataeval.protocols.EvaluationStrategy

class dataeval.protocols.EvaluationStrategy

Protocol defining the interface for evaluating a trained model.

Implementations must provide an evaluate method with this signature. Uses structural typing - no explicit inheritance required.

The @runtime_checkable decorator allows isinstance() checks if needed, though structural typing works without it at type-check time.

Examples

Creating a custom evaluation strategy:

>>> class MyEvaluation:
...     def __init__(self, batch_size: int, metrics: list[str]):
...         self.batch_size = batch_size
...         self.metrics = metrics
...
...     def evaluate(self, model: torch.nn.Module, dataset: Dataset) -> Mapping[str, float | np.ndarray]:
...         # Custom evaluation implementation
...         model.eval()
...         with torch.no_grad():
...             # Compute metrics
...             ...
...         return {"accuracy": 0.95, "f1": 0.93}
evaluate(model, dataset)

Evaluate the model on the dataset and return performance metrics.

Parameters:
model : nn.Module

The trained model to evaluate

dataset : Dataset[T]

The dataset to evaluate on (typically a test/validation set)

Returns:

Mapping of metric names to values. Each value is either: - A scalar (float) for single-class metrics - An array (np.ndarray) for per-class or per-sample metrics

Examples: - {“accuracy”: 0.95} # Single metric - {“accuracy”: 0.95, “precision”: 0.93, “recall”: 0.94} # Multiple metrics - {“accuracy”: np.array([0.9, 0.85, 0.92])} # Per-class metrics

Return type:

Mapping[str, float | ArrayLike]

Notes

Implementations should: - Set model to eval mode if needed - Return consistent metric names across calls - Handle both single-class and multi-class scenarios - Use the entire dataset (unlike training which uses subsets)