dataeval.data.split_dataset¶

dataeval.data.split_dataset(dataset, num_folds=1, stratify=False, split_on=None, test_frac=0.0, val_frac=0.0)¶

Dataset splitting function. Returns a dataclass containing a list of train and validation indices.

Parameters:¶

dataset : AnnotatedDataset or Metadata¶: Dataset to split.
num_folds : int, default 1¶: Number of [train, val] folds. If equal to 1, val_frac must be greater than 0.0
stratify : bool, default False¶: If true, dataset is split such that the class distribution of the entire dataset is preserved within each [train, val] partition, which is generally recommended.
split_on : list or None, default None¶: Keys of the metadata dictionary upon which to group the dataset. A grouped partition is divided such that no group is present within both the training and validation set. Split_on groups should be selected to mitigate validation bias
test_frac : float, default 0.0¶: Fraction of data to be optionally held out for test set
val_frac : float, default 0.0¶: Fraction of training data to be set aside for validation in the case where a single [train, val] split is desired

Returns:¶

split_defs – Output class containing a list of indices of training and validation data for each fold and optional test indices

Return type:¶

SplitDatasetOutput

Notes

When specifying groups and/or stratification, ratios for test and validation splits can vary as the stratification and grouping take higher priority than the percentages