dataeval.data.split_dataset

dataeval.data.split_dataset(dataset, num_folds=1, stratify=False, split_on=None, test_frac=0.0, val_frac=0.0)

Dataset splitting function. Returns a dataclass containing a list of train and validation indices.

Parameters:
dataset : AnnotatedDataset or Metadata

Dataset to split.

num_folds : int, default 1

Number of [train, val] folds. If equal to 1, val_frac must be greater than 0.0

stratify : bool, default False

If true, dataset is split such that the class distribution of the entire dataset is preserved within each [train, val] partition, which is generally recommended.

split_on : list or None, default None

Keys of the metadata dictionary upon which to group the dataset. A grouped partition is divided such that no group is present within both the training and validation set. Split_on groups should be selected to mitigate validation bias

test_frac : float, default 0.0

Fraction of data to be optionally held out for test set

val_frac : float, default 0.0

Fraction of training data to be set aside for validation in the case where a single [train, val] split is desired

Returns:

split_defs – Output class containing a list of indices of training and validation data for each fold and optional test indices

Return type:

SplitDatasetOutput

Notes

When specifying groups and/or stratification, ratios for test and validation splits can vary as the stratification and grouping take higher priority than the percentages