dataeval.core.parity¶

dataeval.core.parity(binned_data: numpy.typing.NDArray[numpy.intp], class_labels: numpy.typing.NDArray[numpy.intp], *, return_insufficient_data: False = False) → tuple[numpy.typing.NDArray[numpy.float64], numpy.typing.NDArray[numpy.float64]]¶

dataeval.core.parity(binned_data: numpy.typing.NDArray[numpy.intp], class_labels: numpy.typing.NDArray[numpy.intp], *, return_insufficient_data: True) → tuple[numpy.typing.NDArray[numpy.float64], numpy.typing.NDArray[numpy.float64], collections.abc.Mapping[int, collections.abc.Mapping[int, collections.abc.Mapping[int, int]]]]

Calculate chi-square statistics to assess the linear relationship between multiple factors and class labels.

This function computes the chi-square statistic for each metadata factor to determine if there is a significant relationship between the factor values and class labels. The chi-square statistic is only valid for linear relationships. If non-linear relationships exist, use balance.

Parameters:¶

binned_data : NDArray[np.intp]¶: Binned metadata factor values
class_labels : NDArray[np.intp]¶: Observed class labels

Returns:¶

Arrays of length (num_factors) whose (i)th element corresponds to the chi-square score and p-value for the relationship between factor i and the class labels in the dataset.

Return type:¶

ParityOutput[NDArray[np.float64]]

Raises:¶

Warning – If any cell in the contingency matrix has a value between 0 and 5, a warning is issued because this can lead to inaccurate chi-square calculations. It is recommended to ensure that each label co-occurs with factor values either 0 times or at least 5 times.

Notes

A high score with a low p-value suggests that a metadata factor is strongly correlated with a class label.
The function creates a contingency matrix for each factor, where each entry represents the frequency of a specific factor value co-occurring with a particular class label.
Rows containing only zeros in the contingency matrix are removed before performing the chi-square test to prevent errors in the calculation.