dataeval.metrics.bias.parity¶

dataeval.metrics.bias.parity(metadata)¶

Calculate chi-square statistics to assess the linear relationship between multiple factors and class labels.

This function computes the chi-square statistic for each metadata factor to determine if there is a significant relationship between the factor values and class labels. The chi-square statistic is only valid for linear relationships. If non-linear relationships exist, use balance.

Parameters:¶

metadata : Metadata¶: Preprocessed metadata from preprocess()

Returns:¶

Arrays of length (num_factors) whose (i)th element corresponds to the chi-square score and p-value for the relationship between factor i and the class labels in the dataset.

Return type:¶

ParityOutput[NDArray[np.float64]]

Raises:¶

Warning – If any cell in the contingency matrix has a value between 0 and 5, a warning is issued because this can lead to inaccurate chi-square calculations. It is recommended to ensure that each label co-occurs with factor values either 0 times or at least 5 times.

Note

A high score with a low p-value suggests that a metadata factor is strongly correlated with a class label.
The function creates a contingency matrix for each factor, where each entry represents the frequency of a specific factor value co-occurring with a particular class label.
Rows containing only zeros in the contingency matrix are removed before performing the chi-square test to prevent errors in the calculation.

See also

balance

Examples

Randomly creating some “continuous” and categorical variables using np.random.default_rng

>>> from dataeval.utils.metadata import preprocess
>>> rng = np.random.default_rng(175)
>>> labels = rng.choice(["doctor", "artist", "teacher"], (100))
>>> metadata_dict = {
...         "age": list(rng.choice([25, 30, 35, 45], (100))),
...         "income": list(rng.choice([50000, 65000, 80000], (100))),
...         "gender": list(rng.choice(["M", "F"], (100))),
... }
>>> continuous_factor_bincounts = {"age": 4, "income": 3}
>>> metadata = preprocess(metadata_dict, labels, continuous_factor_bincounts)
>>> parity(metadata)
ParityOutput(score=array([7.35731943, 5.46711299, 0.51506212]), p_value=array([0.28906231, 0.24263543, 0.77295762]), metadata_names=['age', 'income', 'gender'], insufficient_data={'age': {3: {'artist': 4}, 4: {'artist': 4, 'teacher': 3}}, 'income': {1: {'artist': 3}}})