parity#

Calculate chi-square statistics to assess the relationship between multiple factors and class labels.

This function computes the chi-square statistic for each metadata factor to determine if there is a significant relationship between the factor values and class labels. The function handles both categorical and discretized continuous factors.

Parameters:

data_factors (Mapping[str, ArrayLike]) – The dataset factors, which are per-image attributes including class label and metadata. Each key of dataset_factors is a factor, whose value is the per-image factor values.
continuous_factor_bincounts (Dict[str, int] | None, default None) – A dictionary specifying the number of bins for discretizing the continuous factors. The keys should correspond to the names of continuous factors in data_factors, and the values should be the number of bins to use for discretization. If not provided, no discretization is applied.

Returns:

Arrays of length (num_factors) whose (i)th element corresponds to the chi-square score and p-value for the relationship between factor i and the class labels in the dataset.

Return type:

ParityOutput[NDArray[np.float64]]

Raises:

Warning – If any cell in the contingency matrix has a value between 0 and 5, a warning is issued because this can lead to inaccurate chi-square calculations. It is recommended to ensure that each label co-occurs with factor values either 0 times or at least 5 times. Alternatively, continuous-valued factors can be digitized into fewer bins.

Notes

Each key of the continuous_factor_bincounts dictionary must occur as a key in data_factors.
A high score with a low p-value suggests that a metadata factor is strongly correlated with a class label.
The function creates a contingency matrix for each factor, where each entry represents the frequency of a specific factor value co-occurring with a particular class label.
Rows containing only zeros in the contingency matrix are removed before performing the chi-square test to prevent errors in the calculation.

Examples

Randomly creating some “continuous” and categorical variables using np.random.default_rng

>>> data_factors = {
...     "age": np_random_gen.choice([25, 30, 35, 45], (100)),
...     "income": np_random_gen.choice([50000, 65000, 80000], (100)),
...     "gender": np_random_gen.choice(["M", "F"], (100)),
...     "class": np_random_gen.choice([0, 1, 2], (100)),
... }
>>> continuous_factor_bincounts = {"age": 4, "income": 3}
>>> parity(data_factors, continuous_factor_bincounts)
ParityOutput(score=array([2.82329785, 1.60625584, 1.38377236]), p_value=array([0.83067563, 0.80766733, 0.5006309 ]))