parity#

dataeval.metrics.bias.parity(class_labels: ArrayLike, metadata: Mapping[str, ArrayLike], continuous_factor_bincounts: Mapping[str, int] | None = None) → ParityOutput[ndarray[Any, dtype[float64]]]#

Calculate chi-square statistics to assess the relationship between multiple factors and class labels.

This function computes the chi-square statistic for each metadata factor to determine if there is a significant relationship between the factor values and class labels. The function handles both categorical and discretized continuous factors.

Parameters:

class_labels (ArrayLike) – List of class labels for each image
metadata (Mapping[str, ArrayLike]) – The dataset factors, which are per-image metadata attributes. Each key of dataset_factors is a factor, whose value is the per-image factor values.
continuous_factor_bincounts (Mapping[str, int] | None, default None) – A dictionary specifying the number of bins for discretizing the continuous factors. The keys should correspond to the names of continuous factors in metadata, and the values should be the number of bins to use for discretization. If not provided, no discretization is applied.

Returns:

Arrays of length (num_factors) whose (i)th element corresponds to the chi-square score and p-value for the relationship between factor i and the class labels in the dataset.

Return type:

ParityOutput[NDArray[np.float64]]

Raises:

Warning – If any cell in the contingency matrix has a value between 0 and 5, a warning is issued because this can lead to inaccurate chi-square calculations. It is recommended to ensure that each label co-occurs with factor values either 0 times or at least 5 times. Alternatively, continuous-valued factors can be digitized into fewer bins.

Note

Each key of the continuous_factor_bincounts dictionary must occur as a key in data_factors.
A high score with a low p-value suggests that a metadata factor is strongly correlated with a class label.
The function creates a contingency matrix for each factor, where each entry represents the frequency of a specific factor value co-occurring with a particular class label.
Rows containing only zeros in the contingency matrix are removed before performing the chi-square test to prevent errors in the calculation.

Examples

Randomly creating some “continuous” and categorical variables using np.random.default_rng

>>> labels = np_random_gen.choice([0, 1, 2], (100))
>>> metadata = {
...     "age": np_random_gen.choice([25, 30, 35, 45], (100)),
...     "income": np_random_gen.choice([50000, 65000, 80000], (100)),
...     "gender": np_random_gen.choice(["M", "F"], (100)),
... }
>>> continuous_factor_bincounts = {"age": 4, "income": 3}
>>> parity(labels, metadata, continuous_factor_bincounts)
ParityOutput(score=array([7.35731943, 5.46711299, 0.51506212]), p_value=array([0.28906231, 0.24263543, 0.77295762]), metadata_names=['age', 'income', 'gender'])