parity#

dataeval.metrics.bias.parity(class_labels: ArrayLike, data_factors: Mapping[str, ArrayLike], continuous_factor_bincounts: Mapping[str, int] | None = None) ParityOutput[ndarray[Any, dtype[float64]]]#

Calculate chi-square statistics to assess the relationship between multiple factors and class labels.

This function computes the chi-square statistic for each metadata factor to determine if there is a significant relationship between the factor values and class labels. The function handles both categorical and discretized continuous factors.

Parameters:
  • class_labels (ArrayLike) – List of class labels for each image

  • data_factors (Mapping[str, ArrayLike]) – The dataset factors, which are per-image metadata attributes. Each key of dataset_factors is a factor, whose value is the per-image factor values.

  • continuous_factor_bincounts (Mapping[str, int] | None, default None) – A dictionary specifying the number of bins for discretizing the continuous factors. The keys should correspond to the names of continuous factors in data_factors, and the values should be the number of bins to use for discretization. If not provided, no discretization is applied.

Returns:

Arrays of length (num_factors) whose (i)th element corresponds to the chi-square score and p-value for the relationship between factor i and the class labels in the dataset.

Return type:

ParityOutput[NDArray[np.float64]]

Raises:

Warning – If any cell in the contingency matrix has a value between 0 and 5, a warning is issued because this can lead to inaccurate chi-square calculations. It is recommended to ensure that each label co-occurs with factor values either 0 times or at least 5 times. Alternatively, continuous-valued factors can be digitized into fewer bins.

Note

  • Each key of the continuous_factor_bincounts dictionary must occur as a key in data_factors.

  • A high score with a low p-value suggests that a metadata factor is strongly correlated with a class label.

  • The function creates a contingency matrix for each factor, where each entry represents the frequency of a specific factor value co-occurring with a particular class label.

  • Rows containing only zeros in the contingency matrix are removed before performing the chi-square test to prevent errors in the calculation.

Examples

Randomly creating some “continuous” and categorical variables using np.random.default_rng

>>> labels = np_random_gen.choice([0, 1, 2], (100))
>>> data_factors = {
...     "age": np_random_gen.choice([25, 30, 35, 45], (100)),
...     "income": np_random_gen.choice([50000, 65000, 80000], (100)),
...     "gender": np_random_gen.choice(["M", "F"], (100)),
... }
>>> continuous_factor_bincounts = {"age": 4, "income": 3}
>>> parity(labels, data_factors, continuous_factor_bincounts)
ParityOutput(score=array([7.35731943, 5.46711299, 0.51506212]), p_value=array([0.28906231, 0.24263543, 0.77295762]))