parity#
- dataeval.metrics.bias.parity(data_factors: Mapping[str, _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes]], continuous_factor_bincounts: dict[str, int] | None = None) ParityOutput[ndarray[Any, dtype[float64]]]#
Calculate chi-square statistics to assess the relationship between multiple factors and class labels.
This function computes the chi-square statistic for each metadata factor to determine if there is a significant relationship between the factor values and class labels. The function handles both categorical and discretized continuous factors.
- Parameters:
data_factors (Mapping[str, ArrayLike]) – The dataset factors, which are per-image attributes including class label and metadata. Each key of dataset_factors is a factor, whose value is the per-image factor values.
continuous_factor_bincounts (Dict[str, int] | None, default None) – A dictionary specifying the number of bins for discretizing the continuous factors. The keys should correspond to the names of continuous factors in data_factors, and the values should be the number of bins to use for discretization. If not provided, no discretization is applied.
- Returns:
Arrays of length (num_factors) whose (i)th element corresponds to the chi-square score and p-value for the relationship between factor i and the class labels in the dataset.
- Return type:
ParityOutput[NDArray[np.float64]]
- Raises:
Warning – If any cell in the contingency matrix has a value between 0 and 5, a warning is issued because this can lead to inaccurate chi-square calculations. It is recommended to ensure that each label co-occurs with factor values either 0 times or at least 5 times. Alternatively, continuous-valued factors can be digitized into fewer bins.
Notes
Each key of the
continuous_factor_bincountsdictionary must occur as a key in data_factors.A high score with a low p-value suggests that a metadata factor is strongly correlated with a class label.
The function creates a contingency matrix for each factor, where each entry represents the frequency of a specific factor value co-occurring with a particular class label.
Rows containing only zeros in the contingency matrix are removed before performing the chi-square test to prevent errors in the calculation.
Examples
Randomly creating some “continuous” and categorical variables using
np.random.default_rng>>> data_factors = { ... "age": np_random_gen.choice([25, 30, 35, 45], (100)), ... "income": np_random_gen.choice([50000, 65000, 80000], (100)), ... "gender": np_random_gen.choice(["M", "F"], (100)), ... "class": np_random_gen.choice([0, 1, 2], (100)), ... } >>> continuous_factor_bincounts = {"age": 4, "income": 3} >>> parity(data_factors, continuous_factor_bincounts) ParityOutput(score=array([2.82329785, 1.60625584, 1.38377236]), p_value=array([0.83067563, 0.80766733, 0.5006309 ]))