dataeval.core.parity¶
- dataeval.core.parity(factor_data, class_labels)¶
Calculate statistical parity using Bias-Corrected Cramér’s V.
This function measures the association between metadata factors and class labels to identify potential bias or spurious correlations. It assumes an equal distribution of metadata factors within the dataset.
The calculation uses the G-test (Log-Likelihood Ratio) for the statistical test and applies the Bergsma (2013) bias correction to the Cramér’s V statistic. This correction provides a more accurate estimate of association strength than standard Cramér’s V, particularly for finite samples or large contingency tables.
- Parameters:¶
- Returns:¶
A dictionary containing:
scores: NDArray[np.float64] - Array of bias-corrected Cramér’s V statistics ranging from 0 (independence) to 1 (perfect association).
p_values: NDArray[np.float64] - Array of p-values from the G-test. Low p-values (< 0.05) indicate statistical significance.
insufficient_data: Mapping[int, Mapping[int, Mapping[int, int]]] - Nested dictionary flagging specific combinations with low sample counts (< 5).
Sample structure: {factor_index: {factor_category: {class_label: count}}}.
- Return type:¶
Notes
Interpretation: - 0.0 - 0.1: Negligible association (High Parity) - 0.1 - 0.3: Weak association - 0.3 - 0.5: Moderate association - > 0.5: Strong association (Potential Bias)
Methodology: 1. Constructs a contingency matrix for each factor against class labels. 2. Identifies and flags cells with counts < 5 (insufficient data). 3. Removes rows with zero sums to prevent calculation errors. 4. Performs a G-test (Log-Likelihood Ratio) instead of Pearson’s Chi-Squared. 5. Computes Cramér’s V with Bergsma’s bias correction.
References
Bergsma, W. (2013). A bias-correction for Cramér’s V and Tschuprow’s T. Journal of the Korean Statistical Society, 42(3), 323-328.
See also
balance