dataeval.bias.Parity¶
-
class dataeval.bias.Parity(score_threshold=
0.3, p_value_threshold=0.05)¶ Calculate statistical parity using Bias-Corrected Cramér’s V.
This function measures the association between metadata factors and class labels to identify potential bias or spurious correlations. It assumes an equal distribution of metadata factors within the dataset.
The calculation uses the G-test (Log-Likelihood Ratio) for the statistical test and applies the Bergsma (2013) bias correction to the Cramér’s V statistic. This correction provides a more accurate estimate of association strength than standard Cramér’s V, particularly for finite samples or large contingency tables.
- Parameters:¶
- score_threshold : float, default 0.3¶
Threshold for identifying highly correlated factors. Factors with Cramér’s V above this threshold and p-value below p_value_threshold are considered highly correlated with class labels.
- p_value_threshold : float, default 0.05¶
P-value threshold for statistical significance. Only factors with p-value below this threshold are considered for correlation flagging.
Notes
Interpretation: - 0.0 - 0.1: Negligible association (High Parity) - 0.1 - 0.3: Weak association - 0.3 - 0.5: Moderate association - > 0.5: Strong association (Potential Bias)
Methodology: 1. Constructs a contingency matrix for each factor against class labels. 2. Identifies and flags cells with counts < 5 (insufficient data). 3. Removes rows with zero sums to prevent calculation errors. 4. Performs a G-test (Log-Likelihood Ratio) instead of Pearson’s Chi-Squared. 5. Computes Cramér’s V with Bergsma’s bias correction.
References
Bergsma, W. (2013). A bias-correction for Cramér’s V and Tschuprow’s T. Journal of the Korean Statistical Society, 42(3), 323-328.
Examples
Initialize the Parity class:
>>> parity = Parity()Specifying custom thresholds:
>>> parity = Parity(score_threshold=0.4, p_value_threshold=0.01)output = parity(metadata.binned_data, metadata.class_labels.tolist())
- evaluate(data)¶
Calculate chi-square statistics for the dataset.
- Parameters:¶
- data : AnnotatedDataset[Any] or Metadata¶
Either an annotated dataset (which will be converted to Metadata) or preprocessed Metadata directly.
- Returns:¶
DataFrame containing score, p_value, and correlation flags for each factor, along with insufficient data details.
- Return type:¶
Examples
Randomly creating some “continuous” and categorical variables using
np.random.default_rng>>> metadata = generate_random_metadata( ... labels=["doctor", "artist", "teacher"], ... factors={"age": [25, 30, 35, 45], "income": [50000, 65000, 80000], "gender": ["M", "F"]}, ... length=100, ... random_seed=175, ... )>>> parity = Parity() >>> result = parity.evaluate(metadata) >>> result.factors shape: (3, 5) ┌─────────────┬──────────┬────────────┬───────────────┬───────────────────────┐ │ factor_name ┆ score ┆ p_value ┆ is_correlated ┆ has_insufficient_data │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ cat ┆ f64 ┆ f64 ┆ bool ┆ bool │ ╞═════════════╪══════════╪════════════╪═══════════════╪═══════════════════════╡ │ age ┆ 0.445379 ┆ 4.8290e-8 ┆ true ┆ true │ │ gender ┆ 0.291057 ┆ 0.0055 ┆ false ┆ false │ │ income ┆ 0.568195 ┆ 8.4062e-14 ┆ true ┆ true │ └─────────────┴──────────┴────────────┴───────────────┴───────────────────────┘