Parity#

What is parity?#

Parity is a means for assessing fairness in Machine Learning by looking for statistical independence between metadata factors and class labels in a dataset. This assessment helps a user understand sources of potential bias before a model gets trained on it and inadvertently learns spurious correlations. In an ideal case with zero bias, the probability of observing a class label would be independent from observing a specific metadata factor.

Why is statistical independence important for metadata?#

A model trained on a dataset must avoid learning unintended bias. A common way in which bias manifests is when class labels are not statistically independent from metadata attributes. For example, consider a scenario where a user wants to train a model to classify images as cats or as dogs. Suppose that, in this dataset, all dog pictures were taken in Washington, and all cat pictures were taken in Arizona. A model could learn this spurious correlation, and could classify an image as a cat or dog by inspecting the location information, rather than by inspecting features of cats and dogs. Thus, a picture of a cat taken in Washington could be misclassified as a dog. Early detection and mitigation of metadata bias is critical for training unbiased and reliable models.

Why use parity over other statistical methods?#

Parity measures bias on the dataset prior to model testing allowing for faster iterations in developing unbiased ML pipelines. Several methods such as error rate balances, test-fairness, positive/negative class balance, and equal-confusion fairness, which are commonly used for assessing bias and fairness, are calculated based on model predictions and probabilities. Thus, those methods have to be evaluated after a model is already trained.

What can be done with the parity information?#

If all metadata factors are independent from labels, a model trained on it will be less likely to overfit to spurious correlations.

If a metadata factor is not independent from class labels, then a model trained on the dataset could exhibit unintended bias. In this case, action is recommended. Actions include, but are not limited to:

Collecting or generating additional training data that has consistent label distributions across all values of the metadata factor.
Identifying how the spurious correlation manifests in the embeddings in a model, and subtracting out the bias in latent space.
Assigning weights to the loss function that de-emphasize samples that exhibit spurious correlations.

Parity#

What is parity?#

Why is statistical independence important for metadata?#

Why use parity over other statistical methods?#

What can be done with the parity information?#

See Also#