diversity#

Diversity and classwise diversity measure the evenness or uniformity of metadata factors either over the entire dataset or by class. Diversity indices may indicate which intrinsic or extrinsic metadata factors are sampled disproportionately to others.

dataeval.metrics.bias.diversity(metadata: MetadataOutput, method: Literal['simpson', 'shannon'] = 'simpson') DiversityOutput#

Compute diversity and classwise diversity for discrete/categorical variables and, through standard histogram binning, for continuous variables.

We define diversity as a normalized form of the inverse Simpson diversity index.

diversity = 1 implies that samples are evenly distributed across a particular factor diversity = 0 implies that all samples belong to one category/bin

Parameters:

metadata (MetadataOutput) – Output after running metadata_preprocessing

Note

  • The expression is undefined for q=1, but it approaches the Shannon entropy in the limit.

  • If there is only one category, the diversity index takes a value of 0.

Returns:

Diversity index per column of self.data or each factor in self.names and classwise diversity [n_class x n_factor]

Return type:

DiversityOutput

Example

Compute Simpson diversity index of metadata and class labels

>>> div_simp = diversity(metadata, method="simpson")
>>> div_simp.diversity_index
array([0.72413793, 0.88636364, 0.72413793])
>>> div_simp.classwise
array([[0.69230769, 0.68965517],
       [0.5       , 0.8       ]])

Compute Shannon diversity index of metadata and class labels

>>> div_shan = diversity(metadata, method="shannon")
>>> div_shan.diversity_index
array([0.8812909 , 0.96748876, 0.8812909 ])
>>> div_shan.classwise
array([[0.91651644, 0.86312057],
       [0.68260619, 0.91829583]])

See also

scipy.stats.entropy