dataeval.metrics.bias.diversity#

dataeval.metrics.bias.diversity(metadata, method='simpson')#

Compute diversity and classwise diversity for discrete/categorical variables and, through standard histogram binning, for continuous variables.

We define diversity as a normalized form of the inverse Simpson diversity index.

diversity = 1 implies that samples are evenly distributed across a particular factor diversity = 0 implies that all samples belong to one category/bin

Parameters:
Return type:

DiversityOutput

Note

  • The expression is undefined for q=1, but it approaches the Shannon entropy in the limit.

  • If there is only one category, the diversity index takes a value of 0.

Returns:

Diversity index per column of self.data or each factor in self.names and classwise diversity [n_class x n_factor]

Return type:

DiversityOutput

Parameters:

Example

Compute Simpson diversity index of metadata and class labels

>>> div_simp = diversity(metadata, method="simpson")
>>> div_simp.diversity_index
array([0.6       , 0.80882353, 1.        , 0.8       ])
>>> div_simp.classwise
array([[0.5       , 0.8       , 0.8       ],
       [0.63043478, 0.97560976, 0.52830189]])

Compute Shannon diversity index of metadata and class labels

>>> div_shan = diversity(metadata, method="shannon")
>>> div_shan.diversity_index
array([0.81127812, 0.9426312 , 1.        , 0.91829583])
>>> div_shan.classwise
array([[0.68260619, 0.91829583, 0.91829583],
       [0.81443569, 0.99107606, 0.76420451]])

See also

scipy.stats.entropy