dataeval.metrics.bias.diversity

dataeval.metrics.bias.diversity(metadata, method='simpson')

Compute diversity and classwise diversity for discrete/categorical variables through standard histogram binning, for continuous variables.

The method specified defines diversity as the inverse Simpson diversity index linearly rescaled to the unit interval, or the normalized form of the Shannon entropy.

diversity = 1 implies that samples are evenly distributed across a particular factor diversity = 0 implies that all samples belong to one category/bin

Parameters:
metadata : Metadata

Preprocessed metadata

method : "simpson" or "shannon", default "simpson"

The methodology used for defining diversity

Returns:

Diversity index per column of self.data or each factor in self.names and classwise diversity [n_class x n_factor]

Return type:

DiversityOutput

Note

  • The expression is undefined for q=1, but it approaches the Shannon entropy in the limit.

  • If there is only one category, the diversity index takes a value of 0.

Example

Compute the diversity index of metadata and class labels

>>> metadata = generate_random_metadata(
...     labels=["doctor", "artist", "teacher"],
...     factors={
...         "age": [25, 30, 35, 45],
...         "income": [50000, 65000, 80000],
...         "gender": ["M", "F"]},
...     length=100,
...     random_seed=175)
>>> div_simp = diversity(metadata, method="simpson")
>>> div_simp.diversity_index
array([0.938, 0.944, 0.888, 0.987])
>>> div_simp.classwise
array([[0.964, 0.858, 0.973],
       [0.747, 0.727, 0.997],
       [0.829, 0.915, 0.965]])

Compute Shannon diversity index of metadata and class labels

>>> div_shan = diversity(metadata, method="shannon")
>>> div_shan.diversity_index
array([0.981, 0.983, 0.962, 0.995])
>>> div_shan.classwise
array([[0.99 , 0.948, 0.99 ],
       [0.921, 0.878, 0.999],
       [0.939, 0.972, 0.987]])

See also

scipy.stats.entropy