dataeval.metrics.bias.diversity¶
-
dataeval.metrics.bias.diversity(metadata, method=
'simpson')¶ Compute diversity and classwise diversity for discrete/categorical variables through standard histogram binning, for continuous variables.
The method specified defines diversity as the inverse Simpson diversity index linearly rescaled to the unit interval, or the normalized form of the Shannon entropy.
diversity = 1 implies that samples are evenly distributed across a particular factor diversity = 0 implies that all samples belong to one category/bin
- Parameters:¶
- Returns:¶
Diversity index per column of self.data or each factor in self.names and classwise diversity [n_class x n_factor]
- Return type:¶
Note
The expression is undefined for q=1, but it approaches the Shannon entropy in the limit.
If there is only one category, the diversity index takes a value of 0.
Example
Compute the diversity index of metadata and class labels
>>> metadata = generate_random_metadata( ... labels=["doctor", "artist", "teacher"], ... factors={ ... "age": [25, 30, 35, 45], ... "income": [50000, 65000, 80000], ... "gender": ["M", "F"]}, ... length=100, ... random_seed=175)>>> div_simp = diversity(metadata, method="simpson") >>> div_simp.diversity_index array([0.938, 0.944, 0.888, 0.987])>>> div_simp.classwise array([[0.964, 0.858, 0.973], [0.747, 0.727, 0.997], [0.829, 0.915, 0.965]])Compute Shannon diversity index of metadata and class labels
>>> div_shan = diversity(metadata, method="shannon") >>> div_shan.diversity_index array([0.981, 0.983, 0.962, 0.995])>>> div_shan.classwise array([[0.99 , 0.948, 0.99 ], [0.921, 0.878, 0.999], [0.939, 0.972, 0.987]])See also
scipy.stats.entropy