diversity#
Diversity and classwise diversity measure the evenness or uniformity of metadata factors either over the entire dataset or by class. Diversity indices may indicate which intrinsic or extrinsic metadata factors are sampled disproportionately to others.
- dataeval.metrics.bias.diversity(class_labels: Sequence[int], metadata: list[dict], method: Literal['shannon', 'simpson'] = 'simpson') DiversityOutput#
Compute diversity and classwise diversity for discrete/categorical variables and, through standard histogram binning, for continuous variables.
We define diversity as a normalized form of the inverse Simpson diversity index.
diversity = 1 implies that samples are evenly distributed across a particular factor diversity = 0 implies that all samples belong to one category/bin
- Parameters:
class_labels (Sequence[int]) – List of class labels for each image
metadata (List[Dict]) – List of metadata factors for each image
method (Literal["shannon", "simpson"], default "simpson") – Indicates which diversity index should be computed
Notes
For continuous variables, histogram bins are chosen automatically. See numpy.histogram for details.
The expression is undefined for q=1, but it approaches the Shannon entropy in the limit.
If there is only one category, the diversity index takes a value of 1 = 1/N = 1/1. Entropy will take a value of 0.
- Returns:
Diversity index per column of self.data or each factor in self.names and classwise diversity [n_class x n_factor]
- Return type:
DiversityOutput
Example
Compute Simpson diversity index of metadata and class labels
>>> div_simp = diversity(class_labels, metadata, method="simpson") >>> div_simp.diversity_index array([0.18103448, 0.18103448, 0.88636364])
>>> div_simp.classwise array([[0.17241379, 0.39473684], [0.2 , 0.2 ]])
Compute Shannon diversity index of metadata and class labels
>>> div_shan = diversity(class_labels, metadata, method="shannon") >>> div_shan.diversity_index array([0.37955133, 0.37955133, 0.96748876])
>>> div_shan.classwise array([[0.43156028, 0.83224889], [0.57938016, 0.57938016]])
See also
numpy.histogram