diversity#

Diversity and classwise diversity measure the evenness or uniformity of metadata factors either over the entire dataset or by class. Diversity indices may indicate which intrinsic or extrinsic metadata factors are sampled disproportionately to others.

dataeval.metrics.bias.diversity(class_labels: ArrayLike, metadata: Mapping[str, ArrayLike], continuous_factor_bincounts: Mapping[str, int] | None = None, method: Literal['simpson', 'shannon'] = 'simpson') DiversityOutput#

Compute diversity and classwise diversity for discrete/categorical variables and, through standard histogram binning, for continuous variables.

We define diversity as a normalized form of the inverse Simpson diversity index.

diversity = 1 implies that samples are evenly distributed across a particular factor diversity = 0 implies that all samples belong to one category/bin

Parameters:
  • class_labels (ArrayLike) – List of class labels for each image

  • metadata (Mapping[str, ArrayLike]) – Dict of list of metadata factors for each image

  • continuous_factor_bincounts (Mapping[str, int] or None, default None) – The factors in metadata that have continuous values and the array of bin counts to discretize values into. All factors are treated as having discrete values unless they are specified as keys in this dictionary. Each element of this array must occur as a key in metadata.

  • method ({"simpson", "shannon"}, default "simpson") – Indicates which diversity index should be computed

Note

  • For continuous variables, histogram bins are chosen automatically. See numpy.histogram for details.

  • The expression is undefined for q=1, but it approaches the Shannon entropy in the limit.

  • If there is only one category, the diversity index takes a value of 1 = 1/N = 1/1. Entropy will take a value of 0.

Returns:

Diversity index per column of self.data or each factor in self.names and classwise diversity [n_class x n_factor]

Return type:

DiversityOutput

Example

Compute Simpson diversity index of metadata and class labels

>>> div_simp = diversity(class_labels, metadata, continuous_factor_bincounts, method="simpson")
>>> div_simp.diversity_index
array([0.72413793, 0.72413793, 0.88636364])
>>> div_simp.classwise
array([[0.68965517, 0.69230769],
       [0.8       , 1.        ]])

Compute Shannon diversity index of metadata and class labels

>>> div_shan = diversity(class_labels, metadata, continuous_factor_bincounts, method="shannon")
>>> div_shan.diversity_index
array([0.8812909 , 0.8812909 , 0.96748876])
>>> div_shan.classwise
array([[0.86312057, 0.91651644],
       [0.91829583, 1.        ]])

See also

numpy.histogram