dataeval.bias.Diversity

class dataeval.bias.Diversity(method='simpson', threshold=0.5)

Computes diversity and classwise diversity for discrete/categorical variables through standard histogram binning, for continuous variables.

The method specified defines diversity as the inverse Simpson diversity index linearly rescaled to the unit interval, or the normalized form of the Shannon entropy.

diversity = 1 implies that samples are evenly distributed across a particular factor diversity = 0 implies that all samples belong to one category/bin

Identifies factors with low diversity based on a threshold.

Parameters:
method : "simpson" or "shannon", default "simpson"

The methodology used for defining diversity

threshold : float, default 0.5

Threshold for identifying low diversity. Factors with diversity values at or below this threshold are flagged as having low diversity.

metadata

Preprocessed metadata from the last evaluate() call.

Type:

Metadata

method

The methodology used for defining diversity

Type:

Literal[“simpson”, “shannon”]

threshold

Threshold for identifying low diversity factors

Type:

float

Notes

  • The expression is undefined for q=1, but it approaches the Shannon entropy in the limit.

  • If there is only one category, the diversity index takes a value of 0.

  • Factors with diversity values <= threshold represent low diversity and are flagged.

Examples

Initialize the Diversity class:

>>> diversity = Diversity()

Specifying custom method and threshold:

>>> diversity = Diversity(method="shannon", threshold=0.6)

See also

scipy.stats.entropy

evaluate(data)

Compute diversity and classwise diversity for the dataset.

Parameters:
data : AnnotatedDataset[Any] or Metadata

Either an annotated dataset (which will be converted to Metadata) or preprocessed Metadata directly.

Returns:

Two DataFrames containing diversity scores and low diversity flags: - factors: Factor-level diversity scores - classwise: Class-factor-level diversity scores

Return type:

DiversityOutput

Example

Compute the diversity index of metadata and class labels

>>> metadata = generate_random_metadata(
...     labels=["doctor", "artist", "teacher"],
...     factors={"age": [25, 30, 35, 45], "income": [50000, 65000, 80000], "gender": ["M", "F"]},
...     length=100,
...     random_seed=175,
... )
>>> diversity = Diversity(method="simpson", threshold=0.5)
>>> result = diversity.evaluate(metadata)
>>> result.factors
shape: (3, 3)
┌─────────────┬─────────────────┬──────────────────┐
│ factor_name ┆ diversity_value ┆ is_low_diversity │
│ ---         ┆ ---             ┆ ---              │
│ cat         ┆ f64             ┆ bool             │
╞═════════════╪═════════════════╪══════════════════╡
│ age         ┆ 0.907669        ┆ false            │
│ gender      ┆ 0.992826        ┆ false            │
│ income      ┆ 0.954334        ┆ false            │
└─────────────┴─────────────────┴──────────────────┘
>>> result.classwise
shape: (9, 4)
┌────────────┬─────────────┬─────────────────┬──────────────────┐
│ class_name ┆ factor_name ┆ diversity_value ┆ is_low_diversity │
│ ---        ┆ ---         ┆ ---             ┆ ---              │
│ cat        ┆ cat         ┆ f64             ┆ bool             │
╞════════════╪═════════════╪═════════════════╪══════════════════╡
│ doctor     ┆ age         ┆ 0.619268        ┆ false            │
│ doctor     ┆ gender      ┆ 0.832507        ┆ false            │
│ doctor     ┆ income      ┆ 0.269775        ┆ true             │
│ artist     ┆ age         ┆ 0.556777        ┆ false            │
│ artist     ┆ gender      ┆ 0.715294        ┆ false            │
│ artist     ┆ income      ┆ 0.334096        ┆ true             │
│ teacher    ┆ age         ┆ 0.477477        ┆ true             │
│ teacher    ┆ gender      ┆ 0.86722         ┆ false            │
│ teacher    ┆ income      ┆ 0.703209        ┆ false            │
└────────────┴─────────────┴─────────────────┴──────────────────┘