dataeval.bias.Diversity

class dataeval.bias.Diversity(method=None, threshold=None, config=None)

Computes diversity and classwise diversity for discrete/categorical variables.

Through standard histogram binning, for continuous variables.

The method specified defines diversity as the inverse Simpson diversity index linearly rescaled to the unit interval [0, 1], or the normalized form of the Shannon entropy.

diversity = 1 implies that samples are evenly distributed across a particular factor diversity = 0 implies that all samples belong to one category/bin

Identifies factors with low diversity based on a threshold.

Parameters:
method : "simpson" or "shannon", default "simpson"

The methodology used for defining diversity. When “simpson” is used, the index is linearly rescaled so that 1.0 represents maximum diversity (even distribution) and 0.0 represents minimum diversity (all samples in one bin).

threshold : float, default 0.5

Threshold for identifying low diversity. Factors with diversity values at or below this threshold are flagged as having low diversity.

metadata

Preprocessed metadata from the last evaluate() call.

Type:

MetadataLike

method

The methodology used for defining diversity

Type:

Literal[“simpson”, “shannon”]

threshold

Threshold for identifying low diversity factors

Type:

float

See also

scipy.stats.entropy

Notes

  • The expression is undefined for q=1, but it approaches the Shannon entropy in the limit.

  • If there is only one category, the diversity index takes a value of 0.

  • Factors with diversity values <= threshold represent low diversity and are flagged.

Examples

Initialize the Diversity class:

>>> diversity = Diversity()

Specifying custom method and threshold:

>>> diversity = Diversity(method="shannon", threshold=0.6)

Using configuration:

>>> config = Diversity.Config(method="shannon", threshold=0.6)
>>> diversity = Diversity(config=config)
evaluate(data)

Compute diversity and classwise diversity for the dataset.

Parameters:
data : AnnotatedDataset[Any] or MetadataLike

Either an annotated dataset (which will be converted to Metadata) or any object implementing the MetadataLike protocol.

Returns:

Two DataFrames containing diversity scores and low diversity flags: - factors: Factor-level diversity scores - classwise: Class-factor-level diversity scores

Return type:

DiversityOutput

Example

Compute the diversity index of metadata and class labels

>>> from dataeval import Metadata
>>> metadata = Metadata(dataset)
>>> diversity = Diversity(method="simpson", threshold=0.5)
>>> result = diversity.evaluate(metadata)
>>> result.factors
shape: (6, 3)
┌─────────────┬─────────────────┬──────────────────┐
│ factor_name ┆ diversity_value ┆ is_low_diversity │
│ ---         ┆ ---             ┆ ---              │
│ cat         ┆ f64             ┆ bool             │
╞═════════════╪═════════════════╪══════════════════╡
│ class_label ┆ 0.983706        ┆ false            │
│ angle       ┆ 0.99896         ┆ false            │
│ id          ┆ 0.832298        ┆ false            │
│ location    ┆ 0.949711        ┆ false            │
│ time_of_day ┆ 0.916342        ┆ false            │
│ weather     ┆ 0.992751        ┆ false            │
└─────────────┴─────────────────┴──────────────────┘
>>> result.classwise
shape: (20, 4)
┌────────────┬─────────────┬─────────────────┬──────────────────┐
│ class_name ┆ factor_name ┆ diversity_value ┆ is_low_diversity │
│ ---        ┆ ---         ┆ ---             ┆ ---              │
│ cat        ┆ cat         ┆ f64             ┆ bool             │
╞════════════╪═════════════╪═════════════════╪══════════════════╡
│ person     ┆ angle       ┆ 0.888889        ┆ false            │
│ person     ┆ id          ┆ 0.293564        ┆ true             │
│ person     ┆ location    ┆ 0.836257        ┆ false            │
│ person     ┆ time_of_day ┆ 0.924528        ┆ false            │
│ person     ┆ weather     ┆ 0.833333        ┆ false            │
│ …          ┆ …           ┆ …               ┆ …                │
│ plane      ┆ angle       ┆ 0.987755        ┆ false            │
│ plane      ┆ id          ┆ 0.430427        ┆ true             │
│ plane      ┆ location    ┆ 0.938918        ┆ false            │
│ plane      ┆ time_of_day ┆ 0.84058         ┆ false            │
│ plane      ┆ weather     ┆ 0.987755        ┆ false            │
└────────────┴─────────────┴─────────────────┴──────────────────┘

Classes

Config

Configuration for Diversity evaluator.