dataeval.bias.Diversity¶
-
class dataeval.bias.Diversity(method=
None, threshold=None, config=None)¶ Computes diversity and classwise diversity for discrete/categorical variables.
Through standard histogram binning, for continuous variables.
The method specified defines diversity as the inverse Simpson diversity index linearly rescaled to the unit interval [0, 1], or the normalized form of the Shannon entropy.
diversity = 1 implies that samples are evenly distributed across a particular factor diversity = 0 implies that all samples belong to one category/bin
Identifies factors with low diversity based on a threshold.
- Parameters:¶
- method : "simpson" or "shannon", default "simpson"¶
The methodology used for defining diversity. When “simpson” is used, the index is linearly rescaled so that 1.0 represents maximum diversity (even distribution) and 0.0 represents minimum diversity (all samples in one bin).
- threshold : float, default 0.5¶
Threshold for identifying low diversity. Factors with diversity values at or below this threshold are flagged as having low diversity.
Notes
The expression is undefined for q=1, but it approaches the Shannon entropy in the limit.
If there is only one category, the diversity index takes a value of 0.
Factors with diversity values <= threshold represent low diversity and are flagged.
Examples
Initialize the Diversity class:
>>> diversity = Diversity()Specifying custom method and threshold:
>>> diversity = Diversity(method="shannon", threshold=0.6)Using configuration:
>>> config = Diversity.Config(method="shannon", threshold=0.6) >>> diversity = Diversity(config=config)See also
scipy.stats.entropy- evaluate(data)¶
Compute diversity and classwise diversity for the dataset.
- Parameters:¶
- data : AnnotatedDataset[Any] or MetadataLike¶
Either an annotated dataset (which will be converted to Metadata) or any object implementing the MetadataLike protocol.
- Returns:¶
Two DataFrames containing diversity scores and low diversity flags: - factors: Factor-level diversity scores - classwise: Class-factor-level diversity scores
- Return type:¶
Example
Compute the diversity index of metadata and class labels
>>> from dataeval import Metadata >>> metadata = Metadata(dataset)>>> diversity = Diversity(method="simpson", threshold=0.5) >>> result = diversity.evaluate(metadata) >>> result.factors shape: (6, 3) ┌─────────────┬─────────────────┬──────────────────┐ │ factor_name ┆ diversity_value ┆ is_low_diversity │ │ --- ┆ --- ┆ --- │ │ cat ┆ f64 ┆ bool │ ╞═════════════╪═════════════════╪══════════════════╡ │ class_label ┆ 0.983706 ┆ false │ │ angle ┆ 0.99896 ┆ false │ │ id ┆ 0.832298 ┆ false │ │ location ┆ 0.949711 ┆ false │ │ time_of_day ┆ 0.916342 ┆ false │ │ weather ┆ 0.992751 ┆ false │ └─────────────┴─────────────────┴──────────────────┘>>> result.classwise shape: (20, 4) ┌────────────┬─────────────┬─────────────────┬──────────────────┐ │ class_name ┆ factor_name ┆ diversity_value ┆ is_low_diversity │ │ --- ┆ --- ┆ --- ┆ --- │ │ cat ┆ cat ┆ f64 ┆ bool │ ╞════════════╪═════════════╪═════════════════╪══════════════════╡ │ person ┆ angle ┆ 0.888889 ┆ false │ │ person ┆ id ┆ 0.293564 ┆ true │ │ person ┆ location ┆ 0.836257 ┆ false │ │ person ┆ time_of_day ┆ 0.924528 ┆ false │ │ person ┆ weather ┆ 0.833333 ┆ false │ │ … ┆ … ┆ … ┆ … │ │ plane ┆ angle ┆ 0.987755 ┆ false │ │ plane ┆ id ┆ 0.430427 ┆ true │ │ plane ┆ location ┆ 0.938918 ┆ false │ │ plane ┆ time_of_day ┆ 0.84058 ┆ false │ │ plane ┆ weather ┆ 0.987755 ┆ false │ └────────────┴─────────────┴─────────────────┴──────────────────┘
Classes¶
Configuration for Diversity evaluator. |