dataeval.scope.Representation

class dataeval.scope.Representation(ontology, *, expected=None, config=None)

Evaluate a dataset’s coverage of an ontology and prioritize what to collect.

Resolves the dataset’s class labels against the ontology, compares the observed distribution to an expected one, and returns a RepresentationOutput worklist of the leaf species to acquire or augment. The default expectation is a uniform distribution over leaf species; pass expected to assert a minimum share (a fraction of the whole dataset) for specific classes you know to be rare, which both right-sizes their target and is validated as an assertion.

Parameters:
ontology : Ontology

Ontology whose leaf species define the label space to cover.

expected : Mapping[str, float] or None, default None

Class name to its minimum expected share of the dataset (a fraction in [0, 1]). Named classes use this floor as their target in place of the uniform share, and are validated in RepresentationOutput.violations; unnamed classes keep the uniform target. None means a uniform expectation for every leaf.

config : Representation.Config or None, default None

Optional configuration object; parameters passed directly to __init__ override its values.

See also

dataeval.core.label_coverage

The observation-only coverage facts this builds on.

dataeval.core.label_reconciliation

Resolve labels against an ontology.

Notes

Targets are rounded to the nearest whole label. A class named in expected that does not resolve to exactly one concept is ignored (resolve it upstream).

Examples

>>> from dataeval import Ontology
>>> from dataeval.scope import Representation
>>> ontology = Ontology.from_hierarchy({"animal": {"mammal": ["cat", "dog"], "bird": ["owl"]}})
>>> result = Representation(ontology).evaluate(dataset)
>>> result.columns
['concept', 'label', 'parent', 'action', 'count', 'target', 'deficit']

Assert that a known-rare class need only make up 5% of the dataset:

>>> result = Representation(ontology, expected={"owl": 0.05}).evaluate(dataset)
>>> result.violations.columns
['concept', 'label', 'floor', 'actual', 'shortfall']
evaluate(data)

Evaluate a dataset’s coverage of the ontology.

Parameters:
data : AnnotatedDataset or Metadata

The dataset (or its Metadata) to evaluate. Class labels and the index2label mapping are read from it; raw label counts are derived via label_stats().

Returns:

The collection worklist (acquire / augment rows) with leaf_coverage, total_deficit, violations, and dark_branches.

Return type:

RepresentationOutput

Examples

>>> ontology = Ontology.from_hierarchy({
...     "vehicle": {"land": ["car", "bike"], "water": ["boat"], "air": ["plane"]}
... })
>>> evaluator = Representation(ontology)
>>> result = evaluator.evaluate(dataset)
>>> result.data()
shape: (2, 7)
┌─────────┬───────┬────────┬─────────┬───────┬────────┬─────────┐
│ concept ┆ label ┆ parent ┆ action  ┆ count ┆ target ┆ deficit │
│ ---     ┆ ---   ┆ ---    ┆ ---     ┆ ---   ┆ ---    ┆ ---     │
│ str     ┆ str   ┆ str    ┆ str     ┆ i64   ┆ i64    ┆ i64     │
╞═════════╪═══════╪════════╪═════════╪═══════╪════════╪═════════╡
│ bike    ┆ bike  ┆ land   ┆ acquire ┆ 0     ┆ 23     ┆ 23      │
│ boat    ┆ boat  ┆ water  ┆ augment ┆ 22    ┆ 23     ┆ 1       │
└─────────┴───────┴────────┴─────────┴───────┴────────┴─────────┘
>>> result.total_deficit
24
>>> result.leaf_coverage
0.75

Classes

Config

Configuration for the Representation evaluator.