dataeval.scope.Coverage

class dataeval.scope.Coverage(extractor=None, *, batch_size=None, method=None, num_observations=None, percent=None, min_class_samples=None, isotropy_min_samples=None, near_duplicate_factor=None, config=None)

Evaluate a dataset’s embedding-space coverage and per-class variety.

Computes the global coverage radius once over the full set of image embeddings (each class’s share of globally under-sampled samples), then describes each class with three complementary, class-local variety signals label counts cannot give:

  • dispersion — magnitude of spread (mean distance to centroid, relative to a typical class); low means clustered even if numerous.

  • isotropy — shape of that spread (in how many independent directions the class varies); low means it varies along few axes even when dispersion is high.

  • near_duplicate_fraction — redundancy (share of the class in unusually tight nearest-neighbor pairs); high means repeated / near-identical samples.

Parameters:
extractor : FeatureExtractor or None, default None

Feature extractor used to compute embeddings from a dataset. Optional only when pre-computed embeddings are passed to evaluate().

batch_size : int or None, default None

Batch size for embedding computation. When None, uses the global batch size.

method : {"naive", "adaptive"}, default "adaptive"

Coverage radius method — fixed analytic radius (naive) or data-adaptive cutoff on the percent most sparsely-neighbored samples (adaptive).

num_observations : int, default 20

Neighbors required for a sample to be considered covered (20-50 is typical).

percent : float, default 0.01

Fraction of samples to flag as uncovered (adaptive method only).

min_class_samples : int, default 20

A class needs at least this many samples for its per-class signals to be assessed; smaller classes are reported with assessable=False and null dispersion / isotropy / near_duplicate_fraction.

isotropy_min_samples : int or None, default None

A class needs at least this many samples for its isotropy to be reported (the effective-dimension estimate is degenerate when samples do not exceed dimensions). When None, defaults to one more than the embedding dimensionality.

near_duplicate_factor : float, default 0.5

A nearest-neighbor pair counts as a near-duplicate when its distance is below this fraction of the class’s median nearest-neighbor distance (scale-free).

config : Coverage.Config or None, default None

Optional configuration object; parameters passed directly to __init__ override its values.

See also

dataeval.scope.Representation

The label-space counterpart (ontology coverage).

dataeval.scope.Prioritize

Rank individual samples for labeling in embedding space.

dataeval.core.coverage_adaptive

The underlying sample-level coverage computation.

Notes

dispersion is the class’s mean distance-to-centroid divided by the median of that same measure across assessable classes: ~1 is a typical class, < 1 means the class is unusually clustered (low variety / near-duplicate), > 1 means it is more spread out than its peers. Normalizing by the peer median (rather than the global spread) keeps the scale meaningful even when classes are well-separated in embedding space. isotropy is the class’s effective dimensionality relative to the subspace it spans (via completeness()): ~1 means it varies evenly in every direction it occupies, low means its variation collapses onto a few axes — orthogonal to dispersion, which only measures how far it spreads. near_duplicate_fraction is the share of within-class nearest-neighbor pairs closer than near_duplicate_factor x the class median, surfacing repeated / near-identical samples that inflate counts without adding variety. Embeddings are auto-rescaled to the unit interval for the coverage computation. This evaluator assumes one embedding per label (image classification). For object detection, wrap the dataset with DetectionCrops to present its boxes as an image-classification dataset (one crop per detection, aligned 1:1 with the labels) and evaluate that, or supply detection-level embeddings you have computed yourself.

Examples

>>> from dataeval.scope import Coverage
>>> evaluator = Coverage(extractor, num_observations=20, min_class_samples=20)
>>> result = evaluator.evaluate(dataset)
>>> result.data()  # per-class breakdown, lowest-dispersion classes first
>>> result.uncovered_indices  # individual samples in sparse regions

Pass pre-computed embeddings to skip extraction:

>>> result = evaluator.evaluate(dataset, embeddings=embeddings)

The per-class dispersion column is the signal raw label counts cannot give: two classes with identical counts can differ sharply when one’s examples are near-duplicates (dispersion well below 1).

evaluate(dataset, embeddings=None)

Evaluate a dataset’s embedding-space coverage, broken down by class.

Parameters:
dataset : AnnotatedDataset or Metadata

The dataset to evaluate. Class labels are read from it; embeddings are computed from it via the configured extractor unless provided directly.

embeddings : Array or None, default None

Pre-computed embeddings, one per label. When omitted, an extractor must be configured and embeddings are computed from dataset.

Returns:

The per-class breakdown (count / uncovered_fraction / dispersion / assessable) with sample-level uncovered_indices, coverage_radius, and critical_value_radii.

Return type:

CoverageOutput

Examples

>>> evaluator = Coverage(crop_extractor, min_class_samples=5)
>>> result = evaluator.evaluate(cropped_dataset)

data() is the per-class breakdown, sorted with the lowest-dispersion (least visually varied) classes first. car here has plenty of crops but the least spread — the signal raw counts cannot give:

>>> result.data().select("class", "count", "uncovered", "dispersion")
shape: (4, 4)
┌────────┬───────┬───────────┬────────────┐
│ class  ┆ count ┆ uncovered ┆ dispersion │
│ ---    ┆ ---   ┆ ---       ┆ ---        │
│ str    ┆ i64   ┆ i64       ┆ f64        │
╞════════╪═══════╪═══════════╪════════════╡
│ car    ┆ 24    ┆ 0         ┆ 0.395929   │
│ boat   ┆ 22    ┆ 0         ┆ 0.805483   │
│ plane  ┆ 27    ┆ 0         ┆ 1.194517   │
│ person ┆ 20    ┆ 1         ┆ 1.488841   │
└────────┴───────┴───────────┴────────────┘

Classes

Config

Configuration for the Coverage evaluator.