dataeval.scope.Coverage¶
-
class dataeval.scope.Coverage(extractor=
None, *, batch_size=None, method=None, num_observations=None, percent=None, min_class_samples=None, isotropy_min_samples=None, near_duplicate_factor=None, config=None)¶ Evaluate a dataset’s embedding-space coverage and per-class variety.
Computes the global coverage radius once over the full set of image embeddings (each class’s share of globally under-sampled samples), then describes each class with three complementary, class-local variety signals label counts cannot give:
dispersion — magnitude of spread (mean distance to centroid, relative to a typical class); low means clustered even if numerous.
isotropy — shape of that spread (in how many independent directions the class varies); low means it varies along few axes even when dispersion is high.
near_duplicate_fraction — redundancy (share of the class in unusually tight nearest-neighbor pairs); high means repeated / near-identical samples.
- Parameters:¶
- extractor : FeatureExtractor or None, default None¶
Feature extractor used to compute embeddings from a dataset. Optional only when pre-computed embeddings are passed to
evaluate().- batch_size : int or None, default None¶
Batch size for embedding computation. When None, uses the global batch size.
- method : {"naive", "adaptive"}, default "adaptive"¶
Coverage radius method — fixed analytic radius (
naive) or data-adaptive cutoff on thepercentmost sparsely-neighbored samples (adaptive).- num_observations : int, default 20¶
Neighbors required for a sample to be considered covered (20-50 is typical).
- percent : float, default 0.01¶
Fraction of samples to flag as uncovered (
adaptivemethod only).- min_class_samples : int, default 20¶
A class needs at least this many samples for its per-class signals to be assessed; smaller classes are reported with
assessable=Falseand nulldispersion/isotropy/near_duplicate_fraction.- isotropy_min_samples : int or None, default None¶
A class needs at least this many samples for its
isotropyto be reported (the effective-dimension estimate is degenerate when samples do not exceed dimensions). When None, defaults to one more than the embedding dimensionality.- near_duplicate_factor : float, default 0.5¶
A nearest-neighbor pair counts as a near-duplicate when its distance is below this fraction of the class’s median nearest-neighbor distance (scale-free).
- config : Coverage.Config or None, default None¶
Optional configuration object; parameters passed directly to
__init__override its values.
See also
dataeval.scope.RepresentationThe label-space counterpart (ontology coverage).
dataeval.scope.PrioritizeRank individual samples for labeling in embedding space.
dataeval.core.coverage_adaptiveThe underlying sample-level coverage computation.
Notes
dispersionis the class’s mean distance-to-centroid divided by the median of that same measure across assessable classes:~1is a typical class,< 1means the class is unusually clustered (low variety / near-duplicate),> 1means it is more spread out than its peers. Normalizing by the peer median (rather than the global spread) keeps the scale meaningful even when classes are well-separated in embedding space.isotropyis the class’s effective dimensionality relative to the subspace it spans (viacompleteness()):~1means it varies evenly in every direction it occupies, low means its variation collapses onto a few axes — orthogonal todispersion, which only measures how far it spreads.near_duplicate_fractionis the share of within-class nearest-neighbor pairs closer thannear_duplicate_factorx the class median, surfacing repeated / near-identical samples that inflate counts without adding variety. Embeddings are auto-rescaled to the unit interval for the coverage computation. This evaluator assumes one embedding per label (image classification). For object detection, wrap the dataset withDetectionCropsto present its boxes as an image-classification dataset (one crop per detection, aligned 1:1 with the labels) and evaluate that, or supply detection-level embeddings you have computed yourself.Examples
>>> from dataeval.scope import Coverage >>> evaluator = Coverage(extractor, num_observations=20, min_class_samples=20) >>> result = evaluator.evaluate(dataset) >>> result.data() # per-class breakdown, lowest-dispersion classes first >>> result.uncovered_indices # individual samples in sparse regionsPass pre-computed embeddings to skip extraction:
>>> result = evaluator.evaluate(dataset, embeddings=embeddings)The per-class
dispersioncolumn is the signal raw label counts cannot give: two classes with identical counts can differ sharply when one’s examples are near-duplicates (dispersionwell below1).-
evaluate(dataset, embeddings=
None)¶ Evaluate a dataset’s embedding-space coverage, broken down by class.
- Parameters:¶
- dataset : AnnotatedDataset or Metadata¶
The dataset to evaluate. Class labels are read from it; embeddings are computed from it via the configured extractor unless provided directly.
- embeddings : Array or None, default None¶
Pre-computed embeddings, one per label. When omitted, an extractor must be configured and embeddings are computed from
dataset.
- Returns:¶
The per-class breakdown (
count/uncovered_fraction/dispersion/assessable) with sample-leveluncovered_indices,coverage_radius, andcritical_value_radii.- Return type:¶
Examples
>>> evaluator = Coverage(crop_extractor, min_class_samples=5) >>> result = evaluator.evaluate(cropped_dataset)data()is the per-class breakdown, sorted with the lowest-dispersion (least visually varied) classes first.carhere has plenty of crops but the least spread — the signal raw counts cannot give:>>> result.data().select("class", "count", "uncovered", "dispersion") shape: (4, 4) ┌────────┬───────┬───────────┬────────────┐ │ class ┆ count ┆ uncovered ┆ dispersion │ │ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ i64 ┆ i64 ┆ f64 │ ╞════════╪═══════╪═══════════╪════════════╡ │ car ┆ 24 ┆ 0 ┆ 0.395929 │ │ boat ┆ 22 ┆ 0 ┆ 0.805483 │ │ plane ┆ 27 ┆ 0 ┆ 1.194517 │ │ person ┆ 20 ┆ 1 ┆ 1.488841 │ └────────┴───────┴───────────┴────────────┘
Classes¶
Configuration for the |