dataeval.Metadata

class dataeval.Metadata(dataset, *, continuous_factor_bins=None, auto_bin_method='uniform_width', exclude=None, include=None)

Collection of binned metadata using Polars DataFrames.

Processes dataset metadata by automatically binning continuous factors and digitizing categorical factors for analysis and visualization workflows.

Parameters:
dataset : ImageClassificationDataset or ObjectDetectionDataset

Dataset that provides original targets and metadata for processing.

continuous_factor_bins : Mapping[str, int | Sequence[float]] | None, default None

Mapping from continuous factor names to bin counts or explicit bin edges. When None, uses automatic discretization.

auto_bin_method : Literal["uniform_width", "uniform_count", "clusters"], default "uniform_width"

Binning strategy for continuous factors without explicit bins. Default “uniform_width” provides intuitive equal-width intervals for most distributions.

exclude : Sequence[str] | None, default None

Factor names to exclude from processing. Cannot be used with include parameter. When None, processes all available factors.

include : Sequence[str] | None, default None

Factor names to include in processing. Cannot be used with exclude parameter. When None, processes all available factors.

Raises:

ValueError – When both exclude and include parameters are specified simultaneously.

add_factors(factors, level='auto')

Add additional factors to metadata collection.

Extend the current metadata with new factors at either image or target level. For image-level factors, values are stored only in image-level rows. For target-level factors, values are stored only in target-level rows.

Parameters:
factors : Mapping[str, _1DArray[Any]]

Mapping of factor names to their values. Factor length must match the specified level (image count or target count).

level : {"image", "target", "auto"}, default="auto"

Level at which to store the factors: - “image”: Array length must match image count, stored in image-level rows only - “target”: Array length must match target count, stored in target-level rows only - “auto”: Automatically infers level based on array length

Raises:

ValueError – When factor lengths do not match the specified level’s dimensions.

Return type:

None

Examples

>>> metadata = Metadata(od_dataset)
>>> # Add image-level factors (e.g., from imagestats)
>>> image_factors = {
...     "brightness": [0.2, 0.8, 0.5],  # One per image
...     "contrast": [1.1, 0.9, 1.0],
... }
>>> metadata.add_factors(image_factors, level="image")
>>>
>>> # Add target-level factors (e.g., detection confidence scores)
>>> target_factors = {
...     "iou": [0.85, 0.92, 0.78, 0.88, 0.91],  # One per target/detection
... }
>>> metadata.add_factors(target_factors, level="target")
calculate_distance(other)

Measures the feature-wise distance between two continuous metadata distributions and computes a p-value to evaluate its significance.

Uses the Earth Mover’s Distance and the Kolmogorov-Smirnov two-sample test, featurewise.

Parameters:
other : Metadata

Class containing continuous factor names and values to be compared

Returns:

A mapping with keys corresponding to metadata feature names, and values that are KstestResult objects, as defined by scipy.stats.ks_2samp.

Return type:

MetadataDistanceOutput

See also

Earth, Kolmogorov-Smirnov

Notes

This function only applies to the continuous data

Examples

>>> output = metadata1.calculate_distance(metadata2)
>>> list(output)
['time', 'altitude']
>>> output["time"]
{'statistic': 1.0, 'location': 0.44354838709677413, 'dist': 2.6999999999999997, 'p_value': 0.0}
filter_by_factor(condition)

Filters metadata factors by factor name or FactorInfo.

Parameters:
condition : Callable[[str, FactorInfo], bool]

A condition to include the factor in the output.

Returns:

Array with shape (n_samples, n_factors) where the factors are filtered by the user provided condition.

Return type:

NDArray[np.float64]

get_image_factors(image_idx)

Get all factors for a specific image.

Parameters:
image_idx : int

Index of the image to retrieve factors for

Returns:

Dictionary mapping factor names to their values for the specified image

Return type:

dict[str, Any]

Examples

>>> factors = metadata.get_image_factors(0)
>>> factors["temp"]
72.5
>>> factors["time"]
'morning'
>>> factors["loc"]
'urban'
get_target_factors(image_idx, target_idx)

Get all factors for a specific target within an image.

Parameters:
image_idx : int

Index of the image containing the target

target_idx : int

Index of the target within the image (0-indexed per image)

Returns:

Dictionary mapping factor names to their values for the specified target

Return type:

dict[str, Any]

Examples

>>> factors = metadata.get_target_factors(0, 1)
>>> factors["image_index"]
0
>>> factors["target_index"]
1
>>> factors["class_label"]
1
has_targets()

Check if the source dataset has targets.

Returns:

True if dataset contains targets, False for classification datasets.

Return type:

bool

property auto_bin_method : 'uniform_width' | 'uniform_count' | 'clusters'

Automatic binning strategy for continuous factors.

Returns:

Current method used for automatic discretization of continuous factors that lack explicit bin specifications.

Return type:

{“uniform_width”, “uniform_count”, “clusters”}

property binned_data : numpy.typing.NDArray[numpy.int64]

Factor data with continuous values discretized into bins.

Access fully processed factor data where both categorical and continuous factors are converted to integer bin indices.

Returns:

Array with shape (n_samples, n_factors) containing binned integer data ready for categorical analysis algorithms. Returns empty array when no factors are available. For OD datasets, returns only target-level rows to align with class_label.

Return type:

NDArray[np.int64]

Notes

This property triggers factor binning analysis on first access. Use this for algorithms requiring purely discrete input data.

For object detection datasets, this returns target-level data only to ensure alignment with class_labels (one row per detection).

property class_labels : numpy.typing.NDArray[numpy.intp]

Target class labels as integer indices.

Returns:

Array of class indices corresponding to dataset targets. For object detection datasets, contains one label per detection.

Return type:

NDArray[np.intp]

Notes

This property triggers dataset structure analysis on first access. Use index2label property to get human-readable label names.

property continuous_factor_bins : collections.abc.Mapping[str, int | collections.abc.Sequence[float]]

Binning configuration for continuous factors.

Returns:

Mapping of factor names to either the number of bins (int) or explicit bin edges (sequence of floats).

Return type:

Mapping[str, int | Sequence[float]]

property dataframe : polars.DataFrame

Processed DataFrame containing both image-level and target-level rows.

Access the main data structure with both image-level metadata and target-level information (class labels, scores, bounding boxes). Use image_data or target_data properties to filter to specific row types.

Returns:

DataFrame with columns for image_index, target_index, class_label, scores, bounding boxes (when applicable), and all processed metadata factors. Rows where target_index is None contain image-level data. Rows where target_index is an integer contain target/detection-level data.

Return type:

pl.DataFrame

Notes

This property triggers dataset structure analysis on first access. Factor binning occurs automatically when accessing factor-related data.

For Object Detection datasets, the dataframe now contains: - Image-level rows (target_index=None): One per image with image-level factors - Target-level rows (target_index=0,1,2…): One per detection with detection data

See also

image_data

Filter to image-level rows only

target_data

Filter to target-level rows only

property dropped_factors : collections.abc.Mapping[str, collections.abc.Sequence[str]]

Factors removed during preprocessing with removal reasons.

Returns:

Mapping of dropped factor names to lists of reasons why they were excluded from the final dataset.

Return type:

Mapping[str, Sequence[str]]

Notes

This property triggers dataset structure analysis on first access. Common removal reasons include incompatible data types, excessive missing values, or insufficient variation.

property exclude : set[str]

Factor names excluded from metadata processing.

Returns:

Set of factor names that are filtered out during processing. Empty set when no exclusions are active.

Return type:

set[str]

property factor_data : numpy.typing.NDArray[Any]

Raw factor values before binning or digitization.

Access unprocessed factor data in its original numeric form before any categorical encoding or binning transformations are applied.

Returns:

Array with shape (n_samples, n_factors) containing original factor values. Returns empty array when no factors are available. For OD datasets, returns only target-level rows to align with class_labels.

Return type:

NDArray[Any]

Notes

Use this for algorithms that can work with mixed data types or when you need access to original continuous values. For analysis-ready numeric data, use binned_data.

For object detection datasets, this returns target-level data only to ensure alignment with class_labels (one row per detection).

property factor_info : collections.abc.Mapping[str, FactorInfo]

Type information and processing status for each factor.

Returns:

Mapping of factor names to FactorInfo objects containing data type classification and processing flags (binned, digitized).

Return type:

Mapping[str, FactorInfo]

Notes

This property triggers factor binning analysis on first access. Only includes factors that survived preprocessing and filtering.

property factor_names : collections.abc.Sequence[str]

Names of all processed metadata factors.

Returns:

List of factor names that passed filtering and preprocessing steps. Order matches columns in factor_data and binned_data.

Return type:

Sequence[str]

Notes

This property triggers dataset structure analysis on first access. Factor names respect include/exclude filtering settings.

property image_data : polars.DataFrame

Dataframe containing only image-level rows.

Returns a view of the metadata dataframe filtered to rows where target_index is None, containing one row per image with image-level factors.

Returns:

Dataframe with image-level metadata. For Object Detection datasets, this provides per-image analysis without target-level duplication.

Return type:

pl.DataFrame

Notes

This property triggers dataset structure analysis on first access. Image-level factors are stored only in these rows to avoid duplication.

Examples

>>> metadata.image_data
shape: (3, 8)
┌─────────────┬──────────────┬─────────────┬───────────┬───────────┬──────┬───────────┬──────────┐
│ image_index ┆ target_index ┆ class_label ┆ score     ┆ box       ┆ temp ┆ time      ┆ loc      │
│ ---         ┆ ---          ┆ ---         ┆ ---       ┆ ---       ┆ ---  ┆ ---       ┆ ---      │
│ i64         ┆ i64          ┆ i64         ┆ list[f64] ┆ list[f64] ┆ f64  ┆ str       ┆ str      │
╞═════════════╪══════════════╪═════════════╪═══════════╪═══════════╪══════╪═══════════╪══════════╡
│ 0           ┆ null         ┆ null        ┆ null      ┆ null      ┆ 72.5 ┆ morning   ┆ urban    │
│ 1           ┆ null         ┆ null        ┆ null      ┆ null      ┆ 65.3 ┆ afternoon ┆ rural    │
│ 2           ┆ null         ┆ null        ┆ null      ┆ null      ┆ 68.1 ┆ evening   ┆ suburban │
└─────────────┴──────────────┴─────────────┴───────────┴───────────┴──────┴───────────┴──────────┘
property include : set[str]

Factor names included in metadata processing.

Returns:

Set of factor names that are processed during analysis. Empty set when no inclusion filter is active.

Return type:

set[str]

property item_count : int

Total number of items in the dataset.

Returns:

Count of unique items in the source dataset, regardless of how many targets/detections each item contains.

Return type:

int

property item_indices : numpy.typing.NDArray[numpy.intp]

Dataset indices linking targets back to source item.

Returns:

Array mapping each target/detection back to its source item index in the original dataset. Essential for object detection datasets where multiple detections come from a single item.

Return type:

NDArray[np.intp]

Notes

This property triggers dataset structure analysis on first access.

property raw : collections.abc.Sequence[collections.abc.Mapping[str, Any]]

Original metadata dictionaries extracted from the dataset.

Access the unprocessed metadata as it was provided in the original dataset before any binning, filtering, or transformation operations.

Returns:

List of metadata dictionaries, one per dataset item, containing the original key-value pairs as provided in the source data

Return type:

Sequence[Mapping[str, Any]]

Notes

This property triggers dataset structure analysis on first access.

property target_data : polars.DataFrame

Dataframe containing only target-level rows.

Returns a view of the metadata dataframe filtered to rows where target_index is not None, containing target/detection-level data.

Returns:

Dataframe with target-level metadata. Each row represents a single target or detection with its associated class, score, and bounding box information.

Return type:

pl.DataFrame

Notes

This property triggers dataset structure analysis on first access. This is similar to the legacy behavior where only target-level rows existed, but now image-level metadata is stored separately in image_data.

Examples

>>> metadata.target_data
shape: (5, 8)
┌─────────────┬──────────────┬─────────────┬──────────────┬─────────────┬──────┬───────────┬───────┐
│ image_index ┆ target_index ┆ class_label ┆ score        ┆ box         ┆ temp ┆ time      ┆ loc   │
│ ---         ┆ ---          ┆ ---         ┆ ---          ┆ ---         ┆ ---  ┆ ---       ┆ ---   │
│ i64         ┆ i64          ┆ i64         ┆ list[f64]    ┆ list[f64]   ┆ f64  ┆ str       ┆ str   │
╞═════════════╪══════════════╪═════════════╪══════════════╪═════════════╪══════╪═══════════╪═══════╡
│ 0           ┆ 0            ┆ 0           ┆ [1.0, 0.0,   ┆ [10.0,      ┆ 72.5 ┆ morning   ┆ urban │
│             ┆              ┆             ┆ 0.0]         ┆ 10.0, …     ┆      ┆           ┆       │
│             ┆              ┆             ┆              ┆ 20.0]       ┆      ┆           ┆       │
│ 0           ┆ 1            ┆ 1           ┆ [0.0, 1.0,   ┆ [30.0,      ┆ 72.5 ┆ morning   ┆ urban │
│             ┆              ┆             ┆ 0.0]         ┆ 30.0, …     ┆      ┆           ┆       │
│             ┆              ┆             ┆              ┆ 40.0]       ┆      ┆           ┆       │
│ 1           ┆ 0            ┆ 1           ┆ [0.0, 1.0,   ┆ [5.0, 5.0,  ┆ 65.3 ┆ afternoon ┆ rural │
│             ┆              ┆             ┆ 0.0]         ┆ … 15.0]     ┆      ┆           ┆       │
│ 1           ┆ 1            ┆ 2           ┆ [0.0, 0.0,   ┆ [25.0,      ┆ 65.3 ┆ afternoon ┆ rural │
│             ┆              ┆             ┆ 1.0]         ┆ 25.0, …     ┆      ┆           ┆       │
│             ┆              ┆             ┆              ┆ 35.0]       ┆      ┆           ┆       │
│ 1           ┆ 2            ┆ 0           ┆ [1.0, 0.0,   ┆ [45.0,      ┆ 65.3 ┆ afternoon ┆ rural │
│             ┆              ┆             ┆ 0.0]         ┆ 45.0, …     ┆      ┆           ┆       │
│             ┆              ┆             ┆              ┆ 55.0]       ┆      ┆           ┆       │
└─────────────┴──────────────┴─────────────┴──────────────┴─────────────┴──────┴───────────┴───────┘
property target_factors_only : bool

Whether only target-level factors are included from the factors list.

Returns:

True if image-level factors are excluded, False if included (default).

Return type:

bool