dataeval.data.Metadata

class dataeval.data.Metadata(dataset, *, continuous_factor_bins=None, auto_bin_method='uniform_width', exclude=None, include=None)

Collection of binned metadata using Polars DataFrames.

Processes dataset metadata by automatically binning continuous factors and digitizing categorical factors for analysis and visualization workflows.

Parameters:
dataset : ImageClassificationDataset or ObjectDetectionDataset

Dataset that provides original targets and metadata for processing.

continuous_factor_bins : Mapping[str, int | Sequence[float]] | None, default None

Mapping from continuous factor names to bin counts or explicit bin edges. When None, uses automatic discretization.

auto_bin_method : Literal["uniform_width", "uniform_count", "clusters"], default "uniform_width"

Binning strategy for continuous factors without explicit bins. Default “uniform_width” provides intuitive equal-width intervals for most distributions.

exclude : Sequence[str] | None, default None

Factor names to exclude from processing. Cannot be used with include parameter. When None, processes all available factors.

include : Sequence[str] | None, default None

Factor names to include in processing. Cannot be used with exclude parameter. When None, processes all available factors.

Raises:

ValueError – When both exclude and include parameters are specified simultaneously.

add_factors(factors)

Add additional factors to metadata collection.

Extend the current metadata with new factors, automatically handling length validation and integration with existing data structures.

Parameters:
factors : Mapping[str, Array | Sequence[Any]]

Dictionary mapping factor names to their values. Factor length must match either the number of images or number of detections in the dataset.

Raises:

ValueError – When factor lengths do not match dataset dimensions.

Return type:

None

Examples

>>> metadata = Metadata(dataset)
>>> new_factors = {
...     "brightness": [0.2, 0.8, 0.5, 0.3, 0.4, 0.1, 0.3, 0.2],
...     "contrast": [1.1, 0.9, 1.0, 0.8, 1.2, 1.0, 0.7, 1.3],
... }
>>> metadata.add_factors(new_factors)
calculate_distance(other)

Measures the feature-wise distance between two continuous metadata distributions and computes a p-value to evaluate its significance.

Uses the Earth Mover’s Distance and the Kolmogorov-Smirnov two-sample test, featurewise.

Parameters:
other : Metadata

Class containing continuous factor names and values to be compared

Returns:

A mapping with keys corresponding to metadata feature names, and values that are KstestResult objects, as defined by scipy.stats.ks_2samp.

Return type:

MetadataDistanceOutput

See also

Earth, Kolmogorov-Smirnov

Note

This function only applies to the continuous data

Examples

>>> output = metadata1.calculate_distance(metadata2)
>>> list(output)
['time', 'altitude']
>>> output["time"]
MetadataDistanceValues(statistic=1.0, location=0.44354838709677413, dist=2.7, pvalue=0.0)
filter_by_factor(condition)

Filters metadata factors by factor name or FactorInfo.

Parameters:
condition : Callable[[str, FactorInfo], bool]

A condition to include the factor in the output.

Returns:

Array with shape (n_samples, n_factors) where the factors are filtered by the user provided condition.

Return type:

NDArray[np.float64]

property auto_bin_method : 'uniform_width' | 'uniform_count' | 'clusters'

Automatic binning strategy for continuous factors.

Returns:

Current method used for automatic discretization of continuous factors that lack explicit bin specifications.

Return type:

{“uniform_width”, “uniform_count”, “clusters”}

property binned_data : numpy.typing.NDArray[numpy.int64]

Factor data with continuous values discretized into bins.

Access fully processed factor data where both categorical and continuous factors are converted to integer bin indices.

Returns:

Array with shape (n_samples, n_factors) containing binned integer data ready for categorical analysis algorithms. Returns empty array when no factors are available.

Return type:

NDArray[np.int64]

Notes

This property triggers factor binning analysis on first access. Use this for algorithms requiring purely discrete input data.

property class_labels : numpy.typing.NDArray[numpy.intp]

Target class labels as integer indices.

Returns:

Array of class indices corresponding to dataset targets. For object detection datasets, contains one label per detection.

Return type:

NDArray[np.intp]

Notes

This property triggers dataset structure analysis on first access. Use class_names property to get human-readable label names.

property class_names : collections.abc.Sequence[str]

Human-readable names corresponding to class labels.

Returns:

List of class names where index corresponds to class label value. Derived from dataset metadata or auto-generated from label indices.

Return type:

Sequence[str]

Notes

This property triggers dataset structure analysis on first access.

property continuous_factor_bins : collections.abc.Mapping[str, int | collections.abc.Sequence[float]]

Binning configuration for continuous factors.

Returns:

Dictionary mapping factor names to either the number of bins (int) or explicit bin edges (sequence of floats).

Return type:

Mapping[str, int | Sequence[float]]

property dataframe : polars.DataFrame

Processed DataFrame containing targets and metadata factors.

Access the main data structure with target information (class labels, scores, bounding boxes) and processed metadata factors ready for analysis.

Returns:

DataFrame with columns for image indices, class labels, scores, bounding boxes (when applicable), and all processed metadata factors.

Return type:

pl.DataFrame

Notes

This property triggers dataset structure analysis on first access. Factor binning occurs automatically when accessing factor-related data.

property dropped_factors : collections.abc.Mapping[str, collections.abc.Sequence[str]]

Factors removed during preprocessing with removal reasons.

Returns:

Dictionary mapping dropped factor names to lists of reasons why they were excluded from the final dataset.

Return type:

Mapping[str, Sequence[str]]

Notes

This property triggers dataset structure analysis on first access. Common removal reasons include incompatible data types, excessive missing values, or insufficient variation.

property exclude : set[str]

Factor names excluded from metadata processing.

Returns:

Set of factor names that are filtered out during processing. Empty set when no exclusions are active.

Return type:

set[str]

property factor_data : numpy.typing.NDArray[Any]

Raw factor values before binning or digitization.

Access unprocessed factor data in its original numeric form before any categorical encoding or binning transformations are applied.

Returns:

Array with shape (n_samples, n_factors) containing original factor values. Returns empty array when no factors are available.

Return type:

NDArray[Any]

Notes

Use this for algorithms that can work with mixed data types or when you need access to original continuous values. For analysis-ready numeric data, use binned_data or numeric_data instead.

property factor_info : collections.abc.Mapping[str, FactorInfo]

Type information and processing status for each factor.

Returns:

Dictionary mapping factor names to FactorInfo objects containing data type classification and processing flags (binned, digitized).

Return type:

Mapping[str, FactorInfo]

Notes

This property triggers factor binning analysis on first access. Only includes factors that survived preprocessing and filtering.

property factor_names : collections.abc.Sequence[str]

Names of all processed metadata factors.

Returns:

List of factor names that passed filtering and preprocessing steps. Order matches columns in factor_data, numeric_data, and binned_data.

Return type:

Sequence[str]

Notes

This property triggers dataset structure analysis on first access. Factor names respect include/exclude filtering settings.

property image_count : int

Total number of images in the dataset.

Returns:

Count of unique images in the source dataset, regardless of how many targets/detections each image contains.

Return type:

int

property image_indices : numpy.typing.NDArray[numpy.intp]

Dataset indices linking targets back to source images.

Returns:

Array mapping each target/detection back to its source image index in the original dataset. Essential for object detection datasets where multiple detections come from single images.

Return type:

NDArray[np.intp]

Notes

This property triggers dataset structure analysis on first access.

property include : set[str]

Factor names included in metadata processing.

Returns:

Set of factor names that are processed during analysis. Empty set when no inclusion filter is active.

Return type:

set[str]

property raw : collections.abc.Sequence[collections.abc.Mapping[str, Any]]

Original metadata dictionaries extracted from the dataset.

Access the unprocessed metadata as it was provided in the original dataset before any binning, filtering, or transformation operations.

Returns:

List of metadata dictionaries, one per dataset item, containing the original key-value pairs as provided in the source data

Return type:

Sequence[Mapping[str, Any]]

Notes

This property triggers dataset structure analysis on first access.