dataeval.data.Metadata¶

class dataeval.data.Metadata(dataset, *, continuous_factor_bins=None, auto_bin_method='uniform_width', exclude=None, include=None)¶

Collection of binned metadata using Polars DataFrames.

Processes dataset metadata by automatically binning continuous factors and digitizing categorical factors for analysis and visualization workflows.

Parameters:¶

dataset : ImageClassificationDataset or ObjectDetectionDataset¶: Dataset that provides original targets and metadata for processing.
continuous_factor_bins : Mapping[str, int | Sequence[float]] | None, default None¶: Mapping from continuous factor names to bin counts or explicit bin edges. When None, uses automatic discretization.
auto_bin_method : Literal["uniform_width", "uniform_count", "clusters"], default "uniform_width"¶: Binning strategy for continuous factors without explicit bins. Default “uniform_width” provides intuitive equal-width intervals for most distributions.
exclude : Sequence[str] | None, default None¶: Factor names to exclude from processing. Cannot be used with include parameter. When None, processes all available factors.
include : Sequence[str] | None, default None¶: Factor names to include in processing. Cannot be used with exclude parameter. When None, processes all available factors.

Raises:¶

ValueError – When both exclude and include parameters are specified simultaneously.

add_factors(factors)¶

Add additional factors to metadata collection.

Extend the current metadata with new factors, automatically handling length validation and integration with existing data structures.

Parameters:¶

factors : Mapping[str, Array | Sequence[Any]]¶: Dictionary mapping factor names to their values. Factor length must match either the number of images or number of detections in the dataset.

Raises:¶

ValueError – When factor lengths do not match dataset dimensions.

Return type:¶

None

Examples

>>> metadata = Metadata(dataset)
>>> new_factors = {
...     "brightness": [0.2, 0.8, 0.5, 0.3, 0.4, 0.1, 0.3, 0.2],
...     "contrast": [1.1, 0.9, 1.0, 0.8, 1.2, 1.0, 0.7, 1.3],
... }
>>> metadata.add_factors(new_factors)

calculate_distance(other)¶

Measures the feature-wise distance between two continuous metadata distributions and computes a p-value to evaluate its significance.

Uses the Earth Mover’s Distance and the Kolmogorov-Smirnov two-sample test, featurewise.

Parameters:¶

other : Metadata¶: Class containing continuous factor names and values to be compared

Returns:¶

A mapping with keys corresponding to metadata feature names, and values that are KstestResult objects, as defined by scipy.stats.ks_2samp.

Return type:¶

MetadataDistanceOutput

See also

Earth, Kolmogorov-Smirnov

Note

This function only applies to the continuous data

Examples

>>> output = metadata1.calculate_distance(metadata2)
>>> list(output)
['time', 'altitude']
>>> output["time"]
MetadataDistanceValues(statistic=1.0, location=0.44354838709677413, dist=2.7, pvalue=0.0)

filter_by_factor(condition)¶

Filters metadata factors by factor name or FactorInfo.

Parameters:¶

condition : Callable[[str, FactorInfo], bool]¶: A condition to include the factor in the output.

Returns:¶

Array with shape (n_samples, n_factors) where the factors are filtered by the user provided condition.

Return type:¶

NDArray[np.float64]

property auto_bin_method : 'uniform_width' | 'uniform_count' | 'clusters'¶

Automatic binning strategy for continuous factors.

Returns:¶: Current method used for automatic discretization of continuous factors that lack explicit bin specifications.
Return type:¶: {“uniform_width”, “uniform_count”, “clusters”}

property binned_data : numpy.typing.NDArray[numpy.int64]¶

Factor data with continuous values discretized into bins.

Access fully processed factor data where both categorical and continuous factors are converted to integer bin indices.

Returns:¶: Array with shape (n_samples, n_factors) containing binned integer data ready for categorical analysis algorithms. Returns empty array when no factors are available.
Return type:¶: NDArray[np.int64]

Notes

This property triggers factor binning analysis on first access. Use this for algorithms requiring purely discrete input data.

property class_labels : numpy.typing.NDArray[numpy.intp]¶

Target class labels as integer indices.

Returns:¶: Array of class indices corresponding to dataset targets. For object detection datasets, contains one label per detection.
Return type:¶: NDArray[np.intp]

Notes

This property triggers dataset structure analysis on first access. Use class_names property to get human-readable label names.

property class_names : collections.abc.Sequence[str]¶

Human-readable names corresponding to class labels.

Returns:¶: List of class names where index corresponds to class label value. Derived from dataset metadata or auto-generated from label indices.
Return type:¶: Sequence[str]

Notes

This property triggers dataset structure analysis on first access.

property continuous_factor_bins : collections.abc.Mapping[str, int | collections.abc.Sequence[float]]¶

Binning configuration for continuous factors.

Returns:¶: Dictionary mapping factor names to either the number of bins (int) or explicit bin edges (sequence of floats).
Return type:¶: Mapping[str, int | Sequence[float]]

property dataframe : polars.DataFrame¶

Processed DataFrame containing targets and metadata factors.

Access the main data structure with target information (class labels, scores, bounding boxes) and processed metadata factors ready for analysis.

Returns:¶: DataFrame with columns for image indices, class labels, scores, bounding boxes (when applicable), and all processed metadata factors.
Return type:¶: pl.DataFrame

Notes

This property triggers dataset structure analysis on first access. Factor binning occurs automatically when accessing factor-related data.

property dropped_factors : collections.abc.Mapping[str, collections.abc.Sequence[str]]¶

Factors removed during preprocessing with removal reasons.

Returns:¶: Dictionary mapping dropped factor names to lists of reasons why they were excluded from the final dataset.
Return type:¶: Mapping[str, Sequence[str]]

Notes

This property triggers dataset structure analysis on first access. Common removal reasons include incompatible data types, excessive missing values, or insufficient variation.

property exclude : set[str]¶

Factor names excluded from metadata processing.

Returns:¶: Set of factor names that are filtered out during processing. Empty set when no exclusions are active.
Return type:¶: set[str]

property factor_data : numpy.typing.NDArray[Any]¶

Raw factor values before binning or digitization.

Access unprocessed factor data in its original numeric form before any categorical encoding or binning transformations are applied.

Returns:¶: Array with shape (n_samples, n_factors) containing original factor values. Returns empty array when no factors are available.
Return type:¶: NDArray[Any]

Notes

Use this for algorithms that can work with mixed data types or when you need access to original continuous values. For analysis-ready numeric data, use binned_data or numeric_data instead.

property factor_info : collections.abc.Mapping[str, FactorInfo]¶

Type information and processing status for each factor.

Returns:¶: Dictionary mapping factor names to FactorInfo objects containing data type classification and processing flags (binned, digitized).
Return type:¶: Mapping[str, FactorInfo]

Notes

This property triggers factor binning analysis on first access. Only includes factors that survived preprocessing and filtering.

property factor_names : collections.abc.Sequence[str]¶

Names of all processed metadata factors.

Returns:¶: List of factor names that passed filtering and preprocessing steps. Order matches columns in factor_data, numeric_data, and binned_data.
Return type:¶: Sequence[str]

Notes

This property triggers dataset structure analysis on first access. Factor names respect include/exclude filtering settings.

property image_count : int¶

Total number of images in the dataset.

Returns:¶: Count of unique images in the source dataset, regardless of how many targets/detections each image contains.
Return type:¶: int

property image_indices : numpy.typing.NDArray[numpy.intp]¶

Dataset indices linking targets back to source images.

Returns:¶: Array mapping each target/detection back to its source image index in the original dataset. Essential for object detection datasets where multiple detections come from single images.
Return type:¶: NDArray[np.intp]

Notes

This property triggers dataset structure analysis on first access.

property include : set[str]¶

Factor names included in metadata processing.

Returns:¶: Set of factor names that are processed during analysis. Empty set when no inclusion filter is active.
Return type:¶: set[str]

property raw : collections.abc.Sequence[collections.abc.Mapping[str, Any]]¶

Original metadata dictionaries extracted from the dataset.

Access the unprocessed metadata as it was provided in the original dataset before any binning, filtering, or transformation operations.

Returns:¶: List of metadata dictionaries, one per dataset item, containing the original key-value pairs as provided in the source data
Return type:¶: Sequence[Mapping[str, Any]]

Notes

This property triggers dataset structure analysis on first access.