dataeval.Metadata¶
-
class dataeval.Metadata(dataset, *, continuous_factor_bins=
None, auto_bin_method='uniform_width', exclude=None, include=None)¶ Collection of binned metadata using Polars DataFrames.
Processes dataset metadata by automatically binning continuous factors and digitizing categorical factors for analysis and visualization workflows.
- Parameters:¶
- dataset : ImageClassificationDataset or ObjectDetectionDataset¶
Dataset that provides original targets and metadata for processing.
- continuous_factor_bins : Mapping[str, int | Sequence[float]] | None, default None¶
Mapping from continuous factor names to bin counts or explicit bin edges. When None, uses automatic discretization.
- auto_bin_method : Literal["uniform_width", "uniform_count", "clusters"], default "uniform_width"¶
Binning strategy for continuous factors without explicit bins. Default “uniform_width” provides intuitive equal-width intervals for most distributions.
- exclude : Sequence[str] | None, default None¶
Factor names to exclude from processing. Cannot be used with include parameter. When None, processes all available factors.
- include : Sequence[str] | None, default None¶
Factor names to include in processing. Cannot be used with exclude parameter. When None, processes all available factors.
- Raises:¶
ValueError – When both exclude and include parameters are specified simultaneously.
-
add_factors(factors, level=
'auto')¶ Add additional factors to metadata collection.
Extend the current metadata with new factors at either image or target level. For image-level factors, values are stored only in image-level rows. For target-level factors, values are stored only in target-level rows.
- Parameters:¶
- factors : Mapping[str, _1DArray[Any]]¶
Mapping of factor names to their values. Factor length must match the specified level (image count or target count).
- level : {"image", "target", "auto"}, default="auto"¶
Level at which to store the factors: - “image”: Array length must match image count, stored in image-level rows only - “target”: Array length must match target count, stored in target-level rows only - “auto”: Automatically infers level based on array length
- Raises:¶
ValueError – When factor lengths do not match the specified level’s dimensions.
- Return type:¶
None
Examples
>>> metadata = Metadata(od_dataset) >>> # Add image-level factors (e.g., from imagestats) >>> image_factors = { ... "brightness": [0.2, 0.8, 0.5], # One per image ... "contrast": [1.1, 0.9, 1.0], ... } >>> metadata.add_factors(image_factors, level="image") >>> >>> # Add target-level factors (e.g., detection confidence scores) >>> target_factors = { ... "iou": [0.85, 0.92, 0.78, 0.88, 0.91], # One per target/detection ... } >>> metadata.add_factors(target_factors, level="target")
- calculate_distance(other)¶
Measures the feature-wise distance between two continuous metadata distributions and computes a p-value to evaluate its significance.
Uses the Earth Mover’s Distance and the Kolmogorov-Smirnov two-sample test, featurewise.
- Parameters:¶
- Returns:¶
A mapping with keys corresponding to metadata feature names, and values that are KstestResult objects, as defined by scipy.stats.ks_2samp.
- Return type:¶
MetadataDistanceOutput
See also
Earth,Kolmogorov-SmirnovNotes
This function only applies to the continuous data
Examples
>>> output = metadata1.calculate_distance(metadata2) >>> list(output) ['time', 'altitude'] >>> output["time"] {'statistic': 1.0, 'location': 0.44354838709677413, 'dist': 2.6999999999999997, 'p_value': 0.0}
- get_image_factors(image_idx)¶
Get all factors for a specific image.
- Parameters:¶
- image_idx : int¶
Index of the image to retrieve factors for
- Returns:¶
Dictionary mapping factor names to their values for the specified image
- Return type:¶
dict[str, Any]
Examples
>>> factors = metadata.get_image_factors(0) >>> factors["temp"] 72.5 >>> factors["time"] 'morning' >>> factors["loc"] 'urban'
- get_target_factors(image_idx, target_idx)¶
Get all factors for a specific target within an image.
- Parameters:¶
- Returns:¶
Dictionary mapping factor names to their values for the specified target
- Return type:¶
dict[str, Any]
Examples
>>> factors = metadata.get_target_factors(0, 1) >>> factors["image_index"] 0 >>> factors["target_index"] 1 >>> factors["class_label"] 1
- has_targets()¶
Check if the source dataset has targets.
- property auto_bin_method : 'uniform_width' | 'uniform_count' | 'clusters'¶
Automatic binning strategy for continuous factors.
- property binned_data : numpy.typing.NDArray[numpy.int64]¶
Factor data with continuous values discretized into bins.
Access fully processed factor data where both categorical and continuous factors are converted to integer bin indices.
- Returns:¶
Array with shape (n_samples, n_factors) containing binned integer data ready for categorical analysis algorithms. Returns empty array when no factors are available. For OD datasets, returns only target-level rows to align with class_label.
- Return type:¶
NDArray[np.int64]
Notes
This property triggers factor binning analysis on first access. Use this for algorithms requiring purely discrete input data.
For object detection datasets, this returns target-level data only to ensure alignment with class_labels (one row per detection).
- property class_labels : numpy.typing.NDArray[numpy.intp]¶
Target class labels as integer indices.
- Returns:¶
Array of class indices corresponding to dataset targets. For object detection datasets, contains one label per detection.
- Return type:¶
NDArray[np.intp]
Notes
This property triggers dataset structure analysis on first access. Use index2label property to get human-readable label names.
- property continuous_factor_bins : collections.abc.Mapping[str, int | collections.abc.Sequence[float]]¶
Binning configuration for continuous factors.
- property dataframe : polars.DataFrame¶
Processed DataFrame containing both image-level and target-level rows.
Access the main data structure with both image-level metadata and target-level information (class labels, scores, bounding boxes). Use image_data or target_data properties to filter to specific row types.
- Returns:¶
DataFrame with columns for image_index, target_index, class_label, scores, bounding boxes (when applicable), and all processed metadata factors. Rows where target_index is None contain image-level data. Rows where target_index is an integer contain target/detection-level data.
- Return type:¶
pl.DataFrame
Notes
This property triggers dataset structure analysis on first access. Factor binning occurs automatically when accessing factor-related data.
For Object Detection datasets, the dataframe now contains: - Image-level rows (target_index=None): One per image with image-level factors - Target-level rows (target_index=0,1,2…): One per detection with detection data
See also
image_dataFilter to image-level rows only
target_dataFilter to target-level rows only
- property dropped_factors : collections.abc.Mapping[str, collections.abc.Sequence[str]]¶
Factors removed during preprocessing with removal reasons.
- Returns:¶
Mapping of dropped factor names to lists of reasons why they were excluded from the final dataset.
- Return type:¶
Mapping[str, Sequence[str]]
Notes
This property triggers dataset structure analysis on first access. Common removal reasons include incompatible data types, excessive missing values, or insufficient variation.
- property exclude : set[str]¶
Factor names excluded from metadata processing.
- property factor_data : numpy.typing.NDArray[Any]¶
Raw factor values before binning or digitization.
Access unprocessed factor data in its original numeric form before any categorical encoding or binning transformations are applied.
- Returns:¶
Array with shape (n_samples, n_factors) containing original factor values. Returns empty array when no factors are available. For OD datasets, returns only target-level rows to align with class_labels.
- Return type:¶
NDArray[Any]
Notes
Use this for algorithms that can work with mixed data types or when you need access to original continuous values. For analysis-ready numeric data, use binned_data.
For object detection datasets, this returns target-level data only to ensure alignment with class_labels (one row per detection).
- property factor_info : collections.abc.Mapping[str, FactorInfo]¶
Type information and processing status for each factor.
- Returns:¶
Mapping of factor names to FactorInfo objects containing data type classification and processing flags (binned, digitized).
- Return type:¶
Mapping[str, FactorInfo]
Notes
This property triggers factor binning analysis on first access. Only includes factors that survived preprocessing and filtering.
- property factor_names : collections.abc.Sequence[str]¶
Names of all processed metadata factors.
- Returns:¶
List of factor names that passed filtering and preprocessing steps. Order matches columns in factor_data and binned_data.
- Return type:¶
Sequence[str]
Notes
This property triggers dataset structure analysis on first access. Factor names respect include/exclude filtering settings.
- property image_data : polars.DataFrame¶
Dataframe containing only image-level rows.
Returns a view of the metadata dataframe filtered to rows where target_index is None, containing one row per image with image-level factors.
- Returns:¶
Dataframe with image-level metadata. For Object Detection datasets, this provides per-image analysis without target-level duplication.
- Return type:¶
pl.DataFrame
Notes
This property triggers dataset structure analysis on first access. Image-level factors are stored only in these rows to avoid duplication.
Examples
>>> metadata.image_data shape: (3, 8) ┌─────────────┬──────────────┬─────────────┬───────────┬───────────┬──────┬───────────┬──────────┐ │ image_index ┆ target_index ┆ class_label ┆ score ┆ box ┆ temp ┆ time ┆ loc │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ i64 ┆ list[f64] ┆ list[f64] ┆ f64 ┆ str ┆ str │ ╞═════════════╪══════════════╪═════════════╪═══════════╪═══════════╪══════╪═══════════╪══════════╡ │ 0 ┆ null ┆ null ┆ null ┆ null ┆ 72.5 ┆ morning ┆ urban │ │ 1 ┆ null ┆ null ┆ null ┆ null ┆ 65.3 ┆ afternoon ┆ rural │ │ 2 ┆ null ┆ null ┆ null ┆ null ┆ 68.1 ┆ evening ┆ suburban │ └─────────────┴──────────────┴─────────────┴───────────┴───────────┴──────┴───────────┴──────────┘
- property include : set[str]¶
Factor names included in metadata processing.
- property item_count : int¶
Total number of items in the dataset.
- property item_indices : numpy.typing.NDArray[numpy.intp]¶
Dataset indices linking targets back to source item.
- Returns:¶
Array mapping each target/detection back to its source item index in the original dataset. Essential for object detection datasets where multiple detections come from a single item.
- Return type:¶
NDArray[np.intp]
Notes
This property triggers dataset structure analysis on first access.
- property raw : collections.abc.Sequence[collections.abc.Mapping[str, Any]]¶
Original metadata dictionaries extracted from the dataset.
Access the unprocessed metadata as it was provided in the original dataset before any binning, filtering, or transformation operations.
- Returns:¶
List of metadata dictionaries, one per dataset item, containing the original key-value pairs as provided in the source data
- Return type:¶
Sequence[Mapping[str, Any]]
Notes
This property triggers dataset structure analysis on first access.
- property target_data : polars.DataFrame¶
Dataframe containing only target-level rows.
Returns a view of the metadata dataframe filtered to rows where target_index is not None, containing target/detection-level data.
- Returns:¶
Dataframe with target-level metadata. Each row represents a single target or detection with its associated class, score, and bounding box information.
- Return type:¶
pl.DataFrame
Notes
This property triggers dataset structure analysis on first access. This is similar to the legacy behavior where only target-level rows existed, but now image-level metadata is stored separately in image_data.
Examples
>>> metadata.target_data shape: (5, 8) ┌─────────────┬──────────────┬─────────────┬──────────────┬─────────────┬──────┬───────────┬───────┐ │ image_index ┆ target_index ┆ class_label ┆ score ┆ box ┆ temp ┆ time ┆ loc │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ i64 ┆ list[f64] ┆ list[f64] ┆ f64 ┆ str ┆ str │ ╞═════════════╪══════════════╪═════════════╪══════════════╪═════════════╪══════╪═══════════╪═══════╡ │ 0 ┆ 0 ┆ 0 ┆ [1.0, 0.0, ┆ [10.0, ┆ 72.5 ┆ morning ┆ urban │ │ ┆ ┆ ┆ 0.0] ┆ 10.0, … ┆ ┆ ┆ │ │ ┆ ┆ ┆ ┆ 20.0] ┆ ┆ ┆ │ │ 0 ┆ 1 ┆ 1 ┆ [0.0, 1.0, ┆ [30.0, ┆ 72.5 ┆ morning ┆ urban │ │ ┆ ┆ ┆ 0.0] ┆ 30.0, … ┆ ┆ ┆ │ │ ┆ ┆ ┆ ┆ 40.0] ┆ ┆ ┆ │ │ 1 ┆ 0 ┆ 1 ┆ [0.0, 1.0, ┆ [5.0, 5.0, ┆ 65.3 ┆ afternoon ┆ rural │ │ ┆ ┆ ┆ 0.0] ┆ … 15.0] ┆ ┆ ┆ │ │ 1 ┆ 1 ┆ 2 ┆ [0.0, 0.0, ┆ [25.0, ┆ 65.3 ┆ afternoon ┆ rural │ │ ┆ ┆ ┆ 1.0] ┆ 25.0, … ┆ ┆ ┆ │ │ ┆ ┆ ┆ ┆ 35.0] ┆ ┆ ┆ │ │ 1 ┆ 2 ┆ 0 ┆ [1.0, 0.0, ┆ [45.0, ┆ 65.3 ┆ afternoon ┆ rural │ │ ┆ ┆ ┆ 0.0] ┆ 45.0, … ┆ ┆ ┆ │ │ ┆ ┆ ┆ ┆ 55.0] ┆ ┆ ┆ │ └─────────────┴──────────────┴─────────────┴──────────────┴─────────────┴──────┴───────────┴───────┘