dataeval.data.Metadata¶
-
class dataeval.data.Metadata(dataset, *, continuous_factor_bins=
None, auto_bin_method='uniform_width', exclude=None, include=None)¶ Collection of binned metadata using Polars DataFrames.
Processes dataset metadata by automatically binning continuous factors and digitizing categorical factors for analysis and visualization workflows.
- Parameters:¶
- dataset : ImageClassificationDataset or ObjectDetectionDataset¶
Dataset that provides original targets and metadata for processing.
- continuous_factor_bins : Mapping[str, int | Sequence[float]] | None, default None¶
Mapping from continuous factor names to bin counts or explicit bin edges. When None, uses automatic discretization.
- auto_bin_method : Literal["uniform_width", "uniform_count", "clusters"], default "uniform_width"¶
Binning strategy for continuous factors without explicit bins. Default “uniform_width” provides intuitive equal-width intervals for most distributions.
- exclude : Sequence[str] | None, default None¶
Factor names to exclude from processing. Cannot be used with include parameter. When None, processes all available factors.
- include : Sequence[str] | None, default None¶
Factor names to include in processing. Cannot be used with exclude parameter. When None, processes all available factors.
- Raises:¶
ValueError – When both exclude and include parameters are specified simultaneously.
- add_factors(factors)¶
Add additional factors to metadata collection.
Extend the current metadata with new factors, automatically handling length validation and integration with existing data structures.
- Parameters:¶
- Raises:¶
ValueError – When factor lengths do not match dataset dimensions.
- Return type:¶
None
Examples
>>> metadata = Metadata(dataset) >>> new_factors = { ... "brightness": [0.2, 0.8, 0.5, 0.3, 0.4, 0.1, 0.3, 0.2], ... "contrast": [1.1, 0.9, 1.0, 0.8, 1.2, 1.0, 0.7, 1.3], ... } >>> metadata.add_factors(new_factors)
- calculate_distance(other)¶
Measures the feature-wise distance between two continuous metadata distributions and computes a p-value to evaluate its significance.
Uses the Earth Mover’s Distance and the Kolmogorov-Smirnov two-sample test, featurewise.
- Parameters:¶
- Returns:¶
A mapping with keys corresponding to metadata feature names, and values that are KstestResult objects, as defined by scipy.stats.ks_2samp.
- Return type:¶
See also
Earth,Kolmogorov-SmirnovNote
This function only applies to the continuous data
Examples
>>> output = metadata1.calculate_distance(metadata2) >>> list(output) ['time', 'altitude'] >>> output["time"] MetadataDistanceValues(statistic=1.0, location=0.44354838709677413, dist=2.7, pvalue=0.0)
- property auto_bin_method : 'uniform_width' | 'uniform_count' | 'clusters'¶
Automatic binning strategy for continuous factors.
- property binned_data : numpy.typing.NDArray[numpy.int64]¶
Factor data with continuous values discretized into bins.
Access fully processed factor data where both categorical and continuous factors are converted to integer bin indices.
- Returns:¶
Array with shape (n_samples, n_factors) containing binned integer data ready for categorical analysis algorithms. Returns empty array when no factors are available.
- Return type:¶
NDArray[np.int64]
Notes
This property triggers factor binning analysis on first access. Use this for algorithms requiring purely discrete input data.
- property class_labels : numpy.typing.NDArray[numpy.intp]¶
Target class labels as integer indices.
- Returns:¶
Array of class indices corresponding to dataset targets. For object detection datasets, contains one label per detection.
- Return type:¶
NDArray[np.intp]
Notes
This property triggers dataset structure analysis on first access. Use class_names property to get human-readable label names.
- property class_names : collections.abc.Sequence[str]¶
Human-readable names corresponding to class labels.
- Returns:¶
List of class names where index corresponds to class label value. Derived from dataset metadata or auto-generated from label indices.
- Return type:¶
Sequence[str]
Notes
This property triggers dataset structure analysis on first access.
- property continuous_factor_bins : collections.abc.Mapping[str, int | collections.abc.Sequence[float]]¶
Binning configuration for continuous factors.
- property dataframe : polars.DataFrame¶
Processed DataFrame containing targets and metadata factors.
Access the main data structure with target information (class labels, scores, bounding boxes) and processed metadata factors ready for analysis.
- Returns:¶
DataFrame with columns for image indices, class labels, scores, bounding boxes (when applicable), and all processed metadata factors.
- Return type:¶
pl.DataFrame
Notes
This property triggers dataset structure analysis on first access. Factor binning occurs automatically when accessing factor-related data.
- property dropped_factors : collections.abc.Mapping[str, collections.abc.Sequence[str]]¶
Factors removed during preprocessing with removal reasons.
- Returns:¶
Dictionary mapping dropped factor names to lists of reasons why they were excluded from the final dataset.
- Return type:¶
Mapping[str, Sequence[str]]
Notes
This property triggers dataset structure analysis on first access. Common removal reasons include incompatible data types, excessive missing values, or insufficient variation.
- property exclude : set[str]¶
Factor names excluded from metadata processing.
- property factor_data : numpy.typing.NDArray[Any]¶
Raw factor values before binning or digitization.
Access unprocessed factor data in its original numeric form before any categorical encoding or binning transformations are applied.
- Returns:¶
Array with shape (n_samples, n_factors) containing original factor values. Returns empty array when no factors are available.
- Return type:¶
NDArray[Any]
Notes
Use this for algorithms that can work with mixed data types or when you need access to original continuous values. For analysis-ready numeric data, use binned_data or numeric_data instead.
- property factor_info : collections.abc.Mapping[str, FactorInfo]¶
Type information and processing status for each factor.
- Returns:¶
Dictionary mapping factor names to FactorInfo objects containing data type classification and processing flags (binned, digitized).
- Return type:¶
Mapping[str, FactorInfo]
Notes
This property triggers factor binning analysis on first access. Only includes factors that survived preprocessing and filtering.
- property factor_names : collections.abc.Sequence[str]¶
Names of all processed metadata factors.
- Returns:¶
List of factor names that passed filtering and preprocessing steps. Order matches columns in factor_data, numeric_data, and binned_data.
- Return type:¶
Sequence[str]
Notes
This property triggers dataset structure analysis on first access. Factor names respect include/exclude filtering settings.
- property image_count : int¶
Total number of images in the dataset.
- property image_indices : numpy.typing.NDArray[numpy.intp]¶
Dataset indices linking targets back to source images.
- Returns:¶
Array mapping each target/detection back to its source image index in the original dataset. Essential for object detection datasets where multiple detections come from single images.
- Return type:¶
NDArray[np.intp]
Notes
This property triggers dataset structure analysis on first access.
- property include : set[str]¶
Factor names included in metadata processing.
- property raw : collections.abc.Sequence[collections.abc.Mapping[str, Any]]¶
Original metadata dictionaries extracted from the dataset.
Access the unprocessed metadata as it was provided in the original dataset before any binning, filtering, or transformation operations.
- Returns:¶
List of metadata dictionaries, one per dataset item, containing the original key-value pairs as provided in the source data
- Return type:¶
Sequence[Mapping[str, Any]]
Notes
This property triggers dataset structure analysis on first access.