dataeval.Metadata¶
-
class dataeval.Metadata(dataset=
None, *, continuous_factor_bins=None, auto_bin_method='uniform_width', exclude=None, include=None)¶ Collection of binned metadata using Polars DataFrames.
Processes dataset metadata by automatically binning continuous factors and digitizing categorical factors for analysis and visualization workflows.
This class also implements the
FeatureExtractorprotocol, allowing it to be used directly with drift detectors that accept feature extractors.- Parameters:¶
- dataset : ImageClassificationDataset, ObjectDetectionDataset, or None, default None¶
Dataset that provides original targets and metadata for processing. When None, creates an unbound instance that can be used as a reusable feature extractor. Use
bind()to attach a dataset later, or pass data directly to__call__().- continuous_factor_bins : Mapping[str, int | Sequence[float]] | None, default None¶
Mapping from continuous factor names to bin counts or explicit bin edges. When None, uses automatic discretization.
- auto_bin_method : Literal["uniform_width", "uniform_count", "clusters"], default "uniform_width"¶
Binning strategy for continuous factors without explicit bins. Default “uniform_width” provides intuitive equal-width intervals for most distributions.
- exclude : Sequence[str] | None, default None¶
Factor names to exclude from processing. Cannot be used with include parameter. When None, processes all available factors.
- include : Sequence[str] | None, default None¶
Factor names to include in processing. Cannot be used with exclude parameter. When None, processes all available factors.
- Raises:¶
ValueError – When both exclude and include parameters are specified simultaneously.
Example
Using as a feature extractor with drift detection:
>>> from dataeval import Metadata >>> from dataeval.shift import DriftUnivariate >>> >>> # Create reusable extractor (no dataset bound) >>> extractor = Metadata(continuous_factor_bins={"brightness": 10}) >>> >>> # Use with drift detector >>> drift = DriftUnivariate(extractor=extractor).fit(train_dataset) >>> result = drift.predict(test_dataset)Using with a bound dataset:
>>> # Create with dataset bound >>> metadata = Metadata(train_dataset, continuous_factor_bins={"brightness": 10}) >>> train_factors = metadata() # Extract from bound dataset >>> test_factors = metadata(test_dataset) # Extract from new dataset-
add_factors(factors, level=
'auto')¶ Add additional factors to metadata collection.
Extend the current metadata with new factors at either image or target level. For image-level factors, values are stored only in image-level rows. For target-level factors, values are stored only in target-level rows.
- Parameters:¶
- factors : Mapping[str, _1DArray[Any]]¶
Mapping of factor names to their values. Factor length must match the specified level (image count or target count).
- level : {"image", "target", "auto"}, default="auto"¶
Level at which to store the factors: - “image”: Array length must match image count, stored in image-level rows only - “target”: Array length must match target count, stored in target-level rows only - “auto”: Automatically infers level based on array length
- Raises:¶
ValueError – When factor lengths do not match the specified level’s dimensions.
Examples
>>> metadata = Metadata(dataset) >>> # Add image-level factors (e.g., from imagestats) >>> image_factors = { ... "brightness": np.random.rand(50), # One per image ... "contrast": np.random.rand(50), # One per image ... } >>> metadata.add_factors(image_factors, level="image") >>> >>> # Add target-level factors (e.g., detection confidence scores) >>> target_factors = { ... "iou": np.random.rand(93), # One per target/detection ... } >>> metadata.add_factors(target_factors, level="target")
- bind(dataset)¶
Bind this instance to a dataset.
Attaches a dataset to this Metadata instance for metadata extraction. Any previously processed metadata is cleared.
- Parameters:¶
- dataset : ImageClassificationDataset or ObjectDetectionDataset¶
Dataset to bind for metadata extraction.
- Returns:¶
Returns self for method chaining.
- Return type:¶
Self
Example
>>> from dataeval import Metadata >>> >>> extractor = Metadata(continuous_factor_bins={"brightness": 10}) >>> _ = extractor.bind(train_dataset)
- filter_by_factor_type(factor_type)¶
Filter metadata factors by factor type.
- get_image_factors(image_idx)¶
Get all factors for a specific image.
- Parameters:¶
- image_idx : int¶
Index of the image to retrieve factors for
- Returns:¶
Dictionary mapping factor names to their values for the specified image
- Return type:¶
dict[str, Any]
Examples
>>> metadata = Metadata(dataset) >>> factors = metadata.get_image_factors(0) >>> factors["time_of_day"] 'dawn' >>> factors["weather"] 'rainy' >>> factors["location"] 'suburban'
- get_target_factors(image_idx, target_idx)¶
Get all factors for a specific target within an image.
- Parameters:¶
- Returns:¶
Dictionary mapping factor names to their values for the specified target
- Return type:¶
dict[str, Any]
Examples
>>> metadata = Metadata(dataset) >>> factors = metadata.get_target_factors(1, 1) >>> factors["item_index"] 1 >>> factors["target_index"] 1 >>> factors["class_label"] 2
- has_targets()¶
Check if the source dataset has targets.
- new(dataset)¶
Create new Metadata instance with a different dataset.
Generate a new Metadata object using the same configuration but with a different dataset.
- property auto_bin_method : 'uniform_width' | 'uniform_count' | 'clusters'¶
Automatic binning strategy for continuous factors.
- property class_labels : numpy.typing.NDArray[numpy.intp]¶
Target class labels as integer indices.
- Returns:¶
Array of class indices corresponding to dataset targets. For object detection datasets, contains one label per detection.
- Return type:¶
NDArray[np.intp]
Notes
This property triggers dataset structure analysis on first access. Use index2label property to get human-readable label names.
- property continuous_factor_bins : collections.abc.Mapping[str, int | collections.abc.Sequence[float]]¶
Binning configuration for continuous factors.
- property dataframe : polars.DataFrame¶
Processed DataFrame containing both image-level and target-level rows.
Access the main data structure with both image-level metadata and target-level information (class labels, scores, bounding boxes). Use image_data or target_data properties to filter to specific row types.
- Returns:¶
DataFrame with columns for item_index, target_index, class_label, scores, bounding boxes (when applicable), and all processed metadata factors. Rows where target_index is None contain datum-level data. Rows where target_index is an integer contain target/detection-level data.
- Return type:¶
pl.DataFrame
Notes
This property triggers dataset structure analysis on first access. Factor binning occurs automatically when accessing factor-related data.
For Object Detection datasets, the dataframe now contains: - Image-level rows (target_index=None): One per image with image-level factors - Target-level rows (target_index=0,1,2…): One per detection with detection data
See also
image_dataFilter to image-level rows only
target_dataFilter to target-level rows only
- property dropped_factors : collections.abc.Mapping[str, collections.abc.Sequence[str]]¶
Factors removed during preprocessing with removal reasons.
- Returns:¶
Mapping of dropped factor names to lists of reasons why they were excluded from the final dataset.
- Return type:¶
Mapping[str, Sequence[str]]
Notes
This property triggers dataset structure analysis on first access. Common removal reasons include incompatible data types, excessive missing values, or insufficient variation.
- property exclude : set[str]¶
Factor names excluded from metadata processing.
- property factor_data : numpy.typing.NDArray[numpy.int64]¶
Factor data with continuous values discretized into bins.
Access fully processed factor data where both categorical and continuous factors are converted to integer bin indices.
- Returns:¶
Array with shape (n_samples, n_factors) containing binned integer data ready for categorical analysis algorithms. Returns empty array when no factors are available. For OD datasets, returns only target-level rows to align with class_label.
- Return type:¶
NDArray[np.int64]
Notes
This property triggers factor binning analysis on first access. Use this for algorithms requiring purely discrete input data.
For object detection datasets, this returns target-level data only to ensure alignment with class_labels (one row per detection).
- property factor_info : collections.abc.Mapping[str, FactorInfo]¶
Type information and processing status for each factor.
- Returns:¶
Mapping of factor names to FactorInfo objects containing data type classification and processing flags (binned, digitized).
- Return type:¶
Mapping[str, FactorInfo]
Notes
This property triggers factor binning analysis on first access. Only includes factors that survived preprocessing and filtering.
- property factor_names : collections.abc.Sequence[str]¶
Names of all processed metadata factors.
- Returns:¶
List of factor names that passed filtering and preprocessing steps. Order matches columns in factor_data and binned_data.
- Return type:¶
Sequence[str]
Notes
This property triggers dataset structure analysis on first access. Factor names respect include/exclude filtering settings.
- property image_data : polars.DataFrame¶
Dataframe containing only image-level rows.
Returns a view of the metadata dataframe filtered to rows where target_index is None, containing one row per image with image-level factors.
- Returns:¶
Dataframe with image-level metadata. For Object Detection datasets, this provides per-image analysis without target-level duplication.
- Return type:¶
pl.DataFrame
Notes
This property triggers dataset structure analysis on first access. Image-level factors are stored only in these rows to avoid duplication.
Examples
>>> metadata = Metadata(dataset) >>> metadata.image_data.select("item_index", "time_of_day", "weather", "location").head(5) shape: (5, 4) ┌────────────┬─────────────┬─────────┬──────────┐ │ item_index ┆ time_of_day ┆ weather ┆ location │ │ --- ┆ --- ┆ --- ┆ --- │ │ i64 ┆ str ┆ str ┆ str │ ╞════════════╪═════════════╪═════════╪══════════╡ │ 0 ┆ dawn ┆ rainy ┆ suburban │ │ 1 ┆ day ┆ rainy ┆ rural │ │ 2 ┆ dawn ┆ clear ┆ maritime │ │ 3 ┆ dusk ┆ rainy ┆ maritime │ │ 4 ┆ dusk ┆ clear ┆ suburban │ └────────────┴─────────────┴─────────┴──────────┘
- property include : set[str]¶
Factor names included in metadata processing.
- property is_bound : bool¶
Whether this instance is bound to a dataset.
- property is_discrete : collections.abc.Sequence[bool]¶
Whether each factor is discrete (True) or continuous (False).
- Returns:¶
Boolean sequence with length equal to factor_names, where True indicates a discrete factor (categorical or discrete numeric) and False indicates a continuous factor.
- Return type:¶
Sequence[bool]
Notes
This property is part of the
Metadataand aligns with scientific computing conventions where discrete factors are treated differently from continuous ones in statistical analyses.
- property item_count : int¶
Total number of items in the dataset.
- property item_indices : numpy.typing.NDArray[numpy.intp]¶
Dataset indices linking targets back to source item.
- Returns:¶
Array mapping each target/detection back to its source item index in the original dataset. Essential for object detection datasets where multiple detections come from a single item.
- Return type:¶
NDArray[np.intp]
Notes
This property triggers dataset structure analysis on first access.
- property ndim : int¶
Number of dimensions of the binned metadata array.
- Returns:¶
Number of dimensions.
- Return type:¶
int
- Raises:¶
NotFittedError – If no dataset is bound.
- property raw : collections.abc.Sequence[collections.abc.Mapping[str, Any]]¶
Original metadata dictionaries extracted from the dataset.
Access the unprocessed metadata as it was provided in the original dataset before any binning, filtering, or transformation operations.
- Returns:¶
List of metadata dictionaries, one per dataset item, containing the original key-value pairs as provided in the source data
- Return type:¶
Sequence[Mapping[str, Any]]
Notes
This property triggers dataset structure analysis on first access.
- property raw_data : numpy.typing.NDArray[Any]¶
Raw factor values before binning or digitization.
Access unprocessed factor data in its original numeric form before any categorical encoding or binning transformations are applied.
- Returns:¶
Array with shape (n_samples, n_factors) containing original factor values. Returns empty array when no factors are available. For OD datasets, returns only target-level rows to align with class_labels.
- Return type:¶
NDArray[Any]
Notes
Use this for algorithms that can work with mixed data types or when you need access to original continuous values. For analysis-ready numeric data, use binned_data.
For object detection datasets, this returns target-level data only to ensure alignment with class_labels (one row per detection).
- property shape : tuple[int, Ellipsis]¶
Shape of the binned metadata array.
- Returns:¶
Shape of the binned metadata as (n_samples, n_factors).
- Return type:¶
tuple[int, …]
- Raises:¶
NotFittedError – If no dataset is bound.
- property target_data : polars.DataFrame¶
Dataframe containing only target-level rows.
Returns a view of the metadata dataframe filtered to rows where target_index is not None, containing target/detection-level data.
- Returns:¶
Dataframe with target-level metadata. Each row represents a single target or detection with its associated class, score, and bounding box information.
- Return type:¶
pl.DataFrame
Notes
This property triggers dataset structure analysis on first access. This is similar to the legacy behavior where only target-level rows existed, but now image-level metadata is stored separately in image_data.
Examples
>>> metadata = Metadata(dataset) >>> metadata.target_data.select("item_index", "target_index", "class_label").head(5) shape: (5, 3) ┌────────────┬──────────────┬─────────────┐ │ item_index ┆ target_index ┆ class_label │ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ i64 │ ╞════════════╪══════════════╪═════════════╡ │ 0 ┆ 0 ┆ 0 │ │ 1 ┆ 0 ┆ 3 │ │ 1 ┆ 1 ┆ 2 │ │ 1 ┆ 2 ┆ 1 │ │ 2 ┆ 0 ┆ 1 │ └────────────┴──────────────┴─────────────┘