dataeval.Metadata

class dataeval.Metadata(dataset=None, *, continuous_factor_bins=None, auto_bin_method='uniform_width', exclude=None, include=None)

Collection of binned metadata using Polars DataFrames.

Processes dataset metadata by automatically binning continuous factors and digitizing categorical factors for analysis and visualization workflows.

This class also implements the FeatureExtractor protocol, allowing it to be used directly with drift detectors that accept feature extractors.

Parameters:
dataset : ImageClassificationDataset, ObjectDetectionDataset, or None, default None

Dataset that provides original targets and metadata for processing. When None, creates an unbound instance that can be used as a reusable feature extractor. Use bind() to attach a dataset later, or pass data directly to __call__().

continuous_factor_bins : Mapping[str, int | Sequence[float]] | None, default None

Mapping from continuous factor names to bin counts or explicit bin edges. When None, uses automatic discretization.

auto_bin_method : Literal["uniform_width", "uniform_count", "clusters"], default "uniform_width"

Binning strategy for continuous factors without explicit bins. Default “uniform_width” provides intuitive equal-width intervals for most distributions.

exclude : Sequence[str] | None, default None

Factor names to exclude from processing. Cannot be used with include parameter. When None, processes all available factors.

include : Sequence[str] | None, default None

Factor names to include in processing. Cannot be used with exclude parameter. When None, processes all available factors.

Raises:

ValueError – When both exclude and include parameters are specified simultaneously.

Example

Using as a feature extractor with drift detection:

>>> from dataeval import Metadata
>>> from dataeval.shift import DriftUnivariate
>>>
>>> # Create reusable extractor (no dataset bound)
>>> extractor = Metadata(continuous_factor_bins={"brightness": 10})
>>>
>>> # Use with drift detector
>>> drift = DriftUnivariate(extractor=extractor).fit(train_dataset)
>>> result = drift.predict(test_dataset)

Using with a bound dataset:

>>> # Create with dataset bound
>>> metadata = Metadata(train_dataset, continuous_factor_bins={"brightness": 10})
>>> train_factors = metadata()  # Extract from bound dataset
>>> test_factors = metadata(test_dataset)  # Extract from new dataset
add_factors(factors, level='auto')

Add additional factors to metadata collection.

Extend the current metadata with new factors at either image or target level. For image-level factors, values are stored only in image-level rows. For target-level factors, values are stored only in target-level rows.

Parameters:
factors : Mapping[str, _1DArray[Any]]

Mapping of factor names to their values. Factor length must match the specified level (image count or target count).

level : {"image", "target", "auto"}, default="auto"

Level at which to store the factors: - “image”: Array length must match image count, stored in image-level rows only - “target”: Array length must match target count, stored in target-level rows only - “auto”: Automatically infers level based on array length

Raises:

ValueError – When factor lengths do not match the specified level’s dimensions.

Examples

>>> metadata = Metadata(dataset)
>>> # Add image-level factors (e.g., from imagestats)
>>> image_factors = {
...     "brightness": np.random.rand(50),  # One per image
...     "contrast": np.random.rand(50),  # One per image
... }
>>> metadata.add_factors(image_factors, level="image")
>>>
>>> # Add target-level factors (e.g., detection confidence scores)
>>> target_factors = {
...     "iou": np.random.rand(93),  # One per target/detection
... }
>>> metadata.add_factors(target_factors, level="target")
bind(dataset)

Bind this instance to a dataset.

Attaches a dataset to this Metadata instance for metadata extraction. Any previously processed metadata is cleared.

Parameters:
dataset : ImageClassificationDataset or ObjectDetectionDataset

Dataset to bind for metadata extraction.

Returns:

Returns self for method chaining.

Return type:

Self

Example

>>> from dataeval import Metadata
>>>
>>> extractor = Metadata(continuous_factor_bins={"brightness": 10})
>>> _ = extractor.bind(train_dataset)
filter_by_factor(condition)

Filter metadata factors by factor name or FactorInfo.

Parameters:
condition : Callable[[str, FactorInfo], bool]

A condition to include the factor in the output.

Returns:

Array with shape (n_samples, n_factors) where the factors are filtered by the user provided condition.

Return type:

NDArray[np.float64]

filter_by_factor_type(factor_type)

Filter metadata factors by factor type.

Parameters:
factor_type : "categorical", "discrete" or "continuous"

The factor type to include in the output.

Returns:

Array with shape (n_samples, n_factors) where the factors are filtered by the user provided factor type.

Return type:

NDArray[np.float64]

get_image_factors(image_idx)

Get all factors for a specific image.

Parameters:
image_idx : int

Index of the image to retrieve factors for

Returns:

Dictionary mapping factor names to their values for the specified image

Return type:

dict[str, Any]

Examples

>>> metadata = Metadata(dataset)
>>> factors = metadata.get_image_factors(0)
>>> factors["time_of_day"]
'dawn'
>>> factors["weather"]
'rainy'
>>> factors["location"]
'suburban'
get_target_factors(image_idx, target_idx)

Get all factors for a specific target within an image.

Parameters:
image_idx : int

Index of the image containing the target

target_idx : int

Index of the target within the image (0-indexed per image)

Returns:

Dictionary mapping factor names to their values for the specified target

Return type:

dict[str, Any]

Examples

>>> metadata = Metadata(dataset)
>>> factors = metadata.get_target_factors(1, 1)
>>> factors["item_index"]
1
>>> factors["target_index"]
1
>>> factors["class_label"]
2
has_targets()

Check if the source dataset has targets.

Returns:

True if dataset contains targets, False for classification datasets.

Return type:

bool

new(dataset)

Create new Metadata instance with a different dataset.

Generate a new Metadata object using the same configuration but with a different dataset.

Parameters:
dataset : ImageClassificationDataset or ObjectDetectionDataset

Dataset that provides metadata for the new Metadata instance.

Returns:

New Metadata object configured identically to the current instance.

Return type:

Metadata

property auto_bin_method : 'uniform_width' | 'uniform_count' | 'clusters'

Automatic binning strategy for continuous factors.

Returns:

Current method used for automatic discretization of continuous factors that lack explicit bin specifications.

Return type:

{“uniform_width”, “uniform_count”, “clusters”}

property class_labels : numpy.typing.NDArray[numpy.intp]

Target class labels as integer indices.

Returns:

Array of class indices corresponding to dataset targets. For object detection datasets, contains one label per detection.

Return type:

NDArray[np.intp]

Notes

This property triggers dataset structure analysis on first access. Use index2label property to get human-readable label names.

property continuous_factor_bins : collections.abc.Mapping[str, int | collections.abc.Sequence[float]]

Binning configuration for continuous factors.

Returns:

Mapping of factor names to either the number of bins (int) or explicit bin edges (sequence of floats).

Return type:

Mapping[str, int | Sequence[float]]

property dataframe : polars.DataFrame

Processed DataFrame containing both image-level and target-level rows.

Access the main data structure with both image-level metadata and target-level information (class labels, scores, bounding boxes). Use image_data or target_data properties to filter to specific row types.

Returns:

DataFrame with columns for item_index, target_index, class_label, scores, bounding boxes (when applicable), and all processed metadata factors. Rows where target_index is None contain datum-level data. Rows where target_index is an integer contain target/detection-level data.

Return type:

pl.DataFrame

Notes

This property triggers dataset structure analysis on first access. Factor binning occurs automatically when accessing factor-related data.

For Object Detection datasets, the dataframe now contains: - Image-level rows (target_index=None): One per image with image-level factors - Target-level rows (target_index=0,1,2…): One per detection with detection data

See also

image_data

Filter to image-level rows only

target_data

Filter to target-level rows only

property dropped_factors : collections.abc.Mapping[str, collections.abc.Sequence[str]]

Factors removed during preprocessing with removal reasons.

Returns:

Mapping of dropped factor names to lists of reasons why they were excluded from the final dataset.

Return type:

Mapping[str, Sequence[str]]

Notes

This property triggers dataset structure analysis on first access. Common removal reasons include incompatible data types, excessive missing values, or insufficient variation.

property exclude : set[str]

Factor names excluded from metadata processing.

Returns:

Set of factor names that are filtered out during processing. Empty set when no exclusions are active.

Return type:

set[str]

property factor_data : numpy.typing.NDArray[numpy.int64]

Factor data with continuous values discretized into bins.

Access fully processed factor data where both categorical and continuous factors are converted to integer bin indices.

Returns:

Array with shape (n_samples, n_factors) containing binned integer data ready for categorical analysis algorithms. Returns empty array when no factors are available. For OD datasets, returns only target-level rows to align with class_label.

Return type:

NDArray[np.int64]

Notes

This property triggers factor binning analysis on first access. Use this for algorithms requiring purely discrete input data.

For object detection datasets, this returns target-level data only to ensure alignment with class_labels (one row per detection).

property factor_info : collections.abc.Mapping[str, FactorInfo]

Type information and processing status for each factor.

Returns:

Mapping of factor names to FactorInfo objects containing data type classification and processing flags (binned, digitized).

Return type:

Mapping[str, FactorInfo]

Notes

This property triggers factor binning analysis on first access. Only includes factors that survived preprocessing and filtering.

property factor_names : collections.abc.Sequence[str]

Names of all processed metadata factors.

Returns:

List of factor names that passed filtering and preprocessing steps. Order matches columns in factor_data and binned_data.

Return type:

Sequence[str]

Notes

This property triggers dataset structure analysis on first access. Factor names respect include/exclude filtering settings.

property image_data : polars.DataFrame

Dataframe containing only image-level rows.

Returns a view of the metadata dataframe filtered to rows where target_index is None, containing one row per image with image-level factors.

Returns:

Dataframe with image-level metadata. For Object Detection datasets, this provides per-image analysis without target-level duplication.

Return type:

pl.DataFrame

Notes

This property triggers dataset structure analysis on first access. Image-level factors are stored only in these rows to avoid duplication.

Examples

>>> metadata = Metadata(dataset)
>>> metadata.image_data.select("item_index", "time_of_day", "weather", "location").head(5)
shape: (5, 4)
┌────────────┬─────────────┬─────────┬──────────┐
│ item_index ┆ time_of_day ┆ weather ┆ location │
│ ---        ┆ ---         ┆ ---     ┆ ---      │
│ i64        ┆ str         ┆ str     ┆ str      │
╞════════════╪═════════════╪═════════╪══════════╡
│ 0          ┆ dawn        ┆ rainy   ┆ suburban │
│ 1          ┆ day         ┆ rainy   ┆ rural    │
│ 2          ┆ dawn        ┆ clear   ┆ maritime │
│ 3          ┆ dusk        ┆ rainy   ┆ maritime │
│ 4          ┆ dusk        ┆ clear   ┆ suburban │
└────────────┴─────────────┴─────────┴──────────┘
property include : set[str]

Factor names included in metadata processing.

Returns:

Set of factor names that are processed during analysis. Empty set when no inclusion filter is active.

Return type:

set[str]

property is_bound : bool

Whether this instance is bound to a dataset.

Returns:

True if a dataset is bound, False otherwise.

Return type:

bool

property is_discrete : collections.abc.Sequence[bool]

Whether each factor is discrete (True) or continuous (False).

Returns:

Boolean sequence with length equal to factor_names, where True indicates a discrete factor (categorical or discrete numeric) and False indicates a continuous factor.

Return type:

Sequence[bool]

Notes

This property is part of the Metadata and aligns with scientific computing conventions where discrete factors are treated differently from continuous ones in statistical analyses.

property item_count : int

Total number of items in the dataset.

Returns:

Count of unique items in the source dataset, regardless of how many targets/detections each item contains.

Return type:

int

property item_indices : numpy.typing.NDArray[numpy.intp]

Dataset indices linking targets back to source item.

Returns:

Array mapping each target/detection back to its source item index in the original dataset. Essential for object detection datasets where multiple detections come from a single item.

Return type:

NDArray[np.intp]

Notes

This property triggers dataset structure analysis on first access.

property ndim : int

Number of dimensions of the binned metadata array.

Returns:

Number of dimensions.

Return type:

int

Raises:

NotFittedError – If no dataset is bound.

property raw : collections.abc.Sequence[collections.abc.Mapping[str, Any]]

Original metadata dictionaries extracted from the dataset.

Access the unprocessed metadata as it was provided in the original dataset before any binning, filtering, or transformation operations.

Returns:

List of metadata dictionaries, one per dataset item, containing the original key-value pairs as provided in the source data

Return type:

Sequence[Mapping[str, Any]]

Notes

This property triggers dataset structure analysis on first access.

property raw_data : numpy.typing.NDArray[Any]

Raw factor values before binning or digitization.

Access unprocessed factor data in its original numeric form before any categorical encoding or binning transformations are applied.

Returns:

Array with shape (n_samples, n_factors) containing original factor values. Returns empty array when no factors are available. For OD datasets, returns only target-level rows to align with class_labels.

Return type:

NDArray[Any]

Notes

Use this for algorithms that can work with mixed data types or when you need access to original continuous values. For analysis-ready numeric data, use binned_data.

For object detection datasets, this returns target-level data only to ensure alignment with class_labels (one row per detection).

property shape : tuple[int, Ellipsis]

Shape of the binned metadata array.

Returns:

Shape of the binned metadata as (n_samples, n_factors).

Return type:

tuple[int, …]

Raises:

NotFittedError – If no dataset is bound.

property target_data : polars.DataFrame

Dataframe containing only target-level rows.

Returns a view of the metadata dataframe filtered to rows where target_index is not None, containing target/detection-level data.

Returns:

Dataframe with target-level metadata. Each row represents a single target or detection with its associated class, score, and bounding box information.

Return type:

pl.DataFrame

Notes

This property triggers dataset structure analysis on first access. This is similar to the legacy behavior where only target-level rows existed, but now image-level metadata is stored separately in image_data.

Examples

>>> metadata = Metadata(dataset)
>>> metadata.target_data.select("item_index", "target_index", "class_label").head(5)
shape: (5, 3)
┌────────────┬──────────────┬─────────────┐
│ item_index ┆ target_index ┆ class_label │
│ ---        ┆ ---          ┆ ---         │
│ i64        ┆ i64          ┆ i64         │
╞════════════╪══════════════╪═════════════╡
│ 0          ┆ 0            ┆ 0           │
│ 1          ┆ 0            ┆ 3           │
│ 1          ┆ 1            ┆ 2           │
│ 1          ┆ 2            ┆ 1           │
│ 2          ┆ 0            ┆ 1           │
└────────────┴──────────────┴─────────────┘
property target_factors_only : bool

Whether only target-level factors are included from the factors list.

Returns:

True if image-level factors are excluded, False if included (default).

Return type:

bool