How to build a MetadataLike object from a DataFrame¶
Problem statement¶
DataEval’s bias evaluators (Balance, Diversity,
Parity) do not need your images - they only need your
factors and labels. They accept any object that satisfies the
MetadataLike protocol, which is just four properties:
factor_names- the names of the metadata factorsfactor_data- a(n_samples, n_factors)integer array of discretized factor values (continuous factors must be pre-binned to integers)class_labels- one label per sampleis_discrete- a flag per factor (discrete vs continuous), same length asfactor_names
and two optional properties: index2label (class-name mapping) and
item_indices (which source image each label came from).
When your factors are already tabular, you can implement MetadataLike directly
from a DataFrame and run bias analysis without loading a single image. This
is lighter than wrapping a full dataset (see the related how-tos) and is all you
need when you only care about labels and metadata.
You will build one reusable adapter and apply it to an image classification catalog and an object detection catalog.
When to use¶
Use this when your labels and metadata factors live in a table and you want to
run bias/diversity/parity analysis on them - without decoding images or building
a full AnnotatedDataset.
What you will need¶
A table of factors and labels (here, pandas
DataFrames)A Python environment with the following packages installed:
dataevalpandas
Getting started¶
First import the required libraries needed to set up the example.
import numpy as np
import pandas as pd
from dataeval.bias import Balance
from dataeval.protocols import MetadataLike
Write the adapter¶
The adapter discretizes the factor columns - the one preprocessing step
MetadataLike requires - and exposes the four properties of the protocol plus
the two optional ones:
Discrete factors (categorical or already-integer) are integer-encoded with
pandas.factorize.Continuous factors are binned into integer codes with
pandas.cut; you choose the number of bins per factor, which controls how finely the range is grouped.
Only the columns you pass become factors - any other column in the DataFrame is
ignored. item_indices maps each row back to a source image: for a classification
table (one row per image) the mapping is 1:1; for an object detection table (one
row per box) several rows share an image.
class DataFrameMetadata:
"""A lightweight object implementing the MetadataLike protocol.
Builds the arrays the bias evaluators need directly from DataFrame columns:
discrete factors are integer-encoded and continuous factors are binned, so
``factor_data`` is fully discretized.
"""
def __init__(
self,
dataframe: pd.DataFrame,
label_col: str,
discrete_factors: list[str] | None = None,
continuous_factors: dict[str, int] | None = None,
index2label: dict[int, str] | None = None,
item_index_col: str | None = None,
) -> None:
discrete_factors = discrete_factors or []
continuous_factors = continuous_factors or {}
columns: list[np.ndarray] = []
self._factor_names: list[str] = []
self._is_discrete: list[bool] = []
# Discrete factors: map each distinct value to an integer code
for name in discrete_factors:
columns.append(pd.factorize(dataframe[name])[0].astype(np.int64))
self._factor_names.append(name)
self._is_discrete.append(True)
# Continuous factors: bin into the requested number of integer bins
for name, n_bins in continuous_factors.items():
binned = pd.cut(dataframe[name], bins=n_bins, labels=False)
columns.append(np.asarray(binned, dtype=np.int64))
self._factor_names.append(name)
self._is_discrete.append(False)
self._factor_data = np.stack(columns, axis=1) if columns else np.empty((len(dataframe), 0), dtype=np.int64)
self._class_labels = dataframe[label_col].to_numpy(dtype=np.intp)
self._index2label = index2label or {}
# Each label's source image. Default (no column) is the 1:1 case where
# every row is its own image - correct for classification catalogs.
self._item_indices = (
dataframe[item_index_col].to_numpy(dtype=np.intp)
if item_index_col is not None
else np.arange(len(dataframe), dtype=np.intp)
)
@property
def factor_names(self) -> list[str]:
return self._factor_names
@property
def factor_data(self) -> np.ndarray:
return self._factor_data
@property
def is_discrete(self) -> list[bool]:
return self._is_discrete
@property
def class_labels(self) -> np.ndarray:
return self._class_labels
@property
def index2label(self) -> dict[int, str]:
return self._index2label
@property
def item_indices(self) -> np.ndarray:
return self._item_indices
From an image classification DataFrame¶
A classification catalog has one row per image: the label plus any metadata
factors. Here weather is categorical (discrete) and altitude_m is continuous.
rng = np.random.default_rng(0)
ic_index2label = {0: "cat", 1: "dog", 2: "bird"}
weather_options = ["clear", "rainy", "foggy"]
ic_catalog = pd.DataFrame({
"label": rng.integers(0, 3, size=90),
"weather": rng.choice(weather_options, size=90),
"altitude_m": rng.uniform(50, 150, size=90),
})
ic_catalog.head()
| label | weather | altitude_m | |
|---|---|---|---|
| 0 | 2 | clear | 142.742393 |
| 1 | 1 | foggy | 146.792619 |
| 2 | 1 | clear | 51.470630 |
| 3 | 0 | clear | 136.364009 |
| 4 | 0 | foggy | 148.119504 |
Build the adapter, binning the continuous altitude_m factor into 4 bins. No
item_index_col is needed because each row is already its own image.
ic_meta = DataFrameMetadata(
ic_catalog,
label_col="label",
discrete_factors=["weather"],
continuous_factors={"altitude_m": 4},
index2label=ic_index2label,
)
print(f"Is a MetadataLike: {isinstance(ic_meta, MetadataLike)}")
print(f"factor_names: {ic_meta.factor_names}")
print(f"is_discrete: {ic_meta.is_discrete}")
print(f"factor_data shape: {ic_meta.factor_data.shape}")
print(f"class_labels shape: {ic_meta.class_labels.shape}")
Is a MetadataLike: True
factor_names: ['weather', 'altitude_m']
is_discrete: [True, False]
factor_data shape: (90, 2)
class_labels shape: (90,)
Run a bias evaluator on it. Balance measures the normalized mutual
information between each factor and the class labels.
ic_balance = Balance().evaluate(ic_meta)
print(ic_balance.balance)
shape: (3, 2)
┌─────────────┬──────────┐
│ factor_name ┆ mi_value │
│ --- ┆ --- │
│ cat ┆ f64 │
╞═════════════╪══════════╡
│ class_label ┆ 1.0 │
│ altitude_m ┆ 0.0 │
│ weather ┆ 0.006788 │
└─────────────┴──────────┘
From an object detection DataFrame¶
An object detection catalog in long format has one row per box. Each row
carries its image’s factors, so factor_data and class_labels are naturally
per-detection. The extra step is item_indices, which records the image each
box belongs to.
od_index2label = {0: "person", 1: "car", 2: "bicycle"}
rows = []
for image_index in range(40):
weather = weather_options[image_index % 3]
altitude = float(50 + image_index)
# A variable number of boxes per image - hence one row per box
for _ in range(int(rng.integers(1, 5))):
rows.append({
"image_index": image_index, # which image this box came from
"label": int(rng.integers(0, 3)),
"weather": weather, # image-level factor, repeated per box
"altitude_m": altitude, # image-level factor, repeated per box
})
od_catalog = pd.DataFrame(rows)
od_catalog.head()
| image_index | label | weather | altitude_m | |
|---|---|---|---|---|
| 0 | 0 | 2 | clear | 50.0 |
| 1 | 0 | 2 | clear | 50.0 |
| 2 | 0 | 0 | clear | 50.0 |
| 3 | 1 | 0 | rainy | 51.0 |
| 4 | 1 | 0 | rainy | 51.0 |
Build the adapter, this time passing item_index_col so each detection maps
back to its source image.
od_meta = DataFrameMetadata(
od_catalog,
label_col="label",
discrete_factors=["weather"],
continuous_factors={"altitude_m": 4},
index2label=od_index2label,
item_index_col="image_index",
)
print(f"Is a MetadataLike: {isinstance(od_meta, MetadataLike)}")
print(f"Total detections: {len(od_meta.class_labels)}")
print(f"Unique source images: {len(np.unique(od_meta.item_indices))}")
print(f"factor_data shape: {od_meta.factor_data.shape}")
Is a MetadataLike: True
Total detections: 104
Unique source images: 40
factor_data shape: (104, 2)
The same evaluator works unchanged - it operates on the per-detection factors and labels.
od_balance = Balance().evaluate(od_meta)
print(od_balance.balance)
shape: (3, 2)
┌─────────────┬──────────┐
│ factor_name ┆ mi_value │
│ --- ┆ --- │
│ cat ┆ f64 │
╞═════════════╪══════════╡
│ class_label ┆ 1.0 │
│ altitude_m ┆ 0.0 │
│ weather ┆ 0.011423 │
└─────────────┴──────────┘
That is all it takes: one adapter that discretizes factor columns turns either
kind of tabular catalog into a MetadataLike ready for Balance,
Diversity, and Parity - no images required.