Metadata¶

What is it¶

Metadata refers to data that provides information about the images and/or objects.

Metadata can include details such as the source of the data, feature descriptions, timestamps, labels, sensor types, or even preprocessing steps applied to the dataset and is crucial for understanding the context and provenance of a dataset, ensuring reproducibility, and identifying potential biases or limitations.

By leveraging metadata, machine learning models can achieve better interpretability, enhance feature engineering, and improve overall model performance by allowing algorithms to utilize context-specific information.

It is especially valuable in monitoring and auditing pipelines, ensuring that the system aligns with real-world requirements and ethical considerations.

How to analyze it¶

For both model and dataset development it is important to understand correlational relationships that underlie the dataset.

Often, opportunities for data collection are sparse, available only in non-operational locations and conditions, with limited target diversity, etc. A model trained on these realistic datasets could learn to use secondary information to perform the primary learning task, reducing the model’s ability to generalize to new domains or to perform unexpectedly when presented with new data.

DataEval provides the balance(), diversity(), and parity() metrics which provide methods for identifying relationships between dataset factors and class labels a priori. A T&E engineer or model developer should then use that information to design tests for model generalization or data augmentation to mitigate the opportunity for shortcut learning or sampling imbalance.

In order to use DataEval’s bias metrics, the user must supply their metadata in a DataEval specific format. Because of this requirement, DataEval has a preprocess() function that will take in user metadata and format it into DataEval’s format. Each bias metric takes in the output of the preprocess() function for its analysis.

Why is it important?¶

Statistical independence is important when evaluating metadata because a model trained on a dataset must avoid learning unintended bias. A common way in which bias manifests is when class labels are not statistically independent from metadata attributes.

For example, consider a scenario where a user wants to train a model to classify images as cats or as dogs. Suppose that, in this dataset, all dog pictures were taken in Washington, and all cat pictures were taken in Arizona. A model could learn this spurious correlation, and could classify an image as a cat or dog by inspecting the location information, rather than by inspecting features of cats and dogs, resulting in a picture of a cat taken in Washington being misclassified as a dog.

Early detection and mitigation of metadata bias is critical for training unbiased and reliable models.