How-to Guides¶

These guides help you accomplish specific tasks with DataEval. Each one addresses a practical problem and walks you through the solution step by step.

In addition to viewing them in our documentation, these notebooks can also be opened in Google Colab to be used interactively!

The guides are organized by where they fall in the machine learning life cycle:

Configuration
Data engineering
Model development
Monitoring

See Running notebooks locally at the bottom of this page for how to generate runnable .ipynb files in your local checkout.

Configuration¶

These guides will provide quick examples of how to configure DataEval for your environment.

How to configure global DataEval defaults	Configure global DataEval defaults
How to configuring logging with DataEval	Configure logging with DataEval

Data Engineering¶

These guides cover tasks related to preparing, cleaning, exploring, and curating datasets for machine learning.

How to encode images with ONNX models	Encode image embeddings with an ONNX model
Embed object detection crops and visualize class clusters	Embed object detection box crops and visualize class clusters
How to run clustering analysis	Identify outliers and anomalies with clustering algorithms
How to identify duplicates	Identify and remove duplicates from a PyTorch Dataset
How to visualize cleaning issues	Find negatively impactful images in multiple backgrounds
How to specify custom statistics on object detection datasets	Customize calculation of image stats on an object detection dataset
How to add intrinsic factors to Metadata	Apply DataEval’s statistical outputs to DataEval’s `Metadata` object for bias analysis
How to detect undersampled data subsets	Detect undersampled subsets of datasets
How to wrap a DataFrame-backed image classification dataset	Wrap a pandas DataFrame catalog as a DataEval dataset
How to wrap a DataFrame-backed object detection dataset	Wrap a long-format DataFrame of bounding boxes as an object detection dataset
How to build a MetadataLike object from a DataFrame	Build a MetadataLike object from a DataFrame for bias analysis without loading images
How to delay image loading until needed	Defer image decoding to speed up metadata-only analysis
How to reconcile labels against an ontology	Validate dataset class names against an ontology and recover their hierarchy
How to align two label spaces	Align two label vocabularies into typed correspondences and a carry-over class remapping
How to conform and merge datasets with different label vocabularies	Conform an incoming dataset to a reference vocabulary and merge the two

Model Development¶

These guides cover tasks related to assessing data feasibility and sufficiency for model training.

How to determine image classification feasibility	Calculate feasibility of performance requirements on different datasets using Bayes Error Rate (BER)
How to measure dataset sufficiency for image classification	Determine the amount of data needed to meet image classification performance requirements

Monitoring¶

These guides cover tasks related to comparing datasets and detecting distribution shifts in deployed systems.

How to measure train and test dataset divergence	Display data distributions between 2 datasets
How to measure label independence	Compare label distributions between 2 datasets
How to detect uncertainty drift with a MAITE model	Detect distribution shift from a MAITE model’s prediction uncertainty, with no custom decoding

Running notebooks locally¶

The notebook sources live as py:percent scripts in docs/source/notebooks/. To get runnable .ipynb files for local editing, choose one of:

With nox (recommended): nox -s docsync — bidirectional sync of the .py/.ipynb pairs.
With jupytext directly: jupytext --to notebook docs/source/notebooks/*.py

The generated .ipynb files are gitignored, so edits stay local to your checkout.