How-to Guides¶
These guides help you accomplish specific tasks with DataEval. Each one addresses a practical problem and walks you through the solution step by step.
In addition to viewing them in our documentation, these notebooks can also be opened in Google Colab to be used interactively!
The guides are organized by where they fall in the machine learning life cycle:
See Running notebooks locally at the bottom
of this page for how to generate runnable .ipynb files in your local
checkout.
Configuration¶
These guides will provide quick examples of how to configure DataEval for your environment.
Configure global DataEval defaults |
||
Configure logging with DataEval |
Data Engineering¶
These guides cover tasks related to preparing, cleaning, exploring, and curating datasets for machine learning.
Encode image embeddings with an ONNX model |
||
Embed object detection box crops and visualize class clusters |
||
Identify outliers and anomalies with clustering algorithms |
||
Identify and remove duplicates from a PyTorch Dataset |
||
Find negatively impactful images in multiple backgrounds |
||
How to specify custom statistics on object detection datasets |
Customize calculation of image stats on an object detection dataset |
|
Apply DataEval’s statistical outputs to
DataEval’s |
||
Detect undersampled subsets of datasets |
||
Wrap a pandas DataFrame catalog as a DataEval dataset |
||
Wrap a long-format DataFrame of bounding boxes as an object detection dataset |
||
Build a MetadataLike object from a DataFrame for bias analysis without loading images |
||
Defer image decoding to speed up metadata-only analysis |
||
Validate dataset class names against an ontology and recover their hierarchy |
||
Align two label vocabularies into typed correspondences and a carry-over class remapping |
||
How to conform and merge datasets with different label vocabularies |
Conform an incoming dataset to a reference vocabulary and merge the two |
Model Development¶
These guides cover tasks related to assessing data feasibility and sufficiency for model training.
Calculate feasibility of performance requirements on different datasets using Bayes Error Rate (BER) |
||
Determine the amount of data needed to meet image classification performance requirements |
Monitoring¶
These guides cover tasks related to comparing datasets and detecting distribution shifts in deployed systems.
Display data distributions between 2 datasets |
||
Compare label distributions between 2 datasets |
||
Detect distribution shift from a MAITE model’s prediction uncertainty, with no custom decoding |
Running notebooks locally¶
The notebook sources live as py:percent scripts in
docs/source/notebooks/. To get runnable .ipynb files for local
editing, choose one of:
With nox (recommended):
nox -s docsync— bidirectional sync of the.py/.ipynbpairs.With jupytext directly:
jupytext --to notebook docs/source/notebooks/*.py
The generated .ipynb files are gitignored, so edits stay local to
your checkout.