How-to Guides

These guides help you accomplish specific tasks with DataEval. Each one addresses a practical problem and walks you through the solution step by step.

In addition to viewing them in our documentation, these notebooks can also be opened in Google Colab to be used interactively!

The guides are organized by where they fall in the machine learning life cycle:

  1. Configuration

  2. Data engineering

  3. Model development

  4. Monitoring

Configuration

These guides will provide quick examples of how to configure DataEval for your environment.

How to configure global hardware configuration defaults in DataEval

Configure global hardware settings used in DataEval

Open In Colab

How to configuring logging with DataEval

Configure logging with DataEval

Open In Colab

Data Engineering

These guides cover tasks related to preparing, cleaning, exploring, and curating datasets for machine learning.

How to encode images with ONNX models

Encode image embeddings with an ONNX model

Open In Colab

How to run clustering analysis

Identify outliers and anomalies with clustering algorithms

Open In Colab

How to identify duplicates

Identify and remove duplicates from a PyTorch Dataset

Open In Colab

How to visualize cleaning issues

Find negatively impactful images in multiple backgrounds

Open In Colab

How to specify custom statistics on object detection datasets

Customize calculation of image stats on an object detection dataset

Open In Colab

How to add intrinsic factors to Metadata

Apply DataEval’s statistical outputs to DataEval’s Metadata object for bias analysis

Open In Colab

How to detect undersampled data subsets

Detect undersampled subsets of datasets

Open In Colab

Model Development

These guides cover tasks related to assessing data feasibility and sufficiency for model training.

How to determine image classification feasibility

Calculate feasibility of performance requirements on different datasets using Bayes Error Rate (BER)

Open In Colab

How to measure dataset sufficiency for image classification

Determine the amount of data needed to meet image classification performance requirements

Open In Colab

Monitoring

These guides cover tasks related to comparing datasets and detecting distribution shifts in deployed systems.

How to measure train and test dataset divergence

Display data distributions between 2 datasets

Open In Colab

How to measure label independence

Compare label distributions between 2 datasets

Open In Colab