How-to Guides¶

Warning

The How Tos are WIP and are expected to be heavily modified in the future

These guides demonstrate more in-depth features and customizations of DataEval features for more advanced users.

In addition to viewing them in our documentation, these notebooks can also be opened in Google Colab to be used interactively!

General Usage¶

These guides will provide quick examples of how to configure DataEval for your environment.

Configuring hardware: PyTorch devices and cpu processes	Configure global hardware settings used in DataEval
Configuring Python Logging with DataEval	Configure logging with DataEval

Detectors¶

The purpose of these tools is to identify or detect issues within a dataset. The guides below exemplify powerful solutions to common problems in ML.

How to run clustering analysis	Identify outliers and anomalies with clustering algorithms
How to identify duplicates	Identify and remove duplicates from a PyTorch Dataset
How to visualize cleaning issues	Find negatively impactful images in multiple backgrounds
How to specify custom statistics on object detection datasets	Customize calculation of image stats on an object detection dataset

Metrics¶

Metrics are a set of tools that measure and analyze data. The guides below show best practices when solving common ML problems.

How to determine image classification feasibility	Calculate feasibility of performance requirements on different datasets using Bayes Error Rate (BER)
How to measure train and test dataset divergence	Display data distributions between 2 datasets
How to measure label independence	Compare label distributions between 2 datasets
How to detect undersampled data subsets	Detect undersampled subsets of datasets
How to add intrinsic factors to Metadata	Apply DataEval’s statistical outputs to DataEval’s `Metadata` object for bias analysis

Workflows¶

Workflows are end-to-end processes that detect, measure, and analyze data against requirements. The guides below help you solve common problems found across machine learning tasks.

How to measure dataset sufficiency for image classification	Determine the amount of data needed to meet image classification performance requirements