Getting Started¶
DataEval helps you evaluate image datasets for quality, bias, scope, distribution shift, and performance limits. It implements Modular AI Trustworthy Engineering (MAITE)-compliant metrics that integrate with the broader Joint AI T&E Infrastructure Capability (JATIC) suite of tools.
Note
DataEval imposes no restrictions on image type. It accepts any image modality (RGB, IR, EO, multispectral, greyscale, and others) at any bit depth (8-bit, 16-bit, 32-bit, etc.) and channel count (1+).
Important
Some DataEval functions and classes apply only to image classification tasks, while others apply only to object detection tasks. For more information regarding when to use each function or class see the Functional Overivew page for details.
Step 1: Install DataEval¶
DataEval requires Python 3.10 or higher. It has been tested on Ubuntu and Windows. macOS users may encounter platform-specific issues; report these via the issue tracker.
DataEval can be installed via pip from PyPI:
pip install dataeval
DataEval can be installed via conda from
conda-forge:
conda install -c conda-forge dataeval
See also
For details on optional extras, installing from source, or developer setup, see Installation.
Step 2: Prepare your dataset¶
DataEval has two input paths depending on which part of the library you are using.
dataeval.core provides stateless functions that operate directly on NumPy
arrays — embeddings, labels, image hashes, and statistics. No dataset object is
required. Call these functions with arrays and get results back directly. Examples
include compute_stats(), label_errors(), divergence_mst(),
and ber_knn().
dataeval.quality, dataeval.bias, dataeval.scope, dataeval.shift, and dataeval.performance
provide stateful evaluator classes (Duplicates, Outliers,
Prioritize, Balance, drift detectors, and so on). These
accept MAITE-compliant
datasets, Metadata or Embeddings depending on the evaluator.
If your data is not yet in MAITE format, the sections below show what is required and how to wrap a common format, for both image classification and object detection tasks.
Image classification dataset¶
A MAITE-compliant image classification dataset implements __len__ and
__getitem__, where each item is a tuple of (image, label, metadata).
Images must be NumPy arrays of shape (C, H, W). Labels must be one-hot
encoded arrays of shape (num_classes,). Metadata must be a DatumMetadata
object with at minimum an id field.
import maite.protocols as mp
import maite.protocols.image_classification as ic
import numpy as np
class MyImageClassificationDataset(ic.Dataset):
metadata: mp.DatasetMetadata
def __init__(self, images: list[np.ndarray], labels: list[int], num_classes: int) -> None:
# images: list of np.ndarray, each shape (C, H, W)
# labels: list of int (class indices)
self._images = images
self._labels = labels
self._num_classes = num_classes
self.metadata = mp.DatasetMetadata(
id="my_image_classification_dataset",
index2label={i: f"class_{i}" for i in np.unique(labels)}, # example mapping
)
def __len__(self) -> int:
return len(self._images)
def __getitem__(self, idx: int) -> tuple[ic.InputType, ic.TargetType, ic.DatumMetadataType]:
return (
self._images[idx], # np.ndarray (C, H, W)
np.eye(self._num_classes, dtype=np.float32)[self._labels[idx]], # np.ndarray (num_classes,)
ic.DatumMetadataType(id=idx),
)
Object detection dataset¶
A MAITE-compliant object detection dataset follows the same three-tuple
structure, but the label element is replaced by a detection target object
carrying per-box labels, bounding boxes, and scores. Bounding boxes use
(x0, y0, x1, y1) format. Labels and scores are per-box, not per-image.
import maite.protocols as mp
import maite.protocols.object_detection as od
import numpy as np
class DetectionTarget(od.TargetType):
"""Holds per-box labels, boxes, and one-hot scores for one image."""
def __init__(self, labels: list[int], boxes: list[list[float]], num_classes: int):
# labels: list of int, one per box
# boxes: list of [x0, y0, x1, y1], one per box
self._labels = labels
self._boxes = boxes
self._scores = np.eye(num_classes)[labels]
@property
def labels(self) -> mp.ArrayLike:
return self._labels
@property
def boxes(self) -> mp.ArrayLike:
return self._boxes
@property
def scores(self) -> mp.ArrayLike:
return self._scores
class MyObjectDetectionDataset(od.Dataset):
def __init__(
self, images: list[np.ndarray], labels: list[list[int]], boxes: list[list[list[float]]], num_classes: int
) -> None:
# images: list of np.ndarray, each shape (C, H, W)
# labels: list of list[int] — per-box class indices, one list per image
# boxes: list of list[[x0,y0,x1,y1]] — one list per image
self._images = images
self._labels = labels
self._boxes = boxes
self._num_classes = num_classes
self.metadata = mp.DatasetMetadata(
id="my_object_detection_dataset",
index2label={i: f"class_{i}" for i in np.unique(labels)}, # example mapping
)
def __len__(self) -> int:
return len(self._images)
def __getitem__(self, idx: int) -> tuple[od.InputType, od.TargetType, od.DatumMetadataType]:
return (
self._images[idx], # np.ndarray (C, H, W)
DetectionTarget(self._labels[idx], self._boxes[idx], self._num_classes),
od.DatumMetadataType(id=idx),
)
Wrapping a PyTorch dataset¶
If your data is in a PyTorch Dataset, wrap it to conform to the MAITE
protocol. Note that torchvision tensors are (C, H, W) which is the
supported format by DataEval.
import maite.protocols as mp
import maite.protocols.image_classification as ic
import numpy as np
import torch
from torchvision import transforms
from torchvision.datasets import CIFAR10
tv_cifar10 = CIFAR10(root="./data", train=True, download=True, transform=transforms.ToTensor())
class MyCIFAR10Wrapper(ic.Dataset):
def __init__(self, source: CIFAR10) -> None:
self._source = source
self.metadata = mp.DatasetMetadata(
id="tv_cifar10",
index2label={
0: "airplane",
1: "automobile",
2: "bird",
3: "cat",
4: "deer",
5: "dog",
6: "frog",
7: "horse",
8: "ship",
9: "truck",
},
)
def __len__(self) -> int:
return len(tv_cifar10)
def __getitem__(self, idx: int) -> tuple[ic.InputType, ic.TargetType, ic.DatumMetadataType]:
tv_datum: tuple[torch.Tensor, int] = tv_cifar10[idx]
image = tv_datum[0].numpy()
label = np.eye(10, dtype=np.float32)[tv_datum[1]] # Convert label to one-hot encoding
return image, label, mp.DatumMetadata(id=idx)
dataset: ic.Dataset = MyCIFAR10Wrapper(tv_cifar10)
Step 3: Run your first evaluation¶
The example below uses Duplicates from dataeval.quality to detect
near-duplicate images by finding groups of embeddings that are similar in
embedding space. Duplicates inflate benchmark scores and cause models to overfit
to repeated collection events rather than generalizing to new conditions.
from torch.nn import Flatten
from dataeval.extractors import TorchExtractor
from dataeval.flags import ImageStats
from dataeval.quality import Duplicates
# Configure a feature extractor using a pre-trained PyTorch model.
# Here we use a simple Flatten layer for demonstration, but in practice
# you would use a more powerful model like a pre-trained ResNet or ViT.
extractor = TorchExtractor(Flatten())
# Find near-duplicates using only embedding-based clustering.
# An aggressive cluster_threshold of 1.5 should produce detections
# of near duplicates even with a simple Flatten extractor.
evaluator = Duplicates(
flags=ImageStats.NONE,
cluster_algorithm="hdbscan",
cluster_threshold=1.5,
extractor=extractor,
batch_size=64,
)
result = evaluator.evaluate(dataset)
# Near duplicates are grouped into sets of indices that are within
# the specified cluster_threshold in embedding space.
print(result)
shape: (3, 5)
┌──────────┬───────┬──────────┬────────────────┬─────────────┐
│ group_id ┆ level ┆ dup_type ┆ item_indices ┆ methods │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ str ┆ list[i64] ┆ list[str] │
╞══════════╪═══════╪══════════╪════════════════╪═════════════╡
│ 0 ┆ item ┆ near ┆ [18586, 39942] ┆ ["cluster"] │
│ 1 ┆ item ┆ near ┆ [23157, 31426] ┆ ["cluster"] │
│ 2 ┆ item ┆ near ┆ [32024, 49135] ┆ ["cluster"] │
└──────────┴───────┴──────────┴────────────────┴─────────────┘
A result with many large groups is a signal that your dataset contains repeated collection events. Before training, remove all but one sample from each group. See the deduplication how-to guide for a complete walkthrough, including how to choose which sample to keep.
Where to go next¶
Not sure what to evaluate first? Use the Which tool should I use? guide to find the right evaluator for your situation.
Know which tool to use, then check out the Functional Overview for a quick-reference table of each algorithm’s inputs, outputs, and task applicability.
If you prefer to learn by doing, start with the data cleaning tutorial. It walks through the most common first-pass analysis tasks — duplicates, outliers, and image quality — using a realistic dataset.