DataEval for Testing and Evaluation Engineers

An AI test and evaluation (T&E) engineer is responsible for the independent and rigorous testing of AI systems:

  • to verify that the AI systems meet requirements,

  • to validate that the AI systems are suitable for their intended use, and

  • to identify any potential risks or limitations before deployment.

T&E engineers typically operate as a third party, separate from the ML engineers and data scientists to provide an unbiased assessment of the AI system’s performance and safety.

A T&E engineer’s workflow is often centered around formal testing events where they execute a test plan, analyze the results, and generate a final report with findings and recommendations. DataEval provides a suite of tools that are critical for many of these activities, especially those related to ensuring the quality and operational relevance of the data used for testing.

flowchart 1 Scope And Objectives Scope And Objectives 2 Data Engineering Data Engineering 1:e->2:n 1:s->2:w 3 Model Development Model Development 1:s->3:w 4 Deployment Deployment 1:s->4:n 5 Monitoring Monitoring 1:s->5:e 2:s->3:n 2:w->5:e 6 Analysis Analysis 2:w->6:e 3:s->4:e 3:w->5:e 3:w->6:e 4:w->5:s 5:n->6:s 6:n->1:w

Key T&E engineer tasks and relevant DataEval functions

The following sections highlight some T&E engineer tasks along with the different DataEval tools that can be leveraged in order to accomplish the task.

Ensure test data quality and annotation integrity

Perform a thorough analysis of the test data to identify and flag quality issues, such as blurry or corrupt images, and annotation errors like misaligned bounding boxes, incorrect class labels, or inconsistent labeling standards.

Use DataEval’s labelstats() function to provide the necessary label distributions and counts as well as DataEval’s imagestats() function to identify any loading or annotation errors. DataEval also includes a Outliers class and a Duplicates class to identify anomaly and redundant images.

Validate test data is operationally relevant

Scrutinize the test dataset to ensure it contains images that accurately represent the target operational conditions, including sensor types, camera angles, weather, lighting, and environments.

For datasets that contain acquisition conditions as metadata, DataEval has a set of bias metrics, balance() and diversity(), that can assist in determining relevant conditions. It also contains a completeness() and coverage() metric to determine the representativeness of the dataset.

Evaluate performance on critical data subgroups

Measure and compare model performance on specific, operationally relevant subgroups of the image data (e.g., performance on small objects, low-light images, rainy conditions, or partially occluded targets).

DataEval’s Sufficiency class allows the user to compare model performance of multiple models, including current model performance and predicted performance at different amounts of data, along with the predicted model saturation point. DataEval also has a Select class that allows the user to create subsets of the data based on a user defined selection.

Perform error analysis to identify systemic weaknesses

Conduct a deep dive into the model’s failures (false positives, false negatives, misclassifications). Visualize these errors to identify patterns, such as the model consistently confusing two similar-looking objects or failing to detect objects at a distance.

By combining multiple DataEval functions – Select class, imagestats(), labelstats(), cluster(), balance(), and diversity() – model failures can be investigated at the image level.

Explore unknown risks and potential failure modes

Proactively search for unexpected failure modes by testing the system against visual edge cases, anomalies, and adversarial attacks (e.g., adversarial patches that can make an object invisible to the detector).

While DataEval does not address adversarial robustness or natural robustness, it does contain a Outliers class to identify visual anomalies and a cluster() function which can help identify edge cases.

Monitor model performance

Implement monitoring to track operational metrics (latency, throughput) and to detect data drift. Analyze why a model’s performance is decaying in production by comparing the distribution of image statistics (or embeddings) between the new data and the training data, then propose a retraining or calibration strategy.

DataEval has a set of drift and out-of-distribution (OOD) detection functions, along with divergence() and label_parity(), to identify differences between operational and training distributions of both images and labels.