DataEval for Data Scientists¶
A data scientist is focused on exploration and analysis to extract actionable insights from data. While they share the goal of building effective models with ML engineers, their role is often more exploratory and research-oriented. They are often involved in the early stages of a project:
defining the problem,
understanding the data’s potential and limitations, and
experimenting with various modeling approaches.
They are deeply involved in understanding the data, formulating hypotheses, and using statistical methods to test those hypotheses. For a data scientist, data is not just an input to a model; it is the object of study itself.
A data scientist’s workflow is centered around data preparation. DataEval provides a rich set of tools that align perfectly with the exploratory nature of this role, helping the data scientist to quickly understand, clean, and prepare data for modeling.
Key data scientist tasks and relevant DataEval functions¶
The following sections highlight some data scientist tasks along with the different DataEval tools that can be leveraged in order to accomplish the task.
Perform initial data profiling¶
Compute statistics on image properties like brightness, contrast, sharpness, and color distributions. For object detection, analyze the distributions of bounding box sizes, aspect ratios, and locations.
Use DataEval’s imagestats() to provide the necessary image statistics
on both the image and any bounding boxes.
Identify data quality issues¶
Systematically scan for problems like corrupt or unreadable image files, incorrect or missing labels, inconsistent annotation formats (e.g., COCO vs. YOLO), and misaligned bounding boxes.
Use DataEval’s labelstats() to provide the necessary label distributions
and counts as well as DataEval’s Dataset class to identify any loading or annotation
errors. DataEval also includes a Outliers class and a Duplicates
class to identify anomaly and redundant images.
Discover underlying data structures and patterns¶
Use visualization techniques to review random samples of images. Apply clustering on image embeddings (e.g., from a pre-trained model) to discover natural groupings of scenes or objects that may not be captured by the labels.
Use DataEval’s cluster() to group the images.
Perform statistical tests on image properties¶
Apply formal statistical tests to validate hypotheses about differences in image characteristics between data subsets (e.g., comparing the average bounding box in ‘day’ vs. ‘night’ images).
Use DataEval’s Select class to create different subsets of the dataset
that can then be compared using the results of DataEval’s imagestats()
function.
Quantify bias and representativeness¶
Use quantitative metrics to measure image metadata like class balance, background diversity, lighting conditions, and camera angles for potential biases, and dataset coverage of the operational domain.
DataEval has a set of bias metrics – balance(), diversity(), and
parity() – to identify potential shortcuts based on the metadata. It
also contains completeness() and coverage() to determine the
representativeness of the dataset.
Determine problem feasibility¶
Analyze the dataset to determine if the cleaned dataset is an adequate dataset given the problem requirements and complexity.
DataEval’s ber() and uap() functions calculate the upper performance
bound given the specific dataset. It allows for comparison of different datasets
to determine the best dataset for the problem.
Create dataset splits¶
Analyze the dataset to create a training, validation and testing subset. Ensure that each split adequately represents the target operational environment and that there are no correlations between the splits.
Datasets can be split using DataEval’s split_dataset(), which has options that
enable the user to split the data based on metadata. DataEval’s bias functions,
balance() and diversity() can help identify when there may be spurious
correlations between the splits.
Build and evaluate models¶
Train standard models to establish a performance baseline against and then train experimental and complex models to systematically evaluate model architectures.
While DataEval does not assist in the building and training of ML models, it
does contain Sufficiency which allows the user to compare model performance
of multiple models, including current model performance and predicted performance
at different amounts of data, along with the predicted model saturation point.
Analyze and interpret model errors¶
Go beyond top-line metrics to perform detailed error analysis. Visualize the false positives and false negatives to understand why the model is failing (e.g., it confuses similar objects, fails on small objects, or struggles in low light).
By combining multiple DataEval functions – Select class, imagestats(),
labelstats(), cluster(), balance(), and diversity()
– false positives and false negatives can be further analyzed.
Monitor model performance¶
Implement monitoring to track operational metrics (latency, throughput) and to detect data drift. Analyze why a model’s performance is decaying in production by comparing the distribution of image statistics (or embeddings) between the new data and the training data, then propose a retraining or calibration strategy.
DataEval has a set of drift
and out-of-distribution (OOD) detection functions, along with
divergence() and label_parity(), to identify differences between operational
and training distributions of both images and labels.