Dataset prioritization¶
This page describes dataset prioritization, including the theory and application of the tools, how to use them, and why to use them.
What is dataset prioritization?¶
Dataset prioritization—sometimes referred to as pruning—is the process of ranking instances in a dataset to focus training on a balanced representative subset of data, select new data for training that is least redundant with existing data and, under the right circumstances, to improve model performance by training on a reduced dataset. Dataset prioritization is most important in problems where the dataset is either too large to manage or in problems with high natural redundancy such as full motion video data or overhead imagery with abundant and redundant background samples. DataEval’s approach to prioritization consolidates multiple state-of-the-art approaches for scoring the difficulty, or conversely the representativeness, of individual instances and multiple policies for prioritizing the instance scores.
Why should you prune your dataset?¶
Prioritizing your dataset can lead to faster development cycles, better performing models, and better resource efficiency. Most real-world datasets intrinsically contain redundancy—e.g. targets measured on similar backgrounds or many redundant looks at a single target or class—and models trained with this natural redundancy may require more computational resources and ultimately underperform models trained with a more balanced dataset. Prioritization mitigates redundancy and balances datasets for model development.
Prioritized subsets of data can benefit model developers and engineers in the following ways:
Faster Development Cycles Training on a prototypical subset of data enables model developers and T&E engineers to iterate more quickly, saving time and both physical and human resources.
Better Performing Models Operational datasets are not naturally balanced, and when models are trained with imbalanced data, models tend to underperform, particularly in new domains or locations. A prioritized subset of data will retain essential (prototypical) samples but include samples that are challenging for a model while reducing amount of data needed to train and iterate during model development. Sample selection policies can target an even distribution of classes, while respecting priority, or balance across sample difficulty. Pruning a highly-redundant dataset can in some cases improve model performance on a held-out test set compared to a model trained on a larger but imbalanced dataset.
Resource Efficiency DataEval deprioritizes redundant data samples. Overly-redundant instances in a dataset may implicitly emphasize certain target types or presentations, leading to a biased model. Furthermore, training on data that provides negligible additional information content wastes computational and storage resources as well as developer time.
Prioritization provides a mechanism to understand semantic similarity of new data compared an existing dataset. When faced with newly collected unlabeled data, subject matter experts can use prioritization to identify which samples will contribute most to the existing dataset prior to labeling, saving valuable time and human resources needed for manual labeling.
When should you prune and prioritize your dataset?¶
Typical times to prioritize and prune datasets are
During initial dataset development: Prioritizing data during initial dataset development is used to form a core set of data for training or Test and Evaluation (T&E). When it is appropriate to subsample a dataset, prioritization identifies a representative subset that is used for rapid iterative model development. Prioritization is conducted during Exploratory Data Analysis (EDA) stages after initial cleaning and
Outliers.After collecting new data: Prioritization tools should be used when incorporating new data into an existing dataset, using prioritization scores to identify which of the unlabeled data contribute most to the existing dataset.
When datasets contain known redundancy: Operational datasets often contain obvious redundancy and imbalance—e.g. oversampled secondary/background classes, undersampled high-value targets, etc. When this is the case, dataset prioritization is often beneficial and may improve model performance by mitigating redundancy-induced dataset imbalance.
Theory behind dataset prioritization¶
DataEval’s approach to prioritization and pruning involves several stages including dataset embedding (dimensionality reduction), sample scoring and prioritization according to a selection policy. Below we discuss each stage and permutations of options that are available to customize the procedure to new datasets.
First, however, it is important to contextualize some of the discussion about dataset pruning in general. We have found the literature to be occasionally contradictory, where claims reported by one author cannot be reproduced by another (or us!). The point of this disclaimer is to emphasize that performance of dataset prioritization strongly depends on the dataset itself, and no single algorithm is universally optimal for all datasets. Pruning with image datasets is also further complicated by a mismatch between the space in which we prune and the space over which the resulting model operates. Training on a subset of data induces a different manifold from what the data were pruned on. Or, in other words, we prune with one projection and perform inference with another. Below we try to provide some intuition for using the prioritization tools successfully.
Embedding image data¶
When conducting dataset analysis, a meaningful representation of the data is often critical for generating meaningful results. For dataset pruning, we are interested in measuring redundancy, so as long as the embedding approach captures meaningful information—e.g. texture information, background properties, semantic information, etc. as is relevant to the problem—pruning algorithms will in turn provide meaningful results.
Choosing a model and task for training the embedding model are important to consider before pruning. For reasonably balanced datasets, self-supervised embeddings are likely to be adequate and do not require labels for training. However, if the initial dataset is highly imbalanced—measured by class frequencies or target size relative to image size—self-supervised methods may not be appropriate. They are likely to encode spatial or frequency biases into the projection.
Supervised models will better embed datasets than unsupervised models if target classes of interest are either rare in frequency or small in size. In this case, supervised training will guide the model to focus on the salient properties of the image—e.g. small-target signatures whose contribution would otherwise be overwhelmed by abundant background. The challenge, at least conceptually, with using a supervised model to embed data is that the problem becomes circular. The initial dataset is used to train a model, the dataset is prioritized using the trained embedder, and then another model (potentially the same architecture) is trained on a subset of the initial dataset. It is perfectly reasonable to iteratively prioritize a dataset in this way given appropriate resources, and the initial effort of doing so could enable and benefit future iteration with compact datasets.
Object detection datasets can present embedding challenges, especially if objects are small relative to chip size or image size. This is a case of spatial imbalance, discussed above, where the salient content of the image is spatially underrepresented. Using object detection labels to form or weight embeddings can mitigate spatial imbalance; however, a straightforward unsupervised embedder is likely to still capture redundancy in background types or other dominant features.
Finally, it is reasonable to use pre-trained models to embed datasets depending on the domain of the data and availability of models. For instance, a model trained on Imagenet is appropriate for other natural image datasets, but it might not be appropriate for sonar data. A key factor when deciding whether to use a pre-trained model is whether the target dataset is out-of-distribution with respect to the data used to train the model. It is difficult to predict how a model will embed out-of-distribution data, so exercise caution when embedding with pre-trained models.
Sample scoring¶
Before selecting or ranking instances of a dataset, prioritization algorithms begin by scoring samples using some measure of prototypicality or task difficulty. There are many scoring strategies available, and DataEval focuses on those that only require access to the dataset and optionally to labels. Other methods in recent literature require access to models, gradients, or scores throughout the training pipeline requiring deeper integration with the model training pipeline than is typically allocated for dataset analysis. Also note that none of the core scoring algorithms used in DataEval require labels or annotations, making them suitable for prioritizing unlabeled data as well.
K-nearest neighbor scoring¶
Our approach to scoring samples in latent space according to k-NN distances is
related to its role in coverage(). The score assigned to each instance
is the k-NN distance for fixed \(k\), e.g. \(k =50\). In densely sampled regions of
latent space, k-NN distances will be small, indicating a degree of redundancy.
Samples with large k-NN distances correspond to regions that are less densely
sampled and likely to be less prototypical. Put another way, challenging samples
(with respect to the discrimination task) are more likely to occur in
undercovered regions of the latent space manifold, while relatively ‘easy’
samples are likely to occur in densely sampled regions of the manifold.
DataEval applies ranking policies directly to the k-NN distance statistic.
K-means scoring¶
Another approach to sample scoring commonly used in the literature—e.g.
[1],
[2]—is to cluster the data and use
inter- and intra-cluster distances to compute scores. The simplest approach is
to score each sample according to its distance to cluster centroid. Not unlike
k-NN scoring, prototypical samples will live close to the cluster centroid,
regardless of what labels would be assigned to the sample or cluster, and
challenging samples near the decision boundary are likely to be far from the
cluster centroid. In DataEval, the kmeans_distance scoring approach directly
uses distance to cluster centroid as the prioritization statistic.
The kmeans_complexity is based on [2] and
uses the product of average inter-cluster distance and intra-cluster distance of
the nearest 20 clusters. The product of these two average distances is a
complexity score assigned to each cluster. Sampling from clusters in proportion
to cluster complexity scores, we sort the dataset. This approach begins to blur
the lines between sample scoring and re-ranking, but it aims to address
exacerbation of class imbalance without explicitly using class labels.
Optional use of class labels in prioritization policies can more directly
address issues of class imbalance.
Prioritization policies¶
With a set of scored samples, nominally sorted by prototypicality, we may use
prioritization policies to produce a balanced and effective dataset for model
development and T&E. The most straightforward policies, articulated and
discussed in
[1],
are easy_first and hard_first. While the names are somewhat reductive,
these polices lead to samples sorted by score in increasing and decreasing
order, respectively. The student-teacher models in [1] may inform policy
selection, but the core algorithm favored by the authors on real datasets is
k-means clustering with hard_first prioritization. The easy_first policy is
generally preferred for very small datasets where a model needs to learn the
essential concepts, while the hard_first policy tends to include data needed
to sample the decision boundary for higher-performing models. For moderate
sized datasets and anything short of very aggressive pruning, we have found that
hard_first leads to better performing models than easy_first. However,
with image datasets, we found that these two policies often did not lead to
improved performance over random pruning.
DataEval also includes a stratified re-ranking policy, described in
[3]. This approach bins the range of
scores and draws samples uniformly from each bin, leading to a mixture of
challenging and prototypical samples. This approach avoids oversampling dense
regions of score-space—e.g. many samples with very small scores and high
redundancy. In our testing, we found that this approach leads to downstream
model performance improvement when compared with random decimation, something we
could not consistently demonstrate with hard_first and easy_first.
Finally, DataEval provides a class_balance policy that samples evenly from
each class and by priority within each class, requiring that the user provide
class labels. This approach was used in
[1]
after the authors found that pruning exacerbates class imbalance. Especially
when the initial dataset is extremely unbalanced, this level of direct control
over class balance can be beneficial.
Default settings¶
There are many permutations of options available, but we recommend starting with
the stratified policy and either knn or kmeans_distance scoring. In our
testing, over several datasets, neither consistently outperformed the other, and
they both tended to outperform random decimation. On a tabular dataset derived
from satellite imagery, knn scoring with hard_first led to significant
performance improvements over the unpruned dataset; however, on relatively clean
image datasets (CIFAR, Imagenet) stratified re-ranking was necessary to see
improvement over random decimation.
Related concepts¶
Dataset pruning is closely related to various forms of data cleaning in DataEval
but aims to shape and balance the distribution of samples in a subset of data.
Low-level data cleaning identifies and measures invalid
or unexpected low-level image or label properties, while prioritization operates
in a latent (feature) space, focusing on semantic redundancy. The
clusterer() measures and identifies near duplicates and outlier samples
in the dataset, similarly to prioritization, but is not focused on sampling the
dataset. coverage() is closely related to prioritization and
characterizes regions of the latent space (notionally, concepts) that are
undersampled, while prioritization provides a recommended ranking of samples to
include in a representative subset or to incorporate into an existing dataset.
Prioritization tools aim to provide recommendations for both initial and ongoing
dataset curation.
More broadly, in AI/ML, prioritization is related to
Coreset selection, dataset pruning
Curriculum development in active learning
Loss weighting strategies that emphasize challenging samples, e.g. hard negative mining, focal loss, etc.
References and related work¶
Sorscher, B., Geirhos, R., Shekhar, S., Ganguli, S., & Morcos, A. (2022). Beyond neural scaling laws: beating power law scaling via data pruning. Advances in Neural Information Processing Systems, 35, 19523-19536.
This paper discusses k-means pruning and compares to many other pruning approaches, some of which require access to the model-training pipeline. We attempted to reproduce the results of Fig. 5c but were unable to outperform random pruning on CIFAR with pretrained embeddings. See Fig. 4 in Ref. [4] for corroboration of this assertion about k-means scoring by itself.
Sorscher et al. also introduce a theoretical framework for understanding the optimal tradeoff between selecting prototypical samples and more challenging samples.
Abbas, A., Rusak, E., Tirumala, K., Brendel, W., Chaudhuri, K., & Morcos, A. S. (2024). Effective pruning of web-scale datasets based on complexity of concept clusters. arXiv preprint arXiv:2401.04578.
This paper introduces the idea of cluster complexity (augments k-means scoring) to optimize coverage over the dataset and indirectly mitigate class imbalance.
Zheng, H., Liu, R., Lai, F., & Prakash, A. (2022). Coverage-centric coreset selection for high pruning rates. arXiv preprint arXiv:2210.15809.
This paper introduces stratified resampling (as it is called in DataEval). Stratified re-ranking led to improvement even on well-balanced datasets, e.g. CIFAR10. However, there is some difficulty in interpreting and directly comparing results to other papers, e.g. [1], because Zheng et al. pre-prune by a variable amount that is treated as a hyperparameter. This may be a reasonable approach if test performance can be repeatedly estimated, but oftentimes it is not.
Maharana, A., Yadav, P., & Bansal, M. (2023). D2 pruning: Message passing for balancing diversity and difficulty in data pruning. arXiv preprint arXiv:2310.07931.