How to conform and merge datasets with different label vocabularies

Problem statement

You have a curated reference dataset, and a second incoming dataset arrives labeled under a different scheme: different class names, different (or reordered) integer indices, and an overlapping-but-not-identical set of classes. You cannot simply concatenate them — index 0 might mean "sedan" in one and "submarine" in the other, and one dataset may contain classes the other has never seen.

This guide conforms the incoming dataset to your reference vocabulary (an Ontology) and then merges the two into a single dataset you can analyze together. It builds on ontology alignment: alignment establishes how the labels correspond; Conform with Relabel applies that to the data, and merge_datasets() combines the conformed results.

When to use

  • combining annotation sources that name or index their classes differently

  • bringing an incoming dataset into a fixed reference taxonomy before analysis

  • reconciling a partially-overlapping class set against a reference vocabulary

What you will need

  1. A reference Ontology (the target vocabulary).

  2. Two datasets to combine.

  3. dataeval installed (pip install dataeval).

from collections import Counter
from collections.abc import Iterable, Mapping

import numpy as np

from dataeval import Ontology
from dataeval.core import label_alignment
from dataeval.data import Conform, Relabel, merge_datasets
from dataeval.protocols import DatasetMetadata

A tiny dataset for the example

Conforming and merging act purely on labels and index2label, so the images are irrelevant here — we use a small in-memory dataset with blank images to keep the example fast and self-contained. A real AnnotatedDataset (e.g. a maite-datasets loader or your own) works identically.

class ToyDataset:
    """A minimal image-classification dataset: one-hot targets + an index2label."""

    def __init__(self, dataset_id: str, labels: Iterable[int], index2label: Mapping[int, str]) -> None:
        self._labels = list(labels)
        self._index2label = dict(index2label)
        self.metadata = DatasetMetadata(id=dataset_id, index2label=self._index2label)

    def __len__(self) -> int:
        return len(self._labels)

    def __getitem__(self, index: int):
        onehot = np.zeros(len(self._index2label), dtype=np.float32)
        onehot[self._labels[index]] = 1.0
        return np.zeros((3, 8, 8), dtype=np.float32), onehot, {"id": index}


def labels_from_counts(counts: dict[str, int], index2label: dict[int, str]) -> list[int]:
    """Expand ``{class_name: count}`` into a per-image list of label indices."""
    name_to_index = {name: index for index, name in index2label.items()}
    return [name_to_index[name] for name, count in counts.items() for _ in range(count)]

1. A reference vocabulary

The reference vocabulary is an Ontology — the label space everything will be expressed in. Here is a taxonomy of vessels grouped by domain (air / land / water). Note it has no undersea branch: "submarine" is therefore out-of-vocabulary and will be dropped when conforming.

reference_ontology = Ontology.from_hierarchy({
    "aircraft": {"airliner": None, "fighter jet": None},
    "land vehicle": {"sedan": None, "pickup truck": None},
    "watercraft": {"frigate": None, "cargo ship": None},
})
print(reference_ontology)
print("reference vocabulary:", {i: reference_ontology.concept(c).label for i, c in enumerate(reference_ontology.ids)})
Ontology(9 concepts, 3 roots, 6 leaves, 0 external)
reference vocabulary: {0: 'aircraft', 1: 'airliner', 2: 'fighter jet', 3: 'land vehicle', 4: 'sedan', 5: 'pickup truck', 6: 'watercraft', 7: 'frigate', 8: 'cargo ship'}

2. The reference dataset, conformed to the vocabulary

Our reference data uses its own label ordering. We align its class names to the ontology and conform it. Every class is in the vocabulary, so this is lossless — it just re-expresses the labels in the reference’s index space.

REFERENCE_VOCAB = {0: "sedan", 1: "pickup truck", 2: "frigate", 3: "cargo ship"}
reference_raw = ToyDataset(
    "reference",
    labels_from_counts({"sedan": 6, "pickup truck": 4, "frigate": 5, "cargo ship": 3}, REFERENCE_VOCAB),
    REFERENCE_VOCAB,
)

reference_alignment = label_alignment(reference_raw.metadata.get("index2label", {}).values(), reference_ontology)
reference = Conform(reference_raw, [Relabel(reference_alignment["class_remap"], reference_ontology)])

print("reference images:", len(reference))
print("reference now uses the vocabulary:", reference.metadata.get("index2label"))
reference images: 18
reference now uses the vocabulary: {0: 'aircraft', 1: 'airliner', 2: 'fighter jet', 3: 'land vehicle', 4: 'sedan', 5: 'pickup truck', 6: 'watercraft', 7: 'frigate', 8: 'cargo ship'}

3. An incoming dataset in its own label scheme

The incoming data uses a different, reordered index2label (note submarine is index 0 here), and an overlapping-but-different class set: sedan and frigate overlap the reference, fighter jet is new (but in the vocabulary), and submarine is not in the reference vocabulary at all.

INCOMING_VOCAB = {0: "submarine", 1: "frigate", 2: "sedan", 3: "fighter jet"}
incoming = ToyDataset(
    "incoming",
    labels_from_counts({"submarine": 4, "frigate": 3, "sedan": 5, "fighter jet": 2}, INCOMING_VOCAB),
    INCOMING_VOCAB,
)
print("incoming vocabulary:", incoming.metadata.get("index2label"))
incoming vocabulary: {0: 'submarine', 1: 'frigate', 2: 'sedan', 3: 'fighter jet'}

4. Align the incoming labels to the reference vocabulary

label_alignment() relates the incoming class names to the reference ontology. Matching is by name, so the reordered indices are irrelevant — sedan, frigate, and fighter jet map by equivalence, while submarine is out-of-vocabulary.

incoming_alignment = label_alignment(incoming.metadata.get("index2label", {}).values(), reference_ontology)
for c in incoming_alignment["correspondences"]:
    print(f"  {c.source:>12}  {c.relation:<11} -> {c.target}")
print("out-of-vocabulary:", incoming_alignment["unaligned_source"])
       frigate  equivalent  -> frigate
         sedan  equivalent  -> sedan
   fighter jet  equivalent  -> fighter jet
out-of-vocabulary: ('submarine',)

5. Conform the incoming dataset

Relabel applies the alignment: it drops the out-of-vocabulary submarine images, rewrites the remaining labels into the reference index space, and replaces the dataset’s index2label with the reference vocabulary.

relabel = Relabel(incoming_alignment["class_remap"], reference_ontology)
incoming_conformed = Conform(incoming, [relabel])

print("kept", len(incoming_conformed), "of", len(incoming), "images")
print("dropped (out-of-vocabulary):", dict(relabel.dropped))
print(
    "incoming now uses the reference vocabulary:",
    incoming_conformed.metadata.get("index2label") == reference.metadata.get("index2label"),
)
kept 10 of 14 images
dropped (out-of-vocabulary): {0: 'submarine'}
incoming now uses the reference vocabulary: True

6. Merge

Both datasets now share one index2label, so merge_datasets() can combine them into a single dataset view. The per-class counts show the union: sedan and frigate come from both datasets, pickup truck/cargo ship from the reference, and fighter jet from the incoming data.

merged = merge_datasets(reference, incoming_conformed)
print("merged images:", len(merged))

index2label = merged.metadata.get("index2label", {})
counts = Counter(index2label[int(np.argmax(datum[1]))] for datum in merged)
print("per-class counts:", dict(counts))
merged images: 28
per-class counts: {'sedan': 11, 'pickup truck': 4, 'frigate': 8, 'cargo ship': 3, 'fighter jet': 2}

Conforming first is what makes the merge sound. Merging the datasets before conforming fails, because their label vocabularies do not line up:

try:
    merge_datasets(reference, incoming)
except ValueError as error:
    print("without conforming:", str(error).splitlines()[0])
without conforming: merge_datasets requires all datasets to share the same 'index2label'. Conform them to a common vocabulary first (see dataeval.data.Conform / Relabel).

Summary

  • A reference Ontology defines the shared target vocabulary.

  • label_alignment() relates each dataset’s class names to that vocabulary (by name, so reordered indices don’t matter); Conform with Relabel applies it — rewriting labels, resizing the label space, and dropping out-of-vocabulary classes.

  • Once datasets share an index2label, merge_datasets() combines them into one dataset; it refuses datasets whose vocabularies differ, so conforming is a prerequisite, not an afterthought.

  • Conform is a general seam: Relabel is the first conformer, with metadata- and value-conforming operations to follow.