How to align two label spaces¶
Problem statement¶
Two datasets rarely name things the same way. One annotates "motorbike" where
another writes "motorcycle"; one has a single "car" class where another
distinguishes "sedan" from "suv". Before annotations from different sources
can be compared, graded against one another, or combined, their labels must be
related: which concept here corresponds to which concept there, and how.
Establishing those correspondences is ontology alignment. DataEval performs
it with label_alignment(), which maps a source vocabulary onto a target
(reference) Ontology and reports, for each source class, the typed
correspondence it found — and whether the source can be expressed in the target
without loss.
When to use¶
Use this workflow when you have two label vocabularies and want to:
find which source classes are equivalent to a target class (a safe rename)
find which are narrower (a fine source class that can be safely coarsened up to a more general target) or broader (a coarse source class that spans several finer target classes — a granularity mismatch)
get a class_remap (
source -> target) for carrying source labels into the target vocabulary, and a mergeability verdict on how completely that can be done
Aligning a structureless source — a bare list of class names — against the
target is exactly label reconciliation
(label_reconciliation()) plus structural inference; label_alignment
generalizes it.
What you will need¶
A source vocabulary (a list of class names, or an
Ontology).A target/reference
Ontology.Nothing else for the exact/structural core — it has no extra dependencies. The optional fuzzy-matching recipe in section 4 uses
rapidfuzz(pip install rapidfuzz).
from dataeval import Ontology
from dataeval.core import label_alignment
1. A reference vocabulary¶
Alignment maps a source onto a target — the reference vocabulary everything is
expressed in. Here is a small two-level reference taxonomy, built dependency-free
with Ontology.from_hierarchy().
reference = Ontology.from_hierarchy({
"vehicle": {"car": {"sedan": None, "suv": None}, "truck": None},
"animal": {"dog": None, "cat": None},
})
print(reference)
Ontology(8 concepts, 2 roots, 5 leaves, 0 external)
2. Align a list of class names¶
The simplest source is another dataset’s class names. Pass them to label_alignment()
with the reference. The result is a LabelAlignmentResult TypedDict.
result = label_alignment(["sedan", "truck", "spaceship"], reference)
print("keys:", list(result))
keys: ['correspondences', 'unaligned_source', 'unaligned_target', 'class_remap', 'mergeability']
Each accepted mapping is a Correspondence — a typed
⟨source, target, relation, confidence⟩ tuple, with a note of the matcher that
produced it:
for c in result["correspondences"]:
print(f" {c.source:>10} {c.relation:<10} {c.target:<8} ({c.confidence:.2f}, {c.matcher})")
sedan equivalent sedan (1.00, exact)
truck equivalent truck (1.00, exact)
class_remap is the actionable artifact: the source -> target rewrite for the
correspondences that license one (equivalence and coarsening). unaligned_source
and unaligned_target are read open-world — an unaligned concept is
out-of-vocabulary with respect to the other side, not invalid.
print("class_remap: ", dict(result["class_remap"]))
print("unaligned_source:", result["unaligned_source"])
print("mergeability: ", result["mergeability"])
class_remap: {'sedan': 'sedan', 'truck': 'truck'}
unaligned_source: ('spaceship',)
mergeability: partial
"spaceship" is out-of-vocabulary, so the source is only partially
expressible in the reference. On a structureless source this matches
label_reconciliation() exactly — the equivalence class_remap is
the matched set:
from dataeval.core import label_reconciliation
rec = label_reconciliation(["sedan", "truck", "spaceship"], reference)
print("reconciliation matched:", dict(rec["matched"]))
reconciliation matched: {'sedan': 'sedan', 'truck': 'truck'}
3. Reasoning across granularity¶
The point of carrying a relation on each correspondence is that the relation, not just the pairing, says what the data permits. The asymmetry between coarsening and splitting is the heart of it.
Source coarser than target → broader (a granularity mismatch)¶
A model that only predicts "car" cannot be scored against the reference’s
"sedan"/"suv" ground truth without acknowledging that "car" spans both.
label_alignment flags this with broader correspondences (diagnostics — they are
deliberately excluded from class_remap, because splitting a coarse label into
finer ones is not licensed by the relation alone).
coarse = label_alignment(["car"], reference)
for c in coarse["correspondences"]:
print(f" {c.source} {c.relation:<10} {c.target}")
print("class_remap:", dict(coarse["class_remap"]), "| mergeability:", coarse["mergeability"])
car equivalent car
car broader sedan
car broader suv
class_remap: {'car': 'car'} | mergeability: lossless
Source finer than target → narrower (safe coarsening)¶
The reverse direction is safe: every sedan is a car, so a fine source class can
always be coarsened up to a more general target. When the source carries its own
hierarchy, label_alignment propagates an equivalence anchor down to its descendants. Here
the reference is coarse (car is a leaf) and the source is fine.
coarse_reference = Ontology.from_hierarchy({"vehicle": {"car": None, "truck": None}})
fine_source = Ontology.from_hierarchy({"car": {"sedan": None}})
result = label_alignment(fine_source, coarse_reference)
for c in result["correspondences"]:
print(f" {c.source:>6} {c.relation:<10} {c.target}")
print("class_remap:", dict(result["class_remap"]), "| mergeability:", result["mergeability"])
car equivalent car
sedan narrower car
class_remap: {'car': 'car', 'sedan': 'car'} | mergeability: lossy
"sedan" is coarsened to "car" — valid, but now "sedan" and "car" both map
to "car" and can no longer be told apart, so the source is only lossily
expressible. (A source is lossless only when its class_remap is injective.)
4. Bring your own matcher: fuzzy name matching¶
Exact matching anchors on labels, synonyms, and ids; typos and word-order
variants slip past it. The matchers= argument is the extension seam: any
object implementing the Matcher protocol can propose additional
correspondences for the source concepts the exact pass left unanchored.
DataEval ships the protocol and the dependency-free engine, not specific matchers
— string similarity, embeddings, and instance overlap are different, tunable
strategies. Here is a compact fuzzy matcher built on
rapidfuzz (pip install rapidfuzz)
that you can adapt or swap out.
from collections.abc import Iterable
from rapidfuzz import fuzz, process
from dataeval.types import Correspondence, OntologyConcept
class FuzzyMatcher:
"""A Matcher proposing equivalences by fuzzy string similarity."""
def __init__(self, threshold: float = 0.85):
self.threshold = threshold
# A matcher only needs to iterate concepts (id / label / synonyms) — not the
# full Ontology — so it accepts any iterable of OntologyConcept.
def __call__(self, source: Iterable[OntologyConcept], target: Iterable[OntologyConcept]) -> list[Correspondence]:
# case-folded target label/synonym -> concept id
names = {n.casefold(): c.id for c in target for n in (c.label, *c.synonyms)}
proposals = []
for concept in source:
match = process.extractOne(
concept.label.casefold(),
list(names),
scorer=fuzz.token_sort_ratio,
score_cutoff=self.threshold * 100,
)
if match is not None:
proposals.append(
Correspondence(
source=concept.id,
target=names[match[0]],
relation="equivalent",
confidence=match[1] / 100,
matcher="fuzzy",
)
)
return proposals
It satisfies the Matcher protocol structurally — no inheritance needed:
from dataeval.protocols import Matcher
print("is a Matcher:", isinstance(FuzzyMatcher(), Matcher))
is a Matcher: True
Pass it via matchers= to catch the near-miss names the exact pass missed:
messy = ["motorcyle", "sedans", "truck pickup", "stop sign"] # 3 near-misses, 1 OOV
catalog = Ontology.from_hierarchy({"vehicle": {"motorcycle": None, "sedan": None, "pickup truck": None}})
fuzzy = label_alignment(messy, catalog, matchers=[FuzzyMatcher(threshold=0.85)])
for c in fuzzy["correspondences"]:
print(f" {c.source:>12} -> {c.target:<14} ({c.confidence:.2f}, {c.matcher})")
print("unaligned_source:", fuzzy["unaligned_source"])
motorcyle -> motorcycle (0.95, fuzzy)
sedans -> sedan (0.91, fuzzy)
truck pickup -> pickup truck (1.00, fuzzy)
unaligned_source: ('stop sign',)
A proposal below the acceptance threshold is withheld rather than guessed —
alignment favors precision over recall, leaving "stop sign" unaligned for
inspection. The same seam accepts an embedding- or instance-based matcher with no
change to label_alignment.
5. Aligning more than two sources¶
To relate several sources, align each one to a single shared reference
(pivot) vocabulary rather than every pair. Correspondences between sources are
then read off through the pivot, and the per-source class_remaps express them all in
one label space.
sources = {
"dataset_a": ["sedan", "truck"],
"dataset_b": ["suv", "dog"],
}
for name, labels in sources.items():
r = label_alignment(labels, reference)
print(f"{name:>10}: class_remap={dict(r['class_remap'])} mergeability={r['mergeability']}")
dataset_a: class_remap={'sedan': 'sedan', 'truck': 'truck'} mergeability=lossless
dataset_b: class_remap={'suv': 'suv', 'dog': 'dog'} mergeability=lossless
Summary¶
label_alignment()maps a source vocabulary (a list of names or anOntology) onto a targetOntology, returning aLabelAlignmentResultof typedCorrespondenceobjects.Relations carry meaning:
equivalent(rename) andnarrower(safe coarsening) populateclass_remap;broaderflags a granularity mismatch and is excluded.mergeabilitysummarizes how completely the source is expressible in the target —lossless,lossy(specificity collapses), orpartial(unaligned classes).The
matchers=seam accepts anyMatcher(e.g. the fuzzy recipe above) to catch near-miss names; below-threshold proposals are withheld, not guessed.Align several sources to one shared pivot ontology to express them in a common label space.
Related concepts¶
Ontology — the taxonomic model alignment operates on and the reconciliation it generalizes, plus the alignment theory: correspondences, relations, matchers, mergeability, and the common cut.
Distribution Shift and Divergence — the distributional differences between sources that remain after their label spaces are aligned.