How to reconcile labels against an ontology¶
Problem statement¶
A dataset’s class names rarely live in isolation — they belong to a domain taxonomy. “sedan” and “pickup truck” are both land vehicles; “fighter jet” is an aircraft. Knowing this hierarchy lets you sanity-check labels (does every class actually exist in the reference vocabulary?) and reason about relationships between them (which classes are siblings, which subsume others).
DataEval represents a taxonomy with the Ontology class — a small,
in-memory, strongly-typed graph of concepts — and reconciles a dataset’s class
names against it with label_reconciliation().
When to use¶
Use this workflow when you have a set of class names (e.g.
index2label.values()) and a reference ontology, and you want to:
check which class names map to known concepts (and which are unmatched or ambiguous)
recover each class’s place in the hierarchy (its is-a path to the root)
understand pairwise relationships (ancestor / descendant / sibling) between your classes
This is exact reconciliation (matching on preferred labels, synonyms, and ids). Fuzzy / semantic normalization of messy labels is a separate, future capability — here, an “unmatched” label simply means “not found in the ontology,” not “invalid.”
What you will need¶
A set of class names to reconcile.
A reference ontology.
A Python environment with
dataeval[ontology]installed.
Note
Parsing a reference ontology from an RDF/OWL/JSON-LD source requires the
ontology extra, which pulls in rdflib. Building an ontology in
memory needs no extra dependencies.
Getting started¶
Import the pieces you need.
from dataeval import Ontology
from dataeval.core import label_reconciliation
from dataeval.types import OntologyConcept
1. Build an ontology¶
The simplest, dependency-free way to define a taxonomy is from a plain nested
dictionary with Ontology.from_hierarchy(). Mapping values may be
None (a leaf), a list of child labels, or a further nested mapping.
ontology = Ontology.from_hierarchy({
"vehicle": {
"land vehicle": {"sedan": None, "pickup truck": None},
"watercraft": {"frigate": None, "cargo ship": None},
"aircraft": {"airliner": None, "fighter jet": None},
},
})
print(ontology)
Ontology(10 concepts, 1 roots, 6 leaves, 0 external)
The repr summarizes the structure: total concepts, roots (top-level
concepts), leaves (most specific), and external references (more on those below).
2. Reconcile a dataset’s class names¶
Pass your class names — typically index2label.values() — to
label_reconciliation(). Here the label set includes a class that isn’t in
the ontology.
index2label = {0: "sedan", 1: "pickup truck", 2: "fighter jet", 3: "rowboat"}
result = label_reconciliation(index2label.values(), ontology)
The return value is a LabelReconciliationResult — a TypedDict whose
keys fall into two groups: a match report (matched, unmatched,
ambiguous) and the recovered hierarchy of the matched classes
(ancestor_paths, external_ancestors, induced_edges, relations). The rest
of this section walks through each.
print("keys:", list(result))
keys: ['matched', 'unmatched', 'ambiguous', 'ancestor_paths', 'external_ancestors', 'induced_edges', 'relations']
Start with the match report — which class names resolved to a concept, and which did not:
print("matched: ", result["matched"])
print("unmatched:", result["unmatched"])
print("ambiguous:", result["ambiguous"])
matched: {'sedan': 'sedan', 'pickup truck': 'pickup truck', 'fighter jet': 'fighter jet'}
unmatched: ['rowboat']
ambiguous: {}
"rowboat" is flagged as unmatched — it isn’t a concept in this ontology.
The remaining classes resolved to concepts. The result also recovers hierarchy
information for the matched classes.
# Each matched class's is-a path, from nearest parent up to the root
for name, path in result["ancestor_paths"].items():
print(f"{name:>14} <- {' < '.join(path)}")
sedan <- land vehicle < vehicle
pickup truck <- land vehicle < vehicle
fighter jet <- aircraft < vehicle
Pairwise relations describe how matched classes relate to one another
(ancestor, descendant, sibling, or unrelated):
print("sedan vs pickup truck:", result["relations"][("sedan", "pickup truck")])
print("sedan vs fighter jet: ", result["relations"][("sedan", "fighter jet")])
sedan vs pickup truck: sibling
sedan vs fighter jet: sibling
When your label set includes classes at different levels of the hierarchy,
induced_edges gives the minimal is-a tree connecting just those classes
(intermediate concepts are collapsed):
label_reconciliation(["vehicle", "land vehicle", "sedan"], ontology)["induced_edges"]
[('vehicle', 'land vehicle'), ('land vehicle', 'sedan')]
3. Richer ontologies from OWL / RDF / JSON-LD¶
Real ontologies usually ship as standards-based OWL/RDF/JSON-LD files with
preferred labels, synonyms, and definitions. Parse already-in-memory content
with Ontology.from_rdf() (this requires the dataeval[ontology] extra).
DataEval does not read files itself — load the bytes/text however you like
(here, a small inline JSON-LD document) and pass them in.
JSONLD = """
{
"@context": {
"owl": "http://www.w3.org/2002/07/owl#",
"subClassOf": {"@id": "http://www.w3.org/2000/01/rdf-schema#subClassOf", "@type": "@id"},
"prefLabel": {"@id": "http://www.w3.org/2004/02/skos/core#prefLabel"},
"altLabel": {"@id": "http://www.w3.org/2004/02/skos/core#altLabel"},
"definition": {"@id": "http://www.w3.org/2004/02/skos/core#definition"},
"cv": "http://example.org/cv#"
},
"@graph": [
{"@id": "cv:Aircraft", "@type": "owl:Class", "prefLabel": "Aircraft"},
{"@id": "cv:FighterJet", "@type": "owl:Class", "subClassOf": "cv:Aircraft",
"prefLabel": "Fighter Jet",
"altLabel": ["F-16", "Viper"],
"definition": "A fast, maneuverable military aircraft."}
]
}
"""
owl_ontology = Ontology.from_rdf(JSONLD, format="json-ld")
print(owl_ontology)
Ontology(2 concepts, 1 roots, 1 leaves, 0 external)
Concepts are identified by their IRI, while labels and synonyms are used for
matching. find resolves a name (case-insensitively) across preferred labels,
synonyms, and exact ids — so an annotator’s "F-16" resolves to the canonical
Fighter Jet concept:
print("find('F-16'):", owl_ontology.find("F-16"))
concept = owl_ontology.concept("http://example.org/cv#FighterJet")
print("label: ", concept.label)
print("synonyms: ", concept.synonyms)
print("definition:", concept.definition)
find('F-16'): ('http://example.org/cv#FighterJet',)
label: Fighter Jet
synonyms: ('F-16', 'Viper')
definition: A fast, maneuverable military aircraft.
4. Incomplete (subset) ontologies¶
Ontologies are frequently distributed as subsets, where a concept’s parent is
referenced but not itself included. DataEval keeps these as external
references rather than failing — they still participate in hierarchy queries, and
label_reconciliation() reports where a class’s is-a path is truncated via
external_ancestors.
subset = Ontology([
# 'warship' is referenced as a parent but never defined in this subset
OntologyConcept(id="frigate", label="Frigate", parents=("warship",)),
OntologyConcept(id="sedan", label="Sedan"),
])
print("external_ids:", subset.external_ids)
subset_result = label_reconciliation(["Frigate", "Sedan"], subset)
print("external_ancestors:", subset_result["external_ancestors"])
external_ids: ('warship',)
external_ancestors: {'Frigate': ['warship']}
"Frigate" matched, but its hierarchy is unresolved above warship — useful
for deciding whether the subset is sufficient or the full ontology is needed.
"Sedan" is fully rooted, so it is absent from external_ancestors.
5. Exploring the hierarchy¶
The Ontology object exposes dependency-free graph queries for ad-hoc
exploration:
print("ancestors(sedan): ", ontology.ancestors("sedan"))
print("siblings(sedan): ", ontology.siblings("sedan"))
print("descendants(vehicle):", ontology.descendants("vehicle"))
print("depth_of(sedan): ", ontology.depth_of("sedan"))
print("leaves: ", ontology.leaves)
ancestors(sedan): ('land vehicle', 'vehicle')
siblings(sedan): ('pickup truck',)
descendants(vehicle): ('land vehicle', 'watercraft', 'aircraft', 'sedan', 'pickup truck', 'frigate', 'cargo ship', 'airliner', 'fighter jet')
depth_of(sedan): 2
leaves: ('sedan', 'pickup truck', 'frigate', 'cargo ship', 'airliner', 'fighter jet')
Extract a focused sub-ontology rooted at any concept with subtree (parent
links pointing outside the subtree are pruned, so the concept becomes a root):
watercraft = ontology.subtree("watercraft")
print(repr(watercraft))
print("ids:", sorted(watercraft.ids))
Ontology(3 concepts, 1 roots, 2 leaves, 0 external)
ids: ['cargo ship', 'frigate', 'watercraft']
Summary¶
Ontology.from_hierarchy()builds a taxonomy from plain Python with no extra dependencies;Ontology.from_rdf()parses standards-based OWL/RDF/JSON-LD (with thedataeval[ontology]extra).label_reconciliation()returns aLabelReconciliationResultreportingmatched/unmatched/ambiguousclasses plus hierarchy (ancestor_paths,induced_edges,relations) and flags truncated hierarchies viaexternal_ancestors.The
Ontologyobject supports graph queries (ancestors,descendants,siblings,depth_of,subtree, …) for exploration.
Reconciliation here is exact (labels, synonyms, ids). Fuzzy and semantic
normalization of messy raw labels is planned future work — an unmatched
label means “not present in this ontology,” and is the natural input to such a
normalization step.
Related concepts¶
Ontology — what an ontology is, the vocabulary it uses, and how reconciliation and conformance are defined.
Data Integrity — the other label-quality checks reconciliation sits alongside.