How to reconcile labels against an ontology

Problem statement

A dataset’s class names rarely live in isolation — they belong to a domain taxonomy. “sedan” and “pickup truck” are both land vehicles; “fighter jet” is an aircraft. Knowing this hierarchy lets you sanity-check labels (does every class actually exist in the reference vocabulary?) and reason about relationships between them (which classes are siblings, which subsume others).

DataEval represents a taxonomy with the Ontology class — a small, in-memory, strongly-typed graph of concepts — and reconciles a dataset’s class names against it with label_reconciliation().

When to use

Use this workflow when you have a set of class names (e.g. index2label.values()) and a reference ontology, and you want to:

  • check which class names map to known concepts (and which are unmatched or ambiguous)

  • recover each class’s place in the hierarchy (its is-a path to the root)

  • understand pairwise relationships (ancestor / descendant / sibling) between your classes

This is exact reconciliation (matching on preferred labels, synonyms, and ids). Fuzzy / semantic normalization of messy labels is a separate, future capability — here, an “unmatched” label simply means “not found in the ontology,” not “invalid.”

What you will need

  1. A set of class names to reconcile.

  2. A reference ontology.

  3. A Python environment with dataeval[ontology] installed.

Note

Parsing a reference ontology from an RDF/OWL/JSON-LD source requires the ontology extra, which pulls in rdflib. Building an ontology in memory needs no extra dependencies.

Getting started

Import the pieces you need.

from dataeval import Ontology
from dataeval.core import label_reconciliation
from dataeval.types import OntologyConcept

1. Build an ontology

The simplest, dependency-free way to define a taxonomy is from a plain nested dictionary with Ontology.from_hierarchy(). Mapping values may be None (a leaf), a list of child labels, or a further nested mapping.

ontology = Ontology.from_hierarchy({
    "vehicle": {
        "land vehicle": {"sedan": None, "pickup truck": None},
        "watercraft": {"frigate": None, "cargo ship": None},
        "aircraft": {"airliner": None, "fighter jet": None},
    },
})
print(ontology)
Ontology(10 concepts, 1 roots, 6 leaves, 0 external)

The repr summarizes the structure: total concepts, roots (top-level concepts), leaves (most specific), and external references (more on those below).

2. Reconcile a dataset’s class names

Pass your class names — typically index2label.values() — to label_reconciliation(). Here the label set includes a class that isn’t in the ontology.

index2label = {0: "sedan", 1: "pickup truck", 2: "fighter jet", 3: "rowboat"}

result = label_reconciliation(index2label.values(), ontology)

The return value is a LabelReconciliationResult — a TypedDict whose keys fall into two groups: a match report (matched, unmatched, ambiguous) and the recovered hierarchy of the matched classes (ancestor_paths, external_ancestors, induced_edges, relations). The rest of this section walks through each.

print("keys:", list(result))
keys: ['matched', 'unmatched', 'ambiguous', 'ancestor_paths', 'external_ancestors', 'induced_edges', 'relations']

Start with the match report — which class names resolved to a concept, and which did not:

print("matched:  ", result["matched"])
print("unmatched:", result["unmatched"])
print("ambiguous:", result["ambiguous"])
matched:   {'sedan': 'sedan', 'pickup truck': 'pickup truck', 'fighter jet': 'fighter jet'}
unmatched: ['rowboat']
ambiguous: {}

"rowboat" is flagged as unmatched — it isn’t a concept in this ontology. The remaining classes resolved to concepts. The result also recovers hierarchy information for the matched classes.

# Each matched class's is-a path, from nearest parent up to the root
for name, path in result["ancestor_paths"].items():
    print(f"{name:>14}  <-  {' < '.join(path)}")
         sedan  <-  land vehicle < vehicle
  pickup truck  <-  land vehicle < vehicle
   fighter jet  <-  aircraft < vehicle

Pairwise relations describe how matched classes relate to one another (ancestor, descendant, sibling, or unrelated):

print("sedan vs pickup truck:", result["relations"][("sedan", "pickup truck")])
print("sedan vs fighter jet: ", result["relations"][("sedan", "fighter jet")])
sedan vs pickup truck: sibling
sedan vs fighter jet:  sibling

When your label set includes classes at different levels of the hierarchy, induced_edges gives the minimal is-a tree connecting just those classes (intermediate concepts are collapsed):

label_reconciliation(["vehicle", "land vehicle", "sedan"], ontology)["induced_edges"]
[('vehicle', 'land vehicle'), ('land vehicle', 'sedan')]

3. Richer ontologies from OWL / RDF / JSON-LD

Real ontologies usually ship as standards-based OWL/RDF/JSON-LD files with preferred labels, synonyms, and definitions. Parse already-in-memory content with Ontology.from_rdf() (this requires the dataeval[ontology] extra). DataEval does not read files itself — load the bytes/text however you like (here, a small inline JSON-LD document) and pass them in.

JSONLD = """
{
  "@context": {
    "owl": "http://www.w3.org/2002/07/owl#",
    "subClassOf": {"@id": "http://www.w3.org/2000/01/rdf-schema#subClassOf", "@type": "@id"},
    "prefLabel": {"@id": "http://www.w3.org/2004/02/skos/core#prefLabel"},
    "altLabel": {"@id": "http://www.w3.org/2004/02/skos/core#altLabel"},
    "definition": {"@id": "http://www.w3.org/2004/02/skos/core#definition"},
    "cv": "http://example.org/cv#"
  },
  "@graph": [
    {"@id": "cv:Aircraft", "@type": "owl:Class", "prefLabel": "Aircraft"},
    {"@id": "cv:FighterJet", "@type": "owl:Class", "subClassOf": "cv:Aircraft",
     "prefLabel": "Fighter Jet",
     "altLabel": ["F-16", "Viper"],
     "definition": "A fast, maneuverable military aircraft."}
  ]
}
"""

owl_ontology = Ontology.from_rdf(JSONLD, format="json-ld")
print(owl_ontology)
Ontology(2 concepts, 1 roots, 1 leaves, 0 external)

Concepts are identified by their IRI, while labels and synonyms are used for matching. find resolves a name (case-insensitively) across preferred labels, synonyms, and exact ids — so an annotator’s "F-16" resolves to the canonical Fighter Jet concept:

print("find('F-16'):", owl_ontology.find("F-16"))

concept = owl_ontology.concept("http://example.org/cv#FighterJet")
print("label:     ", concept.label)
print("synonyms:  ", concept.synonyms)
print("definition:", concept.definition)
find('F-16'): ('http://example.org/cv#FighterJet',)
label:      Fighter Jet
synonyms:   ('F-16', 'Viper')
definition: A fast, maneuverable military aircraft.

4. Incomplete (subset) ontologies

Ontologies are frequently distributed as subsets, where a concept’s parent is referenced but not itself included. DataEval keeps these as external references rather than failing — they still participate in hierarchy queries, and label_reconciliation() reports where a class’s is-a path is truncated via external_ancestors.

subset = Ontology([
    # 'warship' is referenced as a parent but never defined in this subset
    OntologyConcept(id="frigate", label="Frigate", parents=("warship",)),
    OntologyConcept(id="sedan", label="Sedan"),
])
print("external_ids:", subset.external_ids)

subset_result = label_reconciliation(["Frigate", "Sedan"], subset)
print("external_ancestors:", subset_result["external_ancestors"])
external_ids: ('warship',)
external_ancestors: {'Frigate': ['warship']}

"Frigate" matched, but its hierarchy is unresolved above warship — useful for deciding whether the subset is sufficient or the full ontology is needed. "Sedan" is fully rooted, so it is absent from external_ancestors.

5. Exploring the hierarchy

The Ontology object exposes dependency-free graph queries for ad-hoc exploration:

print("ancestors(sedan): ", ontology.ancestors("sedan"))
print("siblings(sedan):  ", ontology.siblings("sedan"))
print("descendants(vehicle):", ontology.descendants("vehicle"))
print("depth_of(sedan):  ", ontology.depth_of("sedan"))
print("leaves:           ", ontology.leaves)
ancestors(sedan):  ('land vehicle', 'vehicle')
siblings(sedan):   ('pickup truck',)
descendants(vehicle): ('land vehicle', 'watercraft', 'aircraft', 'sedan', 'pickup truck', 'frigate', 'cargo ship', 'airliner', 'fighter jet')
depth_of(sedan):   2
leaves:            ('sedan', 'pickup truck', 'frigate', 'cargo ship', 'airliner', 'fighter jet')

Extract a focused sub-ontology rooted at any concept with subtree (parent links pointing outside the subtree are pruned, so the concept becomes a root):

watercraft = ontology.subtree("watercraft")
print(repr(watercraft))
print("ids:", sorted(watercraft.ids))
Ontology(3 concepts, 1 roots, 2 leaves, 0 external)
ids: ['cargo ship', 'frigate', 'watercraft']

Summary

  • Ontology.from_hierarchy() builds a taxonomy from plain Python with no extra dependencies; Ontology.from_rdf() parses standards-based OWL/RDF/JSON-LD (with the dataeval[ontology] extra).

  • label_reconciliation() returns a LabelReconciliationResult reporting matched / unmatched / ambiguous classes plus hierarchy (ancestor_paths, induced_edges, relations) and flags truncated hierarchies via external_ancestors.

  • The Ontology object supports graph queries (ancestors, descendants, siblings, depth_of, subtree, …) for exploration.

Reconciliation here is exact (labels, synonyms, ids). Fuzzy and semantic normalization of messy raw labels is planned future work — an unmatched label means “not present in this ontology,” and is the natural input to such a normalization step.

  • Ontology — what an ontology is, the vocabulary it uses, and how reconciliation and conformance are defined.

  • Data Integrity — the other label-quality checks reconciliation sits alongside.