The Psychology of Ontology Harmonization
In which we look into fundamental bottlenecks in the scalability of wide spectrum data integration and what research areas remain largely unexplored 17 years later.
I wrote this post on April 27, 2004 and while the chatter around RDF, XML, schemas and ontologies has died down considerably, the problems I observed back then around the fundamental difficulties around large scale data integration appear to remain valid to this day.
It also shines a light on research around the alignment between independently emergent latent spaces in language models which is (despite amazing recent advances on the neural methods) still a largely unexplored territory.
The semantic web, catchy name aside, is supposed to be about increasing the amount of information that machines can can process with objective and repetitive results.
This is not that different from what computers have been done since they were invented, but one such difference is that information is now shaped as a network, where it is much more likely that the information producer is not its only consumer.
This might look similar to what the database world calls data warehouses, aggregation of heterogeneous databases that can be cross-queried and mined for properties that emerge out of the merging, but this is just a superficial comparison: interoperability of reasoning requires not only the notion of the value of a particular information, but its kind, its type, and more importantly, the relationship that this class of information has with other classes.
People normally find it hard to understand why interoperability requires formalization of semantics, so much, for example, that the entire XML-oriented tribe and the Web Services stack of protocols and specifications consider structural validity as a substitute for semantics. Those who really had to make data interoperability in diversely constrained environments know that implicit semantics thru structural validity is not enough.
So, what do we mean with formal semantics?
In both the database and markup world, the term used for the description of the structural model is "schema":
A database schema indicates the "structure" of the database, sort of the equivalent of the shape of a container, while the data fills it.
A markup schema (DTD, XMLSchema, RelaxNG) indicates the structural property of a markup tree, again indicating its shape and defining what instances of those trees can be said valid in respect of a particular one.
The semantic web is based on the RDF model that describes a direct pseudo-graph, the most complex graph model, which is able to describe both the relational model (the one on top of which databases are based on) and the hierarchical model (the one on top of which markup languages are based on).
At the foundation level, the semantic web approach is to start with describing the data model in such a way that it can match all data models out there.
After that, it uses URIs as unique identifiers for all the entities that make up the graph (nodes and arcs).
RDF is most always confused with RDF/XML, which is merely one of the possible syntaxes that could be used to serialize an RDF model (and, in this case, as an XML document). Given the fact that it is normally hard to map a graph into a tree (or into a set of relational tables), the RDF/XML syntax looks awkward and utterly verbose to XML people, which fail to understand the underlying benefits of RDF.
Having the ability to describe, serialize and parse a directed pseudo-graph is a very useful thing, but only when the need for such a complex data structure is recognized.
It must also be understood that such graphs are normally very hard to understand and visualize because they tend to make explicit all the features that are normally implicit in the other models. This is, in fact, the validity of the approach: it reduces implicitness to a mathematical axiomatic minimum (the graph itself), allowing allowing all information and its relationships to be made explicit and identifiable.
Unfortunately, this is far from being enough from achieving semantic interoperability, because even if those nodes and arcs are uniquely identifiable, hardly two independent source of information would identify the same "concept" with the same identifier.
There is a lot of philosophical debate in whether or not a "concept" could be given an identifier at all. It is a very complex subject and it's borderline to loose any scientific meaning, so I will just say that a "concept" should not be considered as a collection of characters that make up a word or a sentence in a particular natural language, but merely a uniquely identifiable entity.
Experiments in language learning and cognitive sciences show that the human brain seems to have the ability to capture generalizations with less information that information theory would require (Chomsky showed this studying the language learning ability of kids, who were able to identify the meaning of terms they were never exposed to). One possible solution for this apparent contradiction is the existence of innate archetypes in our cognitive abilities.
This is all good in theory, but in practice, the conceptual model used to describe reality in information systems are normally associated with concepts that normally named and identified by a word. "author", "title", "date of birth" are normally found as the 'atomic semantics' of metadata schemata, but they are very far from being abstract enough to considered 'innate' in a cognitive sense.
This results in psychological dissonance when one has to project her own mental model of the world against somebody else's. This dissonance normally generates friction and the tendency to find a counter-argument, a single exception that would make the model not fit the particular purpose, thus giving the economic credibility to the mostly-irrational need for achieving closure and completeness in the world description and, therefore, fork or modify the effort.
This "babel tendency" is very unlikely to go away, because even if the set of innate cognitive processing functions is finite and likely to be statistically fairly homogeneous (I know no evident reason of why statistical differentiation of cognitive abilities between individuals should be different than, for example, ethical traits), the small differences between individuals are highly regarded, both from an individualistic, social and evolutionary perspective.
This tendency reflects in the discovery that even if uniquely identifiable, those entities in the graph might never be identified with the same identifier by two different individuals or groups.
The relationship of identifier equality could bring two otherwise disjoint graphs connected, but in real life this is very hard to be archived, unless supervision and control is applied to the various graphs that are under merge. While feasible on a small and labor-rich environment, this approach cannot work on a global scale.
In order to come out of this identity impasse, the solution has always been that of creating indirect bridges between two nodes, by the creation of intermediate nodes that are general and recognized enough for both sides to connect on. This has lead to the creation of ontologies which are lists of concepts defined in a particular application domain together with the relationships that connect them.
The socio-economical benefit of this approach is that instead of forcing everyone to use a single schema, the various instances are mapped to a more abstract conceptual model, and bridges that connect the two worlds, are found indirectly thru this higher-abstraction layer. While impeccable from a theoretical point of view, this approach seems to assume that higher the abstraction, lower it is the chance of disagreement and higher the objectivity in understanding the different models.
My personal experience seems to suggest the exact opposite: higher the abstraction, lower the objectivity with which people can argue and, therefore, come to an agreement.
I've spoken to people that spent several years of their lives coming up with an ontology and their perception is that the complexity over time of these models to cover a particular domain saturates, does not continue to grow.
This is a basic, but vital assumption for this entire approach to work: if the ontologies grow linearly with the amount of information they can describe, the ontology creation/maintenance process simply won't scale.
So, assuming that this is not the case and that ontologies grow less then linearly with the amount of schemas they can cover, the next problem is that unlike we reach the level of granularity and abstraction of innate human cognition processes, all ontologies won't have the required abstraction to become resolutive per se, and will have to go thru the same process of "bridging thru abstraction" or "man-made mapping".
While the babel effect is hardly to go away at the ontological level too, it is safe to say that since ontologies require much more work than simple schemas, it is very unlikely that the population of such ontologies will grow too big, allowing the man-made mapping procedure to be in a socio-economical feasibility scope.
This process of linking two different ontologies is called, rather diplomatically, harmonization. The process of harmonizing two ontologies is normally a very hard problem, not only conceptually, but also politically and psychologically.
The usefulness and intellectual satisfaction of harmonization (the bridging of two worlds together) normally clashes with the tendency to believe that one side is the core while the other is the extension. As it normally happens for intellectual creations that don't have an objective measure agreed upon by the parties, both tend to see their own work as the core and the other's as the extension.
While interesting from a socio-dynamic perspective, from a purely mathematical perspective which ontology is at the core is utterly irrelevant and tends to blur even more the more ontologies get harmonized.
By observing the harmonization process between Harmony/ABC and CIDOC/CRM, perhaps the most interesting thing is the fact that the parties were able to come to an agreement in all but two classes, in particular "abc:agent" e "crm:actor" and this seems to reflect the different mental model of the world that the different parties have assumed, which, rather significatively, seems to be traceable back to the problem of quantum vs. classical observer, a philosophical debate that physicists had since the introduction of quantum physics.
It is evocative to think of the introduction of the "quark" (a word that James Joyce invented in his Finnegans Wake) to describe the components of the subatomic particles as a way to unlock the babel problem of having everyone to agree on the projection of a particular word on people's mental models.
I can't help thinking about how the principles of Latent Semantic Analysis show how simple linear algebra matrix decomposition could generate an equivalent information space that requires lower dimensions to be described and achieves the synonym extraction abilities of a 6-years-old child.
It seems that the algorithm is able to capture the semantics that are latent on the various words and extract the concept that has been encoded in those words.
Maybe applying LSA-like spectral decomposition to graphs will lead to the creation of these unnamed concepts? Is this the direction that leads to the "holy grail" connection between symbolic and sub-symbolic AI?"