The Epistemology of Truth
In which we revisit the need for nuance around data modeling that aged well a full decade later and comment on how it's harder to ignore these issues today.
I wrote this post on Jan 9th, 2011, more than a decade ago. At that time, I was working on making the Knowledge Graph useful for Google. I’m quoting it here in its entirety and add additional commentary at the end.
Every person that deals with data and data integration, especially at large scale and wide spectrum, sooner or later needs to make a choice: deciding whether truth is discovered or invented.
Sure, there are various shades between those options, but either you believe in a metaphysical reality that is absolute truth and you just have to find a way to discover it. Or you don't and what's left is human creation, social contracts, collective subjectivity distilled by hundreds of years of interaction.
Deciding where you stand on this choice influences dramatically how you think, how you work, what you want to build, who you like to work with and what efforts you feel drawn to and desire to be associated with.
What is also surprising about this choice is how easy it is to make: just like with religion, you either have faith in something you can only aspire to know, or you don't. And both sides can't really understand how the other can't see what's so obvious to them.
This debate about invention vs. discovery, objectivity vs. subjectivity, physical vs. metaphysical, embodied vs. abstract has been raging for thousands of years and takes many forms but what's striking is how divisive it is and how incredibly persistent over time, like a sort of endemic memetic infection (a philosophical cold, if you allow me).
What interests me about this debate is its dramatic impact on knowledge representation technologies.
The question of truth seems easy enough at first, but it gets tricky very quickly. Let me show you.
Let's start with an apparently bland and obvious statement:
Washington, D.C is the capital of the United States of America
True of false? Most people would say true without even pausing to think, but people that have dealt with knowledge modeling problems will probably ask "when?", when is this statement supposed to be true? now or in a different time? So, for example
Washington, D.C. is the capital of the United States of America in 1799.
True or false? It gets trickier. One has to define what capital means and has to know enough history to understand that the newly formed government of the United States of America was actually assembling in Philadelphia that year. But one could very well claim that Washington, D.C. was being built and therefore was already the capital even if nobody was living there yet.
As you can see, something that appears as factual, benign and obvious as knowledge that every elementary school kid knows by heart can immediately become tricky to model symbolically in a way that encompasses disagreements and details all nuances of interpretation.
But there are cases where one statement rings more true/false than another
Washington, D.C. was the capital of the roman empire.
given that the roman empire ended before Washington ever existed, it is operationally safe to assume this one to be false. And yet there are statements that are unknown/unknowable
Washington, D.C. is the capital of the United States of America in 4323.
and statements (/me winks to Gödel) which we are certain their validity can't be known
This statement is false.
Unfortunately, when you're modeling knowledge and try to condense it into a form that is mechanically digestible and symbolically consistent, the problem of finding a sustainable operational definition of truth becomes necessary.
This is where the epistemological debate on truth actually manages to enter the picture: if you think that truth is discovered, you won't accept compromises, you want to find a process that finds it. And given that even if metaphysically existing truth is probably never reachable, it is extremely difficult to embrace this philosophy and yet obtain an actionable definition of truth.
On the other hand, those who believe that truth is merely distilled subjectivity and a series of ever-evolving collective social contracts, one solution is to avoid considering truth in terms of absolutes, but just thinking of statements as 'true enough'. Examples of true enough are "true enough to be understood and agreed upon by a sufficiently large number of people", "true enough to pass scrutiny of a large enough of people considered experts in this domain", "true enough to remain present over years of public edits in highly visible and publicly edited wiki", etc.
This 'true enough' modus operandi allows to build knowledge representation that is useful for certain tasks and models the world and answers questions in a way that rings true to its users enough times to build trust about all the other results that can't be judged by the user directly.
The operational definition of truth as the byproduct of emergent coordination has the huge benefit of being actionable but has the equivalently huge problem of degrading rapidly with the number of people that have to share the same knowledge and understanding. This is simply a result of the combination of finite resources available to people (energy, time, communication bandwidth) and inherent variability of the world.
While this is so obvious to be almost tautological, it is nevertheless the biggest problem with knowledge representation: the number of assertions that can be considered true, independently and without coordination, decreases rapidly with the amount of people that need to naturally resonate with their truths. Even if we find a way to exclude the fraction of the population of divergent naysayers or plain ignorant, the trend remains the same: the larger the population the harder it is that all will agree on something.
Yet, such eigenvectors of knowledge are incredibly valuable as they form the conduit, the skeleton, upon which the rest of discourse and data modeling can take place; the plumbing on top of which integration can flow and exchange of knowledge over information can happen.
Alphabets and natural languages, for example, are examples of such sets of culturally established eigenvectors of knowledge. They are not modeled strictly and centrally, but they emerge rules and patterns, a core of which many people share when we know the language and how to encode it for transport (characters or phonemes) and it allows my brain to inject knowledge all that way into yours right this very moment (which is nothing short of incredible, if you stop and think about it).
There is an implicit assumption out there, shared my many that work to build large-scale knowledge representation systems, that it is possible to bypass this problem by focusing on modeling only "factual" information. That is, information that descends directly from facts and therefore would be true in both views of the world, emergent and metaphysical.
Unfortunately, as I showed above, it's relatively easy to blur the factual quality of an assertion by simply considering time, the world's inherent variability, larger contexts and cultural biases of interpretation.
So, if we can't even model truth, what hope is left to be able to model human knowledge in symbolic representations enough that computers can operate and interrogate on our behalf and generate results that human users would find useful?
Personally, I think we need to learn from the evolution of natural languages and stop thinking in terms of 'true/false' or 'correct/incorrect' and focus on practical usefulness instead.
Unfortunately, this is easy to say but incredibly hard to execute as engineers and ontologists alike have a natural tendency to abhor the incredible amount of doubt and uncertainty that such systems will have to model alongside the information they carry in order to follow the principles I outlined above.
Also, users won't like to receive answers that have associated degrees of belief (often modeled as probabilities, as in bayesian networks).
Think about it: if you asked such system "Is Elvis Presley dead?" and the answer you got was "yes (98%) / no (2%)", would you be satisfied? would you use the system again or would you rather prefer a system that told you "yes" hid all the doubts from you and made your life easier?
Last but not least, people in this field hate to talk about this problem, because they feel it brings too much doubt into the operational nature of what they're trying to accomplish and undermines the entire validity of their work so far. It's the elephant in the room, but everybody wants to avoid it, hoping that with enough data collected and usefulness provided, it will go away, or at least it won't bother their corner.
I'm more and more convinced that this problem can't be ignored without sacrificing the entire feasibility of useful knowledge representation and even more convinced that the social processes needed to acquire, curate and maintain the data over time need to be at the very core of the designing of such systems, not a secondary afterthought used to reduce data maintenance costs.
Unfortunately, even the most successful examples of such systems are still very far away from that.... but it's an exciting problem to have nonetheless.
At the time I wrote this, it was easy for my colleagues to dismiss my worries as “good problems to have”. Today, the debate around the political power of content moderation for information dissemination platforms signals it is too dangerous to do so.
Unfortunately, most data modeling systems and information discovery tools still lack the ability to model and surface uncertainty, degree of belief and disagreement and to distinguish “evidence of absence” from “absence of evidence”.
The hope, it seems to me, is that “AI + contractors will solve content moderation” and this feels naive given that this problem is a byproduct of the interaction of billions of general intelligences far more sophisticated than the ones we hope can be built to make sense of it.
What if we started helping people make sense of an increasingly complex world and built tools to help them to make decisions under uncertainty instead of find ways to hide it from them?