In text mining applications, we often work with some form of raw input (web pages, web sites, emails, etc.) and attempt to organize it in terms of the concepts that are mentioned or introduced in the documents.
This process of interpretation can take the form of 'normalization' or 'canonicalization' (in which many expressions are associated with a singular expression as an exemplar of an set). This happens, for examples, when we map 'Barack Obama', 'President Obama', etc. to a unique string 'President Barack Obama'. This is convenient when we want to retrieve all documents about the president.
In this process, we are associating elements within the same language (language in the sense of sets of symbols and the rules that govern their legal generation).
Another approach is to map (or associate) the terms in the original document with some structured record. For example, we might interpret the phrase 'Starbucks' as relating to a record of key value pairs {name=starbucks, address=123 main street, ...}. In this case, the structure of the record has a semantics (or model) other than that of the original document. In other words, we are mapping from one language to another.
Of course, what we want to do is denote the thing in the real world. It is, however, impossible to represent this as all we can do is shuffle bits around inside the computer. We can't attach a label to the real world and somehow transcend the reality/representation barrier. However, we can start to look at the modeling process with some pragmatics.
The model of the 'semantic' or 'knowledge' language governs a space of statements that can be made. These statements may or may not be true when compared with the real world, but they should at least aspire to a system of behaviours that does mirror the real world. For example, if two entities have the same address, then they should be located in the same space. If two entities have the same phone number then one should connect with the same agent when dialing that number. A set of entities describing a restaurant can include a statement about an entity described as a chef just as a restaurant 'has' a chef. A set of statements describing a clinic should not have any associated statements about chefs just as clinics don't have chefs.
It is convenient to associate some proxy identifier with sets of statements associated with real world entities. We can then use that identifier to 'mean' the real world entity. When we do this we have to be careful of two things. Firstly, the statements may change to better reflect the 'truth'. Thus the identifier stays the same, but the phone number statement changes. Secondly, we mustn't confuse this proxy behaviour with the real world entity. For example, I may have two proxy identifiers for sets of statements I later realise are statements about the same real world entity. Thus I will have to then make some sort of statement about the relationship between the identifiers (whatever id1 indicates, id2 indicates).
I tend to think of the world generatively. The world generates documents that somehow reflect imperfectly its nature. Text miners are in the business of synthesizing this secondary representation.
As somebody who has done a lot of statistics before, I found this very refreshing. This is a fantastic way to explain to somebody philosophically how one will be able to make sense of raw data and turn it into something that is meaningful.
Did I understand your article correctly?
Posted by: Joey Carlisle | July 31, 2011 at 05:18 PM
Nice essay on necessary technology for knowledge representation in a text resource based and NLP-ridden environment!
For digging deeper into the ideas discussed in your last paragraphs, I recommend reading about the Topic Maps standard.
Topic Maps aim to be a rather "pragmatic philosophical technology" as you call it. - And they are a bit different from what currently buzzes as the "Semantic Web". Especially interesting is their use of identifiers and their identity merging concept based on that identifiers. Also worth a glance should be the approach modelling of statements (with fixed semantics) entirely based on "topics" - digital proxys for real world entities, concepts and other constructs of thought and reality.
Posted by: efi | August 01, 2011 at 09:38 AM
Great essay! Topic maps weren't the first effort to address the issues you raise but (personally biased as a topic map standard editor) I too would recommend reading about them. My blog, Another Word For It, http://tm.durusau.net focuses on topic maps and semantic diversity.
The one point about subjects that I would make explicit is that the ways we talk about subjects are also composed of subjects. That is if I am identified by name = "Patrick Durusau", then "name" is just as much a subject as the subject we are trying to identify. That realization makes it easier to create mappings between formats/data structures (which are composed of subjects). Or more bluntly, no format or data structure is final or universal.
Posted by: www.google.com/accounts/o8/id?id=AItOawnCYF4H7VHqH2Bl7SZt1Ev7rDFCr0Vx4f4 | August 04, 2011 at 03:18 PM