[This post is an attempt to jot down some thoughts that really deserve a longer discussion.]
A simple form of text mining is one in which the object data is considered to be some derivitive of the simple character based encoding of the corpus and for which the results are observations made about that data. For example, I could take a corpus of data and count the frequency of sequences of characters that are bounded either by spaces or punctuation.
A more interesting form of anlaysis is one which infers a new layer of data from the original. For example, we might take the lemma of the words in the corpus and count these (the lemma is the root form, c.f. the monotonic process of stemming in which some part of the token is removed). In the first example, the object data is a structure of token instances and the results can be thought of as types with associated meta data (counts and rank). In the second example, we need to first populate a new space of objects (that of lemmas) before associating meta data with the population. This inference requires a model - in this case a model of lemmas, morphology and implicitly a notion of performance which ultimately renders the lemmas in context as words and then strings of characters.
The distinction being made here is that in some forms of analysis, no transition is made from one space to another - we stay in the simple space of characters. In other forms of analysis, we make explicit transitions from one level of data (e.g. characters) to another (e.g. a model of language involving lemmas).
The transistion from one domain to another permits the application of more and more powerful forms of inference and mining. Once in the world of lemmas, we can apply typological/ontological knowledge and so on - this can't be done with strings of characters in any sound manner.
Orthogonal to the notion of representation and inference is the nature of statements that can be made as a result of observations made of the data. Consider clustering. I could represent documents in some vector space and then, via some dimension reduction, present them to the user in a two dimensional visualization. The user, upon seeing the visualization, might then observe that there are 'natural' clusters of documents. In this case it is the user that is making a statement (here is cluster A and here is cluster B).
On the other hand, an algorithm that not only represents the documents in some space but also make assertions as to the composition of clusters is doing something more interesting than the former.
The question in the industrial context is: to what extent do you develop these two dimension in order to support a certain analytical need?
In this exact spirit, here's a shameless plug, but with a bit of a twist to it:
Steven Keith, Owen Kaser, Daniel Lemire, Analyzing Large Collections of Electronic Text Using OLAP, APICS 2005, Wolfville, Canada, October 2005.
http://www.daniel-lemire.com/fr/abstracts/APICS2005.html
Posted by: Daniel Lemire | April 27, 2006 at 09:42 AM