April 25, 2006

Making Statements in Text Analysis: Prologue

[This post is an attempt to jot down some thoughts that really deserve a longer discussion.]

A simple form of text mining is one in which the object data is considered to be some derivitive of the simple character based encoding of the corpus and for which the results are observations made about that data. For example, I could take a corpus of data and count the frequency of sequences of characters that are bounded either by spaces or punctuation.

A more interesting form of anlaysis is one which infers a new layer of data from the original. For example, we might take the lemma of the words in the corpus and count these (the lemma is the root form, c.f. the monotonic process of stemming in which some part of the token is removed). In the first example, the object data is a structure of token instances and the results can be thought of as types with associated meta data (counts and rank). In the second example, we need to first populate a new space of objects (that of lemmas) before associating meta data with the population. This inference requires a model - in this case a model of lemmas, morphology and implicitly a notion of performance which ultimately renders the lemmas in context as words and then strings of characters.

The distinction being made here is that in some forms of analysis, no transition is made from one space to another - we stay in the simple space of characters. In other forms of analysis, we make explicit transitions from one level of data (e.g. characters) to another (e.g. a model of language involving lemmas).

The transistion from one domain to another permits the application of more and more powerful forms of inference and mining. Once in the world of lemmas, we can apply typological/ontological knowledge and so on - this can't be done with strings of characters in any sound manner.

Orthogonal to the notion of representation and inference is the nature of statements that can be made as a result of observations made of the data. Consider clustering. I could represent documents in some vector space and then, via some dimension reduction, present them to the user in a two dimensional visualization. The user, upon seeing the visualization, might then observe that there are 'natural' clusters of documents. In this case it is the user that is making a statement (here is cluster A and here is cluster B).

On the other hand, an algorithm that not only represents the documents in some space but also make assertions as to the composition of clusters is doing something more interesting than the former.

The question in the industrial context is: to what extent do you develop these two dimension in order to support a certain analytical need?

November 23, 2005

Elements of Data Mining: Concepts and Pragmatics

Data Mining, and the specialised and related fields that I am interested in (including text mining and data visualization), has many useful definitions. Definitions in fields which are heavily academic in nature, and perhaps even more so when these methods are gaining commercial popularity, play a number of roles. Firstly, they give some broad description in layman's terms; secondly, they provide links to other areas of research; thirdly they act as guidelines for practices and approaches.

This last aspect is worth further discussion. A short definition of data mining appearing in Wikipedia and attributed to Frawley et al is:

The nontrivial extraction of implicit, previously unknown, and potentially useful information from data.

To me, there are two interesting aspects to this: potentially useful information - which describes the pragmatic aspects of data mining, and data which encapsulates the domain over which mining (or inference) is made.

Consider mining information captured in text. If the system consumes a large volume of textual data and derives the fact that computers and computer are somehow related, then the domain is textual and the inference it captures  is a piece of linguistic knowledge. Now consider a system in which computers and computer have been reduced in some preprocessing step to {lex:computer, number:plural} and {lex:computer, number:singular} - in other words, a linguistic preprocess has applied existing knowledge of morphology to the input and produced a representation of the data in a domain other than text. This new system has access to a richer explicit feature set from which to mine.

A pragmatic approach to the definition would, in the case of text mining, discard any requirements that the domain being mined be fixed to the second type of representation as long as the results the system produced were, for some set of applications, useful. The conceptual approach to definitions would assert that the second domain is the right domain to best produce useful results from text.

The above has discussed the aspect of domain representation in data mining. Glenn Fannick recently asked why Marti Hearsts definition of text mining persists as the top result for a search on Google for the term. Marti's definition doesn't actually discuss the issue of the data domain and focuses on a definition characterised by the nature of the results:

However, I am a bit of a purist when it comes to defining what text mining is.  I distinguish between what I call "real" text mining, that discovers new pieces of knowledge, from approaches that find overall trends in textual data. 

An analogy I like to use comes from the realm of crime fighting.   I think discovering new knowledge vs. showing trends is like the difference between a detective following clues to find the criminal vs. analysts looking at crime statistics to assess overall trends in car theft.

Clearly, her definition of text mining is not a pragmatic one. In fact, it relies on (either implicitly or explicitly) a representation of reasonably deep linguistic concepts. In addition, it is interesting to compare her characterization of the form of the results with that used in Wikipedia for data mining:

[data mining] is usually associated with a business or organization's need to identify trends.

Definitions are tricky, and any attempt to adhere strongly to a certain definition of a new field rejects the inherent fluidity that the field requires to develop. However, I do believe it is possible to define useful aspects of definitions (such as pragmatics and concepts). I have assumed a relationship here between data mining and text mining. I believe that this is a useful assumption to make as many of the methods originally used in data mining have found some application in text mining.

May 2008

Sun Mon Tue Wed Thu Fri Sat
        1 2 3
4 5 6 7 8 9 10
11 12 13 14 15 16 17
18 19 20 21 22 23 24
25 26 27 28 29 30 31
Blog powered by TypePad