Data Mining, and the specialised and related fields that I am interested in (including text mining and data visualization), has many useful definitions. Definitions in fields which are heavily academic in nature, and perhaps even more so when these methods are gaining commercial popularity, play a number of roles. Firstly, they give some broad description in layman's terms; secondly, they provide links to other areas of research; thirdly they act as guidelines for practices and approaches.
This last aspect is worth further discussion. A short definition of data mining appearing in Wikipedia and attributed to Frawley et al is:
The nontrivial extraction of implicit, previously unknown, and potentially useful information from data.
To me, there are two interesting aspects to this: potentially useful information - which describes the pragmatic aspects of data mining, and data which encapsulates the domain over which mining (or inference) is made.
Consider mining information captured in text. If the system consumes a large volume of textual data and derives the fact that computers and computer are somehow related, then the domain is textual and the inference it captures is a piece of linguistic knowledge. Now consider a system in which computers and computer have been reduced in some preprocessing step to {lex:computer, number:plural} and {lex:computer, number:singular} - in other words, a linguistic preprocess has applied existing knowledge of morphology to the input and produced a representation of the data in a domain other than text. This new system has access to a richer explicit feature set from which to mine.
A pragmatic approach to the definition would, in the case of text mining, discard any requirements that the domain being mined be fixed to the second type of representation as long as the results the system produced were, for some set of applications, useful. The conceptual approach to definitions would assert that the second domain is the right domain to best produce useful results from text.
The above has discussed the aspect of domain representation in data mining. Glenn Fannick recently asked why Marti Hearsts definition of text mining persists as the top result for a search on Google for the term. Marti's definition doesn't actually discuss the issue of the data domain and focuses on a definition characterised by the nature of the results:
However, I am a bit of a purist when it comes to defining what text mining is. I distinguish between what I call "real" text mining, that discovers new pieces of knowledge, from approaches that find overall trends in textual data.
An analogy I like to use comes from the realm of crime fighting. I think discovering new knowledge vs. showing trends is like the difference between a detective following clues to find the criminal vs. analysts looking at crime statistics to assess overall trends in car theft.
Clearly, her definition of text mining is not a pragmatic one. In fact, it relies on (either implicitly or explicitly) a representation of reasonably deep linguistic concepts. In addition, it is interesting to compare her characterization of the form of the results with that used in Wikipedia for data mining:
[data mining] is usually associated with a business or organization's need to identify trends.
Definitions are tricky, and any attempt to adhere strongly to a certain definition of a new field rejects the inherent fluidity that the field requires to develop. However, I do believe it is possible to define useful aspects of definitions (such as pragmatics and concepts). I have assumed a relationship here between data mining and text mining. I believe that this is a useful assumption to make as many of the methods originally used in data mining have found some application in text mining.
Comments