A simple characterization of text mining platform can be given by two basic paradigms. Each paradigm involves some sort of interface, a data set and user expectations and needs.
In the search paradigm, a user expects almost instant results. The user interacts with the data via a simple text based interface into which they type queries. Queries are run over a large data set and results are returned in a simple format (e.g. a standard search engine results page or a trend graph). The data that backs this paradigm is generally a full text index of some sort with little annotation. This form of text mining permits rapid analysis but only with a shallow view of the data.
Contrast this with the mining paradigm. Here, the user interacts with the data via a rich client that not only allows access to advanced analyses of the data (perhaps stored as annotation in the data), but also embeds data and text mining algorithms allowing the user to perform on the fly analysis on data subset and views of that data. This form of text mining leverages the full power of sophisticated automated analytics (such as parsing, semantic analysis, etc.) but has a higher processing cost. Note that, however, in general this processing is data parallel and so a scalable, responsive system is certainly achievable.
These two paradigms, of course, represent two ends of a spectrum, and many applications will land somewhere in between. For example, some amount of analysis may be performed on the data at ingest time and this analysis may be available as a searchable feature. Note that these paradigms describe platforms. The applications and algorithms that may be brought to bear represent another dimension which can be used to characterize the space.
Comments