While trying to write up some thoughts on our big news, specifically highlighting the value of the text/data mining portion of the technology, I got an invitation to the NewsVine beta. My initial reaction was: this is a nicely presented, well designed system. My second reaction was: what! more stuff to wade through?
You see, although products appear to be making improvements in clustering articles, gathering recommendations, etc., these are only tiny improvements towards the real goal: tell me what is going on in the spaces I am interested in. The articles that are paraded before the user should not be regarded as the end product.
Bloglines, and RSS aggregation in general, is another example of this problem. The more I am interested in, the less time I have to read all the stuff. I need a system that will tell me, in its own words, what is happening and then, and only then, points me to articles that colour the event, movie, site in question.
The basic technologies required to do this: NLP (information extraction, parsing, semantic interpretation, summarization), data mining, text mining are all being deployed in enterprise facing solutions. They are being applied to a certain extent in consumer solutions, but their results are never surfaced in those solutions as the primary result.
Imagine an extension of the BlogPulse person name analysis which combines information about who is being mentioned with information about why they are being mentioned. This is done to a certain extent with the Blog Bites feature which associates mined phrases that appear in near proximity, but it doesn't go far enough. Imagine a twist on memeorandum where, rather than a major article with associated discussion, there is a major topic with associated discussion. For example, the topic might be presented as simple tuples: {Ang Lee, Brokeback Mountain, Golden Globes} or, {Ariel Sharon, surgery, 1/18/2006}.
The technology required to do this is all pretty old hat. So why aren't the products around?
There is what we might call a textural argument against this type of tool: machine generated summaries are not appealing because it is the manner in which people discuss issues that is part of the value, not simply what they are talking about. This is true, but all I am suggesting is that we index and organize by what is being said - a user of such a tool could then drill down to read human generated content (HGC - cf MGC: machine generated content).



Comments