If you search for 'blogpulse' in any of the major blog search engines (BlogPulse, Technorati, IceRocket, Sphere and Google), you will see hits on areas of the document that are not what one might call object content. Problem areas include:
- Tags (where there is a footer line saying something like 'Technorati tags: foo', or where the term is actually one of the tags listed, e.g. 'Technorati tags: blogpulse')
- Links to search results ('blog linking to this post on : Technorati, Blogpulse, Bloglines, ...'
What is required to avoid this is some simple document analysis. Stripping footers and tag lines should be trivial - so why isn't it done? One of the major data quality components I've been involved in creating at Intelliseek is the document analysis code that takes raw posts documents and marks up the structural areas (for message board data this might be quoted material and signatures, for blogs this is the type of thing I've outlined above).
One of the temptations in developing products in a market as fast paced as the blogosphere is that to deliver new features rather than improve existing ones. Content is the key asset here, but modifying the index to horizontally improve it is a large task which is often lost in the rush to add a new widget or gizmo. It is a lovely irony that tags are indexed as object content when they are intended to be meta-data.
Matt, this is great. I just left a comment about this problem at Rick Klau's article on upgrades to FeedFlare: http://www.rklau.com/tins/archives/2006/01/26/feedflare_is_big_really_big.php. Essentially, I think feeds would be greatly enhanced with this metadata explicitly marked. But then spammers could just as easily strip it. (I also think we should stop putting rel=tag links in the body of our feeds...) I see that Rick has also replied to my comment.
Posted by: Jack Vinson | January 28, 2006 at 11:06 PM