If you search for 'blogpulse' in any of the major blog search engines (BlogPulse, Technorati, IceRocket, Sphere and Google), you will see hits on areas of the document that are not what one might call object content. Problem areas include:
- Tags (where there is a footer line saying something like 'Technorati tags: foo', or where the term is actually one of the tags listed, e.g. 'Technorati tags: blogpulse')
- Links to search results ('blog linking to this post on : Technorati, Blogpulse, Bloglines, ...'
What is required to avoid this is some simple document analysis. Stripping footers and tag lines should be trivial - so why isn't it done? One of the major data quality components I've been involved in creating at Intelliseek is the document analysis code that takes raw posts documents and marks up the structural areas (for message board data this might be quoted material and signatures, for blogs this is the type of thing I've outlined above).
One of the temptations in developing products in a market as fast paced as the blogosphere is that to deliver new features rather than improve existing ones. Content is the key asset here, but modifying the index to horizontally improve it is a large task which is often lost in the rush to add a new widget or gizmo. It is a lovely irony that tags are indexed as object content when they are intended to be meta-data.