I’ve been noticing some weird behaviour in Google’s blog search recently that makes me suspect that are testing new features and possibly a new back end or index. Firstly, I noticed changes between different attempts to search for the same item. There was a significant reduction in the number of results with many legitimate posts missing. Secondly, I’ve started to notice hits for search terms that are not in the post. For example, a search for “political streams” brings up this post from Blog About Stats which doesn’t mention the phrase, but which (as of the time of writing) has it in a title of a post under it’s ‘Recent Posts’ list. The strange thing is that this feed isn’t partial. I had originally thought that Google was attempting to fill in partial feed data and getting it wrong, but this doesn’t seem to be the case.
It looks to me like they're now indexing entire pages as posts. So in addition to the post text and links, individual results at GBS now include all sidebar info and links. Maybe they switched from indexing feeds to indexing HTML. The result is noisier search results, and it's become useless for tracking *new* mentions of keywords as any new post with the same old keyword in a sidebar shows up as new.
Posted by: pb | November 05, 2008 at 02:13 PM
I think you are right, pb. I've noticed the same things. This was the approach that technorati took (I'm not sure if they still do it). It certainly weakens the whole system.
Posted by: Matthew Hurst | November 05, 2008 at 03:13 PM
It looks likes something has just been fixed (for now) in the index. I now see the result set I saw before I noticed these changes and the reduction in results.
Posted by: Matthew Hurst | November 06, 2008 at 01:55 AM