I think it's time there was a better general understanding of the shaky foundations of the blogosphere. Kevin Burton has been getting more and more upset with blog search. While pings have the potential to make the web far more efficient, the truth of the matter is that like any other open 'standard', they are just another vehicle for systematic abuse by spammers. On top of this, there is a pretty hefty chunk of legitimate use which is simply poorly executed (incorrect dates, lack of synchronization with the feed, etc.).
When one considers the expectations for real time search in the blogosphere, periodic crawling (e.g. crawling every t minutes, which is what I believe both TailRank and Techmeme do - please correct me if this is wrong) is not a scalable solution. Leaving aside the issue of crawling 10s of millions of pages multiple times a day, it also fails to provide an adequate discovery mechanism for new blogs.
While memetrackers are crawling something like 10s-100s of thousands of blogs, I don't believe they could cope with 2 orders of magnitude of scale. So once one accepts the ping ecology, one has to deal with spam. And spam filtering algorithms have as a very clear parameter the time taken to make the decision. If there is a need to make a quick decision - to ensure some sort of real time system - then that limits the evidence one can use to make that decision.
What I'm getting at here, possibly in a round about manner, is that things are a mess. One can have high quality - as the memetrackers give us - but limited coverage and discoverability; or one can have near real-time and better discovery (coverage) with the danger of spam. Getting high coverage, quick mean time to index and no spam is hard - perhaps the real problem is that we have transferred main stream web expectations (where sand boxing is appropriate due to a different view of real time) to the blogosphere.
It is not clear to me that a blog search engine has limited time to consider whether a weblog in the ping ecology is spam. Why is that?
Seems like you could assume all new weblogs are spam, withholding their posts until you have done a thorough check. Then, assume weblogs that have a history of no spam are not spam, but then eliminate them retroactively if they switch to spam.
That would seem to allow plenty of time to analyze the evidence, would it not?
Posted by: Greg Linden | December 31, 2006 at 09:37 PM
Greg,
While it is true that you can retroactively remove spam - the sandboxing method that you mention is, I believe, not possible with current expectations. Let's imagine it takes 10 posts before a blog is permitted out of the sandbox and in to the visible index - that is going to be a serious negative to legit new bloggers (or new blogs). In addition, spamming is a very sophisticated process - with a number of decloaking practices (in which a blog posts non-spam content for a while and then flips to spam mode). The problem is also compounded by various endorsed and non-endorsed aggregators and rebloggers.
Personally, I believe that we are somewhat a victim of our own expectations (or alternatively, a victim of the infrastructure that the blogosphere is built on). Of course, an additional issue is the use of a different ranking function. In main stream web search, relevance is a powerful antidote to spam - it simply ranks low. With time as the primary ranking function, relevance can't play this role.
Posted by: Matthew Hurst | December 31, 2006 at 10:19 PM
I would agree that it's a hard problem. The benefit of solving a hard problem though is that you'll lack any competition in the market and be able to make a return on your investment.
That said...... I think you're comment is a false dichotomy:
"One can have high quality - as the memetrackers give us - but limited coverage and discoverability; or one can have near real-time and better discovery (coverage) with the danger of spam."
I don't think so. It's certainly a hard problem but we're planning on expanding at least one order of magnitude and should be able to remove and handle spam.
The trick is to figure out the right blog to spam ratio. At 58M blogs it's pretty hard for Technorati to keep out spam.
Of course again....... if you can solve it you're in a good position.
Kevin
Posted by: Kevin Burton | January 02, 2007 at 03:08 AM