Umbria's new white paper on spam blogs provides some solid insights into the problem of spam in both the search and marketing intelligence spaces. However, it continues the practice, set out in their recent press release on the topic, of using flawed logic and misleading information to attempt to position itself ahead of the competition.
Umbria analyzes the percentage of spam in the leading blog search engines. There is an improvement here compared to the press release in that it identifies the search terms used and the counts of spam found. The are very open in admitting the anecdotal nature of this experiment. My real concern is that they then use these tests against consumer facing search engines to imply that their enterprise marketing intelligence solution is somehow leading the pack.
They claim that the real time nature of the search engine space prevents adequate spam filtering. Although there may be evidence that the state of existing search engines doesn't neccessarily refute that, fundamentally I don't believe that to be the case. There is no reason why blog search engines can't approach the problem with some form of sandboxing for new blogs whose status is not yet determined. In addition there are approaches to true real time spam filtering that wouldn't even require sandboxing.
The most flawed part of their argument is that they imply that the quality of results in Intelliseek's consumer facing product is identical to that in our enterprise marketing intelligence solution. Given their argument that they would have challanges filtering spam in a real time consumer solution, why would they use the same mechanism in a completely different application? I wouldn't, for example, use the fact that their home page presented the spam blog white paper in a buggy manner to conclude that their web hosted enterprise solution was poorly implemented, or that the poor logic employed in their white paper indicates that the results in their customer reports were not to be trusted! Would I?
Perhaps their story is that their product is more useful than, say, trend mining on BlogPulse, or search on Technorati (which are both free).