Marketing Vox posts about Umbria's commentary on the state of the splogosphere:
Though some 80,000 blogs are being created every day, 10-20 percent of them may be spam, according to Umbria Communications, which monitors consumer-generated media, reports AdWeek. Umbria's research has found that 2.7 million blogs out of 20.3 million are spam blogs, or splogs, many of which are created solely as a shady marketing tactic, using stolen web content, often via RSS feeds, to profit from contextual ad programs.
Umbria examined blog search results in October from Technorati, IceRocket and BlogPulse, finding that on average 44 of the top 100 results were splogs.
There are some subtleties in here that are being used to confound the reader. Umbria is making a claim that 10-20 % of the global data is spam. They are then implying that 44 % of the results in the listed search engines are spam, suggesting that the search engines are effectively amplifying the problem.
This is a pretty ugly way to go about negative marketing. There is an important piece of missing information: they don't disclose any details of the search terms used. Did they search for 'real estate'? or 'yoga for toddlers'? Were the searches representatitve of search user behaviour? - I'm guessing not as Umbria has no consumer facing search application.
Umbria sells enterprise analysis applications over blog and other data, as do the most of the blog search engines cited. The processes involved in delivering quality results to such clients is quite different from those relating to supporting a consumer facing application like BlogPulse. In blog search engines, the primary dimension of ranking results is time, not relevance. Standard web search, where relevance is the primary ranking dimension, allows spam to simply fall to the bottom of the relevance ranking - never to be seen by the reader. Conflating time and relevance is a challanging problem. Determining the relevance of a post globally in a large-scale, data driven analytical engine is quite different and so the exposure to spam for customers of such systems is entirely different to exposure in time ranked blog search.
actually, theyare right about totals, but the new adds every day are probably 50 pct or more on some days.
Given that blog engines return primarily based on date, when there is heavy splog additions, the 44 out of 100 is probably low.
October was a light month compared to the last 30 days
Posted by: mark cuban | December 22, 2005 at 03:56 PM
Our research shows that the splog rate may be worse than this. We looked at 40M pings from weblogs.com and 75% were from splogs. Our methodology differs from the one used by Umbria as does the splog test. It will be interesting to compare the results, which I hope we can do soon. See http://ebiquity.umbc.edu/getnews/html/id/31/ or http://ebiquity.umbc.edu/blogger/?p=429 for more information.
Posted by: tim finin | December 22, 2005 at 10:50 PM
They've certainly confounded the issue by not disclosing the terms they used for probing the indexes. Choosing 'spammy' terms can certainly raise the proportion of spam results.
Posted by: ryan | December 24, 2005 at 01:27 AM