The last few years has seen amazing growth in both the size of the blogosphere and the number of companies which have sprung up to help organize, index and exploit it. It has been fascinating to see how smaller companies have managed to keep ahead of the late-to-entry giants. It would be nice to think that the startups have managed to do this via tenacity and innovation. Nice, but romantic. The fact of the matter is that the blogosphere is a mess, and it simply takes time and experience to know how to deal with the mess in order to deliver anything that approximates the truth.
Take, for example, a search for the URL to this blog on Technorati:
Interspersed with posts from my blog (good), I also see things that look like posts, but are actually links to blog home pages (bad), with dates which are just plain wrong. For example, just now I see several links to categories on Kevin's feedblog (the links to categories are pretending to be links to posts), one of which is dated October 18th 2005, which Technorati records as being posted 3 days ago.
Take the URL that I searched on: Typepad has two key concepts, a user and a blog. A user has a one to one relationship with a URL, but as a user may have multiple blogs, the true URL for the blog is an extension of the user URL. The URL that I searched on above is actually my user URL which redirects to my default blog (actually, my only typepad blog). Often authors intend to link to the blog, but use the user's URL instead which appears to have the right behaviour, but actually makes it very messy for link counting algorithms which have to then discover what a user's default blog is and if they have multiple blogs at all. The URL for this blog is:
http://datamining.typepad.com/data_mining
Take PubSub's site stats. PubSub uses feeds to monitor link data. The problem with feeds is that the consumer has to keep a record of what they have seen and as the consumer is not directly in communication with the provider, there is no mechanism to ensure that data will not be missed. Everyone pretends it is a push model when in fact it is a pull model. Due to errors in accounting, PubSub's stats are often out of whack. It often over counts citations as well as post counts.
Take ping servers. Now, you may think that these fill in the gap between the idea of a push model and the actuality of the pull model. I have it from a reliable source that a certain major blog host pings the ping server before publishing the update to the feed. This means that if one reacted to the ping and pulled the feed, one would get no new data at all.
Now that every consultancy out there is advising corporations to add a blog to their site we are seeing more and more blogs becoming integrated into home pages. This means that a link to corp.com becomes ambiguous - is it a link to the company, or a link to the blog? This makes counting inlinks very tricky.
Spam - let's not even discuss it.
In business strategy, it is very common to have a line item: create barriers to entry. What this means is: do stuff that makes it harder for competitors to steal your customers. From where I am sitting, the blogosphere is looking after its own.
What an interesting post. Thanks. When looking at any sort of statistical analysis of blogs, whether it be just inlinks, outlinks, or the quality of the posts, it can be very difficult and controversial.
We work hard to provide the best statistics possible. We are attentive to user issues with our stats and have worked to fix any problems. I'd like to help out if possible and would you be able to send me some examples of what you see as "out of whack." Please feel free to contact me.
In addition, we do provide a feed for any blogs stats. While you are right that there is no archive, I've found it useful to have the feed sent via e-mail (I've had success with RSSfwd). This way, there is some sort of archive present.
Steven Cohen
PubSub Concepts, Inc
[email protected]
Posted by: Steven Cohen | March 24, 2006 at 10:55 AM
I don't know how Technorati can call itself a service. Besides that its results are simply horrendous, the bold colors and awkward typeface make it literally hard to read.
Posted by: pwb | March 24, 2006 at 04:30 PM
Hey Matt
Interesting post. I do have several comments:
1) You state "there is no mechanism to ensure that data will be missed". I am hoping that you intended to add a "not" in there. I think the current structure (or lack of it) of the blogosphere does its best to ensure that data will be missed.
2) I think one of the main problems in link counting is that there is no reliable mechanism to match a feed to its blog. In fact, my previous work experience counting links has shown that there is typically a one to many relationship between blogs and feeds. The 'good' ones map directly based on url, just add a feed.rss and you're good to go. Now throw in feedburner and similar schemes and it takes some extra programming and / or human intervention to accurately map a feed to a site.
For even more fun thrown in a couple random blogs a directory level or two down from a domain in the url. However there are also 'valid' website content that live at those same directory levels. Remember, you are being graded on accuracy !
And don't get me started on the sites that can be accessed via http://username.example.com or http://users.example.com/username or http://www.example.com/users/username . All of these refer to the same blog. The fun part comes in when trying to accurately count links.
Oh well, sorry for the rant...
Flexability keeps life interesting...
(Note, comments/thoughts/emotions are mine and are not intended to reflect the views of any current or previous employers)
Posted by: mark wagner | March 26, 2006 at 10:35 PM
Hi Matt, Great post!
Hi Mark, you are completely right, and it is exactly why I am sure (at least, I hope) that people will anotate their blogs/websites with FOAF-like files. That way, crawlers/agents/systems will be able to easily make order in that mess, otherwise....
Take care,
Salutations,
Fred
Posted by: Fred | March 27, 2006 at 08:09 AM