My Photo

 

  • Subscribe with Kindle

« RSS Changes Everything | Main | The scoop on CustomScoop »

August 04, 2005

Blog engine evaluation and median time to index

There has been discussion recently regarding the methodology of evaluating blog search engines (a good place to start is Mary Hodder's (profile) recent post on the topic). David Sifry (profile) has also posted some recent stats regarding the world of blogs as viewed by Technorati.

Part of Sifry's analysis is the median time to index. This measures the average time for a post to be in Technorati's index after it pings one of the blog ping services (weblogs.com, blo.gs, etc.) This statistic is somewhat problematic. Firstly, it doesn't account for coverage. The statistic is computed by looking at the post date and the ingest date (that is to say, the timestamps). But what about all those posts that don't make it into the index. They effectively have an infinite time to index which means that the statistic does not tell a blogger how long they should wait on average to be indexed. Secondly, more information is needed about the numbers that back this stat. I'd like to know the distribution of times to index, and am especially interested in the mode of this distribution - that is to say, the time that most indexed posts take to be included.

It is great that Technorati is tracking and reporting this number. But keep in mind this extreme case: if I were to launch a new search engine that just indexed posts from Technorati's blog, I could trivially make this number very close to 0. The issues of timeliness (as measured by this and other statistics) and coverage are very much related - in fact they are even related in terms of the architecture of the crawler/spider that backs the search engine. Consequently, measurements of both needed to be brought together to give the complete picture.

The clock is ticking...

TrackBack

TrackBack URL for this entry:
http://www.typepad.com/services/trackback/6a00d8341c994053ef00d834872d6069e2

Listed below are links to weblogs that reference Blog engine evaluation and median time to index:

Comments

Well, you can see our coverage rate by looking at the growth of the blogosphere numbers posted earlier in the week. We think that we're indexing over 95% of the public blogosphere, according to the coverage tests that we've done internally.

Of course, those are our internal numbers, and don't necessarily mean they are "objective truth".

Dave

David - I can see your *rate of growth* in your first two posts, but no statistics on the percentage of coverage that you have. Your comment mentions this (95%), so this is new information. I'd be interested in what you mean by the public blogosphere. Does this include walled gardens that don't ping? What about languages? Actually, I'm not really after those numbers, but want to be clear about the point I made wrt mean time to index: it is a precision measurement, not a recall measurement, thus only has meaning iff the blog is captured by your system. In addition, while it is a metric with some good properties (reducing this number is probably a good thing!) it is not necessarily good for setting expectations for an arbitrary blogger.

I therefore conclude that you miss something!give me more info.thanks

great!it's awesome..i enjoyed reading this blog

Verify your Comment

Previewing your Comment

This is only a preview. Your comment has not yet been posted.

Working...
Your comment could not be posted. Error type:
Your comment has been posted. Post another comment

The letters and numbers you entered did not match the image. Please try again.

As a final step before posting your comment, enter the letters and numbers you see in the image below. This prevents automated programs from posting comments.

Having trouble reading this image? View an alternate.

Working...

Post a comment

Twitter Updates

    follow me on Twitter

    July 2009

    Sun Mon Tue Wed Thu Fri Sat
          1 2 3 4
    5 6 7 8 9 10 11
    12 13 14 15 16 17 18
    19 20 21 22 23 24 25
    26 27 28 29 30 31  

    Categories

    Blog powered by TypePad