How do readers get to blogs? There is plenty of work looking at the structure of the blogosphere as a stand alone partition of the web, but what is it's relationship with the web at large? I've collected a sample of referral statistics from 1, 500 of the top weblogs. The data was collected between 29th of July and 26th of August 2006 and resulted in 890, 469 referrals. The sample was collected by capturing the 20 most recent referrals at 10:00 EST.
Of this data, 539, 593 were of unknown origin.
Of the remaining 350, 516 referrals, the break down by host is as follows:
host | count | % |
---|---|---|
google.com | 82, 203 | 23.4 |
images.google.com | 13, 817 | 3.9 |
search.yahoo.com | 11, 492 | 3.3 |
google.co.uk | 9, 119 | 2.6 |
search.msn.com | 5, 684 | 1.6 |
google.ca | 5, 020 | 1.4 |
bloglines.com | 3, 765 | 1.1 |
google.com.au | 3, 597 | 1.0 |
google.fr | 3, 435 | 1.0 |
google.de | 2, 840 | 0.8 |
michellemalkin.com | 2, 786 | 0.8 |
Note that these are per host statistics, not per domain (if we look at referrals by domain we will see, for example, that bloglines and typepad are in the top 10).
What these numbers suggest is that main stream web search (MSW search) is a major channel of traffic to the blogosphere. In fact, Google domains (i.e. google.com, google.co.uk, images.google.com, etc.) account for 45.4% of referrals. Google traffic breaks down as follows:
host | count | % |
---|---|---|
google.com | 82203 | 23.452 |
images.google.com | 13817 | 3.9419 |
google.co.uk | 9119 | 2.60159 |
google.ca | 5020 | 1.43217 |
google.com.au | 3597 | 1.0262 |
google.fr | 3435 | 0.979984 |
google.de | 2840 | 0.810234 |
google.co.in | 2284 | 0.651611 |
images.google.co.uk | 2015 | 0.574867 |
google.es | 1889 | 0.53892 |
images.google.de | 1430 | 0.40797 |
images.google.fr | 1398 | 0.398841 |
google.nl | 1308 | 0.373164 |
images.google.ca | 1245 | 0.355191 |
blogsearch.google.com | 1101 | 0.314108 |
news.google.com | 1015 | 0.289573 |
google.it | 982 | 0.280158 |
google.com.br | 948 | 0.270458 |
google.be | 891 | 0.254197 |
images.google.com.au | 838 | 0.239076 |
Some thoughts:
- Are users arriving at blogs from Google and aware that they are on a blog?
- Are users intending to find data on blogs and going to Google to search for it? I suspect not. I believe that these referals represent users who are searching for information in general and happen to land on a blog.
- Does this bias represent an opportunity for blog specific search? Absolutely. I see this partly as an educational problem. If blog search engines play their cards right, these searchers will eventually be using blog specific search engines as they begin to understand what blogs are and what their value is. The key to enabling this change is that blog search be better than general search. Given the the fundamental difference in blog data and MSW data (particularly timeliness) this is not an unreasonable expectation.
Caveats of the data: note that this data isn't a straightfoward sample. Due to the way in which it is gathered, it doesn't represent a snapshot in time. In addition, there is this mass of unknown referrals. I am assuming here that there distribution is the same as that for the known referrals, but there is no way to actually determine that to be the case.
And - this wouldn't be a data mining post without some graphic, so here is a treemap of the data.
Update - I've located a glitch in the data. The implication of the problem is roughly that the basic shape of the data is the same but the precise numbers are not accurate. I'm working on getting a cleaner version of the data and will repost this analysis and more when I do.
So you got the logs of 1,500 of the top blogs and did this analysis?
Where did you get this data?
Dave
Posted by: David Sifry | August 31, 2006 at 11:39 AM
Hi, Matt. I am confused by something on this data. Where are the other blog search engines and feed readers? In particular, why don't Technorati and My Yahoo show up as a referrers?
Looking at the data, I was about to conclude that Google web search overwhelms blog search engines and feed readers as the major way people get to weblogs -- a claim I have also seen elsewhere -- but the data only weakly supports that conclusion with only 82k of 890k clicks labeled google.com.
What do you think?
Posted by: Greg Linden | August 31, 2006 at 12:03 PM
Dave,
I don't have the logs; I'm just using data that is freely available on the web.
Matt
Posted by: Matthew Hurst | August 31, 2006 at 05:46 PM
Greg,
I'm going to follow up with some more posts on this, so perhaps some of your questions will
be answered later. For now, note that 82k of 350k are from google.com (not from 890k - don't forget those 'unknown' refers). Of the 350k, 45% are from one of Google's hosts (which includes referals from blogsearch.google.com btw, but this isn't much).
Stay tuned!
Matt
Posted by: Matthew Hurst | August 31, 2006 at 05:54 PM
Matt,
Can you point me to this data?
Where is it available on the web?
Dave
Posted by: David Sifry | September 01, 2006 at 12:12 AM
"How do readers get to blogs?" - speaking for myself there are two answers, different modes.
When I'm catching up on the day's new material I get to blogs from other blogs, usually from reading from an online aggregation (I got here today via Corante, I also regularly readPlanet RDF, Planet XMLHack, Planet Web 2.0 etc) or from the feeds I've got in Bloglines.
But when I'm looking for a specific piece of information (which happens a lot with tech stuff), I'll use Google and as often as not end up at a blog.
The significance in the context of your post is that in the latter case, I really don't care where I get the information from - so a blog-only search engine wouldn't really add anything.
Posted by: Danny | September 01, 2006 at 05:22 AM