[Update: I've created a new image with some improved qualities.]
What does the blogosphere look like? Well, I'm not really sure what the question means. However, it is certainly intuitive to think about the graphical structure of the blogosphere, where nodes are blogs and edges are the links between them (either from blogrolls, trackbacks or links). I've tried a few experiments to draw this graph and bascially it has demonstrated that the blogosphere is one giant hairball. As a graph drawing problem, this visualization challange has two solutions: use a different graph layout algorithm, or draw a different graph.
For the upcoming workshop, I'm really keen to produce a good visualization, so I've been thinking about drawing a different graph. The problem seems to be that there is essentially no typology to links. Blogs which are topical are good sources of rich link strucutre as they keep on topic. The majority of blogs, however, are diary in nature, and so their contents tend to be somewhat random (hey - have you seen this?). Consequently, the link structure is a mess.
There are some pretty obvious things one can do to start removing links based on trivial count based filters: remove all links between two blogs that have fewer than t instances, remove blogs which have fewer than c citations, etc. I'm interested in something a little more subtle, however. I want to look at the blogosphere from the point of view of robust, rich community structure. Basically, I want a magic filter that removes all blogs which don't participate in community of some sort.
The following image is my first pass at doing this. I'm not yet ready to talk about the filtering method used. However, it does attempt to follow the basic goal above. The nodes displayed represent blogs. The size of the node is a rough indication of the number of citations. The colour of the nodes indicates livejournal (blue), blogspot (red), typepad (green), wordpress (cyan) and Weblogsinc (pink) - all other blogs are gray. The layout algorithm is a variation on the force based organic method and has been iterated 1, 000 times. The basic interpretation: blogs that are near each other cite each other more than those that are further apart. The data was taken from the workshop data (which contains approximately 1 million blogs and 10 million posts).
A couple of observations:
- Livejournal is very self referantial and keeps away from the rest of the blogosphere.
- Typepad and Blogspot appear to be well mixed with the rest of the blogosphere.
- There is a small group of WordPress blogs off to the right.
- Weblogsinc blogs (pink) form a tight little cluster - probably due to lots of interlinking.
Note that this is a very preliminary result. Note also that this data is the largest connected component in the entire graph - I cut out the rest of the blogosphere that wasn't linked to this cluster.
Feels like I just had the first glimse into a new galaxy!