Visualizing the blogosphere, in the form of a graph, seems like a very natural thing to do. The research world of graph layout is rich with theory and many implementations which can deal with very large graphs (see this post from the excellent Infosthethics). Graphs, though potentially appealing as eye-candy to any data set which represents relational information, are incredibly difficult to parse intuitively. Graph layout algorithms are often suited to certain types of graphs - force based algorithms work reasonably well with tree-like structures, but can only deal with small graphs (thousands of nodes) due to their iterative nature. Algorithms like Harel/Koren's eigen vector methods are theoretically appealing due to their speed, but also have limitations in that they are far better suited to grid like structures. In between are hierarchical and analytical approaches that do some form of analysis on the structure of the graph and use this to guide heuristics for appealing layouts.
In exploring this space, one aspect of the blogosphere became very clear: the big head. It is fashionable to talk about the long tail, but the blogosphere has a very big head. What this means for graph layout is that if you characterize the blogosphere as a set of nodes (blogs) with edges representing link structure (e.g. blogroll data), everyone points to a small number of a-list bloggers in addition to their other favourite blogs - turning certainly my naive attempts at layout into a big hairball.
Thinking about this lead me to consider what parts of the blogosphere could be described in a clear way but wouldn't suffer from this behaviour. Certainly filtering on blogs that have some but not too many inlinks or outlinks is one way to do this. Another is to look at depth first trees rooted in popular blogs.
I created some test data by crawling Blogrolling.com in a depth first fashion. In the examples below, the colour of the node represents the average distance between it and all other nodes in the graph. Colours towards the cooler end of the spectrum have lower average distance, whereas the hot nodes have higher average distance.
Blogroll links for DailyKos for a depth first crawl of 100.
Blogroll links for Slashdot for a depth first crawl of 100.
As stated above, reading graphs is very hard, so what these two graphs actually mean I will leave up to the reader. One of the key tenets of data mining, and by extension, data visualization is: let the data speak. The promise of data visualization is that patterns that may not be algorithmically derivable may be intuitively derivable using human pattern recognition capabilities. It may not be clear what these graphs mean, but they do at least appear to be quite different.
Comments