The notion of community is key to the blogosphere. One signal that we can pick up which suggests community is in the graph that we can build in which the nodes are the blogs and the edges are links between blogs determined by citations found in posts. This is how many examples of graphs posted on this, and other blogs have been formed.
Given a graph, we can partition it in to connected components. A connected component is any sub-graph for which there is no link from any node within it to any node outside it. By carrying out this partitioning, we can observe the distribution of sub-graphs in terms of the number of sub-graphs of a particular size. It is not surprising to discover that there are many small sub-graphs and progressively fewer sub-graphs as the size increases ending up in the core of the blogosphere - the largest sub-graph.
If we plot this distribution, we get the following graph (note that the x and y axes are plotted on a logarithimic scale):
Note that the 'long tail' is represented here by all the single node sub-graphs of which there are 187, 376 in this data set. Note that the core is all the way over on the right - a sub-graph of size 19, 712. This distribution suggests a power law between the number of nodes in the sub-graph and the number of sub-graphs of that size.
The graph being used here is derived from the data prepared for the upcoming Workshop on Weblogging Ecosystems. We can further refine the basic graph used by filtering out links between blogs less than a certain weight. The weight on a link is determined by the count of citations between blogs. If we filter on progressively higher weights - requiring first at least 1 citation, then 2, then 3 and so on - we can plot the resulting distributions of the paritions of connected components:
In this plot, the z axis (the vertical axis) shows the count of partitions, the x axis (left to right) shows the number of nodes in the sub-graph (the number of blogs in the community) and the y axis shows the minimum weight of the links in the graph (the number of citations between blogs). Here the sub-graph size has been limited to 100.
The reason it is interesting to plot this distribution for different weight thresholds is because one might argue that a single citation between blogs may not indicate a real social relationship - it could be some arbitrary link outside a community. By increasing the weight, we establish a stronger pattern of citation and therefore a more real indication of a relationship between the two blogs. In fact, there are other, more interesting ways in which this relationship may be established - material for another post.