Two of the major differences between main stream web and the various flavours of social media are the temporal aspect (time is a lot more important in social media) and the document representation (documents tend to be fragments of complete HTML pages, for example). A persistent theme heard in social media monitoring channels is the desire for real time monitoring. Customers want to know what is happening to their brands and reputations as it happens (actually, I think there may well be a mismatch between how frequently they want this type of information and how quickly they can react to it - but that is another story). The blogosphere has enough infrastructure to allow for near real time monitoring (that is to say acquisition, analysis and reporting). Both Technorati (historically) and TailRank (via this comment on my blog) report 5 minutes as the mean time to index a post. But what about the boardscape - the less loved but far larger world of 300MM+ social media participants?
I had many conversations with Ron Kass, CEO of BoardTracker and Klostu, at ICWSM (Klostu is a conference sponsor). An idea that I've been particularly fond of when it comes to data acquisition in the social media space has been predictive crawling. Rather than crawling on a schedule, a predictive crawler builds a dynamic model of the source and uses this to predict when the next post will be published. This information is then used to determine the most efficient time for each data pull. Ron indicated that this approach was being used by BoardTracker and that the mean time to index was less than two hours. Pretty impressive for a content space with no publication notification mechanism to speak of.
Note that there was also a paper presented at ICWSM by Ka Cheung Sia et al, on a related topic - mining the browsing patterns of a client side RSS reader to use bandwidth efficiently while ensuring that the most recent post was available for the reader.
Comments