Thomas Richardson, at University of Washington, emailed me a pointer to a talk that I wished I could have attended: How to Read 100 Million Blogs (and How to Classify Deaths without Physicians). I was travelling at the time, but I read (most of) the paper that Gary King presented. The content of the paper ties in with a key challenge in industrial text mining settings which we encountered again and again at BuzzMetrics.
Let's say you have a system for mining sentiment from blog posts. By using this system (which, let's assume, can also associate the sentiment with a topic of discourse) we can count the number of positive and negative opinions found online for some topic. Using these counts, we can then state the proportion of posts that contained positive, negative or neutral opinions about the topic. However, such a system will have some error rate (both in the sentiment mining, the topic recognition and the association). By investigating the nature of these errors, we can account for the bias in the automation and provide a more accurate estimate of proportions (e.g. if we find 80 negative posts on X and know that our recall for negative posts is 80%, we can predict - with a number of obvious caveats - that we failed to observe 20 posts with a negative opinion of X). In addition, we can provide some form of error bounds on the estimates.
The challenge with the deployment of such a system is that users want to drill down to instances, and when errors are encountered, they get nervous about the results in general.
King and Hopkins' paper is all about being comfortable with proportional predictions. I've not yet read the paper thoroughly and will probably revisit it here when I have done. I did notice that the data they used was provided by BlogPulse.
One thing to note about this general form of analysis: text/data mining is, fundamentally about finding relationships within the data. This is far easier to do when one can annotate features of data instances.
Comments