Since my earlier post on the new trending tool provided by Google Books, I've been thinking more about the service. While I've found plenty of interesting trends (more of which later), I've also been considering the underlying data and interface.Many of these considerations are common to any trending or other data probing interface (such as BlogPulse).
While there have been lots of reasonably visible copy written about the opportunities presented by the data set - the potential to understand trends in our culture and linguistics - this enthusiastic data geekery is somewhat lacking in data diligence. The original article in Science, for example, doesn't describe the data in the most basic terms.
At the very least, the data needs to be described in terms of design, accuracy and bias.
By data design I mean the intentions of the data. These intentions are somewhat exposed in the interface (where one can choose from things like 'American English', 'British English', etc.). I'd love to understand the rationale behind some of the corpora - e.g. English 1 Million - and the reason for missing corpora (we have English Fiction but not English Non-Fiction).
The accuracy of the data, with respect to the design, can at least be considered in terms of the current specifications. How accurate are the years associated with the articles? How accurate is the origin of publication? In addition, as Google points out, the accuracy of the OCR is also of great interest, especially for older texts (Danny Sullivan has an interesting post on this topic).
Finally, given any data designed along a set of dimensions, one can always take another set of dimensions and see how they are distributed and correlated - if at all. For example, what is the mixture of fiction and non-fiction in the English corpus? What is the distribution of topics? Are these representative with respect to historically accurate accounts of linguistic and cultural shifts (e.g. the introduction of the novel, the impact of the enlightenment on the mixture of fiction and non-fiction). What is the sampling from different publishing houses and is that representative of the number of books, or the number of copies sold? This last point is intriguing - does a book with 1 million copies in circulation have more 'culturonomic' impact than a book with only a single copy out there.
While the data sets are clearly labeled as 'American English' and 'British English' the books in those collections are not actually classified as such. Rather they are defined by their country of publication. With this in mind, how do we interpret the color v colour graph from my earlier post? As Natalie pointed out in an email, the trend in 'British English' of the difference between these terms could be described either by an underlying cultural shift towards the American spelling, or by a change in the ratio of American books published in the UK without editorial 'translations'.
Searching for foreign terms in certain languages brings up hits for the foreign language (e.g. 'because' in the Spanish corpus, 'pourquoi' in the English corpora).
Regarding the English Fiction corpus, I was surprised to see mentions of figures and tables in works of fiction.
Drilling down on these in the interface surfaces what are clearly non-fiction publications (but it is not clear if this search is filtered by the various corpora visualized in the ngram interface). It is also important to bear in mind when looking at these anomolies the volume of hits. Here we are seeing very small fractions of the overall corpus containing what look like terms indicating false positives.
Another subtle, but easily missed (I missed it!) aspect of the interface is that it is case sensitive. This allows us to do interesting queries like 'however' versus 'However.'
How do we interpret this? The most obvious interpretation might be that 'however' at the beginning of a sentence is becoming more frequent. We could also conclude that 'however' in general is becoming more frequent (imagine if we could combine the lines). Alternatively, it could mean that sentence length in the corpus is shifting. Given that we don't know the exact cultural mix of the 'British English' corpus, it could be somehow related to the mixture of American and British content. Finally, it could be due to the mix of fiction and non-fiction. Interestingly, the 'American English' corpus has quite a different signal.
When investigating temporal data, it is always interesting to try to discover things that don't change over time. What words would we expect to be relatively stable? From a simple initial probing, it seems like numbers and days of the week are reasonably stable. In looking at this, I did find that certain colours come and go in a very correlated pattern.
Overall, I find this to be a hugely exciting project. I'm disappointed in the general lack of analysis given to the data set before jumping to conclusions, but perhaps this is more a reflection of the blogosphere and the quality of writing. I'd love to see a more in depth analysis of the corpora provided by the team that wrote the Science article.
Update: Read more at The Binder Blog (1, 2), and at the Language Log.
The data is not super accurate. Searches for Internet, for example, turn up a couple of books dated 100 years ago, etc. As an aggregate, though, I think the results look good. Agree it would be helpful to learn more about how data was selected etc.
Posted by: Marshallk | December 20, 2010 at 01:46 PM