An interesting evaluation of any text is the probability of encountering a new word as one progresses through the writing. In the chart below, I compare Romeo and Juliet, Pride and Prejudice, War and Peace and A Child's History of England. It is interesting to note that War and Peace has a very similar trajectory to Romeo and Juliet, that Austen is clearly below this curve and that a book aimed at children is the one in which one is most likely to encounter novel terms. This later insight might be attributed to the fact that historical documentation likely includes a continuous stream of new characters and locations where as fiction tends to focus on a limited number of both.
In the below chart, the vertical axis indicates the size of the vocabulary and the horizontal axis represents progress through the book (i.e. words read).
An interesting observation - but could you provide an explanation for the graphic?
I don't quite get what the numbers along the vertical axis are meant to represent (number of novel terms, perhaps?), and the horizontal axis isn't labeled at all ... so I'm at a bit of a loss.
Posted by: Jim Shamlin | September 08, 2011 at 06:57 AM
Jim - it is a mortal sin in the data viz world not to label axes! I've added an update to the post to describe what is going on in case it is not obvious.
Posted by: Matthew Hurst | September 08, 2011 at 09:13 AM
Thanks!
Posted by: Jim Shamlin | September 08, 2011 at 06:42 PM
This is very interesting. Perhaps it could be used as a feature to classify text. Scientific publications or technical reports that present concepts at first and then expand on these would probably appear more curved (higher slope in the beginning but shallow evolution)
Posted by: AA | September 13, 2011 at 08:13 AM