Over the past few years, there has been a real burst in research and application development in areas that mine the textual content of the blogosphere: appraisal analysis, keyword mining, entity extraction, etc. It is remarkable, then, that there is little or no published work on the size of the lexicon.
Understanding the number of adjectives being used by a community, or associated with a topic, for example, is fundamental to understanding how opinion is expressed. Adjectives are an important sub-area that any opinion mining technology needs to master (along with all the other forms of opinion expression).
Understanding the growth of nouns in a community, and the appearance of new ones, is an important signal when tracking conversations and social trends at all levels.
A simple experiment to explore this space is to scan a collection of documents and graph the appearance of hitherto unseen terms. The graph below shows this for a sample of blog data. The x-axis shows the number of documents inspected, the y-axis shows the number of types of a certain part of speech (NN = nouns, JJ = adjectives, VB = verbs, RB = adverbs).
What is clear from this is the relative growth of lexical types (nouns grow faster than adjectives, verbs and adverbs grow at a significantly slower rate). Drilling down into the data, I’ve noticed that, for adjectives, the top 1, 000 types (that is to say, the 1, 000 adjectives that occur most frequently) account for 82.85% of all observations; the top 100 for 50.15% and the top 1 (good) for 2.65%.