Over the past few years, there has been a real burst in research and application development in areas that mine the textual content of the blogosphere: appraisal analysis, keyword mining, entity extraction, etc. It is remarkable, then, that there is little or no published work on the size of the lexicon.
Understanding the number of adjectives being used by a community, or associated with a topic, for example, is fundamental to understanding how opinion is expressed. Adjectives are an important sub-area that any opinion mining technology needs to master (along with all the other forms of opinion expression).
Understanding the growth of nouns in a community, and the appearance of new ones, is an important signal when tracking conversations and social trends at all levels.
A simple experiment to explore this space is to scan a collection of documents and graph the appearance of hitherto unseen terms. The graph below shows this for a sample of blog data. The x-axis shows the number of documents inspected, the y-axis shows the number of types of a certain part of speech (NN = nouns, JJ = adjectives, VB = verbs, RB = adverbs).
What is clear from this is the relative growth of lexical types (nouns grow faster than adjectives, verbs and adverbs grow at a significantly slower rate). Drilling down into the data, I’ve noticed that, for adjectives, the top 1, 000 types (that is to say, the 1, 000 adjectives that occur most frequently) account for 82.85% of all observations; the top 100 for 50.15% and the top 1 (good) for 2.65%.
Matthew, do you have a simple list of the top 1000 adjectives in order of popularity ? Or even a list of common adjectives and verbs. I've looked before for this without success. If so, I'd appreciate a pointer to it.
Posted by: Jeff Clark | February 05, 2009 at 01:20 PM
This graph doesn't show that nouns grow faster than adjectives, only that there are more of them. If, for example, there were twice as many nouns as adjectives, then even if they were both growing at exactly the same rate the noun line would always be higher.
Looking at the shape of the data, I'd plot it on a log-log scale. This will better show how the numbers grow, independent of their absolute numbers.
Posted by: Aler | February 05, 2009 at 04:52 PM
There's a lot of good research you could do with this sort of analysis. Heap's Law governs the sublinear growth you're seeing in vocabulary size compared to word count, and Zipf's law is related to your statistics about usage of common adjectives. I recall reading that Zipf's law (the n'th most common word appears roughly every 1/(n^a) words for some a) applies only to all words, and roughly to either verbs or nouns, I forget which, but doesn't apply within most parts of speech. Thus understanding growth rate and usage with parts of speech, and within different domains, could be revealing for lexical analysis.
Posted by: Paul | February 09, 2009 at 09:45 AM
How present are the typos, a new so***o with 17 o's and also all the errors made (including a lot of things that can't really be considered as lexical units because such systems have major missing information) by the POS tagging system in these numbers? Usually you will see more of these when looking at new lexical unit. Anyways, the numbers are maybe lower, but they are still interesting. The main measure missing here I think is what will happened next with those new lexical units. My guess is that at least 90 % of them will reoccur very very rarely. Given that I'm wright again :-) we still have a remaining number that is important to deal with in opinion mining, but doing well with those will not bring you that much in terms of quality. When you have a solid base they're more difficult challenges causing also much more errors for already known lexical units combined differently.
Posted by: alltoute | February 10, 2009 at 12:38 AM
Did you do this with a particular POS tagger? Does it have trouble with words not in its lexicon?
Posted by: Brendan O'Connor | February 19, 2009 at 05:21 AM
Are these findings applicable for other languages than English as well? I have been trying to analyze the stopwords and word count on blogs in different languages for some time, and it seems that while there is little difference between English and German, the actual difference to Russian e.g. is far bigger. Do you have experience on language specific analysis?
Posted by: Rafael | March 09, 2009 at 06:12 PM