My Photo

« Evri Entity Extraction for All | Main | Temporal Information Retrieval »

February 05, 2009


Jeff Clark

Matthew, do you have a simple list of the top 1000 adjectives in order of popularity ? Or even a list of common adjectives and verbs. I've looked before for this without success. If so, I'd appreciate a pointer to it.


This graph doesn't show that nouns grow faster than adjectives, only that there are more of them. If, for example, there were twice as many nouns as adjectives, then even if they were both growing at exactly the same rate the noun line would always be higher.

Looking at the shape of the data, I'd plot it on a log-log scale. This will better show how the numbers grow, independent of their absolute numbers.


There's a lot of good research you could do with this sort of analysis. Heap's Law governs the sublinear growth you're seeing in vocabulary size compared to word count, and Zipf's law is related to your statistics about usage of common adjectives. I recall reading that Zipf's law (the n'th most common word appears roughly every 1/(n^a) words for some a) applies only to all words, and roughly to either verbs or nouns, I forget which, but doesn't apply within most parts of speech. Thus understanding growth rate and usage with parts of speech, and within different domains, could be revealing for lexical analysis.


How present are the typos, a new so***o with 17 o's and also all the errors made (including a lot of things that can't really be considered as lexical units because such systems have major missing information) by the POS tagging system in these numbers? Usually you will see more of these when looking at new lexical unit. Anyways, the numbers are maybe lower, but they are still interesting. The main measure missing here I think is what will happened next with those new lexical units. My guess is that at least 90 % of them will reoccur very very rarely. Given that I'm wright again :-) we still have a remaining number that is important to deal with in opinion mining, but doing well with those will not bring you that much in terms of quality. When you have a solid base they're more difficult challenges causing also much more errors for already known lexical units combined differently.

Brendan O'Connor

Did you do this with a particular POS tagger? Does it have trouble with words not in its lexicon?


Are these findings applicable for other languages than English as well? I have been trying to analyze the stopwords and word count on blogs in different languages for some time, and it seems that while there is little difference between English and German, the actual difference to Russian e.g. is far bigger. Do you have experience on language specific analysis?

The comments to this entry are closed.

Twitter Updates

    follow me on Twitter

    March 2016

    Sun Mon Tue Wed Thu Fri Sat
        1 2 3 4 5
    6 7 8 9 10 11 12
    13 14 15 16 17 18 19
    20 21 22 23 24 25 26
    27 28 29 30 31    


    Blog powered by Typepad