My Photo

« The Time To Build NLP Applications | Main | Back To The Future: NLP, Search, Google and Powerset »

February 14, 2007



Actually, my test was more rudimentary than that. I was just trying to see if these would pick up Ian Anderson's name. My tests w/both ClearForest and Alias-I were using the partial sentence, "Ian Anderson was born in". Neither picked up on the name which is all I was expecting to see. As it relates to the longer sentence, ClearForest picked up on "Fife" and "Scotland" but missed "Ian Anderson". I didn't try to the full sentence w/Alias-I, so I'm glad that worked better.


If Dunfermline is a city, it would be appropriate in many contexts to consider it an organization. For that matter, the same is true of Fife (which I take to be a state, going by the example--correct me if I'm wrong) and Scotland itself. I'm curious how these systems deal with ambiguities like this, since Dunfermline the organization is essentially the same entity as Dunfermline the location.

Peter Jones

A quick look on google does suggest that the term "Dunfermline", out of context, is assoicated more with the football club, an organization, than the city.

Bob Carpenter

The first four results for "Ian Anderson was born in Dunfermline, Fife, Scotland." from the LingPipe NE English News web demo are (use the resultType,nBest mode):

Ian, Dunfermline, Fife, Scotland
2. null, ORG, LOC, LOC

The "right" answer according to the MUC 6 coding standard was the 4th-best output. The confusibility of organizations and locations is what led to the "geopolitical-entity" (GPE) type in later MUCs.

The reason there are no entities found for our demo with "Ian Anderson was born in" is because it's not considered a sentence. The sentence detector may be configured to treat fragments as sentence, and I probably should've set it that way in the demo. With a period at the end, it finds the person mention.

If you run in resultType,conf mode (and then look at the source, because I forgot a close tag in the output of the servlet), here are the top ranked entities for the input:

1. Scotland: LOC [.25, 1.0]
2. Dunfermline : ORG [.25, .5]
3. Fife : LOC [.5, .66]
4. Ian : PER [.75, .75]
5. Fife : ORG [.75, .60]
6. Dunfermline : LOC [1.0, .66]
7. Ian : LOC [1.0, ...
8. Ian : ORG
9. Dunfermline, Fife : ORG
10. Fife : PER

The bracketed figures are recall/precision numbers resulting from returning the first n-best outputs. By rank 6, all 4 entities have been found, but there are also 2 erroneous entities found, for a 100% recall at 66% precision.

Dan Kauwell

By the way, the URL for Dan Roth's Cognitive Computation Group is


Very interesting blog you have here. One of the reasons of so bad results for certain companies working in the area of text analytics/text data mining is that they stick to one or two approaches at most, either pattern-based or statistical and think that one is the only way to knack the real meaning (and thus, the real entities).
They also tend to linger to much in developing solutions for a very limited and unrealistic type of texts.
One needs to work on the text from different perspectives. Not an easy task, but a very exciting field.


The comments to this entry are closed.

Twitter Updates

    follow me on Twitter

    March 2016

    Sun Mon Tue Wed Thu Fri Sat
        1 2 3 4 5
    6 7 8 9 10 11 12
    13 14 15 16 17 18 19
    20 21 22 23 24 25 26
    27 28 29 30 31    


    Blog powered by Typepad