Named entity recognition – the discovery in text of strings that refer to classes of things like places, people, companies, etc. – has become a standard tool in the text mining world. For many classes (e.g. people) recognition is pretty good. But what are these entities doing? Figuring out the verb associated with a named entity, and the type of association (or the role) is trickier than you might think.
I’ve been playing a little with Thomson Reuters’ OpenCalais – their publicly available service built on top of the ClearForest technology acquired in 2007. To illustrate the challenges that the verb poses, take this sentence:
The Director of National Intelligence, Mike McConnell, has meanwhile implied he suspects the Pakistan-based group Lashkar-e-Toiba was responsible.
Verbs here are has, imply and suspect. OpenCalais analyses this as follows:
First, it works with the named entities, discovering
The Director of National Intelligence, Mike McConnell, has meanwhile implied he suspects the Pakistan-based group Lashkar-e-Toiba was responsible
Not too bad. Note that they are smart enough to pick up the use of ‘he’. In addition, interaction with the tool reveals that they resolve the pronoun to the preceding entity. Now, with the propositional analysis, things start getting tricky. Rather than recognizing that McConnell implied that he suspected [the …], the system looks for the verb (imply) and then takes the preceding and succeeding noun phrases as the grammatical subject and object, resulting in The Director of National Intelligence implying Mike McConnell.
This example illustrates some of the challenges in the space. It is costly and difficult to do a full parse (which in this case would have discovered the subordinate sentence, and discourse analysis (figuring out what pronouns refer to) is complex. Finally, the lexical resources required to guide the association of the noun phrases that go with verbs is a large investment.
While, to a large extent, one can get away with light weight solutions to named entities that treat the text as linear. Getting the verbs done correctly requires a richer understanding of the structure. You can no longer treat the text sequentially (or, to be fair, you have to at least approximate the grammatical, tree-like structure of the language).
To some degree, certain text mining applications can help here, if designed appropriately. While the above example surfaces a clear error in analysis, if the application relies on the aggregates of this type of analysis, and if in aggregate you tend to get your facts right, then the user can win. However, if at any stage, one has to reveal a specific document, the user is likely to be exposed to errors.
that project looks pretty interesting! thanks for sharing
Posted by: Christoph | December 04, 2008 at 07:10 AM
This is definitely a hard problem, but making "he" the object of a verb is a little pathetic. Pronouns are the one part of English where there's some semblance of case marking; take advantage of it!
Posted by: PetWolverine | December 04, 2008 at 08:30 AM
Calais looks incredibly promising, and it's encouraging that the have Reuters ownership. It would be incredible if all Reuters data was semantically marked up (instead of just hackish IPTC notation).
I just used this simplified interface for playing around with calais: http://sws.clearforest.com/calaisViewer/
Posted by: Chris Blow | December 05, 2008 at 05:32 AM
There's a cluster of academics (Suzanne Stevenson, Paula Merlo, Sabine Schulte im Walde and me come to mind) who've been working on verbs specifically, and making a bit of progress, we think. Jianguo Li (http://www.ling.ohio-state.edu/~jianguo/) and Kirk Baker ( http://www.ling.ohio-state.edu/~kbaker/) just
finished Ph.D theses, in which they explored the consequences of various ways of aggregating verbs over instances.
Briefly, by using verbs and their classes, you can get useful deltas on various performance measures for parsing, prepositional-phrase-attachment and so on, but we totally agree that any system based on this stuff is going to be producing ugly-looking errors along with the good stuff.
I believe (this is my well-informed guess, not Jianguo and Kirk's carefully supported scientific claim) that ANY attempt to make all Reuters' data semantically marked up will have ugly looking errors. This would be true even if there were humans doing all the work, because there is no known effective mechanism for ensuring the necessary consistency across the required numbers of annotators, and because semantic annotation is extremely hard. The carefully edited examples in papers look OK, but on the large scale there are lots of difficult judgements, and some are bound to fail. All the examples in dictionaries and WordNet have been looked at multiple times by multiple skilled people, and still errors get through.
Posted by: Chris Brew | December 06, 2008 at 09:41 AM
Thanks Matthew for evaluating OpenCalais.
OpenCalais has two general modes for extracting relations (finding the verb that connects multiple "things").
One mode is for predefined relation types - we have a large set of relations that the service specifically looks for (like M&A, IPO, Movie Release etc.)
The other mode is what you've been experiencing with; we call it Generic Relations, sometimes it's called exhaustive extraction. The idea is to detect all Subj-Verb-Obj relations that can (potentially) highlight interesting information. And this is the mode where we see the biggest challenges (because the domain is not bounded, because different verbs take different complements and so on).
We're already working on incorporating verb classes into this capability, to help us avoid errors of the kind you pointed out. It will be released in early 2009, so come visit us again :)
Posted by: Michal | December 07, 2008 at 04:30 AM