Named entity recognition – the discovery in text of strings that refer to classes of things like places, people, companies, etc. – has become a standard tool in the text mining world. For many classes (e.g. people) recognition is pretty good. But what are these entities doing? Figuring out the verb associated with a named entity, and the type of association (or the role) is trickier than you might think.
I’ve been playing a little with Thomson Reuters’ OpenCalais – their publicly available service built on top of the ClearForest technology acquired in 2007. To illustrate the challenges that the verb poses, take this sentence:
The Director of National Intelligence, Mike McConnell, has meanwhile implied he suspects the Pakistan-based group Lashkar-e-Toiba was responsible.
Verbs here are has, imply and suspect. OpenCalais analyses this as follows:
First, it works with the named entities, discovering
The Director of National Intelligence, Mike McConnell, has meanwhile implied he suspects the Pakistan-based group Lashkar-e-Toiba was responsible
Not too bad. Note that they are smart enough to pick up the use of ‘he’. In addition, interaction with the tool reveals that they resolve the pronoun to the preceding entity. Now, with the propositional analysis, things start getting tricky. Rather than recognizing that McConnell implied that he suspected [the …], the system looks for the verb (imply) and then takes the preceding and succeeding noun phrases as the grammatical subject and object, resulting in The Director of National Intelligence implying Mike McConnell.
This example illustrates some of the challenges in the space. It is costly and difficult to do a full parse (which in this case would have discovered the subordinate sentence, and discourse analysis (figuring out what pronouns refer to) is complex. Finally, the lexical resources required to guide the association of the noun phrases that go with verbs is a large investment.
While, to a large extent, one can get away with light weight solutions to named entities that treat the text as linear. Getting the verbs done correctly requires a richer understanding of the structure. You can no longer treat the text sequentially (or, to be fair, you have to at least approximate the grammatical, tree-like structure of the language).
To some degree, certain text mining applications can help here, if designed appropriately. While the above example surfaces a clear error in analysis, if the application relies on the aggregates of this type of analysis, and if in aggregate you tend to get your facts right, then the user can win. However, if at any stage, one has to reveal a specific document, the user is likely to be exposed to errors.