My Photo

« The Edited Palin: Part 2 | Main | UnAutomated News »

December 04, 2008



that project looks pretty interesting! thanks for sharing


This is definitely a hard problem, but making "he" the object of a verb is a little pathetic. Pronouns are the one part of English where there's some semblance of case marking; take advantage of it!

Chris Blow

Calais looks incredibly promising, and it's encouraging that the have Reuters ownership. It would be incredible if all Reuters data was semantically marked up (instead of just hackish IPTC notation).

I just used this simplified interface for playing around with calais:

Chris Brew

There's a cluster of academics (Suzanne Stevenson, Paula Merlo, Sabine Schulte im Walde and me come to mind) who've been working on verbs specifically, and making a bit of progress, we think. Jianguo Li ( and Kirk Baker ( just
finished Ph.D theses, in which they explored the consequences of various ways of aggregating verbs over instances.

Briefly, by using verbs and their classes, you can get useful deltas on various performance measures for parsing, prepositional-phrase-attachment and so on, but we totally agree that any system based on this stuff is going to be producing ugly-looking errors along with the good stuff.

I believe (this is my well-informed guess, not Jianguo and Kirk's carefully supported scientific claim) that ANY attempt to make all Reuters' data semantically marked up will have ugly looking errors. This would be true even if there were humans doing all the work, because there is no known effective mechanism for ensuring the necessary consistency across the required numbers of annotators, and because semantic annotation is extremely hard. The carefully edited examples in papers look OK, but on the large scale there are lots of difficult judgements, and some are bound to fail. All the examples in dictionaries and WordNet have been looked at multiple times by multiple skilled people, and still errors get through.


Thanks Matthew for evaluating OpenCalais.
OpenCalais has two general modes for extracting relations (finding the verb that connects multiple "things").
One mode is for predefined relation types - we have a large set of relations that the service specifically looks for (like M&A, IPO, Movie Release etc.)
The other mode is what you've been experiencing with; we call it Generic Relations, sometimes it's called exhaustive extraction. The idea is to detect all Subj-Verb-Obj relations that can (potentially) highlight interesting information. And this is the mode where we see the biggest challenges (because the domain is not bounded, because different verbs take different complements and so on).
We're already working on incorporating verb classes into this capability, to help us avoid errors of the kind you pointed out. It will be released in early 2009, so come visit us again :)

The comments to this entry are closed.

Twitter Updates

    follow me on Twitter

    March 2016

    Sun Mon Tue Wed Thu Fri Sat
        1 2 3 4 5
    6 7 8 9 10 11 12
    13 14 15 16 17 18 19
    20 21 22 23 24 25 26
    27 28 29 30 31    


    Blog powered by Typepad