Bob Carpenter makes an important contribution to the debate around NLP and search (thanks to Mark Liberman for capturing this as it unfolds). Before addressing Bob's post, I'd like to put a stake in the sand. Some of the discussion around NLP is to do with the ability of certain, one might say atomic, elements of any NLP system and the performance of those components. In terms of quality, I don't believe that we are really at the point where we need to wait to eek out another .01 % improvement. This has been the course for a while in various tasks (including POS tagging and parsing). It is interesting to see a reflection of this point of view in the ACL's attitude towards paper reviewing. I say that now is the time to take what we have and figure out how to apply it. Sure, it would be nice to get a slightly better parser or a slightly better POS tagger, but these shouldn't be fundamental barriers to creating rich applications.
Ok - so Bob's post. Bob asks what a search engine which took advantage of NLP would actually look like. He uses the example of trying to find out if Ian Anderson (famed flute player and salmon farmer) is Scottish or English. To illustrate some of the challenges here, Bob reports the results from Google via the use of the wild card '*' in Google's search syntax.
Google QUERY: ian anderson was born in *
Here’s the first few “answers”:
1. Paiseley, Scotland
3. Scotland in 1947
4. Fife in 1947
6. Croydon before England won the world cup
7. Williston, ND
8. Nottingham on June 29, 1948
Now which of those is the “right” answer? Well, first it depends on which Ian Anderson we’re talking about. There are 10 with their own Wikipedia pages.
The voting thing doesn’t help much here unless you happen to know that Dunfermline is in West Fife, which is in Scotland, which is in the United Kingdom. I’m confused about answer 1, because it’s “Paiseley”, which is spelled wrong. Wikipedia claims the answer is Dunfermline. Clearly the source is important in providing answers.
This is such a great example due to the ambiguity of the word 'in'. In addition, a system with a modicum of world knowledge would be off to a running start knowing that a person was being discussed (simple entity extraction would deal with this). Consequently, it would be able to report relationships between people and (birth) times and people and (birth) places. I believe that for this example, we can easily imagine the presentation of these results involving an explicit indication of the potential for ambiguity.
However, this example would better be discussed in the light of the information in the entire corpus. For example, Google reports 372k pages for 'ian anderson fife'. Discussion around being able to parse correctly has always to consider this level of redundant information - the same facts expressed over and over in many different ways.
- Notable Fifers: ... Ian Anderson (list format)
- Born: Aug 10, 1947 in Dunfermline, Fife, Scotland (semi-structured)
- Ian Scott Anderson, born August 10, 1947 in Dunfermline, Fife, Scotland, is a Scottish singer, songwriter, guitarist and flautist, and is best known as the head of the rock band Jethro Tull.
- Ian was born in 1947 in Dunfermline, Fife, Scotland.
- Birthplace: Dunfermline, Fife, Scotland (semi-structured)
- ...and so on, 372, 000 times.
Note here, BTW, that there is some number of expressions of this information in a semi-structured form. While parsing sentences is a key area of NLP research, there are many ways in which the relationships between things and ideas may be expressed and tabular and list-like formats are potentially more accessible than sentential forms (it is this observation that drives much of the technology behind Google's simple onebox answers - see this post for a description, or amuse yourself by asking Google 'what is the density of France?').
While I haven't really addressed Bob's question, I believe I am proposing that a more sophisticated search engine would be explicit about ambiguity (rather than let the user and documents figure this out for themselves) and would take information from many sources to resolve ambiguity, recognize ambiguity and synthesize results.