I've been playing ping-pong with Fernando over the potential and value of NLP in search. I've yet to respond to Fernando's latest post (which I hope to do RSN). However, I'd like to point to something in this recent interview with Peter Norvig, Google's Director of Research. In an interview with VentureBeat, which was prompted by a news cycle from Powerset, though not directly concerning the company, Norvig states:
It would be great if we understood every word of every document and every query, but that’s a long way off.
To me, this kind of statement illustrates a crucial issue in the NLP debate. Norvig quite rightly talks about evaluating technologies from a user needs and quality perspective. This quote indicates a very traditional search point of view: a query is entered and a set of documents is returned. If one thinks outside those extreme constraints, and instead views the information found online independently of the documents that capture that information in human language, the huge redundancy in those documents suggests approaches to serving the user that don't require the perfect analysis of every document.
One of the basic paradigms of text mining, and a simple though constraining architectural paradigm, is the one document at a time pipeline. A document comes in, the machinery turns, and results pop out. However, this is limiting. It fails to leverage redundancy - the great antidote to the illusion that perfection is required at every step. The key to cracking the problem open is the ability to measure, or estimate, the confidence in the results. With this in mind, given 10 different ways in which the same information is presented, one should simply pick the results which are associated with the most confident outcome - and possibly fix the other results in that light. Another, related approach, is to be able to measure the difficulty associated with a particular interpretation. This is an approach that is very pertinent to social media, or blog data. While there is plenty of reasonably well formed language in blogs, there is also a huge volume of noisy text, with little or no traditional grammatical quality, dysfluencies, etc. What a system does, or attempts to do, with this data could be different from what is attempted with the more formal language.
And as for the issue of 'understanding every query' this is where the issue of what Barney calls a grunting pidgin language comes in. For example, I saw recently someone landing on this blog via the query - to Google - 'youtube data mining'. As the Google results page suggested, this cannot be 'understood' in the same way that a query like 'data mining over youtube data' can. Does the user want to find out about data mining over YouTube data, or a video on YouTube about data mining?
One thing is clear about the current debate - the issue is gaining visibility and there will be pressure for various parties to force particular points of view or interpretations of the capability of these technologies. This PR move may well become disassociated with what is really going on, and the real debate about the liberating potential of freeing users from existing behaviours and expectations - which is, of course, the hardest part of any fundamental change.
Since writing the above, I see that Fernando has another post up. In this he argues that while relevance wrt some keyword query is in some sense a natural - or observable - function (it makes sense because it deals with constructs that we can observe, namely the query and the documents), this is not the case for NLP type analysis in which the result is an artificial construct (a parse tree or a relational assertion for example). I don't disagree with the basis of this argument.
The situation is different for keyword-based search, because the input-output function is in a sense trivial: return all the documents containing query terms. The only, very important, subtlety is the order in which to return documents, so that those most relevant are returned first. Relevance judgments are relatively easy to elicit in bulk, compared with trying to figure out whether an entity or relation is an appropriate answer to a natural language question across a wide range of domains.
It is the notion of relevancy against keywords which is problematic, however. As in the example I illustrated, one can't determine the intention of the user simply from the keywords, thus while there may be a gradation to the relevance of some set of documents, there are also wholly separate collections of results which are not all relevant in the same way. To change this perspective, one needs to get the combination of the user and the system to deal with accurately capturing the intention of the user - which is where Barney's 'books for children' and 'books by children' example comes in (or my 'youtube data mining' example).
I'd argue that there is a temporal axis to relevancy that is not being considered by NLP nor many other systems (I only know of one that does consider this). The temporal axis would help address finding the answer to a question like "show me the baddest movies of 1950". Had this question been asked in 1950 the answer would have been different than when asked today. The term "bad" has only recently evolved in jargon to mean "good". Hence, in 1950 the response might have correctly illicited movies that were undesirable or poorly made, while today that same query should correctly illicit movies that are really good. However, none of the proposed technologies has a notion of the temporal axis required to provide context to NLP queries. Hence, why I'm somewhat pessimistic about Powerset's hype, well that and the fact that many people have a tough time expressing themselves in writing, so how will an algorithm capture what a person means when the person doesn't have the tools (ability to write properly) to convey this. Call it a UI problem more than an algorithmic one.
Posted by: P-Air | February 11, 2007 at 04:42 PM
Regarding the example you give of a query [youtube data mining] one could think of yet another interpretation: this guy might be looking for Swivel, which can be thought of as the "YouTube" of Data Mining, so maybe he heard someone talk about this but did not remember the name of the site ;)
Posted by: Olivier Bousquet | February 12, 2007 at 04:15 AM
I couldn't agree more about doing confidence-based collective data extraction. Those who've participated in the DARPA bakeoffs over time have been doing just what Matthew suggests.
That's why we built LingPipe to handle confidence-based extraction for tagging, chunking and classification. But we prefer to use an expectation-based form of inference rather than a winner-take all, which only makes sense for bakeoff entries. Most social networking and clustering software works just fine with expectations.
In real life, at least for intelligence analysts and biologists, recall is the game. The reason is the long tail. Most relations are not mentioned dozens of times in various forms. In other words, you may have to deal with the equivalent of "youtube data mining", because language is vague.
But it's not just recall of relations. Intelligence analysts need to link back to sources to generate their reports. Biologists won't trust current relation extraction programs to guide their research because even reasoning cumulatively, they're not precise enough. Too much depends on context outside of the phrase, sentence or even paragraph/document in which a fact is stated.
Posted by: Bob Carpenter | February 12, 2007 at 02:19 PM
Could anyone help someone like me with limited imagination? I'm trying to envision what the next generation of NLP enabled search is going to look like from a user's perspective.
Let's say my wife and I are trying to settle a bet that arose over dinner, such as whether the lead singer of Jethro Tull was Scottish or English. I actually went to Google and used their newish pattern based search:
Google QUERY: ian anderson was born in *
Here's the first few "answers":
1. Paiseley, Scotland
2. 1947
3. Scotland in 1947
4. Fife in 1947
5. Philadelphia
6. Croydon before England won the world cup
7. Williston, ND
8. Nottingham on June 29, 1948
9. 1981
10. digg
Now which of those is the "right" answer? Well, first it depends on which Ian Anderson we're talking about. There are 10 with their own Wikipedia pages.
The voting thing doesn't help much here unless you happen to know that Dunfermline is in West Fife, which is in Scotland, which is in the United Kingdom. I'm confused about answer 1, because it's "Paiseley", which is spelled wrong. Wikipedia claims the answer is Dunfermline. Clearly the source is important in providing answers.
Posted by: Bob Carpenter | February 12, 2007 at 02:30 PM