I've been playing ping-pong with Fernando over the potential and value of NLP in search. I've yet to respond to Fernando's latest post (which I hope to do RSN). However, I'd like to point to something in this recent interview with Peter Norvig, Google's Director of Research. In an interview with VentureBeat, which was prompted by a news cycle from Powerset, though not directly concerning the company, Norvig states:
It would be great if we understood every word of every document and every query, but that’s a long way off.
To me, this kind of statement illustrates a crucial issue in the NLP debate. Norvig quite rightly talks about evaluating technologies from a user needs and quality perspective. This quote indicates a very traditional search point of view: a query is entered and a set of documents is returned. If one thinks outside those extreme constraints, and instead views the information found online independently of the documents that capture that information in human language, the huge redundancy in those documents suggests approaches to serving the user that don't require the perfect analysis of every document.
One of the basic paradigms of text mining, and a simple though constraining architectural paradigm, is the one document at a time pipeline. A document comes in, the machinery turns, and results pop out. However, this is limiting. It fails to leverage redundancy - the great antidote to the illusion that perfection is required at every step. The key to cracking the problem open is the ability to measure, or estimate, the confidence in the results. With this in mind, given 10 different ways in which the same information is presented, one should simply pick the results which are associated with the most confident outcome - and possibly fix the other results in that light. Another, related approach, is to be able to measure the difficulty associated with a particular interpretation. This is an approach that is very pertinent to social media, or blog data. While there is plenty of reasonably well formed language in blogs, there is also a huge volume of noisy text, with little or no traditional grammatical quality, dysfluencies, etc. What a system does, or attempts to do, with this data could be different from what is attempted with the more formal language.
And as for the issue of 'understanding every query' this is where the issue of what Barney calls a grunting pidgin language comes in. For example, I saw recently someone landing on this blog via the query - to Google - 'youtube data mining'. As the Google results page suggested, this cannot be 'understood' in the same way that a query like 'data mining over youtube data' can. Does the user want to find out about data mining over YouTube data, or a video on YouTube about data mining?
One thing is clear about the current debate - the issue is gaining visibility and there will be pressure for various parties to force particular points of view or interpretations of the capability of these technologies. This PR move may well become disassociated with what is really going on, and the real debate about the liberating potential of freeing users from existing behaviours and expectations - which is, of course, the hardest part of any fundamental change.
Since writing the above, I see that Fernando has another post up. In this he argues that while relevance wrt some keyword query is in some sense a natural - or observable - function (it makes sense because it deals with constructs that we can observe, namely the query and the documents), this is not the case for NLP type analysis in which the result is an artificial construct (a parse tree or a relational assertion for example). I don't disagree with the basis of this argument.
The situation is different for keyword-based search, because the input-output function is in a sense trivial: return all the documents containing query terms. The only, very important, subtlety is the order in which to return documents, so that those most relevant are returned first. Relevance judgments are relatively easy to elicit in bulk, compared with trying to figure out whether an entity or relation is an appropriate answer to a natural language question across a wide range of domains.
It is the notion of relevancy against keywords which is problematic, however. As in the example I illustrated, one can't determine the intention of the user simply from the keywords, thus while there may be a gradation to the relevance of some set of documents, there are also wholly separate collections of results which are not all relevant in the same way. To change this perspective, one needs to get the combination of the user and the system to deal with accurately capturing the intention of the user - which is where Barney's 'books for children' and 'books by children' example comes in (or my 'youtube data mining' example).