I've written before about the differences between clever algorithms and approaches to AI which use massive data volumes as their core MO. Philipp Lenssen over at Google Blogoscoped has a great example of this supplied by Google's spell checker. Typing in 'she invented' returns the spelling suggestion 'he invented'. Other examples surface in the comments. There is a nice irony here. I used Schmidt's claim that the Google spell checker was a form of AI as an example of why Google's centrality in the public's perception of AI is a sort of tragedy.
An issue that came up in the Future of Search event (about which I'm attempting to write a longer piece) is that of correctness (the term precisiation came up...). You search for some information and get a page that makes a statement of the correct form, but is it correct? Some great examples flew about regarding claims of the origin of various electrical applications and the names of Edison and Tesla. The problem is that the majority vote may not always be correct.
Another point that Prof. Lotfi Zadeh (fm the Computer Science Division, Dept. of EECS, Berkeley) made, which I thought was well explained, was that "imprecision in natural language cannot be dealt with through the use of bivalent logic and probability theory. What are needed are new concepts and new techniques", than what's available today. Where he summarizes the future of search and being "question answering", he elaborated the short comings of natural language elegantly. I have to agree that communication and question answering goes beyond natural language only. Communication and answers are always relative, frequently through some exchange between people even, not just relying on to the words used, but also counting on the context in which they're used, where that context can be one's facial expressions, tone of voice, time of day or time in history, our mutually shared experiences and/or all of the above and other contexts. Even the written word has baggage along the lines of the items listed here, except for perhaps tone of voice and facial expressions.
The use of language to understand intent as is being positioned by NLP falls short in that language is but one channel in a multi-channel communication. Understanding the question asked by its words alone fall short of truly understanding the question asked. This leads me to the thought that Prof. Zadeh may have it right.
It was good meeting you there Matt and great to hear your perspectives live.
Posted by: p-air | May 07, 2007 at 02:25 AM
The reason why the majority might not be correct is why you use a trust metric.
Trust certain nodes over others.
Imagine these applied to voting systems. :)
Posted by: Kevin Burton | May 07, 2007 at 03:51 AM
I'm pretty baffled by Zadeh's comments and the questions about correctness. It seems like a resignation that is only acceptable outside of business. Imagine people at a search company saying "There is just too much context to try to figure out what this person is looking for based on keywords alone."
The natural language search panel devolved into the quest for a perfect question-answer machine that knows that Tesla invented the light bulb for instance. It's not clear to me why we would want our search technology determining the correctness of the pages. I just want search to be best at finding the pages I've described.
Posted by: Mike Love | May 07, 2007 at 12:38 PM
Using massive volumes of data creates problems if term interdependence is relied upon. Building representations of text collections using Latent Semantic Indexing, for example, become counterproductive when a document set reaches a certain size (somewhere on the order of 100,000 documents). I imagine that's the very source of "he" replacing "she." LSI is one of many term interdependence schemes that is supposed to be clever and often utilized to represent features of documents in large data sets. Using bigrams, trigrams, phrasers, etc., are all clever approaches for which you will receive nice stars from your professors. But in practice not only are they a pain. They are ultimately really dumb, truly, as document sets get larger and more "realistic." The lesson is not that data sets can become too large. There are approaches that mitigate the tyrranical majorities of pure frequencies. Ultimately term independence must be enforced.
Alas, too clever, whether by algorithm or via learning of large data sets, is never smart enough.
I still don't really know what "natural language" means.
Posted by: Patrick Herron | May 10, 2007 at 04:24 PM