My Photo

« Rich Client, Poor Client | Main | BBC Infographics on Iraq »

February 11, 2007



I'd argue that there is a temporal axis to relevancy that is not being considered by NLP nor many other systems (I only know of one that does consider this). The temporal axis would help address finding the answer to a question like "show me the baddest movies of 1950". Had this question been asked in 1950 the answer would have been different than when asked today. The term "bad" has only recently evolved in jargon to mean "good". Hence, in 1950 the response might have correctly illicited movies that were undesirable or poorly made, while today that same query should correctly illicit movies that are really good. However, none of the proposed technologies has a notion of the temporal axis required to provide context to NLP queries. Hence, why I'm somewhat pessimistic about Powerset's hype, well that and the fact that many people have a tough time expressing themselves in writing, so how will an algorithm capture what a person means when the person doesn't have the tools (ability to write properly) to convey this. Call it a UI problem more than an algorithmic one.

Olivier Bousquet

Regarding the example you give of a query [youtube data mining] one could think of yet another interpretation: this guy might be looking for Swivel, which can be thought of as the "YouTube" of Data Mining, so maybe he heard someone talk about this but did not remember the name of the site ;)

Bob Carpenter

I couldn't agree more about doing confidence-based collective data extraction. Those who've participated in the DARPA bakeoffs over time have been doing just what Matthew suggests.

That's why we built LingPipe to handle confidence-based extraction for tagging, chunking and classification. But we prefer to use an expectation-based form of inference rather than a winner-take all, which only makes sense for bakeoff entries. Most social networking and clustering software works just fine with expectations.

In real life, at least for intelligence analysts and biologists, recall is the game. The reason is the long tail. Most relations are not mentioned dozens of times in various forms. In other words, you may have to deal with the equivalent of "youtube data mining", because language is vague.

But it's not just recall of relations. Intelligence analysts need to link back to sources to generate their reports. Biologists won't trust current relation extraction programs to guide their research because even reasoning cumulatively, they're not precise enough. Too much depends on context outside of the phrase, sentence or even paragraph/document in which a fact is stated.

Bob Carpenter

Could anyone help someone like me with limited imagination? I'm trying to envision what the next generation of NLP enabled search is going to look like from a user's perspective.

Let's say my wife and I are trying to settle a bet that arose over dinner, such as whether the lead singer of Jethro Tull was Scottish or English. I actually went to Google and used their newish pattern based search:

Google QUERY: ian anderson was born in *

Here's the first few "answers":

1. Paiseley, Scotland
2. 1947
3. Scotland in 1947
4. Fife in 1947
5. Philadelphia
6. Croydon before England won the world cup
7. Williston, ND
8. Nottingham on June 29, 1948
9. 1981
10. digg

Now which of those is the "right" answer? Well, first it depends on which Ian Anderson we're talking about. There are 10 with their own Wikipedia pages.

The voting thing doesn't help much here unless you happen to know that Dunfermline is in West Fife, which is in Scotland, which is in the United Kingdom. I'm confused about answer 1, because it's "Paiseley", which is spelled wrong. Wikipedia claims the answer is Dunfermline. Clearly the source is important in providing answers.

The comments to this entry are closed.

Twitter Updates

    follow me on Twitter

    March 2016

    Sun Mon Tue Wed Thu Fri Sat
        1 2 3 4 5
    6 7 8 9 10 11 12
    13 14 15 16 17 18 19
    20 21 22 23 24 25 26
    27 28 29 30 31    


    Blog powered by Typepad