My Photo

« Skewz not BLEWS | Main | The Surprising Mr Obama »

March 20, 2008



It seems from the examples that they are attempting to categorize the content without consideration for the context, but rather in a general case. We do it context-based, although when dealing with generic Web results some amount of "noise" is inevitable. I think they may have a better chance of success in a vertical, by creating solid taxonomies tuned for specific area of content. That's why I think in the challenge they seek app ideas for a vertical.

Frank Goertzen

Perhaps the proper semantic name for this contest should be the AmericanSemanticHackerContest.


Interesting examples. I'm guessing they got Maharashtra from MSR, since MSR appears on the Wikipedia Maharashtra page, though a good tokenizer would not have fallen for it.

I wouldn't expect a good disambiguation solution any time soon; the problem is just that hard. Personally, I find it even more striking that what they're looking for is a business plan. So they have some technology, it might not be perfect, but keep in mind how tough the problem is -- categorizing some random text without any domain restriction. That's extremely ambitious. But, what they're lacking is a good business application for it. Isn't this indicative of text mining in general? Great algorithms and ideas, but not much viable practical applications?

I think the bigger question is what are good applications of "crummy NLP" in Church and Hovy's sense...

The comments to this entry are closed.

Twitter Updates

    follow me on Twitter

    March 2016

    Sun Mon Tue Wed Thu Fri Sat
        1 2 3 4 5
    6 7 8 9 10 11 12
    13 14 15 16 17 18 19
    20 21 22 23 24 25 26
    27 28 29 30 31    


    Blog powered by Typepad