TechCrunch writes about SemanticHacker - a challenge put out by TextWise to see what the crowd can do with its NLP technology. On the front page they have a demo of their system, which creates 'semantic signatures' (essentially nodes from a broad hierarchical classification scheme) summarizing the content entered.
When dealing with the analysis of social media content - weblogs, usenet, etc. - one has to be very careful when transfering state of the art NLP and text mining solutions. There are a number of key reasons, two of which are: i) noisy text and ii) the relationship between document structure and the dialogue/conversation that is taking place between the author and the entire content space. This has a big impact on getting at what the document is 'about'. How do you treat quoted material? for example. [Not to mention my use of intersentential question marks...]
I took this opening paragraph, which is essentially about Microsoft and Microsoft Research:
It is almost exactly a year ago that I joined Microsoft. I was lucky enough with my timing that my first week here coincided with TechFest. TechFest is an expo put on by Microsoft Research to showcase new and ongoing innovation internally. What I remember most about that first week was how impressed I was at the diversity of work being carried out by MSR. While this event is an internal one, there is also a press day which takes some of these research projects and demonstrates them to the media. This year's press day was yesterday.
And TextWise produced this semantic signature:
.../Education/Colleges_and_Universities/Asia/Maharashtra 98Recreation/Autos/Makes_and_Models/BMW 8.../Software/Operating_Systems/Microsoft_Windows/Windows_XP 6Computers/Hardware/Components 6.../Microsoft_Windows/Windows_2000/FAQs_and_Tutorials 6
Semantic Signatures® are built from weighted concepts. This simplified display shows the concept on the left, with its respective weight on the right. The weights represent the significance of ALL topics in the block of text. For the purpose of this demo, we are only displaying the top 5 concepts. Also, the weights have been placed on a 1 through 100 scale, 100 being the highest significance possible.
I cut down the bush.
.../North_America/Presidents/Bush,_George_Walker/Opposing_Views 38.../By_Region/North_America/Presidents/Bush,_George_Walker/Humor 31.../By_Region/North_America/Presidents/Bush,_George_Walker 31.../North_America/Presidents/Bush,_George_Walker/Opposing_Views 24Shopping/Jewelry/Diamonds 22
Not a promising start. Note also that the $1MM prize is paid out as $100k initially with 'up to an additional $900k during the first year after the application is released.' So the winner may only see 10% of the prize.
I'm all for more visibility for NLP in the consumer space, definitely in to semantics and the transformation of object data (text) into a logical form, so I wish TextWise all the best. That being said, I personally believe that the way to deploy large scale NLP applications in the consumer space requires a more incremental and controlled plan.
I suspect that a big piece that they are missing out on with the structure of their competition is getting the community to improve the lexical and ontological resources (e.g. to fix the ambiguity in the example above).


It seems from the examples that they are attempting to categorize the content without consideration for the context, but rather in a general case. We do it context-based, although when dealing with generic Web results some amount of "noise" is inevitable. I think they may have a better chance of success in a vertical, by creating solid taxonomies tuned for specific area of content. That's why I think in the challenge they seek app ideas for a vertical.
Posted by: keywitness | March 20, 2008 at 09:50 AM
Perhaps the proper semantic name for this contest should be the AmericanSemanticHackerContest.
Posted by: Frank Goertzen | March 20, 2008 at 10:29 AM
Interesting examples. I'm guessing they got Maharashtra from MSR, since MSR appears on the Wikipedia Maharashtra page, though a good tokenizer would not have fallen for it.
I wouldn't expect a good disambiguation solution any time soon; the problem is just that hard. Personally, I find it even more striking that what they're looking for is a business plan. So they have some technology, it might not be perfect, but keep in mind how tough the problem is -- categorizing some random text without any domain restriction. That's extremely ambitious. But, what they're lacking is a good business application for it. Isn't this indicative of text mining in general? Great algorithms and ideas, but not much viable practical applications?
I think the bigger question is what are good applications of "crummy NLP" in Church and Hovy's sense...
Posted by: Entroppy | March 24, 2008 at 06:03 PM