TechCrunch writes about SemanticHacker - a challenge put out by TextWise to see what the crowd can do with its NLP technology. On the front page they have a demo of their system, which creates 'semantic signatures' (essentially nodes from a broad hierarchical classification scheme) summarizing the content entered.
When dealing with the analysis of social media content - weblogs, usenet, etc. - one has to be very careful when transfering state of the art NLP and text mining solutions. There are a number of key reasons, two of which are: i) noisy text and ii) the relationship between document structure and the dialogue/conversation that is taking place between the author and the entire content space. This has a big impact on getting at what the document is 'about'. How do you treat quoted material? for example. [Not to mention my use of intersentential question marks...]
I took this opening paragraph, which is essentially about Microsoft and Microsoft Research:
It is almost exactly a year ago that I joined Microsoft. I was lucky enough with my timing that my first week here coincided with TechFest. TechFest is an expo put on by Microsoft Research to showcase new and ongoing innovation internally. What I remember most about that first week was how impressed I was at the diversity of work being carried out by MSR. While this event is an internal one, there is also a press day which takes some of these research projects and demonstrates them to the media. This year's press day was yesterday.
And TextWise produced this semantic signature:
.../Education/Colleges_and_Universities/Asia/Maharashtra 98
Recreation/Autos/Makes_and_Models/BMW 8
.../Software/Operating_Systems/Microsoft_Windows/Windows_XP 6
Computers/Hardware/Components 6
.../Microsoft_Windows/Windows_2000/FAQs_and_Tutorials 6
What you see here are categories and scores. Here is the explanation:
Semantic Signatures® are built from weighted concepts. This simplified display shows the concept on the left, with its respective weight on the right. The weights represent the significance of ALL topics in the block of text. For the purpose of this demo, we are only displaying the top 5 concepts. Also, the weights have been placed on a 1 through 100 scale, 100 being the highest significance possible.
They also have problems with more obvious ambiguities:
I cut down the bush.
Produces:
.../North_America/Presidents/Bush,_George_Walker/Opposing_Views 38
.../By_Region/North_America/Presidents/Bush,_George_Walker/Humor 31
.../By_Region/North_America/Presidents/Bush,_George_Walker 31
.../North_America/Presidents/Bush,_George_Walker/Opposing_Views 24
Shopping/Jewelry/Diamonds 22
Not a promising start. Note also that the $1MM prize is paid out as $100k initially with 'up to an additional $900k during the first year after the application is released.' So the winner may only see 10% of the prize.
I'm all for more visibility for NLP in the consumer space, definitely in to semantics and the transformation of object data (text) into a logical form, so I wish TextWise all the best. That being said, I personally believe that the way to deploy large scale NLP applications in the consumer space requires a more incremental and controlled plan.
I suspect that a big piece that they are missing out on with the structure of their competition is getting the community to improve the lexical and ontological resources (e.g. to fix the ambiguity in the example above).