My Photo

 

  • Subscribe with Kindle

« Skewz not BLEWS | Main | The Surprising Mr Obama »

March 20, 2008

SemanticHacker

TechCrunch writes about SemanticHacker - a challenge put out by TextWise to see what the crowd can do with its NLP technology. On the front page they have a demo of their system, which creates 'semantic signatures' (essentially nodes from a broad hierarchical classification scheme) summarizing the content entered.

When dealing with the analysis of social media content - weblogs, usenet, etc. - one has to be very careful when transfering state of the art NLP and text mining solutions. There are a number of key reasons, two of which are: i) noisy text and ii) the relationship between document structure and the dialogue/conversation that is taking place between the author and the entire content space. This has a big impact on getting at what the document is 'about'. How do you treat quoted material? for example. [Not to mention my use of intersentential question marks...]

I took this opening paragraph, which is essentially about Microsoft and Microsoft Research:

It is almost exactly a year ago that I joined Microsoft. I was lucky enough with my timing that my first week here coincided with TechFest. TechFest is an expo put on by Microsoft Research to showcase new and ongoing innovation internally. What I remember most about that first week was how impressed I was at the diversity of work being carried out by MSR. While this event is an internal one, there is also a press day which takes some of these research projects and demonstrates them to the media. This year's press day was yesterday.

And TextWise produced this semantic signature:

.../Education/Colleges_and_Universities/Asia/Maharashtra 98
Recreation/Autos/Makes_and_Models/BMW 8
.../Software/Operating_Systems/Microsoft_Windows/Windows_XP 6
Computers/Hardware/Components 6
.../Microsoft_Windows/Windows_2000/FAQs_and_Tutorials 6
What you see here are categories and scores. Here is the explanation:
Semantic Signatures® are built from weighted concepts. This simplified display shows the concept on the left, with its respective weight on the right. The weights represent the significance of ALL topics in the block of text. For the purpose of this demo, we are only displaying the top 5 concepts. Also, the weights have been placed on a 1 through 100 scale, 100 being the highest significance possible.
They also have problems with more obvious ambiguities:
I cut down the bush.
Produces:
.../North_America/Presidents/Bush,_George_Walker/Opposing_Views 38
.../By_Region/North_America/Presidents/Bush,_George_Walker/Humor 31
.../By_Region/North_America/Presidents/Bush,_George_Walker 31
.../North_America/Presidents/Bush,_George_Walker/Opposing_Views 24
Shopping/Jewelry/Diamonds 22

Not a promising start. Note also that the $1MM prize is paid out as $100k initially with 'up to an additional $900k during the first year after the application is released.' So the winner may only see 10% of the prize.

I'm all for more visibility for NLP in the consumer space, definitely in to semantics and the transformation of object data (text) into a logical form, so I wish TextWise all the best. That being said, I personally believe that the way to deploy large scale NLP applications in the consumer space requires a more incremental and controlled plan.

I suspect that a big piece that they are missing out on with the structure of their competition is getting the community to improve the lexical and ontological resources (e.g. to fix the ambiguity in the example above).

TrackBack

TrackBack URL for this entry:
http://www.typepad.com/services/trackback/6a00d8341c994053ef00e551500c928834

Listed below are links to weblogs that reference SemanticHacker:

Comments

It seems from the examples that they are attempting to categorize the content without consideration for the context, but rather in a general case. We do it context-based, although when dealing with generic Web results some amount of "noise" is inevitable. I think they may have a better chance of success in a vertical, by creating solid taxonomies tuned for specific area of content. That's why I think in the challenge they seek app ideas for a vertical.

Perhaps the proper semantic name for this contest should be the AmericanSemanticHackerContest.

Interesting examples. I'm guessing they got Maharashtra from MSR, since MSR appears on the Wikipedia Maharashtra page, though a good tokenizer would not have fallen for it.

I wouldn't expect a good disambiguation solution any time soon; the problem is just that hard. Personally, I find it even more striking that what they're looking for is a business plan. So they have some technology, it might not be perfect, but keep in mind how tough the problem is -- categorizing some random text without any domain restriction. That's extremely ambitious. But, what they're lacking is a good business application for it. Isn't this indicative of text mining in general? Great algorithms and ideas, but not much viable practical applications?

I think the bigger question is what are good applications of "crummy NLP" in Church and Hovy's sense...

Verify your Comment

Previewing your Comment

This is only a preview. Your comment has not yet been posted.

Working...
Your comment could not be posted. Error type:
Your comment has been posted. Post another comment

The letters and numbers you entered did not match the image. Please try again.

As a final step before posting your comment, enter the letters and numbers you see in the image below. This prevents automated programs from posting comments.

Having trouble reading this image? View an alternate.

Working...

Post a comment

Twitter Updates

    follow me on Twitter

    July 2009

    Sun Mon Tue Wed Thu Fri Sat
          1 2 3 4
    5 6 7 8 9 10 11
    12 13 14 15 16 17 18
    19 20 21 22 23 24 25
    26 27 28 29 30 31  

    Categories

    Blog powered by TypePad