My Photo

 

  • Subscribe with Kindle

April 06, 2009

Strings are not Meanings Part 2

I wrote briefly about an article by Google researchers called The Unreasonable Effectiveness of Data – saying it was worth a read, but making three points to keep in mind when reading it. Fernando responded quite enthusiastically, and I want to clarify two of the points here (I’ll be following up with clarifications on the third in a later post).

Data may be unreasonably effective, but effective at what?

In asking this, I was really drawing attention to firstly the ability for large volumes of data (and not much else) to deliver interesting and useful results, but its inability to tell us how humans produce and interpret this data. One of the original motivations for AI was not simply to create machines that play chess better than people, but to actually understand how people’s minds work.

Despite all the ontology nay-sayers, a big chunk of our world is structured due to the well organized, systematic and predictable ways in which industry, society and even biology creates stuff.

Here, I want to draw attention to the skepticism around ontologies. Yes, they come at a cost, but it is also the case that they offer true grounding of interpretations of textual data. Let me give an example. The Lord of the Rings is a string used to refer to a book (in three parts) a sequence of films, various video games, board games, and so on. The ambiguity of the phrase requires a plurality of interpretations available to it. This is a 1-many mapping. The 1 is a string, but what is the type of the many? I actually see the type of work described in the paper as being wholly complimentary with categorical knowledge structures.

February 05, 2009

Lexical Growth in the Blogosphere

Over the past few years, there has been a real burst in research and application development in areas that mine the textual content of the blogosphere: appraisal analysis, keyword mining, entity extraction, etc. It is remarkable, then, that there is little or no published work on the size of the lexicon.

Understanding the number of adjectives being used by a community, or associated with a topic, for example, is fundamental to understanding how opinion is expressed. Adjectives are an important sub-area that any opinion mining technology needs to master (along with all the other forms of opinion expression).

Understanding the growth of nouns in a community, and the appearance of new ones, is an important signal when tracking conversations and social trends at all levels.

A simple experiment to explore this space is to scan a collection of documents and graph the appearance of hitherto unseen terms. The graph below shows this for a sample of blog data. The x-axis shows the number of documents inspected, the y-axis shows the number of types of a certain part of speech (NN = nouns, JJ = adjectives, VB = verbs, RB = adverbs).

image

What is clear from this is the relative growth of lexical types (nouns grow faster than adjectives, verbs and adverbs grow at a significantly slower rate). Drilling down into the data, I’ve noticed that, for adjectives, the top 1, 000 types (that is to say, the 1, 000 adjectives that occur most frequently) account for 82.85% of all observations; the top 100 for 50.15% and the top 1 (good) for 2.65%.

February 03, 2009

Evri – Entity Extraction for All

image Greg recently pointed me to evri.com. The site does a number of really interesting things. Firstly, it attempts to solve the named entity extraction problem in a broad way. Named entity recognition is often limited to person names, places and organizations. Evri doesn’t seem to have any limit to the types of things it discovers – music, bands, movies, books. Secondly, it looks for relationships between those entities. This is largely via collocation in a document. Thirdly, it attempts to disambiguate concepts with more than one possible type, thus Blue, which could be a film, a band or an album (not to mention a colour) is disambiguated. Finally, it gives access to the web via an interface which allows the user to both search and wander across the relationships between entities.

In named entity recognition, there are three key features which largely determine the nature of the task:

  1. Inherent types: a person name is generally recognizable as such without context (with some obvious exceptions – names like White, Black, etc.)
  2. Syntactic types: product names and addresses often have some syntactic pattern that gives internal coherence.
  3. Cultural types: the name of a book, film, video game, etc. is often simply some number of words from the language (The Lord of the Rings, Lips, Wanted, Today).

The third type – cultural entities – are the hardest to match, and this is exactly the type that evri appears to excel at. It does have some trouble with the harder cases (It – the book or film, Today – the US television show).

In evri, I see a glimpse of the future – a new way to craft the users relationship not just to the documents of the web, but the information on the web.

December 30, 2008

Processing Language Natural Statistical of Foundations

Oops.

image

July 02, 2008

acquire(_microsoft, _powerset)

Congratulations to everyone at Powerset! I’m psyched to be working once more with Barney.

While many may remark on aspects of this deal, I for one am still quite amazed that industry watchers are still completely misunderstanding what Powerset, and companies in this space, actually do. Take this post by Todd Bishop:

Powerset's technology attempts to figure out the meaning and intent behind phrases typed into search engines, seeking to improve search results.

Powerset’s technology will still deliver results even if you put in a simple noun phrase. A huge part of the equation is what they do in the back end. Kevin Heisler at Search Engine Watch continues with the lack of depth:

[natural language search] means you can type questions in a search box the way you normally ask them. (Think Ask Jeeves 1.5)

May 12, 2008

Powerset Launches!

Powerset, which provides a new relationship with web data via innovative interfaces and natural language processing, launched this evening. Take a look at this video:

I'll write more later, but for now, check out other posts I've made on Powerset and NLP. I'll try to keep abreast of the commentary as it comes in. Meanwhile, I'm waiting for Fernando to pounce.

Update: ok, some comments. A couple of things that people are going to get hung up on. Firstly, writers seem to be referring to the technology as context or contextual search - why not call it NLP. Not sure where that is coming from. Secondly (actually, this is more important) pundits are going to write about the wikipedia-only issue. They're not getting it. 90% of search results come from a tiny fraction of web pages due to the huge redundancy on the web and the differences between searcher needs and author/publisher intents. The task isn't to always search that huge set, but to get the answers to the user.

April 18, 2008

He Said, She Said

Matt Cutts points to a neat new feature in Google news search which extracts quotes by individuals and displays them at the top of the result set. You can click through to more quotes by the same person. Regardless of what you might thing of the value of this, it does expose some key capabilities on the linguistic side.

  • Disambiguation: a search for Hillary Clinton produces quotes like the following, which would require the system to resolve 'Clinton' to 'Hillary Clinton' : "When it comes to finishing the fight, Rocky and I have a lot in common. I never quit," Clinton said recently.
  • Pronoun resolution: the same search produces quotes qualified by 'she': She said last week that she knows, "what it means to get knocked down, but I've never stayed down."

I'm guessing that the product has been tuned highly for precision (that is, after all, what web search companies are all about). Thus, a search for just 'clinton' on the front end only presents results for 'Hillary Rodham Clinton', and a search for 'Bill Clinton' produces no quote results. My guess is that there is some general technology underneath this, but there is a strong editorial layer designed to ensure that all the results are of high quality at the expense of recall. This is not surprising and quite reasonable.

It'd be interesting to know who is on the list of people that get passed through. I see Gordon Brown, but not Tony Blair. No sign of the Dalai Lama saying anything quotable even though the top news search result has this very quotable passage:

"From the very beginning I have supported the Olympics," said the Dalai Lama. "We must support China's desires. Even after this sad situation in Tibet, today I support the Olympics." Still, he said he fully understands why people would express frustration and protest.

April 16, 2008

NLP and Search: The Topodia Story

I recently came across a new (stealth) company which is exploring an innovative approach to leveraging NLP in search. The innovation is not just in the matching technology, but also - importantly - in the form that interactions take with the system.

Topodia uses documents as queries, not short strings. It synthesizes a query from the query document and uses that as its starting point. When documents are retrieved for the query, Topodia does further analysis to rank documents by relevance.

The fun doesn't stop there. Interactions with the system are done via a browser plugin, and the resulting collections of documents (which the user can modify) are then shared with the Topodia back end making the collection available to other searchers.

Have a look at their video (original) below and check out their site.

With NLP and semantic technologies getting more visibility, observers generally compare them with the existing paradigm. In evaluating these system, I think it is healthy to look at cases where the front end and UX changes as well as the back end analysis and matching components.

March 20, 2008

SemanticHacker

TechCrunch writes about SemanticHacker - a challenge put out by TextWise to see what the crowd can do with its NLP technology. On the front page they have a demo of their system, which creates 'semantic signatures' (essentially nodes from a broad hierarchical classification scheme) summarizing the content entered.

When dealing with the analysis of social media content - weblogs, usenet, etc. - one has to be very careful when transfering state of the art NLP and text mining solutions. There are a number of key reasons, two of which are: i) noisy text and ii) the relationship between document structure and the dialogue/conversation that is taking place between the author and the entire content space. This has a big impact on getting at what the document is 'about'. How do you treat quoted material? for example. [Not to mention my use of intersentential question marks...]

I took this opening paragraph, which is essentially about Microsoft and Microsoft Research:

It is almost exactly a year ago that I joined Microsoft. I was lucky enough with my timing that my first week here coincided with TechFest. TechFest is an expo put on by Microsoft Research to showcase new and ongoing innovation internally. What I remember most about that first week was how impressed I was at the diversity of work being carried out by MSR. While this event is an internal one, there is also a press day which takes some of these research projects and demonstrates them to the media. This year's press day was yesterday.

And TextWise produced this semantic signature:

.../Education/Colleges_and_Universities/Asia/Maharashtra 98
Recreation/Autos/Makes_and_Models/BMW 8
.../Software/Operating_Systems/Microsoft_Windows/Windows_XP 6
Computers/Hardware/Components 6
.../Microsoft_Windows/Windows_2000/FAQs_and_Tutorials 6
What you see here are categories and scores. Here is the explanation:
Semantic Signatures® are built from weighted concepts. This simplified display shows the concept on the left, with its respective weight on the right. The weights represent the significance of ALL topics in the block of text. For the purpose of this demo, we are only displaying the top 5 concepts. Also, the weights have been placed on a 1 through 100 scale, 100 being the highest significance possible.
They also have problems with more obvious ambiguities:
I cut down the bush.
Produces:
.../North_America/Presidents/Bush,_George_Walker/Opposing_Views 38
.../By_Region/North_America/Presidents/Bush,_George_Walker/Humor 31
.../By_Region/North_America/Presidents/Bush,_George_Walker 31
.../North_America/Presidents/Bush,_George_Walker/Opposing_Views 24
Shopping/Jewelry/Diamonds 22

Not a promising start. Note also that the $1MM prize is paid out as $100k initially with 'up to an additional $900k during the first year after the application is released.' So the winner may only see 10% of the prize.

I'm all for more visibility for NLP in the consumer space, definitely in to semantics and the transformation of object data (text) into a logical form, so I wish TextWise all the best. That being said, I personally believe that the way to deploy large scale NLP applications in the consumer space requires a more incremental and controlled plan.

I suspect that a big piece that they are missing out on with the structure of their competition is getting the community to improve the lexical and ontological resources (e.g. to fix the ambiguity in the example above).

September 01, 2007

The Carolina Index

What's not to love? Powerset parsed the answer that Miss Teen South Carolina generated to the question: "Recent polls have shown that a fifth of Americans can't locate the US on a map. Why do you think this is?"

I personally believe that U.S. Americans are unable to do so because uh some uh people out there in our nation don't have maps and uh I believe that our ed- education like such as in South Africa and uh the- the Iraq everywhere like such as and I believe that they should uh our education over here in the U.S. should help the U.S. or- or- should help South Africa and should help the Iraq and the Asian countries so we will be able to build up our future

Southcarolina

Twitter Updates

    follow me on Twitter

    July 2009

    Sun Mon Tue Wed Thu Fri Sat
          1 2 3 4
    5 6 7 8 9 10 11
    12 13 14 15 16 17 18
    19 20 21 22 23 24 25
    26 27 28 29 30 31  

    Categories

    Blog powered by TypePad