In my previous post Google Book Search and Geographic Entity Extraction I fell, regrettably, into a blogger hackerism. Having been aware via my feed reading of the post that I cited, and having been reminded of it by a friend via email, I thought I should post something about it. What I should have been doing was looking at the feature, discussing the potential technology behind it and giving some thought to how well the feature works. Hopefully, this post will remedy these oversights.
Firstly, a little about entity extraction. The simplest type of entity extraction ought really to be termed entity recognition. This is the task of identifying spans of text that the author has used to identify a real world entity of a particular type. For example, in the following, from Around The World In 80 Days, we can easily identify the span denoting a location.
Mr. Phileas Fogg lived, in 1872, at No. 7, Saville Row, Burlington Gardens, the house in which Sheridan died in 1814.
The next part of the problem is to provide a logical reduction of this span of text. Clearly, even with the above - the first location in this particular novel - this is challenging for we don't know which town/city or country this address is located in. This is where all the clever bits of work come in. The next paragraph, for example, starts
Certainly an Englishman, it was more doubtful whether Phileas Fogg was a Londoner.
We might, then, guess that the address was in London. This example introduces another problem: Does one capture the term 'Londoner' and recognize the location. Ultimately, as in the application being discussed, locate it on the map?
The Google Book Search system lists 10 location in its analysis of Jules Verne's classic. The map, shown below, shows many more.
However, it doesn't show No. 7, Saville Row, Burlington Gardens [London].
So then the question is: what is the precision and recall. Without really knowing the answer to this, we must question the general impression given by the visualization of the collection of locations discovered in a book. I'm not familiar with the book, but there appears to be a bit of a hop between Cairo and India - a recall issue perhaps?