In my previous post Google Book Search and Geographic Entity Extraction I fell, regrettably, into a blogger hackerism. Having been aware via my feed reading of the post that I cited, and having been reminded of it by a friend via email, I thought I should post something about it. What I should have been doing was looking at the feature, discussing the potential technology behind it and giving some thought to how well the feature works. Hopefully, this post will remedy these oversights.
Firstly, a little about entity extraction. The simplest type of entity extraction ought really to be termed entity recognition. This is the task of identifying spans of text that the author has used to identify a real world entity of a particular type. For example, in the following, from Around The World In 80 Days, we can easily identify the span denoting a location.
Mr. Phileas Fogg lived, in 1872, at No. 7, Saville Row, Burlington Gardens, the house in which Sheridan died in 1814.
The next part of the problem is to provide a logical reduction of this span of text. Clearly, even with the above - the first location in this particular novel - this is challenging for we don't know which town/city or country this address is located in. This is where all the clever bits of work come in. The next paragraph, for example, starts
Certainly an Englishman, it was more doubtful whether Phileas Fogg was a Londoner.
We might, then, guess that the address was in London. This example introduces another problem: Does one capture the term 'Londoner' and recognize the location. Ultimately, as in the application being discussed, locate it on the map?
The Google Book Search system lists 10 location in its analysis of Jules Verne's classic. The map, shown below, shows many more.
However, it doesn't show No. 7, Saville Row, Burlington Gardens [London].
So then the question is: what is the precision and recall. Without really knowing the answer to this, we must question the general impression given by the visualization of the collection of locations discovered in a book. I'm not familiar with the book, but there appears to be a bit of a hop between Cairo and India - a recall issue perhaps?




I did a little poking around with the map app in Book Search based on your earlier post. Being an O'Brian fan, I did a search on the Aubrey/Maturin books. As it turns out, there's a location in Venezuela named "Maturin." The Google search also finds Berkeley, California, in a passage about "a young attaché called Berkeley..."
Don't get me wrong, I see a real use for this thing. We just have to remember its limitations.
Posted by: MIke | January 28, 2007 at 08:17 PM
This reminds me of another project, GutenKarte, which maps place names found in free books using open source tools and MetaCarta's API: http://gutenkarte.org/
They have a tool here which will attempt to do the same for any web page: http://labs.metacarta.com/PageMapper/
Before Flickr had its own geotags, my colleagues built Mappr to plot photos with placenames in tags onto a map of the USA: http://www.mappr.com/
All these projects suffer from the same problems identified in the post and first comment. Mappr was interesting though because it didn't require a "place" to be blessed by the big geocoding databases - tourist trails like Route 66, or events like Burning Man, could emerge as "places" in their own right.
I wonder if the book search tools will develop in this direction, and also how they will deal with historical locations that no longer exist, or that change name. An exciting challenge!
Posted by: Tom Carden | January 28, 2007 at 08:39 PM
Speaking as someone currently working on a geocoder, it's difficult enough to reliably parse addresses when they're already identified as such, and even when you can assume some vague sort of consistent format will be present. How does one determine that "4th Ave Bypass" has two suffixes ('ave' and 'bypass'), while "Lyttleton Close Road" has one? Worse, "Lyttelton Close" could refer to the street called "Lyttelton" with the suffix "close", or it could refer to the street "Lyttleton Close" with the suffix omitted. Address parsing is littered with problems and ambiguities like this, and it only gets worse if you want to recognise addresses in free text.
Posted by: Nick Johnson | January 28, 2007 at 09:09 PM