November 10, 2005


Glenn Fannick

I'm very interested in geocoding as well. The challenge seems to be accurately mining the appropriate region from the document in question. Are we talking about the region from which the document was published? the region the document is about? It's not clear which is easier nor does every document even contain the clues needed. Further, many documents mention regions that aren't at the heart of the issue. Ideas?

Matthew Hurst

Glenn - firstly, it's no coincidence that many of these demos/mashup systems work with news articles. Location is very much a part of the information in a news article and so articles with no location information are either localised (in which case a simple lookup on the source of publication is sufficient) or contains information with no geographic importance.

I have a lot of faith in using large databases of location names with appropriate methods for dealing with ambiguity. For example, 'Casablanca' with no modifier is far more likely to be the African city than the US counterpart. One also has to be careful with transliterations of foreign names (e.g. 'The Hat' is a location in Asia I believe). Other aspects of news include the huge amount of syndication and redundancy in the aggregate data stream, the ability to tag location and topic/theme for news feeds (e.g. Edinburgh/Culture), the common use of by-lines including location, knowledge of the location of reporters.

This comment is getting too long, so I'll put a post together on the topic soon.

