[this post is a work in progress]
If I take a look at Google's local search results for a moderately populated road in my neighborhood I can see a considerable number of errors.
Varsity (E) is positioned in the wrong block; Bernu's (F) is in the wrong location and it is actually closed; Himalayan Kitchen (I - not shown) is long gone and has been replaced by another unlisted restaurant, etc.
Similarly, the results on Bing have problems:
Somethings of the results are an improvement over Google: Varsity (8) is in the right location. On the other hand, Hot Dish (4) is in the right location, but it closed and the location was then occupied by Himalayan Kitchen, ...
The local search problem has two key components: data curation (creating and maintaining a set of high quality statements about what the world looks like) and relevance (returning those statements in a manner that satisfies a user need. The first part of the problem is a key enabler to success, but how hard is it?
There are many problems which involve bringing together various data sources (which might be automatically or manually created) and synthesizing an improved set of statements intended to denote something about the real world. The way in which we judge the results of such a process is to take the final database, sample it, and test it against what the world looks like.
In the local search space, this might mean testing to see if the phone number in a local listing is indeed that associated with a business of the given name and at the given location.
But do we quantify this challenge? We might perform the above evaluation and find out that 98% of the phone numbers are correctly associated. Is that good? Expected? Poor?
There are many factors at play. Some are to do with the nature of the real world entity in question:
- The number of real world entities that we are attempting to describe (certainly if our goal was to describe a single restaurant then it would be easier than, say, 1 million).
- The complexity of the real world entities that we are attempting to describe (for example, if we have to include just the phone number that would be easier than phone, address, website, chef, year of incorporation, etc.)
- The legal nature of the entity (entities that are defined legally are probably easier to describe than those which are not due to stricter requirements governing their appearance and disappearance).
- The dynamic nature of the entities (for example, businesses appear and disappear rapidly making it hard to maintain a 'fresh' database).
Others are to do with the manner in which the entity interacts with people and other real world entities. These interactions often result in data sets being generated as a by-product. The quality of these data sets is determined by their original application which may not have the same requirements as their opportunistic application in the local search space.
- How does the entity interact with people?
- How does the entity interact with other entities (for example, other businesses or government organizations)
- The privacy associated with the entity (people in many countries have some legal documentation, but that documentation is not necessarily in the public domain).
Finally, there are issues relating to computation made over available explicit and implicit data sets that denote or are otherwise related to or influence by the real world entity.
- How can we infer or derive required properties (for example, how hard is it to determine the specific geographic location of something given an address?)
- How does variance in descriptive titles (think: business names) relate to algorithmically inferring that these titles are intended to denote the same real world entity (e.g. Fred's Diner, Fred's Cafe, Fred's Restaurant)
Any data set we have access to can be characterized in terms of:
- How much of the model it covers (e.g. name, phone and address but not web site)
- How many of the entities it covers (e.g. all restaurants, all businesses, just Seattle restaurants, etc.)
- How much noise is in the data (at the time the statements were made, how accurate were they)
- The latency in the data (or, how much of the world has changed since these statements were made)
- Is the information direct (statements of the exact for of the desired ultimate data set) or indirect (statements that can be used in whole or in part to derive the desired statements, in which case the accuracy of an algorithm for doing so must be determined)
Finally, across all data sets we can determine
- Correlation or other dependencies. If we discover two sets of statements about the world but find that they are from the same source (or were originally and have since diverged) then their combined value is less than if we find independent sets of statements that can be used for mutual corroboration.