A colleague brought to my attention a post on the influential search blog Search Engine Land which makes claims about the quality of local data found on search engines and local verticals: Yellow Pages Sites Beat Google In Local Data Accuracy Test. The author describes surprise at the outcome reported - that Yellow Pages sites are better at local search than Google. Rather, we should express surprise at how poorly this article is written and at the intentional misleading nature of the title.
The article describes an analysis done by Implied Intelligence. The analysis looks at 1, 000 local businesses in the US. Here is the first problem - these businesses exclude chains and franchises. In addition, if a website wasn't known for the business, it too was excluded. With some general assumptions about the definition of local business, it is safe to assert that firstly there are many instances of chains and franchises out there and secondly that many (if not most) businesses don't have a website (the distribution varies by category of course). Quite where the original sample of 1, 000 came from is not reported.
This biases the analysis - Google, like Bing is intersted in all local entities.
The initial part of the analysis is reasonable - looking at coverage (% in the sample found on the site) and quality (duplicates, phone number errors and adderss errors). Note, however, that this is a measure of the local data, not of local search. A search product includes a relevance component and it is quite possible that a well tuned relevance algorithm might suppress duplicates.
The last table in the analysis sees us swinging back to bad reporting. It describes the percentage of records that have a certain attribute: URL, Hours of Operation and 'additional info'. Did you see what they did there? This is what we call the coverage of an attribute, and it tells us nothing as to the quality of the value. I can quite easily populate a local database with 100% coverage for all attributes. They might all be wrong, but the coverage could be 100%. Consequently, this table is reasonably close to meaningless. If they had included the precision of these values then coverage can be used to compute recall, but that wasn't done.
In summary, an important search publication has either written an intentionally misleading article, or has demonstrated that it doesn't really get data.