Infogroup - one of the leading providers of business listings - has an interesting post on their site about the problem of errors in local data. In this article they talk specifically about the error of business closure and the frustration that consumers experience when they look up a business, travel to the location only to find that the business is closed.

A report released today by Infogroup, the leading provider of high-value data and multichannel solutions, finds that 52 percent of consumers using local search services have visited a closed business and 44 percent have had a social outing ruined by outdated business listing information.

You can read the full article here.

Now, upon reading this, you might reflect on your own experience and grumble in recognition of this problem. However, the probability of failure can be misleading.

Let's imagine we have some event that has a probability of .99 success. This means that if we attempt this once, there is a .01 chance that we will experience failure. If we attempt this twice, we will have a failure probability of 0.0199. This is computed by calculating the probability of two successes and then subtracting that from 1 (i.e. 1 - 0.99^2).

If we interpret the survey data from Infogroup as meaning consumers have a 52% chance of experiencing an error (on the closure of a business) then we can ask - for a given quality of data, how many unique businesses would a user have to experience in a search engine such that the probability of seeing a single error was 52%?

For example, if our data's accuracy for being open was .99, we find that 1-.99^73 is approximately .52. In other words, a user would have to see only 73 distinct businesses before the probability of having seen a single error reaches 52% as per the Infogroup article.

As data is never perfect, we can then ask - for any corpus of data - how good is it? To be able to determine that a corpus of local listings has a precision of .99 for some attribute (e.g. being open rather than closed) is actually very difficult. Firstly there is the size of the sample required to get reasonable error bounds at 95% confidence; secondly there is the error in labeling (which at this degree of precision is a very tricky issue).

All told, while this is an interesting article, it is important to step back and look at the big picture both in terms of interpreting the results and in terms of understanding not just the challenges in getting accurate data, but even the problem of determining how accurate that data is.

The most important thing in data engineering (the job of building systems that aggregate data and improve it in some regard) is building a system that can respond to change and apply updates and improvements in a fluid manner. When evaluating a data provider, while it is important to ask them for details on the quality of their data (surprisingly, many of them won't be able to tell you) it is equally important to learn about the processes they have in place to update and correct data with as low a latency as possible.