The web search industry is making great progress in transitioning from building tools for finding pages and sites to building tools that leverage and surface facts and knowledge. The local search space - where I work in Bing - is founded on structured knowledge - the entity data that represents businesses and other things that necessarily have a location, and is a core piece of the knowledge space required for this future.
Over the past few years, my team has been working on mining the web for information about local entities. This data now helps to power a significant percentage of local search interactions in a number of countries around the world.
As we have been working on this system, we have come to think deeply about how to build systems for web mining, but also how to construct efficient developer workflows and how to add data management components to these systems to take advantage of human input when appropriate.
These processes constitute what I term Agile Web Mining, the core principles of which are: optimize for developer productivity, optimize for data management and invest in low latency systems. So much of what we hear about in the industry currently revolves around very large data sets (big data) which often entail long processing times and high latency interactions. In contrast, we tend to think of our data in a different way, where the size of data is relatively small (on the order of the size of a web site), but where there are many examples of these small data sets.
We are currently growing our team, and so if you are interested in learning more about Agile Web Mining, please get in touch with me.