I work in local search at Microsoft which means, like all those working in this space, I have to deal with an identity crisis on a daily basis. Currently, most local search products - like Bing's and Google's - leverage multiple data sets to derive a digital model of the world that users can then interact with. In creating this digital model, multiple statements have to be conflated to form a unified representation. This can be extremely challenging for two reasons. Firstly, the system has to decided when two records are intended to denote the same real world entity. Secondly, the designers of the system have to determine what real world entities are and how to describe them.
For example, if a business moves is that the same business or the closure of one and the opening of another? What does it mean to categorize a business? The cafe in Barnes and Noble is branded Starbucks but isn't actually part of the Starbucks chain - should is surface as a separate entity or is it 'hidden' within the bookshop as an attribute ('has cafe')?
Thinking through these hard representational problems is as much part of the transformative trends going on in the tech industry as are those characterized by terms like 'big data' and 'data scientist'.
Another example can be found in my recent interaction with spotify where I have the option of the following albums from Rush:
The only distinction I can see here is, upon drilling down, the tracks appear to have been published by different sources. They are, as far as I can tell, identical in all ways. There is a slight variation in the cover art that, if you squint a little you may be able to perceive.
As the online world continues to move towards knowledge the Ph in PhD will become more and more useful.


hi Thomas,
I just discovered your twitter feed, your blog and rapidminer. I'm new to the world of neural networks. I do enjoy dating mining and data visualization quite alot and hope to build a career out of it. I hope to go through your tutorials on rapidminer over the next week but wanted your advice on how i can actually gather data from sites like Twitter and Facebook. I imagine you have to hire a developer who knows the twitter and facebook api to collect everyone's tweets and status updates to have your data sets which you would put into a tool like rapidminer for analysis? I'd love to learn from you regarding how I should go about the data acquisition part? Should i hire a developer and where is the best place to find someone that i can hire? I love the trending analytics that yahoo and msn put on their website and wanted to also learn how to do that....like what's hot among my friends on facebook and what's not, or counting the number of times the word wedding appears in a status update.
If you have any advice on how i can get started on the entire project from data acquistions, data packaging, data anlaysis and visualization, I would love for you to mentor me. or if you already have a blog in your archives on how to get started, i would be all ears!
Kindest regards,
Neil Patel
Posted by: Shorty2hops | March 04, 2012 at 01:57 PM