There were a couple of papers that I really liked at www. One was Mapping the worlds photos by David Crandall et al, another was SOFIE: a self organizing framework for information extraction by Fabian Suchanek et al. Both of these papers (while operating in different domains) used features from orthogonal spaces (images, tags and geo coding in the first case and entity, relations, text patterns and logic in the second) to automatically mine new facts from a data set.
However, after the initial impression of how great these things are (and they are great) one realises that the facts that they have surfaced are already known. In the first case, which excels at discovering landmarks and images of landmarks, this is a well known knowledge set – landmarks, by there definition, are discovered facts. In the second case, the fact used in the example (which we shouldn’t, of course, judge the entire system by) was also well known.
What one would want to see in these papers and certainly in their presentations, is the long tail of facts. The head of the knowledge-sphere is well known by definition, rather than discovering it, we should assume it. The long tail will have weaker signals and it is there that we really care about the power of these systems.
Other papers in this and related areas from the conference include:
- Exploiting web search to generate synonyms from entities by Surajit Chaudhuri et al
- Measuring the similarity between explicit semantic relations on the web by Danushka Bollegala et al