Fernando is still skeptical about the potential of NLP to play a major role in search. I may be putting words in Fernando's mouth, but I believe the reason he states this is because he is assessing its impact against the standard search interaction (type words in a box, get a list of URLs back). This is missing the point.
When one is dealing with text (that is to say, an abstraction of content which is ultimately a sequence of letters represented in some form) there is a considerable amount of ambiguity. Sure, there are some manipulations that can be performed (ok - there is one, its called stemming), but ultimately, you are at the mercy of the most trivial representation of the intended communication of the speaker/author. It is not surprising, then, that the standard search paradigm (which is equivalent to the IR paradigm) is to deliver a list of documents: text in, documents out.
When one is dealing with language, one is dealing at a higher level of abstraction. Rather than sequences of characters (or tokens - what we might rudely refer to as words) we are dealing with logical symbols. Rather than the primary relationships being before and after (as in this word is before that word) we can capture relationships that are either grammatical (pretty interesting) or semantic (extremely interesting). With this ability to transform simple text into logical representations one has (had to) resolve a lot of ambiguity. The current search paradigm relies on a number of statistical qualities relating the query to the text in all the documents in the index to resolve these ambiguities with the help of the user: something interesting ought to be found somewhere in the documents at the top of the heap - please go and find it. When the content system itself is dealing with the ambiguity, the interface no longer has the job of dealing with this issue and so search systems (in fact, the won't even be called search systems) will be able to provide far more interesting applications.
This is why I find the notion of the DataWeb so interesting (see this post on Cognos for an illustration).
In some sense, I've broken the clear distinction I made earlier, in the post Fernando responded to, between the back and front end of search: I'm claiming that changes to the back end will enable fundamental changes to how 'results' are served.
...search quality can be and has to be traded off with search cost.
If you change the game (e.g. by changing the way results are provided) then the notion of quality has been disrupted. I'm not sure what the costs are that Fernando is referring to. CPU (e.g. time to process all content)? Response time?
One incarnation of the (vaguely defined) dataweb that I've illustrated on this blog involves providing statistical results to a query. The example I gave was querying a system for the number of bloggers in China and getting back a result which, breaking with current search result models, wasn't a list of 'relevant' documents, but instead a plot of the estimates for this variable plotted over time. For reference, the post is here.
Recently, John Furnari (Cognos Grassroots Marketing Specialist) dropped me a pointer to the Cognosenterprise search product. I checked it out and, in amongst the marketing materials, spotted something that had some resonance with the above idea in a video of Cognos' CEO talking about their technology. After some interaction with John, Delaney Turner (Manager, Product and Solutions Communications) has furnished me with some screen shots. The pick of the litter is shown below.
What you see here is a search for 'revenue' resulting in a graph (not a document!). Conceptually, I see this as a big deal.
I've posted on technologies available right now that are better than existing consumer (that is to say, free) geographic visualization tools/toys. Here is something similar for 3d buildings. This project from Edinburgh Uni's EPCC (where I was an intern many years ago) is a demonstration of high detail 3d modeling for town planning. There are also movies on the site.
Note that this is non-Google data imported in to Google Earth which is essentially being used as a rendering engine.
One of the big issues I see looming on the horizon is how data like this is distributed and discovered. Should it be consolidated and licensed (as in the Google Earth model), should it be made a specialized commodity that people have to down load on to their desktops to use, or should it be distributed and automatically (and opportunistically) integrated into data smart clients. This seems like a data web issue to me.
One of the favourite themes of Anderson's Long Tail thesis is that of the movie business: blockbusters are on the way out and with them the theatre cinema business. I am still somewhat skeptical of this and always on the look out for more data. I recently came across this page from the UK Film Council - their statistical year book. While I don't think it takes the argument either way, lots of interesting data to take in to consideration.
Incidentally, this is one of the cases where my dream of the data web makes finding data painful. Imagine if you could land on this page and have it link to, manipulate and compare data from Chris' blog or other sources so that you could research and investigate the data right in the browser.
I'm writing a paper with some colleagues on estimating the number of bloggers from various countries. We wanted to compare our statistics with those published elsewhere online. In my mind, this is a great example of the Data Web.
Imagine the following interaction. A user asks an interface: How many Chinese bloggers are there? Instead of the usual ranked list of web pages which match some or all of these words, and which may or may not contain content that associates their meaning in the way intended, we get the following output:
This graph shows the various estimates for the size of the Chinese blogosphere (actually, the number of Chinese bloggers) published online against the date associated with the estimate. This was gathered by hand from 14 different published estimates in 10 different documents (with some pain, I might add).
James Fee notes that Swivel originally had a data api. (What follows was originally a comment I submitted to James' post but his MT installation appears to be having trouble.)
The API issue is interesting. I've also encouraged Swivel to provide an API and I'm sure they are going to release something very cool in this area. However, the more one thinks about it, the more the convergence of ideas leads me to the notion of the Data Web. This is an environment in which the data is distributed, discoverable, described and linked just as the text data of documents is now. One would have a data browser and a service like Swivel would be more of an aggregator/search engine rather than a data repository. Rather than data tools exposed through current browsing technology, there would be a data browser.
Note that I'm not talking about the semantic web here. I'll try to get my thinking together and provide a deeper dive on the Data Web soon. For now, this is also appropriate.