I've been having fun playing with Google's new image search. Naturally, the first thing I tried was a vanity search. I wondered what it would do with the images in my blogosphere gallery. It was interesting to see how one of the images had propagated around the net, being used by a number of sites to illustrate network effects and to capture the idea of the blogosphere.
However, while it is interesting to see the exact match type of results, what I think is more revealing is the discovery of similar images. Before jumping in to that, here's a quick guess at how Google is implementing this feature.
Firstly, I'm going to guess that it is using an indexing system just like its system for retrieving regular textual documents. This means that images are converted into discrete tokens, the sum of which effectively represent the image. When these tokens are close to identical the images are likely to be identical.
Secondly, the tokens that are used are analytical. That means that they represent qualities of the data, not the thing captured in the image. What I mean by this is that these tokens don't capture features of the object denoted by the image, rather they capture characteristics of the raster data that encodes the image. These features may be aggregates (e.g. the histogram of colours) and also associated with specific subareas of the image.
Finally, the system relies on the user to perceive the similarity. Thus when you put in a picture of The Eye of London, you get back the following images:
and you then remark - wow, it understands what was in the query image - how cool is that!
This is, of course, no mean feet. What Google has excelled at here, as it generally does, is the execution at scale of a reasonably well understood approach to image matching.
However, there is another side to this paradigm. What is this a picture of?
A person with wavy hair?
A man in a gray shirt?
A man holding a microphone?
A person gesturing?
Sergey Brin?
When we ask Google for similar images, we get the following:
The result set includes, interestingly: transformations of the original image, Bill Gates, Larry Page, people in gray shirts, an agent from The Matrix, both men and women.
So, on the one hand we had the impressive reponse to London landmark query which gives the appearance of intelligence and on the other a rather confusing set of images with a high variance for what precisely is 'similar'.
Of course, the missing ingreadient here is what is traditionally referred to as semantics. The images are not interpreted in terms of what they denote, they are interpreted in terms of the characteristics of the way in which they are encoded.
Image search as an application of this paradigm makes for a good metahpor for certain approaches to text understanding.

