I've been having fun playing with Google's new image search. Naturally, the first thing I tried was a vanity search. I wondered what it would do with the images in my blogosphere gallery. It was interesting to see how one of the images had propagated around the net, being used by a number of sites to illustrate network effects and to capture the idea of the blogosphere.
However, while it is interesting to see the exact match type of results, what I think is more revealing is the discovery of similar images. Before jumping in to that, here's a quick guess at how Google is implementing this feature.
Firstly, I'm going to guess that it is using an indexing system just like its system for retrieving regular textual documents. This means that images are converted into discrete tokens, the sum of which effectively represent the image. When these tokens are close to identical the images are likely to be identical.
Secondly, the tokens that are used are analytical. That means that they represent qualities of the data, not the thing captured in the image. What I mean by this is that these tokens don't capture features of the object denoted by the image, rather they capture characteristics of the raster data that encodes the image. These features may be aggregates (e.g. the histogram of colours) and also associated with specific subareas of the image.
Finally, the system relies on the user to perceive the similarity. Thus when you put in a picture of The Eye of London, you get back the following images:
and you then remark - wow, it understands what was in the query image - how cool is that!
This is, of course, no mean feet. What Google has excelled at here, as it generally does, is the execution at scale of a reasonably well understood approach to image matching.
However, there is another side to this paradigm. What is this a picture of?
A person with wavy hair?
A man in a gray shirt?
A man holding a microphone?
A person gesturing?
Sergey Brin?
When we ask Google for similar images, we get the following:
The result set includes, interestingly: transformations of the original image, Bill Gates, Larry Page, people in gray shirts, an agent from The Matrix, both men and women.
So, on the one hand we had the impressive reponse to London landmark query which gives the appearance of intelligence and on the other a rather confusing set of images with a high variance for what precisely is 'similar'.
Of course, the missing ingreadient here is what is traditionally referred to as semantics. The images are not interpreted in terms of what they denote, they are interpreted in terms of the characteristics of the way in which they are encoded.
Image search as an application of this paradigm makes for a good metahpor for certain approaches to text understanding.


Mhmm in my opinion they are using some techniques related to the wavelet to keep track of the intrinsic features of the images. Using wavelet identifiers it is easy recognize image even if they are proposed in different colors size and rotation.
To be a little bit more formal a wavelet rendition of a image is invariant by rotation, translation and homothety.
Depending of the deepness of the wavelet used it returns images really similar to the pattern (...useful for example for face recognition) or able distinct only the main features ... in the last option is clear that if the pattern is a human being the system will return human being, but it is not able to distinguish among different humans :).
Let me bring to your attention my blog:
http://textanddatamining.blogspot.com/
Posted by: Cristian Mesiano | June 30, 2011 at 03:17 PM
Cristian - thanks for the comment; what you say is pretty much aligned with how I described the tokenization / indexing approach. The features are, ideally, invariant to certain transformations of the image including saturation, chromacity, orientation, etc.
Posted by: Matthew Hurst | June 30, 2011 at 04:04 PM