Snakes on a Plane is a movie. It has never been a book, a radio play or a graphic novel. The Lord of the Rings on the other hand is a book and a movie. If we wanted to compare trends for the two movies how would we deal with this ambiguity? In other words, if we search for Snakes on a Plane, we might assume that all mentions are related to the movie. However, a search for Lord of the Rings would have a mixture of book and movie references (technically speaking, we might consider these two be two different references of the phrase Lord of the Rings).
One strategy is to constrain the search by some disambiguating term, e.g. '+movie.'
For Snakes on a Plane, we get the following:
The last six months has 20, 140 hits for the open query and 8, 527 for the restricted query giving a ratio of 0.423.
The Lord of the Rings gives us the following graph:
Here, the hits are 85, 603 and 40, 324 giving a ratio of 0.471.
Here's another example. The Devil Wears Prada is both a book and a movie:
We have 24, 051 and 11, 497 giving a ratio of 0.478.
From these numbers, one might assume that were there is ambiguity (Lord of the Rings and Devil Wears Prada), the number of posts that use the disambiguating term is higher. This seems to be intuitive - there is a greater need to disambiguate the reference to the movie from the reference to the book.
Now look at the undisambiguated searches. If we assume that Snakes on a Plane searches are all references to the movie and that Lord of the Rings searches are a mixture of movie and book, we can't reliably compare volume or trending. Can we compare Snakes on a Plane with Lord of the Rings +movie? To do that, we would have to assume that every references to the Lord of the Rings movie mentioned the term movie - something which I don't believe to be the case. Can we compare Snakes on a Plane +movie with Lord of the Rings +movie? To do that, we would have to assume that the percentage of posts about Snakes on a Plane that contained the term 'movie' was equal to the percentage of posts about the Lord of the Rings movie which contained the term 'movie.' Again, we can't really do that if we believe that there is an increased use of 'movie' due to the ambiguity of the phrase.
I'll leave discussion of solutions to this problem to a later post.
Note that the examples used in this post all show movies which are at different points in their life cycle. Snakes on a Plane is just about to be released, The Lord of the Rings was released a long time ago and Devil Wears Prada was a recent release. The life stage of the movie might also affect the use of disambiguating terms. For example, as more people come to hear of The Devil Wears Prada through the release of its movie, the movie might become the standard reference whereas prior to the release, the book would have been. Note also that there are more ambiguities at stake here: Lord of the Rings dvds, video games, picture books, etc.
Disambiguation is indeed both an art and a science. Getting it wrong will often mean that you're getting misinformation out of the data. It's pretty near impossible to disambiguate two similar subjects (a movie and a book with the same title) unless you use a series of supporting evidence terms. Even this can fail to produce something approaching 100% recall. Often we find that settling with 90-95% recall is "good enough" for what you're trying to accomplish. Bayisian techniques might take you closer in some instances, but further away in others. All in all, I find it quite challenging to figure out how to separate two things that are intertwined.
Posted by: Glenn Fannick | August 24, 2006 at 03:46 PM