Over the past 4 years or so at Intelliseek/BuzzMetrics, I had been responsible for the document analysis system which took posts from blogs, usenet and message boards and recognized various structural elements. In the body of the document, the most important were quoted material and signatures. These become critical tasks as recognizing a word in a quote or a signature is quite different to recognizing it in newly minted content (consider, for example, the inclusion of 'games I own' in messages on a video gaming message board).
Now that we are in the last leg of preparing for ICWSM, we've been passing around a quite a few messages with schedule information in them. These messages are basically lists of things like '8:30-10:00 Session X'. In reading and writing these in GMail, Google's free email service, I noticed something weird. Some things which were new content were being treated as quotes. Like most email clients, when you reply to a message in GMail, it quotes the original message and gives a visual cue as to the quotation. However, it turns out it also uses some sort of substring matching algorithm to infer quoted material.
Try the following:
- Send yourself (that is to say, your GMail self) a message with the three lines 'This is line one\n\nThis is line two\n\nThis is line three\n'
- Reply to that message, but remove all the content in the reply and drop in 'This is line two\n\nThis is line three'
- When you open the second message, you will see nothing but the 'Show quoted text' message.
Okay - so you probably think I'm a pedant picking up on a bug (or at least, an inference that may or may not be correct) like this. However, the reason that it seems of interest to me is that the most trivial approach to detecting signatures in messages is - of course - to recognize repeated text found at the end of multiple posts by the same author. But hang on, that is exactly what the quotation finding method that GMail uses in the example above does.
Novel discovery. Can it infer threading from a large collection of mail? Also think of what can be collected about an individual's profile merely by examining what, who, and how people respond to their emails. I'd love to continue working on signature detection.
Posted by: Matthew Siegler | March 11, 2007 at 09:00 PM
I think William Cohen at CMU has one or more papers exactly on this topic. Very interesting stuff.
http://www.cs.cmu.edu/~wcohen/postscript/email-2004.pdf
Posted by: Mark D. | March 12, 2007 at 12:06 AM