Site wrapping is nothing new - Dapper, however, brings the idea to the masses with a web based wrapper learning system. TechCrunch gives a positive review of the service as well as an introduction to a showcase example which wraps Technorati blog profile pages to produce a graph of various blog statistics.
Dapper follows a pretty common process for wrapper induction:
- Acquire examples of pages by adding URLS to a basket.
- Get annotations from the user as to the location of certain elements that you wish to extract. For example, you might select a particular text node containing a statistic you wish to extract as in the Technorati case.
- Derive rules which reliably extract that information over the example pages you have defined.
- Manage and deliver the running of the new wrapper by hooking it up to some application.
Below are the results for TechCrunch with the Technorati graphing instance of this solution:
I've tried playing with the service and find it a little clunky. Wrapper induction systems are common in a number of industrial applications. The system we built at WhizBang!Labs also suffered from some clunkiness in the interface due to the use of javascript to display annotations and interact with the user.
Have a look at REXA, or CiteSeer, for relevant research papers.
The system that I'd like to see removes the human from the loop but is restricted in capability. Given a URL, it locates numeric fields in the document that a) are located in the same place for different instances of the document and b) change over time. I believe that locating such fields would be straight forward. This would allow the type of analysis that I occasionally post on this blog - e.g. tracking stats for the BoardTracker search engine - to be trivially achieved.
Comments