The 3rd Web Document Analysis Workshop closed with an interesting discussion around the provocative question:
What does document analysis give us, how can we take advantage of it and how can we encourage it?
The question was inspired largely by the content of Dan Lopresti's excellent invited talk ('The case of the missing dimension(s)'). Dan observed that traditional systems view web documents as linear sequences of tokens but that they were in fact encodings of two dimensional documents.
Much of the discussion focused on search: how would document analysis affect search results? A number of responses to this were proposed including:
- The interpretation of tabular material.
For example, if you were interested in climactic information about
cities in Korea, you might use the query 'average rainfall seoul
pusan'. Thomas Breuel pointed out, quite correctly, that issuing this
search would most likely produce a page with the desired tabular data. In later discussions I had with Robert Dale and Vanessa Long, we discussed the notion of search result quality. In other words, relevancy is not the same as quality. In the case of the search for climactic information, imagine a system that given such a query could produce a statistical summary of the results found in all tables (e.g. giving the mean and variance in a super table).
- Title and other block segmentation.
Here the desire is to ensure that adjacency in the linear stream of tokens is not confused with token adjacency in the document. For example, treating the last word in a title or section heading as the first work in a phrase including the initial tokens in the following paragraph.
PDF documents, and other layout-weak document encodings are commonly returned in search results. These document pose significant challenges at very low levels. Consequently, a reasonable number of standard document analysis processes need to be run against the document prior to indexing.
This is something of particular interest to blog or message board search engines. Web pages are generally made up of a number of functional elements (including title, navigation, adverts, main content). Indexers have not recognition of the significance of these areas, which is why in some cases results that take you to a page may not contain the query that got you there. The blogosphere offers a good example with the inclusion of recently updated blogs on typepad blogs. This list is changing constantly and is almost guaranteed to be different from how it appeared at index time.
Similar to document zoning, the problem of sub-page documents is familiar to blog search engine implementers. It addresses the fact that the basic unit of content is not the web page, but some smaller unit (e.g. a blog post). In addition, the web page contains many such elements which all need to be indexed individually.
There was recognition that discussion on search applications makes broad assumptions about use cases and user expectation which have been drilled in to the consumers of such interfaces. The example of a search result returning a summary of tabular data illustrates this point and hints at the potential for new interfaces, new user experiences and new user expectations in the search space.
Document analysis researchers often view the problem of analysing web pages as a very partitioned space - the web documents must be consumed as is. The second part of the discussion looked at what can be done to assist in the analysis of online documents. A big part of this problem is the inclusion of information in the markup which will help with various tasks. In the case of certain layout elements (e.g. titles) that information is already present. However, for many of the issues raised above, there is now clear standard. It was recognized that there are a number of ad-hoc inclusions (e.g. comments to indicate where ads appear, or where navigation appears). These inclusions may be taken advantage of opportunistically but do not represent a stable path to success.
As with the inclusion of any novel information, adding in this data is going to be challenging from the human behaviour point of view, though it was recognized that structured blogging and microformats were a start.
I was encouraged to write these notes sooner rather than later by Abdel Belaid (thanks), but do recognize that these are not minutes of the meetings and include my own personal bias and some subsequent conversations with others. This content will be posted both on the WDA2005 blog and on my own blog. Please comment on the WDA blog only.