December 27, 2006

The Ignorant Web Browser

John Battelle riffs about an interface to models and information about the human body. We are all used to rich interfaces to geographic information now, thanks to Google Earth and MSFT's Virtual Earth. However, I'm still amazed, flabbergasted even, that our we browsers don't understand documents. This is partly the fault of HTML, which is too limited and has to be hijacked if one is to present a rich document experience akin to that which we see encoded in PDF files. I can understand that there is a desire to leverage a simple information asymmetry - I mean, if we knew which parts were adverts, which were navigation and which were content, then perhaps we would do something about what we see. That being said, the fact that one can't work with tables, lists, headlines, etc. in a browser, e.g. to select parts of the document with a simple cut and paste action, is beyond me.

Sometimes I feel like I wish the web would freeze while we could catch up to its potential...

August 18, 2006

Online Site Wrapping From Dapper

Site wrapping is nothing new - Dapper, however, brings the idea to the masses with a web based wrapper learning system. TechCrunch gives a positive review of the service as well as an introduction to a showcase example which wraps Technorati blog profile pages to produce a graph of various blog statistics.

Dapper follows a pretty common process for wrapper induction:

  1. Acquire examples of pages by adding URLS to a basket.
  2. Get annotations from the user as to the location of certain elements that you wish to extract. For example, you might select a particular text node containing a statistic you wish to extract as in the Technorati case.
  3. Derive rules which reliably extract that information over the example pages you have defined.
  4. Manage and deliver the running of the new wrapper by hooking it up to some application.

Below are the results for TechCrunch with the Technorati graphing instance of this solution:

Dapperscreen2

I've tried playing with the service and find it a little clunky. Wrapper induction systems are common in a number of industrial applications. The system we built at WhizBang!Labs also suffered from some clunkiness in the interface due to the use of javascript to display annotations and interact with the user.

Have a look at REXA, or CiteSeer, for relevant research papers.

The system that I'd like to see removes the human from the loop but is restricted in capability. Given a URL, it locates numeric fields in the document that a) are located in the same place for different instances of the document and b) change over time. I believe that locating such fields would be straight forward. This would allow the type of analysis that I occasionally post on this blog - e.g. tracking stats for the BoardTracker search engine - to be trivially achieved.

January 26, 2006

Indexing Content, not Documents

If you search for 'blogpulse' in any of the major blog search engines (BlogPulse, Technorati, IceRocket, Sphere and Google), you will see hits on areas of the document that are not what one might call object content. Problem areas include:

  • Tags (where there is a footer line saying something like 'Technorati tags: foo', or where the term is actually one of the tags listed, e.g. 'Technorati tags: blogpulse')
  • Links to search results ('blog linking to this post on : Technorati, Blogpulse, Bloglines, ...'

What is required to avoid this is some simple document analysis. Stripping footers and tag lines should be trivial - so why isn't it done? One of the major data quality components I've been involved in creating at Intelliseek is the document analysis code that takes raw posts documents and marks up the structural areas (for message board data this might be quoted material and signatures, for blogs this is the type of thing I've outlined above).

One of the temptations in developing products in a market as fast paced as the blogosphere is that to deliver new features rather than improve existing ones. Content is the key asset here, but modifying the index to horizontally improve it is a large task which is often lost in the rush to add a new widget or gizmo. It is a lovely irony that tags are indexed as object content when they are intended to be meta-data.

November 30, 2005

Question Answering: What is the Density of France?

A subtle but powerful feature on Google is question answering. If you type in a question like 'what is the population of UK', you will see an answer at the top of the page:

Ukpop
When I was working at the University of Edinburgh, I was involved in a project that required the extraction of information from tables. To cut a long story short, this train of research resulted in my thesis: Understanding Tables in Text.

Why do I mention this? Google is using table understanding to come up with the answers to these questions. If you take a look at the source in the above example, you can see that it is a view of the data presented in the CIA World Factbook. The page is, in fact, a big table of country names and population statistics. If you browse the other listed sources, you will find that they all contain some element of structured content pertaining to the answer of the question. It may not be an explicit table - there may be, for example, the name of the country and the key value pair population: #.

Once you grok this, you can get Google to do some impressive party tricks. By entering 'what is the religion of nepal?' you can find an answer: Religion: 90% Hindu, 5% Buddhist, 3% Muslim, 2% other from this page. By reverse engineering the table to create a question, you can get Google to answer something very interesting, like: who are the major trading partners of nepal? A neat trick to suggest real intelligence in the machine.

Of course, as Google is doing a light-weight trick which avoids any real semantics, you can get it to give odd answers:

  • What is the density of France? 110.
  • Who is the Queen? is the female head of a royal family.
  • Who is the King? town (1990 pop. 4,059), Stokes Co., North, North Carolina, 15 mi/24 KM NNW of Winston-Salem;
  • Where is the end? Country: UK

My guess is that for certain sources (like Wikipedia) which they have directly wrapped, they trust the answers. For other sources, where a loose match is found, they require multiple sources of evidence - what I call the WoodStein inference paradigm (Woodward and Bernstein wouldn't report anything unless they had multiple sources).

Google plans to roll out this feature gradually as it becomes more powerful and as the answers become more reliable. When they step over the threshold of simple text based answers to actual semantics, they will have taken a first step into a world where search no-longer means 'give me a list of documents'.

September 08, 2005

Wrapper Induction, Recovery 2.0

I've been meaning to write something on a document analysis technology called wrapper induction for a while. I now have at least one good reason (and several plain reasons) to do this. The good reason is related to the Katrina disaster, and Ethan Zuckerman's (and others') mobilization of web volunteers to create a single database of lost/found people.

The specific problem that Ethan described was that of multiple web sites set up to collect information about people effected by the hurricane. Most of this data was in the form of repeated groupings of text either in tables, message board posts, blog posts, blog comments, and other web pages. At Intelliseek, part of our content collection system deals with message boards and another with blogs. Wrappers are a central part of both systems. A wrapper, in this context, is essentially some piece of software (and often some form of parametrization or model) which can take a web page or a web site and provide programmatic access to the data on the site - essentially creating a structured access mechanism to areas of the document.

Wrappers which involve a model require some parameters that encode things like 'the person's name is located in the second column of the third table', 'the address is in the next cell to the right', and so on. Wrapper systems that are driven by constraints use rules like 'partition the page by dates, and then find all the bold text between dates, call that these titles'. I'm interested in a variation of this type of wrapper which starts off with even less information 'things in this page are grouped - find the groups of sections of the document that are similar'.

I just so happened that this type of technology was just the right thing to point at web pages that contain lists of repeated, semi-structured free text - the ad hoc online repositories for the lost and found of Katrina.

Ethan requests write-ups of technology that can be of assistance in this type of situation. I certainly believe that technology that can consume pages or sites of semi-structured data and which can both segment that data and even group data found on different heterogeneous sites is already here, albeit scattered around a number of research communities and industrial sites. I think that a related problem that needs addressing is the discovery problem. How do we know where all the databases are? That was in part solved by the communities online, the community was the solution in this case. Assisting technologies which are reasonably mature, such as named entity extraction, can also become part of the solution, but the use of the global volunteer network was a fantastic idea - one which has a strong parallel in industry: whenever you come up with some new technology that derives value from web resources, your competition must always include out sourced man power.

I have a strong belief that getting ready for the next disaster may have a lot to do not with the digirati pooling resources, but many other fundamental social issues that the USA is rife with. This is not a political blog, so I'm not going to start writing about how narrowing the rich/poor divide, removing guns from the public sphere, implementing solid social and medical welfare, etc. would have resulted in a considerably different outcome.

August 30, 2005

What does document analysis give us?

The 3rd Web Document Analysis Workshop closed with an interesting discussion around the provocative question:

What does document analysis give us, how can we take advantage of it and how can we encourage it?

The question was inspired largely by the content of Dan Lopresti's excellent invited talk ('The case of the missing dimension(s)'). Dan observed that traditional systems view web documents as linear sequences of tokens but that they were in fact encodings of two dimensional documents.

Much of the discussion focused on search: how would document analysis affect search results? A number of responses to this were proposed including:

  • The interpretation of tabular material.

For example, if you were interested in climactic information about cities in Korea, you might use the query 'average rainfall seoul pusan'. Thomas Breuel pointed out, quite correctly, that issuing this search would most likely produce a page with the desired tabular data. In later discussions I had with Robert Dale and Vanessa Long, we discussed the notion of search result quality. In other words, relevancy is not the same as quality. In the case of the search for climactic information, imagine a system that given such a query could produce a statistical summary of the results found in all tables (e.g. giving the mean and variance in a super table).

  • Title and other block segmentation.

Here the desire is to ensure that adjacency in the linear stream of tokens is not confused with token adjacency in the document. For example, treating the last word in a title or section heading as the first work in a phrase including the initial tokens in the following paragraph.

  • Accurate PDF search.

PDF documents, and other layout-weak document encodings are commonly returned in search results. These document pose significant challenges at very low levels. Consequently, a reasonable number of standard document analysis processes need to be run against the document prior to indexing.

  • Document zoning.

This is something of particular interest to blog or message board search engines. Web pages are generally made up of a number of functional elements (including title, navigation, adverts, main content). Indexers have not recognition of the significance of these areas, which is why in some cases results that take you to a page may not contain the query that got you there. The blogosphere offers a good example with the inclusion of recently updated blogs on typepad blogs. This list is changing constantly and is almost guaranteed to be different from how it appeared at index time.

  • Sub-page Documents

Similar to document zoning, the problem of sub-page documents is familiar to blog search engine implementers. It addresses the fact that the basic unit of content is not the web page, but some smaller unit (e.g. a blog post). In addition, the web page contains many such elements which all need to be indexed individually.

There was recognition that discussion on search applications makes broad assumptions about use cases and user expectation which have been drilled in to the consumers of such interfaces. The example of a search result returning a summary of tabular data illustrates this point and hints at the potential for new interfaces, new user experiences and new user expectations in the search space.

Document analysis researchers often view the problem of analysing web pages as a very partitioned space - the web documents must be consumed as is. The second part of the discussion looked at what can be done to assist in the analysis of online documents. A big part of this problem is the inclusion of information in the markup which will help with various tasks. In the case of certain layout elements (e.g. titles) that information is already present. However, for many of the issues raised above, there is now clear standard. It was recognized that there are a number of ad-hoc inclusions (e.g. comments to indicate where ads appear, or where navigation appears). These inclusions may be taken advantage of opportunistically but do not represent a stable path to success.

As with the inclusion of any novel information, adding in this data is going to be challenging from the human behaviour point of view, though it was recognized that structured blogging and microformats were a start.

I was encouraged to write these notes sooner rather than later by Abdel Belaid (thanks), but do recognize that these are not minutes of the meetings and include my own personal bias and some subsequent conversations with others. This content will be posted both on the WDA2005 blog and on my own blog. Please comment on the WDA blog only.

August 25, 2005

ICDAR 2005

I will soon be heading out to the airport to fly to Korea where I will be co-chairing Web Document Analysis 2005 and attending ICDAR 2005. Hopefully this doesn't mean a break in posting, but there is that possibility.

When working on the organization of WDA with Ethan Munson, we decided to use a blog as the communication channel. I think using a blog in this situation is perfect - actually, it's the syndication part that is perfect as it allows interested parties to keep up to date with developments in a passive way, and it allows the organizers to have a single drop point for updates - no email lists to maintain!

I'm still searching for blogs that cover document analysis/document understanding, I noticed that links to the main conference were very thin on the ground. There is some irony here as blogs and other forms of online personal media are a great area for web document analysis research.

August 06, 2005

Document Analysis and Blogging

Looking at the posts in this blog, I realised that I was not posting on a couple of areas that are of real interest to me, namely Computational Linguistics and Document Analysis. As these areas of research have a strong academic portion, I thought it made sense to look for references to conferences in blogs to find blogs that might be worth reading. Taking document analysis, I searched on TalkDigger for the URL http://www.icdar2005.org - ICDAR is a major international conference on document analysis. The results were astonishing - all blog engines reported 1 (or 2 with duplicates) result (here).

I find this amazing as blogs represent a very interesting data source for document analysis. Firstly there is the problem of segmenting blogs - something which requires wrapper induction, or wrapping of some sort if a feed is not available (wrapping is the process of using a model of the html to allow a program to access the content of a web page in a structured manner). Secondly, there is the problem of analysing blog posts in order to determine things like quoted material versus original content. Both of these problems need to be solved if any system is going to aggregate and analyse blog content.

Certainly ICDAR is not the only conference out there, and there are other approaches to finding blog which post on document analysis, but that said, a single post on this important conference suggests that the document analysis community is not paying attention to this important field - or document analysis researchers are not blogging. It is also interesting to note that Google reports only 13 links to the conference web site, so perhaps there is something about this conference or conferences in general which doesn't encourage links.

At any rate, I'm still trying to find a really powerful way to find blogs on a topic instead of blog posts mentioning a certain phrase.

May 2008

Sun Mon Tue Wed Thu Fri Sat
        1 2 3
4 5 6 7 8 9 10
11 12 13 14 15 16 17
18 19 20 21 22 23 24
25 26 27 28 29 30 31
Blog powered by TypePad