I've been meaning to write something on a document analysis technology called wrapper induction for a while. I now have at least one good reason (and several plain reasons) to do this. The good reason is related to the Katrina disaster, and Ethan Zuckerman's (and others') mobilization of web volunteers to create a single database of lost/found people.
The specific problem that Ethan described was that of multiple web sites set up to collect information about people effected by the hurricane. Most of this data was in the form of repeated groupings of text either in tables, message board posts, blog posts, blog comments, and other web pages. At Intelliseek, part of our content collection system deals with message boards and another with blogs. Wrappers are a central part of both systems. A wrapper, in this context, is essentially some piece of software (and often some form of parametrization or model) which can take a web page or a web site and provide programmatic access to the data on the site - essentially creating a structured access mechanism to areas of the document.
Wrappers which involve a model require some parameters that encode things like 'the person's name is located in the second column of the third table', 'the address is in the next cell to the right', and so on. Wrapper systems that are driven by constraints use rules like 'partition the page by dates, and then find all the bold text between dates, call that these titles'. I'm interested in a variation of this type of wrapper which starts off with even less information 'things in this page are grouped - find the groups of sections of the document that are similar'.
I just so happened that this type of technology was just the right thing to point at web pages that contain lists of repeated, semi-structured free text - the ad hoc online repositories for the lost and found of Katrina.
Ethan requests write-ups of technology that can be of assistance in this type of situation. I certainly believe that technology that can consume pages or sites of semi-structured data and which can both segment that data and even group data found on different heterogeneous sites is already here, albeit scattered around a number of research communities and industrial sites. I think that a related problem that needs addressing is the discovery problem. How do we know where all the databases are? That was in part solved by the communities online, the community was the solution in this case. Assisting technologies which are reasonably mature, such as named entity extraction, can also become part of the solution, but the use of the global volunteer network was a fantastic idea - one which has a strong parallel in industry: whenever you come up with some new technology that derives value from web resources, your competition must always include out sourced man power.
I have a strong belief that getting ready for the next disaster may have a lot to do not with the digirati pooling resources, but many other fundamental social issues that the USA is rife with. This is not a political blog, so I'm not going to start writing about how narrowing the rich/poor divide, removing guns from the public sphere, implementing solid social and medical welfare, etc. would have resulted in a considerably different outcome.