I noted William's post on a new paper (to be published at ICDM) which his student, Richard Wang has written. The paper describes a system which uses a smart combination of wrapper induction, set (list) discovery and graph based ranking to materialize, as if by magic, expanded sets of terms. For example, enter Ursa Major, Ursa Minor, Orion and you will get back a nice list of related terms (including Taurus, Gemini, etc.).
I love this type of thing. It is close to some of the things I worked on during my thesis (mining information from tables in text) and is an example of a very active research area (which includes quite a bit of work on mining Wikipedia for entities and relationships).
However, there is no free lunch. This approach to knowledge discovery is fixed at the surface level of text (and the surface level of the representation language of documents, to be complete). Consequently, the performance of the system highlights both what is good about statistical surface techniques (little training required - which is often the case for systems that work with both document structure, textual data and high precision seed input; works in (m)any language(s); fast) and what is bad (has no real knowledge of language).
An example of this problem can be seen when we give the seeds {obama, clinton} to the system. The following results appear:
# | Entity | Weight |
1 | obama | 1.00000 |
2 | clinton | 1.00000 |
3 | edwards | 0.13000 |
4 | romney | 0.11125 |
5 | mccain | 0.10493 |
6 | he | 0.08484 |
7 | giuliani | 0.07974 |
8 | the | 0.06658 |
9 | bush | 0.06585 |
10 | hillary | 0.06373 |
While many of these results are fine, there are also errors which illustrate the separation of surface and symbolic processing: he, the.
Google Sets is always fun for this. Here are the results for {Obama, Clinton}:
http://labs.google.com/sets?hl=en&q1=obama&q2=clinton&q3=&q4=&q5=&btn=Large+Set
Posted by: Darius K. | August 15, 2007 at 10:29 AM
From what I have seen SEAL is generally better or even much better sometimes than Google Sets. EXCEPT when the seeds are single words. My guess is that it's just some bug. Try multi-token words and compare Google and SEAL.
Posted by: Stefano | August 27, 2007 at 11:10 PM