I just read that Swivel is up and live (TechCrunch). I have a pretty crushed day today so won't be able to do a deep dive. However, in looking around I did spot something of interest (that is to say, if you are obsessed with tabular data processing). Although Swivel indicates where you can get data to upload, it currently doesn't have the ability to slurp tables in raw HTML format.
Swivel is not yet smart enough to automatically extract tables from web pages. Getting data from web pages into Swivel is tricky, because of the wild variety of formats and structures of data in web pages. Follow the instructions below and give it a shot.
I can believe this - it is a hard problem (which is why I spent at least 4 years researching it for my PhD). Swivel peeps may be interested in some of the literature in this area (appearing in a special issue of IJDAR on tables). Here's also an approximate list of papers that I've written/contributed to.
Having .csv upload would be simple enough.
It would be awesome to have them work on a plugin so I can import from google analytics or sitemeter....
Big source of data...
Kevin
Posted by: Kevin Burton | December 06, 2006 at 01:48 PM
They do have csv upload as standard (though when I tried to upload a single dimension of data it failed). The import from other sources is an idea I passed along - imagine access to blogpulse data so you could mix it with stock prices or average temperatures, etc.
Posted by: Matthew Hurst | December 06, 2006 at 01:59 PM
I uploaded a dataset from a website. I didn't notice it had a bunch of references until after I uploaded it, and they caused the formatting and everything to be all messed up. I couldn't find a way to change the input dataset. I'm trying to upload it again as a new set (the site is being hammered), but I don't see how to delete the old one.
Posted by: Andrew Hitchcock | December 06, 2006 at 03:30 PM
ooh, those papers should be an interesting read. but personally, pulling meaningful data out of html web pages consistently correct all the time, i have admit is impossible. add on top of that, that the table will likely be digested into a more human readable format (totals rows, aggregated data values) and it becomes even more intractable. swivel happens to shine at near-raw formatted data, the type that people look at and instantly need to put into a tool anyway.
now, what would you say to a data input api, free for anyone to use or write plugins for? hotness.
visnu
swivel eng #2
Posted by: visnu | December 07, 2006 at 12:08 AM
witches, i didn't read everyone's comments before commenting myself. kevin is insightful.
Posted by: visnu | December 07, 2006 at 12:11 AM
Visnu - o ye of little faith. The problem is not to provide an algorithm that is perfect for every table, but to provide an algorithm that can correctly pull data from some tables and, importantly, know when it is being accurate.
Posted by: Matthew Hurst | December 07, 2006 at 06:35 AM
matthew - that's what i'd expect to deliver for html table parsing, i'd be satisfied if it works for 80+% of the cases, but falls flat on a good 1% or more. i actually cut my teeth on machine learning parsing resumes into richer xml formats for an old company and resumes fall in the scary territory of nlp-type things.
now, microformats. those would be fun to support too. point swivel at a page that isn't even in a table, but in more semantic html mark up and have it bring back something useful. that's web _3.0_. aw yeah.
but again, my lazy self would love a gang of plugin/api coders doing the work for me.
Posted by: visnu | December 07, 2006 at 07:48 PM