Socrata (which otherwise I have a lot of respect for) posted on their blog about the use of their platform for the state of Oregon's data assets. The site, data.oregon.gov, demonstrates the sophistication and solid engineering of the Socrata platform. However, looking through the html source of the data (e.g. this data set on unemployment statistics), it appears to me that the data is not crawlable. In other words, a web crawler will not be able to spider the site and pull down the data in a reasonable form (e.g. csv, excel, html tables, etc.). Of course, the data can be downloaded by a user via interacting with the site, but as this appears to use javascript to modify the DOM to create the download links.
As I've mentioned in several posts about d8taplex, my belief is that there is sufficient data on the web that can be discovered, crawled and automatically interpreted by a system like d8taplex or Timetric. Making the automated access to this data more complex or impossible is against the spirit of open data.
Of course, I'm not 100% sure that the data is not openly crawlable on the Oregon site, but my initial inspection suggests that it isn't - I'd be very happy to be proved wrong on this!
For the unemployment data can't you just follow this link http://data.oregon.gov/views/db5t-4zd9/rows.csv?accessType=DOWNLOAD
..download into a CSV and then parse it?
Posted by: Ptoulis | July 04, 2011 at 02:04 PM
Ptoulis - that URL would be perfect, but it is not in the source of the page. I suspect it is created by some javascript when the user interacts with the page. Of course, a sophisticated crawler could run javascript on the page and simulate all user interactions and then figure out if any of them create a new link, but that seems less than ideal.
Posted by: Matthew Hurst | July 04, 2011 at 02:31 PM
Might want to look at the API - a much better way to work with data.
http://dev.socrata.com/
Posted by: Jufemaiz.wordpress.com | July 04, 2011 at 11:56 PM
Thanks for your feedback. We strive to make our data as accessible and discoverable as possible, so we've put a lot of effort into making sure that the data on our platform is crawlable by search engines like Google:
- Our data catalog is entirely accessible without Javascript, so it can easily be crawled directly. This is also important for Section 508 accessibility
- Our dataset pages include all of the metadata for the dataset included in the base HTML, not AJAXed in from our API
- While the data itself is dynamically loaded via our API into our advanced data grid and visualizations, the data itself is crawlable as a simpel HTML table via the accessible HTML-only version of the dataset page (which can be reached by adding "/alt" to any dataset URL or by using accessible keyboard links)
What other improvements do you suggest we could make to the dataset page to make it more crawlable? To provide the interactive experience that our users are expecting, we won't be able to remove the interactive data grid, but we might be able to provide more prominent links in the source to our data feed formats or API endpoints.
That said, if you want programmatic access to data on the Socrata platform, it's probably better to go through the API anyway.
Thanks,
Chris Metcalf
Director of Product Development
and Developer Evangelism
Posted by: Chris Metcalf | July 15, 2011 at 02:31 AM
Chris,
Thanks for your comment. On reading your comment I understand that a) the meta data is available trivially to a crawler and the data itself is available if one alters the URL or via keyboard 'links' (I'm not sure what these are).
What I believed when I looked at the html source was that a standard crawling algorithm (loading the html of a page, extracting out links and recursing) would not be able to access the data. From your response, I'm actually still not sure if that is possible or not - in fact it still sounds to me like it isn't.
I *do* see that excel files are showing up in Google's index, but as far as I can tell that is due to their execution of javascript on the site rather than following the plain html graph. I will run my crawler on the site to see if it is trivially crawlable via the basic spidering logic I described.
Posted by: Matthew Hurst | July 17, 2011 at 02:38 AM