My Photo

« On Google Image Search, A Paradigm and Metaphor | Main | Unrest in the Middle East, Social Media, Text Mining and Trending Topics »

July 02, 2011



For the unemployment data can't you just follow this link into a CSV and then parse it?

Matthew Hurst

Ptoulis - that URL would be perfect, but it is not in the source of the page. I suspect it is created by some javascript when the user interacts with the page. Of course, a sophisticated crawler could run javascript on the page and simulate all user interactions and then figure out if any of them create a new link, but that seems less than ideal.

Might want to look at the API - a much better way to work with data.

Chris Metcalf

Thanks for your feedback. We strive to make our data as accessible and discoverable as possible, so we've put a lot of effort into making sure that the data on our platform is crawlable by search engines like Google:

- Our data catalog is entirely accessible without Javascript, so it can easily be crawled directly. This is also important for Section 508 accessibility
- Our dataset pages include all of the metadata for the dataset included in the base HTML, not AJAXed in from our API
- While the data itself is dynamically loaded via our API into our advanced data grid and visualizations, the data itself is crawlable as a simpel HTML table via the accessible HTML-only version of the dataset page (which can be reached by adding "/alt" to any dataset URL or by using accessible keyboard links)

What other improvements do you suggest we could make to the dataset page to make it more crawlable? To provide the interactive experience that our users are expecting, we won't be able to remove the interactive data grid, but we might be able to provide more prominent links in the source to our data feed formats or API endpoints.

That said, if you want programmatic access to data on the Socrata platform, it's probably better to go through the API anyway.


Chris Metcalf
Director of Product Development
and Developer Evangelism

Matthew Hurst


Thanks for your comment. On reading your comment I understand that a) the meta data is available trivially to a crawler and the data itself is available if one alters the URL or via keyboard 'links' (I'm not sure what these are).

What I believed when I looked at the html source was that a standard crawling algorithm (loading the html of a page, extracting out links and recursing) would not be able to access the data. From your response, I'm actually still not sure if that is possible or not - in fact it still sounds to me like it isn't.

I *do* see that excel files are showing up in Google's index, but as far as I can tell that is due to their execution of javascript on the site rather than following the plain html graph. I will run my crawler on the site to see if it is trivially crawlable via the basic spidering logic I described.

The comments to this entry are closed.

Twitter Updates

    follow me on Twitter

    March 2016

    Sun Mon Tue Wed Thu Fri Sat
        1 2 3 4 5
    6 7 8 9 10 11 12
    13 14 15 16 17 18 19
    20 21 22 23 24 25 26
    27 28 29 30 31    


    Blog powered by Typepad