My Photo

« On Google Image Search, A Paradigm and Metaphor | Main | Unrest in the Middle East, Social Media, Text Mining and Trending Topics »

July 02, 2011

TrackBack

TrackBack URL for this entry:
http://www.typepad.com/services/trackback/6a00d8341c994053ef0154336cc868970c

Listed below are links to weblogs that reference Is Data Open if it Can't be Crawled?:

Comments

Ptoulis

For the unemployment data can't you just follow this link http://data.oregon.gov/views/db5t-4zd9/rows.csv?accessType=DOWNLOAD

..download into a CSV and then parse it?

Matthew Hurst

Ptoulis - that URL would be perfect, but it is not in the source of the page. I suspect it is created by some javascript when the user interacts with the page. Of course, a sophisticated crawler could run javascript on the page and simulate all user interactions and then figure out if any of them create a new link, but that seems less than ideal.

Jufemaiz.wordpress.com

Might want to look at the API - a much better way to work with data.

http://dev.socrata.com/

Chris Metcalf

Thanks for your feedback. We strive to make our data as accessible and discoverable as possible, so we've put a lot of effort into making sure that the data on our platform is crawlable by search engines like Google:

- Our data catalog is entirely accessible without Javascript, so it can easily be crawled directly. This is also important for Section 508 accessibility
- Our dataset pages include all of the metadata for the dataset included in the base HTML, not AJAXed in from our API
- While the data itself is dynamically loaded via our API into our advanced data grid and visualizations, the data itself is crawlable as a simpel HTML table via the accessible HTML-only version of the dataset page (which can be reached by adding "/alt" to any dataset URL or by using accessible keyboard links)

What other improvements do you suggest we could make to the dataset page to make it more crawlable? To provide the interactive experience that our users are expecting, we won't be able to remove the interactive data grid, but we might be able to provide more prominent links in the source to our data feed formats or API endpoints.

That said, if you want programmatic access to data on the Socrata platform, it's probably better to go through the API anyway.


Thanks,

Chris Metcalf
Director of Product Development
and Developer Evangelism

Matthew Hurst

Chris,

Thanks for your comment. On reading your comment I understand that a) the meta data is available trivially to a crawler and the data itself is available if one alters the URL or via keyboard 'links' (I'm not sure what these are).

What I believed when I looked at the html source was that a standard crawling algorithm (loading the html of a page, extracting out links and recursing) would not be able to access the data. From your response, I'm actually still not sure if that is possible or not - in fact it still sounds to me like it isn't.

I *do* see that excel files are showing up in Google's index, but as far as I can tell that is due to their execution of javascript on the site rather than following the plain html graph. I will run my crawler on the site to see if it is trivially crawlable via the basic spidering logic I described.

Verify your Comment

Previewing your Comment

This is only a preview. Your comment has not yet been posted.

Working...
Your comment could not be posted. Error type:
Your comment has been posted. Post another comment

The letters and numbers you entered did not match the image. Please try again.

As a final step before posting your comment, enter the letters and numbers you see in the image below. This prevents automated programs from posting comments.

Having trouble reading this image? View an alternate.

Working...

Post a comment

Twitter Updates

    follow me on Twitter

    July 2014

    Sun Mon Tue Wed Thu Fri Sat
        1 2 3 4 5
    6 7 8 9 10 11 12
    13 14 15 16 17 18 19
    20 21 22 23 24 25 26
    27 28 29 30 31    

    Categories

    Blog powered by Typepad