It is 18 months until US citizens will have decided to keep their current president or roll in another. While hard many to imagine, this means that chatter is starting now about who will be running for election. This means that we have already started to hear sound bites about why A is better than B and how party X did this and party Y did that. Naturally, bikini statistics will play a major part in the discussion. Or rather, will be used as a tool to bamboozle the electorate.
It doesn't have to be that way.
With services like d8taplex, Timetric, BuzzData, Socrata and other data engines, there is a real opportunity to help people cut through the mumbo-jumbo and go directly to data assets to help make better informed decisions and, perhaps more importantly, to hold the circus accountable for honesty in the use and presentation of data.
A simple idea that I plan to further play with is to create data sets in d8taplex as well as some specialized visualizations to help people understand a number of key points:
The statistical history of their parties
Relative measures of different countries (what does a country with good health care look like?)
Straight forward presentations of scientific data (should we invest in ethanol?)
I rattled out an example of the first area tonight. The graph below shows the spend on national defense in billions of FY 2000 dollars. Overlaid on this data set are coloured areas that represent the party in power at any given time (red = republican, blue = democrat). The data is taken from www.census.gov and is available currently in d8taplex (though not in the form below and not as discoverable as it could be).
I would love to see the other data engines help get out the data!
Part of the idea behind d8taplex is that by not only discovering but actually interpreting data sets found online the user can start to learn something new from all the data being accessible in one place. Naturally, this will mean some functionality in the future allowing for the discovery of relationships between data sets. But before we get there, I've started coding up some ideas behind visualizing the correlation between variables in a single data set.
The first attempt at this employs small multiples to show sparklines of clustered variables. In the example below, we are looking at the consumption of oil by various nations (according to oilcrisis.com).
Clearly some labeling would be useful here (there are tooltips on the dev version of this), but as a proof of concept I like it. The second line includes South Korea, Singapore, Pakistan and Thailand. The oil consumption of these nations can be compared with the third line which includes Belarus, The Russian Federation and Ukraine.
Here's a couple more interesting correlations (Denmark, Sweden), (Chile, Iceland):
As these data are correlated, it may make sense to display them with a single sparkline per cluster - in some sense the icon of the cluster - then list the names of the variables.
Google recently launched a new labs tool called Google Correlate. Given some data (either via a search term you enter or from data you upload), the system ranks all queries (in what I assume is a subset of the entire space of queries issued to Google) in terms of how well they correlate to the object data.
With this, we might notice that searches for 'margaret thatcher' have a reasonable correlation with 'ronald reagan biography', the most correlated term for 'live labs' is '10 meters is how many feet', 'world cup' with 'copa mondial', 'bayes' with 'lithography', etc.
Perhaps a little less well known, but predating Google Correlate is Eidosearch. This is similar to the above but with the twist that you can define the query in terms of an arbitrary span of a time series. Watch the video:
I've been having fun with a new search engine for graphs call Zanran. Zanran crawls the web and classifies content in a number of formats including PDF, excel, html and image. Resources that it classifies as either graphical presentations of data (time series, pie charts, bar graphs, etc) or tabular data it indexes. When you search over this data set, the results are charts and tables containing data relating to your query.
The site, like any sensible engineering endeavour, is very focused in terms of scenario. There is a simple search / response interaction similar to a web search engine.
One can interact with the search results by hovering over the icon on the left (which also indicates the type of the result document, be it PDF, excel, etc.). This action brings up an overlay which includes the page or other document element containing the data.
Of all the data engines out there, including d8taplex, Infochimps, DataMarket and timetric, Zanran possibly has the most data. A search for 'temperature' brought up over a million hits.
However, as it has opted for the same interaction paradigm as a search engine, it is forced to optimize for relevance as its core competency in connecting users to data (other models might include browsing by categories, browsing by source, or even browsing by the shape of the data as done by Eiodosearch). This is where the site has some real challenges. The site also suffers from some failures. Searches for 'per' and 'for' provide a server side error of some sort (possibly due to a high hit count?).
At anyrate, I see Zanran as being somewhat aligned with the philosophy behind d8taplex in terms of data sources - the wild web, rather than being limited to open data. However, it is focused on a different user scenario (get the user to a source with the data rather than get the data to the user for interaction).
A couple of interesting and very different stories today:
There is a new data engine on the block : Zanran is a search engine for data that is a little closer to a traditional search engine. It doesn't provide mechanisms to visualize or manipulate the data that it uncovers. In some ways, I see it as being far nearer to the vision of d8taplex in that it is hunting down data wherever it may be on the web (see Wild Data and Open Data). On the other hand, as it handles a wider range of formats, including images various document formats in addition to the excel and html that others deal with, it is more of a challenge for the site to deliver a uniform experience over these disparate data sets. Look for a more in depth review in a future post.
DataMarket has just launched an experiment in their labs to get users to help rank data : visiting Rate this Data Set, users will be shown a time series graph and be asked to rate it as interesting or not. While this is fun, I see this as another experiment in user engagement. All sites, d8taplex included, are experimenting with ways to better engage users. The barriers are high given the precedence of web search engines and the general lack of data literacy.
I've made a few updates to d8taplex including the addition of a Facebook like button, so if you love data, please like d8taplex!
One of the claims of d8taplex is that it discovers data in the wild, in the nooks and crannies, mountain tops and plains of the web. This data can be found in spreadsheets, HTML, plain text documents and eventually other formats.
When finding raw data in this manner, the system has to do something sensible with duplicate, redundant and similar data sets. Ideally, a fully functioning data engine would recognize all relationships between all data sets and describe them appropriately in any given user interaction.
As a first step, I've implemented a simple similarity mechanism which identifies time series data that has a high potential overlap. This feature is only deployed for data sets on the same host (or source).
The reason this is important is that many sources publishing statistical data will add new, fresh data sets which overlap with the earlier publication of the same variables. For example, a site might publish every month the unemployment statistics for the past 50 years. If each of these were surfaced as a search result, then the user would see an endless list of similar but slightly different data sets.
If the same, or similar, data sets are found on different sites, then currently the results will present all the data sets.
This problem is less of an issue for those data engines that focus primarily on open data from government APIs as there are well established data assets that are singular in nature. Once the semantics of the open data site are internalized by whatever mechanism is used to pull the data set, then it is simple to overwrite the previous version of the data with the freshest.
However, in both the wild data and the open data environment, it is useful to recognize when the same data is present on multiple sites. This allows one to understand the importance of the data and potential how it has been disseminated on the internet.
I've been learning a lot by following the processes described in Khoi Vinh's design book, applying those principles to some new ideas for d8taplex. Below is a screenshot of where I am in designing a page specifically for the data set.
I've tried to both follow the grid layout design principles and also adopt the golden ratio presentation of the graph as the main constraint. I'm aiming for wide screens, so the width of the overall page is 1200px. The grid is based on 24 50px columns, with a ratio of 300px, 500px, 400px for the left, middle and right areas.
I've been exploring the use of Wikipedia and other sources of content to describe the sources of data and, in addition to the main central graph, have been looking at using the exploded view of the time series to provide more detail and visibility into the variables in the data set on the right rail.
d8taplex (me) added some more data over the weekend to take the number of time series over the 1 million mark. This data is discovered from over 122 sites and over 50k data sets (tables). As I mentioned in a previous post, I'm going to try to stop thinking about data sets now and try to get the site better designed and functioning.
Note : there is some functionality delivered via Protovis which I find extremely interesting (though simple). This is the ability to filter visualizations of large data sets dynamically via a simple regex text box. This seems like something important to explore when presenting time series data sets with 100s of variables. I believe I first saw this many years ago on the famous baby names explorer.