Firstly, what is BuzzData? Functionally, it supports the following features:
creating identity: a user has a profile, etc. with all the normal social capabilities between objects in the BuzzData universe (following people, following data)
uploading data: as per other data markets, BuzzData permits the uploading of data files
associating objects with data: these can be visualizations (note that it doesn't provide its own visualization technology) or articles (discovered online, relating to the data set in question)
searching for data sets: the usual keyword interaction
This set of functionality supports an ecosystem intended to snowball value on to data sets. Users follow datasets, users currate the data (e.g. I find a visualization of a data set and share it). I can comment on data sets, etc. Like any ecological system, one has to figure one of two strategies. You either have to provide value to individual users independent of the design ecosystem (this was exactly the clever part of delicious. It was useful to the user for bookmarking even without all the social effects of discovery and sharing) or you have to ensure there is not a cold start issue (in the case of BuzzData, this would mean that the site was already rich with data sets).
Independent of what the use-cases, persona or other design intentions of the site are, I'm not sure that BuzzData has yet solved the initial conditions in either of the two ways described above. It doesn't have the data coverage of other sites like timetric, ZanRan or d8taplex, yet it doesn't provide data tools such as visualization or statistical analytics or manipulation. However, perhaps this points to the intended value proposition of the site - bringing social to data. It is the user base that provides both of these (or will if things turn out right). That being the case, the data priming challenge is perhaps where the company needs to focus.
Overall, I like the design principles and implementation of the site. True, there are some beta (and alpha) level bugs (I'm having trouble loading up a small data set right now), but that is not exceptional in the highly iterative web application world.
It is going to be very interesting to see how the site grows and evolves as a consequence. Is it a commercial version of IBM's Many Eyes? A twist on DataMarket or InfoChimps? A reimplentation of Swivels (the YouTube of data)?
I've been thinking a little about how to make d8taplex more accessible. One of the challenges is that users don't necessarily know what data - or what variables - are available in the million plus time series in the system. One idea to help surface this sort of information is to mine the tables for concepts in the labels given to the variables.
I've done a little of this and while there is a long way to go, I can see a couple of trends.
Firstly, country names and the names of other geographic and political areas are extremely common. Thus it would seem appealing to provide some sort of location based pivot to the data.
In the last roll out of features on d8taplex I included an experimental dynamic filter for data sets. To access it, you expand the graph, click in the filter text box and start typing. As you do, only time series whose names match in part the regular expression you type will be present in the graph.
I've finally rolled out some updates to d8taplex that I've been tinkering with this summer:
table title extraction: where possible the system now extracts the title of tables and displays it in the results page (it will be added to the data set pages shortly)
correlated data set visualization: as per a few posts on the topic on this blog, I've added a visualization of correlated time series to the data set page allowing you to spot variables that are highly correlated
improved relevance: the system now uses more of the textual context to help rank time series (though there is still lots of work to be done here)
speed improvements: I've made some improvements to the speed of serving search results
I've also introduced a breaking change that improves the id system for data sets so in some older blog posts that embedded d8taplex data you will now see an error message.
As I've mentioned in several posts about d8taplex, my belief is that there is sufficient data on the web that can be discovered, crawled and automatically interpreted by a system like d8taplex or Timetric. Making the automated access to this data more complex or impossible is against the spirit of open data.
Of course, I'm not 100% sure that the data is not openly crawlable on the Oregon site, but my initial inspection suggests that it isn't - I'd be very happy to be proved wrong on this!
It is 18 months until US citizens will have decided to keep their current president or roll in another. While hard many to imagine, this means that chatter is starting now about who will be running for election. This means that we have already started to hear sound bites about why A is better than B and how party X did this and party Y did that. Naturally, bikini statistics will play a major part in the discussion. Or rather, will be used as a tool to bamboozle the electorate.
It doesn't have to be that way.
With services like d8taplex, Timetric, BuzzData, Socrata and other data engines, there is a real opportunity to help people cut through the mumbo-jumbo and go directly to data assets to help make better informed decisions and, perhaps more importantly, to hold the circus accountable for honesty in the use and presentation of data.
A simple idea that I plan to further play with is to create data sets in d8taplex as well as some specialized visualizations to help people understand a number of key points:
The statistical history of their parties
Relative measures of different countries (what does a country with good health care look like?)
Straight forward presentations of scientific data (should we invest in ethanol?)
I rattled out an example of the first area tonight. The graph below shows the spend on national defense in billions of FY 2000 dollars. Overlaid on this data set are coloured areas that represent the party in power at any given time (red = republican, blue = democrat). The data is taken from www.census.gov and is available currently in d8taplex (though not in the form below and not as discoverable as it could be).
I would love to see the other data engines help get out the data!
Part of the idea behind d8taplex is that by not only discovering but actually interpreting data sets found online the user can start to learn something new from all the data being accessible in one place. Naturally, this will mean some functionality in the future allowing for the discovery of relationships between data sets. But before we get there, I've started coding up some ideas behind visualizing the correlation between variables in a single data set.
The first attempt at this employs small multiples to show sparklines of clustered variables. In the example below, we are looking at the consumption of oil by various nations (according to oilcrisis.com).
Clearly some labeling would be useful here (there are tooltips on the dev version of this), but as a proof of concept I like it. The second line includes South Korea, Singapore, Pakistan and Thailand. The oil consumption of these nations can be compared with the third line which includes Belarus, The Russian Federation and Ukraine.
Here's a couple more interesting correlations (Denmark, Sweden), (Chile, Iceland):
As these data are correlated, it may make sense to display them with a single sparkline per cluster - in some sense the icon of the cluster - then list the names of the variables.
One of the key metrics for any search engine is relevance. A metric of relevance estimates essentially how useful the result was if one could divine the intentions of the searcher. However, it is quite possible that the search results are entirely relevant, but that they are presented in such a way that the user has no idea. Thus one really needs to measure how effective the presentation of the results is at efficiently explaining to the user the essence of the result of object so that they can make a judgement as to their next action.
Initially in d8taplex, there was no mechanism to explain to the user why the result was coming back. Part of the challenge with interpreting data in the wild is that there is no guarantee that the system will find the title of the table, the units on the axes, the labels of the time series, etc. The hope is that there is enough information for the user to figure out if the data is of value.
However, there was consistent feedback around there not being enough information present to figure out if the data was relevant or not.
I've just rolled out a simple step to address this. Now, when you search on d8taplex, the cells in the table which contain terms form the query are presented to the user with the usual highlighting of the specific terms.
In the example below, as a result of the query for "crude oil" matching table cells are presented in the first line of the result object.