There is a new category of online experience - the data engine. These sites (which include timetric, Data Market, Infochimps, The Guardian's Data Store, Nation Master and my own experimental site d8taplex) represent the intermediary between the formal data being released by many networked organizations and researchers (and many other data sources besides) and a user base spanning data journalists, data geeks and an unwary public.
There are two central challenges facing these utilities: collecting data and providing mechanism to allow users to discover, explore and interact with data sets.
Data collection happens in many ways including the crawling of government sites, focused crawling of the web at large, and via business models which provide market places for the exchange of data. It is in the aggregate body of data that these systems are first differentiated.
The second differentiation point is through the mechanisms used to connect users with data sets and to explore, manipulate and modify these assets. The data, while available via the web, is wholly different from the 'traditional' web data. Notions of importance, popularity and quality are unfamiliar here. In fact, while one might assume the data, being of a more structural form, is accurate it is not uncommon to find data sets from the same organizations describing the same facts but with different values.
While most of the sites mentioned above offer some form of search interaction, they fall in to two essential types of responses. Firstly, there are those (listed below) which offer a graphical response. Secondly there are those which provide a more textual interaction.
timetric indicates data via sparklines (which brings clarity, but removes the data from the context within which it is published).
Data Market indicates the presence of graphical data but uses a single, large icon for all results which tantalizes the user.
With d8taplex, being an experiment in HTML5 technologies, I opted for a combination of fully interactive, small scale graphs and snippets of the original table.
As these variants suggest, the exciting thing about this space is that the objects of interest are complex yet direct connections can be made with the consumer visually. A reasonably small graphic (such as the sparkline or full graph) can summarize very large volumes of data. In fact, on a personal note, I find these interactions to be rewarding purely due to the discovery of interestingly shaped data.
This last point leads to the next challenge: selection. Given a set of data sets which mention a specific thing (be it votes or herons) which one is going to be of value to the user? In fact, how will users even begin to model the data sets that they are interested in and how will these data engines evolve to capture this model?
The third differentiation point is what you can do with the data once you have it. Some of these sites allow you to download various versions of the data, some allow you to view, manipulate and overlay the data in a graphical form. timetric, for example, provides a large graph and various manipulation functions.
The good news is that we are going to see all of this unfold in the near future as the salience and value of this data sets rides a wave of increasing data literacy at all levels.
Good post. BTW, one of my favorite data sites is FreeBase.com (now owned by Google). I have two sites off of it.
Posted by: AnalyticalWay | February 20, 2011 at 12:12 PM