To replicate his experiment you can do the following steps:
Visit Google Finance and get the MSFT stock data up.
Visit the 'historical prices" link and edit the date range so that it includes a good chunk of data (say 10 years).
Click the update button to update the data on the page (note that the graph won't update - some sort of bug).
Click on the download option on the right hand side (this is not available for all data on Google Finance, btw).
Once the data is downloaded, upload it to Google Documents as a spreadsheet and delete all but the first column (dates) and the second last column (closing price). Note that you might see a server error when deleting columns, just ignore it and keep going.
Now download the results and open the data in a text editor and delete the first line.
Next, upload this data to Google Correlate.
I did this for a couple of other stocks : Google and IBM. For IBM, the correlated terms are as shown below:
While there are various views on what value of r for Pearson's statistic constitute good or strong correlations, it is clear from the above that these terms are more correlated with IBM's stock than the terms Paul showed for Microsoft's and, importantly, that they bear no intuitive reasoning - what explains the relationship between 'mi novio' (my boyfriend) and IBM?
You can see the relationship between the top correlated term and the stock below:
One of the key metrics for any search engine is relevance. A metric of relevance estimates essentially how useful the result was if one could divine the intentions of the searcher. However, it is quite possible that the search results are entirely relevant, but that they are presented in such a way that the user has no idea. Thus one really needs to measure how effective the presentation of the results is at efficiently explaining to the user the essence of the result of object so that they can make a judgement as to their next action.
Initially in d8taplex, there was no mechanism to explain to the user why the result was coming back. Part of the challenge with interpreting data in the wild is that there is no guarantee that the system will find the title of the table, the units on the axes, the labels of the time series, etc. The hope is that there is enough information for the user to figure out if the data is of value.
However, there was consistent feedback around there not being enough information present to figure out if the data was relevant or not.
I've just rolled out a simple step to address this. Now, when you search on d8taplex, the cells in the table which contain terms form the query are presented to the user with the usual highlighting of the specific terms.
In the example below, as a result of the query for "crude oil" matching table cells are presented in the first line of the result object.
The Dohop blog points to some great work by Eric Fischer which provides a visualization contrasting a locals view of the city with that of a tourist. By looking at the frequency of photographs uploaded by a user, Fischer figured out approximately which were taken by locals (many pictures over time from the same city) and which were from tourists (only for a short period of time in a city). The resulting plots give an indication of how these different personas view cities.
Blue points indicate concentrations of images by locals, red, by tourists.
Google recently launched a new labs tool called Google Correlate. Given some data (either via a search term you enter or from data you upload), the system ranks all queries (in what I assume is a subset of the entire space of queries issued to Google) in terms of how well they correlate to the object data.
With this, we might notice that searches for 'margaret thatcher' have a reasonable correlation with 'ronald reagan biography', the most correlated term for 'live labs' is '10 meters is how many feet', 'world cup' with 'copa mondial', 'bayes' with 'lithography', etc.
Perhaps a little less well known, but predating Google Correlate is Eidosearch. This is similar to the above but with the twist that you can define the query in terms of an arbitrary span of a time series. Watch the video:
I've been having fun with a new search engine for graphs call Zanran. Zanran crawls the web and classifies content in a number of formats including PDF, excel, html and image. Resources that it classifies as either graphical presentations of data (time series, pie charts, bar graphs, etc) or tabular data it indexes. When you search over this data set, the results are charts and tables containing data relating to your query.
The site, like any sensible engineering endeavour, is very focused in terms of scenario. There is a simple search / response interaction similar to a web search engine.
One can interact with the search results by hovering over the icon on the left (which also indicates the type of the result document, be it PDF, excel, etc.). This action brings up an overlay which includes the page or other document element containing the data.
Of all the data engines out there, including d8taplex, Infochimps, DataMarket and timetric, Zanran possibly has the most data. A search for 'temperature' brought up over a million hits.
However, as it has opted for the same interaction paradigm as a search engine, it is forced to optimize for relevance as its core competency in connecting users to data (other models might include browsing by categories, browsing by source, or even browsing by the shape of the data as done by Eiodosearch). This is where the site has some real challenges. The site also suffers from some failures. Searches for 'per' and 'for' provide a server side error of some sort (possibly due to a high hit count?).
At anyrate, I see Zanran as being somewhat aligned with the philosophy behind d8taplex in terms of data sources - the wild web, rather than being limited to open data. However, it is focused on a different user scenario (get the user to a source with the data rather than get the data to the user for interaction).
A couple of interesting and very different stories today:
There is a new data engine on the block : Zanran is a search engine for data that is a little closer to a traditional search engine. It doesn't provide mechanisms to visualize or manipulate the data that it uncovers. In some ways, I see it as being far nearer to the vision of d8taplex in that it is hunting down data wherever it may be on the web (see Wild Data and Open Data). On the other hand, as it handles a wider range of formats, including images various document formats in addition to the excel and html that others deal with, it is more of a challenge for the site to deliver a uniform experience over these disparate data sets. Look for a more in depth review in a future post.
DataMarket has just launched an experiment in their labs to get users to help rank data : visiting Rate this Data Set, users will be shown a time series graph and be asked to rate it as interesting or not. While this is fun, I see this as another experiment in user engagement. All sites, d8taplex included, are experimenting with ways to better engage users. The barriers are high given the precedence of web search engines and the general lack of data literacy.
I've made a few updates to d8taplex including the addition of a Facebook like button, so if you love data, please like d8taplex!
The BBC website is held in extremely high regard by web designers. Having briefly explored the grid and golden ratio as design principles, I'm interested in deconstructing the BBC site to see what I can learn from it.
The first thing I observed is that the image ratios do not hold to the golden ratio. The main story image is 304 wide and 171 high, a ratio of 1.78.
This ratio is repeated for the two smaller image sizes associated with sub stories in the main area and also those in the right column.
Other (non-news) areas of the site have a couple of other sizes. The large lead image on the sports area maintains the ratio at 464 x 261.
But the smaller thumbnail for video elements on the right column of the sports section has a ratio consistent with the video player (66 x 49, a ratio of 1.35).
The site has two main columns. The primary column has a width of 640px and is contains two sub-columns of 320px each.
The right hand column is slightly wider than these sub-columns coming it at around 340px, suggesting a fine grained grid metric of 20px.
The navigational tabs at the top of the page, which expand out to two rows, do not align with the columnar structure and are designed to fit the width of the text in the tab control rather than some standard tab width.
The Wall Street Journal has created an animated presentation of a week of Foursquare checkins focused on New York. The analytic covers other regions with top checkin events, gender differences and a comparison between New York and the Bay Area.