One of the claims of d8taplex is that it discovers data in the wild, in the nooks and crannies, mountain tops and plains of the web. This data can be found in spreadsheets, HTML, plain text documents and eventually other formats.
When finding raw data in this manner, the system has to do something sensible with duplicate, redundant and similar data sets. Ideally, a fully functioning data engine would recognize all relationships between all data sets and describe them appropriately in any given user interaction.
As a first step, I've implemented a simple similarity mechanism which identifies time series data that has a high potential overlap. This feature is only deployed for data sets on the same host (or source).
The reason this is important is that many sources publishing statistical data will add new, fresh data sets which overlap with the earlier publication of the same variables. For example, a site might publish every month the unemployment statistics for the past 50 years. If each of these were surfaced as a search result, then the user would see an endless list of similar but slightly different data sets.
If the same, or similar, data sets are found on different sites, then currently the results will present all the data sets.
This problem is less of an issue for those data engines that focus primarily on open data from government APIs as there are well established data assets that are singular in nature. Once the semantics of the open data site are internalized by whatever mechanism is used to pull the data set, then it is simple to overwrite the previous version of the data with the freshest.
However, in both the wild data and the open data environment, it is useful to recognize when the same data is present on multiple sites. This allows one to understand the importance of the data and potential how it has been disseminated on the internet.
See a related discussion on the BuzzData blog.
Comments