Of all the concepts that are thrown about in the web/data mining space, data, information, knowledge and content seem to be the most important, but also the most overloaded and slippery. In many situations, this isn't too much of a problem, but when dealing with the creation of large scale systems that are intended to derive one (e.g. knowledge) from the other (e.g. data) it becomes vital to get some idea of where the boundaries are.
The definition in Wikipedia for data is a reasonable start, but, for me, not firm enough:
The term data means groups of information that represent the qualitative or quantitative attributes of a variable or set of variables. Data (plural of datum, which is seldom used) are typically the results of measurements and can be the basis of graphs, images, or observations of a set of variables. Data are often viewed as the lowest level of abstraction from which information and knowledge are derived. Raw data refers to a collection of numbers, characters, images or other outputs from devices that collect information to convert physical quantities into symbols, that are unprocessed.
I need a bit more.
To me data must include
- a representation: the set of symbols used to capture the data
- an intention: the reason for capturing the data
- an object concept: the thing about which it is data of
- a semantics: a set of 'rules' for interpreting the representation
If you think of a sequence of numbers representing a sample of frequencies recorded from a microphone pointed at a song bird. We might represent those frequencies as numbers. The numbers have to be represented as symbols and those symbols have to be known to map to frequencies of sound waves in a certain unit of measure. The object concept is the sound wave and the intention is to capture the song to determine (for example) the characteristics of the song.
I believe this is all important in a definition as I look at data from the perspective of archiving and discovery. If I come across a document with numbers written on it, is that data?
Of course, the interesting thing about capturing these details of data is that they can then be layered to arbitrary representations. Electrons to binary, binary to digital, digital to unicode, unicode to xml, etc.


Prescriptivist grammar experts should run screaming from the room right now: it will save time and avoid disturbing other readers.
It is important to be clear that data is a collection of individual datums. People who say that data are collections of individual data points are just confused.
Seriously, I agree that data has to be ABOUT something. I think there are (imagined) circumstances under which you could truthfully say "we had no data on the potential for long term poluution underwater cities until we realized that our existing measurements of ocean salinity were relevant"
Posted by: www.facebook.com/profile.php?id=691375626 | May 22, 2010 at 12:07 PM
…pollution due to underwater…
Posted by: www.facebook.com/profile.php?id=691375626 | May 22, 2010 at 12:08 PM