Of all the concepts that are thrown about in the web/data mining space, data, information, knowledge and content seem to be the most important, but also the most overloaded and slippery. In many situations, this isn't too much of a problem, but when dealing with the creation of large scale systems that are intended to derive one (e.g. knowledge) from the other (e.g. data) it becomes vital to get some idea of where the boundaries are.
The definition in Wikipedia for data is a reasonable start, but, for me, not firm enough:
The term data means groups of information that represent the qualitative or quantitative attributes of a variable or set of variables. Data (plural of datum, which is seldom used) are typically the results of measurements and can be the basis of graphs, images, or observations of a set of variables. Data are often viewed as the lowest level of abstraction from which information and knowledge are derived. Raw data refers to a collection of numbers, characters, images or other outputs from devices that collect information to convert physical quantities into symbols, that are unprocessed.
I need a bit more.
To me data must include
- a representation: the set of symbols used to capture the data
- an intention: the reason for capturing the data
- an object concept: the thing about which it is data of
- a semantics: a set of 'rules' for interpreting the representation
If you think of a sequence of numbers representing a sample of frequencies recorded from a microphone pointed at a song bird. We might represent those frequencies as numbers. The numbers have to be represented as symbols and those symbols have to be known to map to frequencies of sound waves in a certain unit of measure. The object concept is the sound wave and the intention is to capture the song to determine (for example) the characteristics of the song.
I believe this is all important in a definition as I look at data from the perspective of archiving and discovery. If I come across a document with numbers written on it, is that data?
Of course, the interesting thing about capturing these details of data is that they can then be layered to arbitrary representations. Electrons to binary, binary to digital, digital to unicode, unicode to xml, etc.