Below is the first part of a post in response to Chris Anderson's latest cover article in Wired magazine entitled The End of Theory: The Data Deluge Makes The Scientific Method Obsolete. I had the whole post finished and ready to go, but hadn't counted on the poor quality of software that Sixapart (which hosts this blog) has rolled out in the latest version of its post editing system. Most of my work was lost while checking spelling. Rather than try to recover it, I'll point you to John Timmer's post at ars technica.
First of all, let's be abstract. A system S produces a number of events. Any single event E may generate a set of observable data O. This data is interpreted by some observer and may be recorded in some form. The system might be the weather, the event might be a hurricane, the observable data might be the change in atmospheric pressure.
Now, let's imagine that you have a big collection of data. You can look back at it and say I saw X, then I saw Y (perhaps to subsequent readings of a barometer). In fact, if you saw X right now, you might be inclined to say that you expect to see Y shortly in the future. However, let's imagine that you see X'. What could you say about your expectations for the next reading? Without some model, you can't really say anything. Now this model might be a model of the data. That is to say, you might fit a function to the data and use that to predict the next point. Or the model could be that of the underlying system (which you can't observe directly). Either way, you have stepped over the line from data to a model.
The really neat thing about models is they allow us to peer through the thin veneer of data and glimpse the next layer of the world. They extend our context and our understanding. In addition, and from a more utilitarian point of view, they allow us to predict things in the future based on observations that haven't been made before.
Another example. I come across a word that I've never seen before. Immediately, I can use that word and apply all manner of morphology to it such that others who speak my language can understand. It is new data, but because I have at some level abstracted the language (we might say, modeled the language) I can painlessly handle that novelty.
And yet another example. No matter how many times we observe an apple falling from a tree, our data will tell us nothing about the trajectory required to send a rocket to the moon. The only questions we can ask of the data are about the past, limited to events that have already occurred. With no model - with no theory of the underlying system (S) we can't ask question about things that we have never experienced.