June 25, 2008

Chris Anderson: Aware of All Statistical Traditions (with bonus fall course announcement)

Attention conservation notice: Someone is wrong in Wired magazine.

I recently made the mistake of trying to kill some waiting-room time with Wired. (Yes, I should know better.) The cover story was a piece by editor Chris Anderson, about how having lots of data means we can just look for correlations by data mining, and drop the scientific method in favor of statistical learning algorithms. Now, I work on model discovery, but this struck me as so thoroughly, and characteristically, foolish — "saucy, ignorant contrarianism", indeed — that I thought I was going to have to write a post picking it apart. Fortunately, Fernando Pereira (who actually knows something about machine learning) has said, crisply, what needs to be said about this. I hope he won't mind (or charge me) if I quote him at length:

I like big data as much as the next guy, but this is deeply confused. Where does Anderson think those statistical algorithms come from? Without constraints in the underlying statistical models, those "patterns" would be mere coincidences. Those computational biology methods Anderson gushes over all depend on statistical models of the genome and of evolutionary relationships.

Those large-scale statistical models are different from more familiar deterministic causal models (or from parametric statistical models) because they do not specify the exact form of observable relationships as functions of a small number of parameters, but instead they set constraints on the set of hypotheses that might account for the observed data. But without well-chosen constraints — from scientific theories — all that number crunching will just memorize the experimental data.

I might add that anyone who thinks the power of data mining will let them write a spam filter without understanding linguistic structure deserves the in-box they'll get; and that anyone who thinks they can overcome these obstacles by chanting "Bayes, Bayes, Bayes", without also employing exactly the kind of constraints Pereira mentions, is simply ignorant of the relevant probability theory.

By coincidence, I am going to teach our data mining course (36-350) again in the fall. The theme for the semester, which I decided on back in the spring, will be "waste, fraud and abuse" — not so much detecting suspicious activity, though some examples of that might be fun, as warnings against wasteful, fraudulent and/or abusive data mining.

Update, 29 June: see next post.

Update, 2 July: A correspondent writes to let me know that Anderson's essay and the linked pieces from Wired are up at Edge.org, along with responses from some of the other clients of John Brockman's literary agency leading public intellectuals associated with that site. So far, the only one whose reaction is both substantial and not completely clueless is Danny Hillis, who politely says that Anderson's idea does not have "even a little bit of truth in it".

There's no reason we couldn't have an interesting public discussion about what big data, and data-mining, could contribute to science. We already have a very large and successful scientific discipline which routinely generates and deals with petabytes of data, namely experimental high-energy physics. Its example suggests that theory becomes more rather than less important with huge volumes of data. That may not hold for the biological and social sciences, but I'd like some argument as to why. Of course, if one looks at actually-existing quantitative models in those sciences, it seems clear that part of what they are doing is representing scientists' substantive knowledge and/or guesses, but another part is just put in for tractability, especially statistical tractability — linear or logistic dependence, Gaussian noise, etc., etc. One of the things modern statistics and big data could do is to drastically weaken those tractability constraints. (To repeat a slogan from my class, "More science, fewer t-tests.")

We could have a conversation about these matters. But its participants would have to know something about scientific practice, about statistics and about data-mining. Some of these participants might even argue quite strongly that discovery can be automated, if one goes about it the right way. If someone — say, a literary agent and impresario whose client list includes just about every well-known popular science writer in America — wanted to organize such a discussion, it would certainly be possible and a contribution to public enlightenment. That would, however, require such impresarios to have somewhat more critical acumen than a puppy, which evidently is not the case. So the actually-existing conversation is a source not of light but of noise.

Why oh why can't we have a better consciousness industry?

Manual trackback: Entertaining Research; Tongue but no door; O Hermenauta; Whimsley; Quantum of Wantum; The Statistical Mechanic; Lies and Stats; sciber

Enigmas of Chance; Corrupting the Young

Posted by crshalizi at June 25, 2008 15:43 | permanent link

Three-Toed Sloth:   Hosted, but not endorsed, by the Center for the Study of Complex Systems