Attention conservation notice: Someone iswrongin Wired magazine.

I recently made the mistake of trying to kill some waiting-room time
with Wired. (Yes, I should know
better.) The cover story was
a piece
by editor Chris Anderson, about how having lots of data means we can just look
for correlations by data mining, and drop the scientific method in favor of
statistical learning algorithms. Now, I *work*
on model discovery, but this
struck me as so thoroughly,
and characteristically,
foolish —
"saucy,
ignorant contrarianism", indeed — that I thought I was going to have
to write a post picking it apart.
Fortunately, Fernando Pereira
(who actually knows something
about machine learning) has said, crisply, what needs to be said about this. I
hope he won't mind
(or charge me) if I
quote him at length:

I like big data as much as the next guy, but this is deeply confused. Where does Anderson think those statistical algorithms come from? Without constraints in the underlying statistical models, those "patterns" would be mere coincidences. Those computational biology methods Anderson gushes over all depend on statistical models of the genome and of evolutionary relationships.Those large-scale statistical models are different from more familiar deterministic causal models (or from parametric statistical models) because they do not specify the exact form of observable relationships as functions of a small number of parameters, but instead they set constraints on the set of hypotheses that might account for the observed data. But without well-chosen constraints — from scientific theories — all that number crunching will just memorize the experimental data.

I might add that anyone who thinks the power of data mining will let them
write a spam filter without understanding linguistic
structure deserves the in-box they'll get; and that anyone who thinks they
can overcome these obstacles by chanting "Bayes, Bayes, Bayes", without also
employing
*exactly* the kind of constraints Pereira mentions, is simply
ignorant of the relevant
probability theory.

By coincidence, I am going to teach our data mining course (36-350) again in the fall. The theme for the semester, which I decided on back in the spring, will be "waste, fraud and abuse" — not so much detecting suspicious activity, though some examples of that might be fun, as warnings against wasteful, fraudulent and/or abusive data mining.

**Update**, 29 June: see next post.

**Update**, 2 July: A correspondent writes to let me know that
Anderson's essay and the linked pieces from Wired are up
at Edge.org,
along with responses from some of the other
~~clients
of John
Brockman's literary agency~~ leading public intellectuals associated
with that site. So far, the only one whose reaction is both substantial and
not completely clueless
is Danny
Hillis, who politely says that Anderson's idea does not have "even a little bit of truth in it".

There's no reason we couldn't have an interesting public discussion about
what big data, and data-mining, could contribute to science. We already have a
very large and successful scientific discipline which routinely generates and
deals with petabytes of data, namely experimental
high-energy physics. Its example suggests that theory becomes more rather
than less important with huge volumes of data. That may not hold for the
biological and social sciences, but I'd like some argument as to why. Of
course, if one looks at actually-existing quantitative models in those
sciences, it seems clear that part of what they are doing is representing
scientists' substantive knowledge and/or guesses, but another part is just put
in for tractability, especially statistical tractability — linear or
logistic dependence, Gaussian noise, etc., etc. One of the things modern
statistics and big data *could* do is to drastically weaken those
tractability constraints. (To repeat a slogan from
my class, "More science,
fewer *t*-tests.")

We *could* have a conversation about these matters. But its
participants would have to know something about scientific practice, about
statistics and about data-mining. Some of these participants might
even argue
quite strongly that discovery can be automated, *if* one goes about
it the right way. If someone — say, a literary agent and impresario
whose client list includes just about every well-known popular science writer
in America — wanted to organize such a discussion, it would certainly be
possible and a contribution to public enlightenment. That would, however,
require such impresarios to have somewhat more critical acumen than a puppy,
which
evidently is not the case.
So the actually-existing conversation is a source not of light but of noise.

Why oh why can't we have a better consciousness industry?

*Manual trackback*: Entertaining Research;
Tongue but no door;
O
Hermenauta;
Whimsley; Quantum of Wantum;
The Statistical
Mechanic; Lies and Stats;
sciber

Posted by crshalizi at June 25, 2008 15:43 | permanent link