ALL YOUR BAYES ARE BELONG TO US |

we share with him |

Attention Conservation Notice:jargony, self-promotional ramblings about a new paper on the frequentist properties of non-parametric Bayesian procedures, which is exactly as statistics-geeky as it sounds. Plus a weird analogy to mathematical models of evolution. Even if this sounds vaguely interesting, you could always check back later and see if peer review exposed it as a tissue of fallacies. [It didn't.]

Here's the new preprint:

- CRS, "Dynamics of Bayesian Updating with Dependent Data and Misspecified Models", arxiv:0901.1342 [now also Electronic Journal of
Statistics
**3**(2009): 1039--1074] *Abstract*: Recent work on the convergence of posterior distributions under Bayesian updating has established conditions under which the posterior will concentrate on the truth, if the latter has a perfect representation within the support of the prior, and under various dynamical assumptions, such as the data being independent and identically distributed or Markovian. Here I establish sufficient conditions for the convergence of the posterior distribution in non-parametric problems even when*all*of the hypotheses are wrong, and the data-generating process has a complicated dependence structure. The main dynamical assumption is the generalized asymptotic equipartition (or "Shannon-McMillan-Breiman") property of information theory. I derive a kind of large deviations principle for the posterior measure, and discuss the advantages of predicting using a combination of models known to be wrong. An appendix sketches connections between the present results and the "replicator dynamics" of evolutionary theory.

There are people out there who see Bayes's Rule as the key to all methodologies, something essential to rationality. I find this view thoroughly misguided and not even a regulative ideal. But I've written about that at great length elsewhere and won't repeat myself here.

While there are certainly plenty of statisticians who embrace the Bayesian
way in all
its Savagery
(some of them will be on my tenure committee), I think it's fair to say that
most of the time, Bayesian methods are *not* used in contemporary
statistics and machine learning to impose coherence on statisticians', or
computers', personal beliefs. When people
like Andy Gelman,
Aleks Jakulin
&c. estimate logistic
regressions with a particular default prior, what they are doing is
really *regularization*, which is something that a frequentist like me
can understand.

Actually, Bayesian inference is a regularization-by-smoothing technique,
only in the space of probability distributions instead of the sample space. In
ordinary least-squares regression, one asks "what straight line comes closest
(in the Euclidean sense) to all the data points?" In subtler methods, one has
a whole menagerie of possible curves, and asks which *superposition* of
curves comes closest to the data points. To avoid over-fitting, one needs to
keep the weight given to very wiggly basis curves under control, e.g., by
simply ignoring possible terms which are not sufficiently smooth. (Through the
magic
of Lagrange
multipliers, this is equivalent to penalizing wiggliness.) As one gets
more and more data, finer and finer features of the regression curve can be
reliably discerned, so the amount of imposed smoothing should be relaxed. For
estimating regression curves, this is
all thoroughly understood
and even automated (making the
persistence of linear regression a mystery; but another time).

All of this carries over to estimating entire distributions rather than just
curves. In the space of probability distributions, the entire sample is a
single point,
the empirical
distribution. (I'm ignoring some subtleties about time series here.) The
family of models we're willing to consider forms a manifold (or set of
manifolds; more details) in the space of probability measures. In this space,
the right way to measure distance isn't the Euclidean metric, but rather the
relative entropy,
a.k.a. the Kullback-Leibler divergence, which generates
the information
geometry. The maximum likelihood estimate is simply the geometric
projection of the empirical distribution on to the manifold of models. In
other words, the most likely predictive distribution comes from
this *single* point on the manifold. (There
are cleverer ways to get
frequentist predictive distributions, though.) The Bayesian posterior
predictive distribution gives some weight to *all* points on the
manifold; this is the smoothing. The weights are proportional to the prior,
and also decline exponentially with the product of sample size and relative
entropy. The effect of the prior is to blunt the impact of what we actually
see, keeping us from simply trying to match it — exactly like the effect of
smoothing. The amount of smoothing done by the prior declines as the sample
size grows.

Why smooth or regularize in the first place?
As every school-child
knows, the answer lies in the bias-variance trade-off. (Pedantically,
bias-variance is for mean-squared error, but other loss functions have similar
decompositions.) Regularized estimates are less responsive to differences
between data sets, which means that they have less variance. No matter what
the data look like, the regularized estimates all have the same shape, so to
speak. But unless reality *also* has that shape, this creates a
systematic error, i.e., bias. So the more aggressively we smooth, the more we
decrease variance, but the more we increase bias. If some degree of smoothing
removes more variance than it adds bias, we come out ahead. Of course, the
variance should fall automatically as we get more data, but the bias depends on
the strength of smoothing, so the latter should shrink as the sample grows.
The important thing in analyzing this type of estimation or prediction scheme
is to check whether they relax their regularization at the right rate, so that
they don't get killed by either bias or variance, but rather consistently
converge on the truth, or at least the best approximation to the truth among
the available models.

All of this applies to Bayesian learning. Like any other regularization
scheme, it is a way of deliberately introducing bias into our inferences, not
so much on Burkean/Quinean
grounds that *our* prejudices providentially point to the truth, but
simply to reduce variance, the extent to which our inferences are at the mercy
of Fortune (in her
role as the goddess of sampling fluctuations). The question about it then
becomes whether it gets the trade-off right, and manages to converge despite
its bias. In other
words, when
does Bayesian inference possess frequentist consistency?

A surprisingly large number of people have been satisfied with a result
given by the great probabilist J. L. Doob, which essentially says that under
some reasonable-looking conditions, the Bayesian learner's
*prior* probability of being inconsistent is zero. (More exactly, the
posterior probability of any set of models containing the truth goes to 1,
except on a set of sample paths whose prior probability is zero.) Even taken
at face value, this just says that each Bayesian is convinced a priori that
they'll converge on the truth, not that they actually *are* almost sure
to find the truth.

Examining the reasonable-looking conditions of Doob's result, it turns out
that they entail the existence of a consistent *non-Bayesian* estimator.
(Doob's original assumptions can actually be weakened to just the existence of
such an estimator;
see Schervish's Theory of
Statistics, ch. 7.) It is a curious fact that *every* proof
of the consistency of Bayesian learning I know of requires the existence of a
consistent non-Bayesian estimator. (Though it need not be the maximum
likelihood estimator.) There don't seem to be any situations where Bayesian
updating is the *only* convergent route to the truth.

It turns out that Bayesian inference is *not* consistent in general.
The late David Freedman
and the very-much-alive Persi Diaconis
showed that if you
choose a prior distribution which is badly adapted to the actual
data-generating process, your posterior distribution will happily converge on
the *wrong* model, even though the rest of the set-up is very tame
— independent and identically distributed data in a boringly well-behaved
sample space, etc.

Still, there are many situations where Bayesian learning does seem to work reasonably effectively, which in light of the Freedman-Diaconis results needs explaining, ideally in a way which gives some guidance as to when we can expect it to work. This is the origin of the micro-field of Bayesian consistency or Bayesian nonparametrics, and it's here that I find I've written a paper, rather to my surprise.

I never intended to work on this. In the spring of 2003, I was going to the
statistics seminar in Ann Arbor, and one week the speaker happened to
be Yoav Freund, talking
about this
paper (I think) on model averaging for classifiers. I got hung up
on *why* the weights of different models went down exponentially with
the number of errors they'd made. It occurred to me that this was what would
happen in a very
large genetic
algorithm, if a solution's fitness was inversely proportional to the number
of errors it made, and there was no mutation or cross-over. The model-averaged
prediction would just be voting over the population. This made me feel better
about why model averaging was working, because using a genetic algorithm to
evolve classifier rules was something I was
already pretty familiar with.

The next day it struck me that this story would work just as well for
Bayesian model averaging, with weights depending on the likelihood rather than
the number of errors. In fact, I realized, Bayes's rule just *is* the
discrete-time replicator
equation, with different hypotheses being so many different replicators,
and the fitness function being the conditional likelihood.

As you know, Bob, the replicator dynamic is a mathematical representation of the basic idea of natural selection. There are different kinds of things, the kinds being called "replicators", because things of one kind cause more things of that same kind to come into being. The average number of descendants per individual is the replicator's fitness; this can depend not only on the properties of the replicator and on time and chance, but also on the distribution of replicators in the population; in that case the fitness is "frequency dependent". In its basic form, fitness-proportional selection is the only evolutionary mechanism: no sampling, no mutation, no cross-over, and of course no sex. The result is that replicators with above-average fitness increase their share of the population, while replicators with below-average fitness dwindle.

This is a pretty natural way of modeling half of the
mechanism Darwin
and Wallace
realized was behind evolution, the "selective retention" part — what it
leaves out is "blind variation". Even with this omission, the replicator
equation is a surprisingly interesting kind of dynamical system, especially
when fitness is frequency-dependent, which opens up deep connections to
evolutionary game theory. (ObBook.) Interestingly, however, Bayes is a very
limited special case of the replicator equation, since fitness is
frequency *in*dependent.

"Selective retention" is also the idea that lies behind reinforcement learning and Thorndike's law of effect. Crudely, these are all variations on the theme of "do more of what worked, and less of what didn't". Less crudely, there are at least three independent discoveries of how reinforcement learning, itself, leads to the replicator equation in the appropriate limit. So Bayesian updating isn't just a special case of evolutionary optimization; it's also something like habit formation.

Initially, I wrote this all up in the spring of 2003 and then set it aside,
because, after making the formal connections, I didn't see what it
was *good* for. Then, that summer, I went to
the first "Science et Gastronomie" workshop, talked
about this idea, and realized, from the conversations, that I
actually *could* do something with it. The fitness function was going
to end up being the relative entropy rate, and this would control where the
posterior distribution concentrated in the long run. This would let me say
something about the convergence of Bayesian learning with non-independent and
even non-Markovian data, but also about what happens when the true distribution
is not "in the support of the prior", i.e., when all the models
really *are* wrong.

So I spent a lot of time in Lyon sweating over the asymptotic behavior of the integrated likelihood and so forth (literally, given the great heat-wave). By the end of the summer, I had versions of what are now Theorems 2, 3 and 4 in the paper. These say, respectively, that the posterior density goes to zero everywhere where fitness isn't maximized, and the rate is the fitness difference; that eventually the posterior distribution concentrates on the global peaks of the fitness landscape; and that the posterior distribution in a subset of model space not including those peaks is driven by the highest fitness in the subset. All of this was pretty nice and I was fairly pleased with myself.

Then I saw that if my results were correct, Bayesian updating
should *always* be consistent. So I put the thing aside for a while....

Eventually I figured out what I was doing wrong; I was presuming that all
the points in model-space were equally well-behaved. I was explicitly assuming
that the (log) likelihood would eventually converge to the relative entropy
for *each* hypothesis. This is on the one hand mathematically harmless
(it's asymptotic
equipartition), and on the other hand statistically I can't see how any
likelihood-based method can hope to converge unless performance over a
sufficiently long past is indicative of future results. (This is why such
methods do not solve the problem of induction.) But I was further assuming,
implicitly, that there was no way for the likelihood of a hypothesis with very
high relative entropy to also converge very slowly. That is, I was assuming the
Bayesian learner could not acquire bad habits. But of course it can; the bad
habits just need to seem like good ideas *at first*.
In human
terms:

No one ever forgets how to do something that's worked for them in the past. Just replacing it with another behavior can be hard enough, and the old behavior is still going to be lurking there underneath it. Thieves keep stealing. Liars keep lying. Drunks never forget about chemically modifying their nervous systems.Similarly, the Bayesian learner never forgets about the model which matches the data

What's needed is some restrictions on the prior distribution which keep it
from putting too much weight on these bad hypotheses, though actually in
non-parametric problems one doesn't want to give them strictly zero weight,
because, hey, maybe a particular sequence of 13,127 observations really will
repeat forever. (Metaphorically: a *little* lying, stealing and
drinking is part of a complete life.) We are back to regularization, which has
the duty of promoting virtue and suppressing vice.

The most common ways of doing this divide the space of models into distinct
classes, and then
use statistical
learning theory
or empirical
process theory to gauge the risk of over-fitting within these constrained
subspace. Typical arguments involve things like showing that every hypothesis
in the constrained set is close to one of a finite number of hypotheses which
"cover" or "bracket" the space, which gives *uniform* convergence. Then
one relaxes the constraint as more data arrives, according to some
deterministic schedule
("the method of
sieves") or to optimize the trade-off between actual performance and a
bound on over-fitting
("structural risk
minimization"), etc.

Existing proofs of Bayesian consistency in non-parametric problems basically
work similarly. To simplify just a bit, the trick has two parts. The first is
to find constraints on the hypotheses which are strong enough to ensure uniform
convergence, but can be relaxed to include everything; this is what you'd do
anyway if you wanted a consistent non-parametric procedure. (Some people don't
explicitly posit constraint sets, but do other things with much the same
effect.) The second is to construct a prior whose bias towards the
most-constrained sets is strong enough to keep the wild, high-capacity parts of
the hypothesis space from dominating the posterior distribution, but
isn't *so* biased that the constraint can't be overcome with enough
data.

This is what I ended up having to do. (There was a gap of several years
between seeing that *some* constraint was needed, and
seeing *what* constraint would work. I am, in fact, a sloth.) My
constraint was, roughly, "the log likelihood converges at least *this*
fast". Unfortunately, I wasn't able to express it in relatively nice terms,
like covering numbers, though I suspect someone cleverer could replace it with
something along those lines. (It's about one-sided deviations of time-averages
from their limiting values, so it *feels* empirical-process-y.) Anyone
who actually cares will actually be better served by reading the paper than by
my trying to re-express it verbally.

I didn't figure out the right constraint to impose, and the right way to
relax it, until the summer of 2008. (This was what was on my mind when I read
Chris Anderson's ode to
overfitting.) Once I did, everything else was pretty direct, especially
since it turned out I could salvage most (but definitely not all) of what I'd
done in Lyon. One of the bigger efforts in actually writing the paper was
uniformly eliminating talk of replicators and fitness, except for a small
appendix, in favor of more statistical jargon. (I hope I succeeded.) Adding
an example, namely trying to use Markov chains to predict
a sofic
system, took about a day. This is an amusing case, because the posterior
distribution *never* converges — you can always do strictly better
by moving to Markov chains of higher order. Nonetheless, the (Hellinger)
distance between the posterior predictive distribution and the actual
predictive distribution goes to zero.

Having submitted this, I'm going to put it aside again until I hear from the referees, because there's a lot of other work which needs finishing. (When I do hear from the referees, I anticipate a certain amount of gnashing of teeth.) But in the meanwhile, I'll use this space to jot down some ideas.

- How can my Assumption 6 be replaced by a more sane-looking condition on, say, the capacity of the log-likelihood functions?
- Reinforcement learning converges to the
*continuous-time*replicator dynamics — imagine that the learner makes more and more trials per unit time, but dials down its learning rate per trial. For Bayesian learning, this would be something like make more and more observations per unit time, but with a proportionately shrinking amount of (Fisher) information per observation. What's a natural example of this situation, and does it work? - Can we say anything more about the paths followed by the Bayesian learner by using the averaging techniques developed for stochastic approximation?
- What can be said intelligently about the next-to-leading-order terms in
the posterior weights? That is, my results cover the part of the weights which
goes to zero like
*e*^{-an}for some deterministic*a*, but presumably there are, e.g., correction factors that go like*e*^{-Bn1/2}for random*B*. (In fact, I know there are, from one of the examples I left on the cutting-room floor.) These can't involve the relative entropy rate (can they?), so what do they depend on? - Hand-wavingly, what Bayes's rule is doing is like Fisher's fundamental theorem of natural selection — the population-average fitness increases, at the expense of the diversity of the population. Presumably this is related to Carl Bergstrom and Michael Lachmann's work on the fitness value of information (which I first heard about in Lyon); someone should make this precise.
- Related to the last, a point I touched on at the end of the paper is that
the accuracy of the
*population-averaged prediction*depends equally on the average accuracy*and*the diversity. (This is pure Scott Page, and makes me feel less guilty about the time I spent on this stuff while funded by his grants.) There is, in general, no reason that the gains from selection must exceed the loss of diversity, and I have examples (cut from the paper) where they don't, when Bayesian updating is overly-selective. When, in general, does this happen? How could it be detected? - Bayesian updating
*really is*just a special case of evolutionary search. What can we do with the more general case? What sort of model averaging procedures can we get by adding random variation (mutation, sex, etc.)? Ones which could come up with new models? Ones which avoid over-concentration and loss of diversity? What would a frequency-dependent fitness function mean, statistically? — At the suggestion of a (noble!) Savage Bayesian of my acquaintance, I tried to work out what decision problem is being solved by the population in the Hawk-Dove game, only to conclude that the population is not solving any decision problem. But I could be wrong, or Hawk-Dove could be anomalous.

*Manual Trackback*: 3 Quarks Daily;
Dynamic Ecology

Thanks to Nicolas Della Penna, Shiva Kaul and Chris Wiggins for typo-spotting.

Posted by crshalizi at January 30, 2009 22:47 | permanent link