Frequentist Consistency of Bayesian Procedures
06 Feb 2013 23:21
"Bayesian consistency" is usually taken to mean showing that, under Bayesian updating, the posterior probability concentrates on the true model. That is, for every (measurable) set of hypotheses containing the truth, the posterior probability goes to 1. (In practice one shows that the posterior probability of any set not containing the truth goes to zero.) There is a basic result here, due to Doob, which essentially says that the Bayesian learner is consistent, except on a set of data of prior probability zero. That is, the Bayesian is subjectively certain they will converge on the truth. This is not as reassuring as one might wish, and showing Bayesian consistency under the true distribution is harder. In fact, it usually involves assumptions under which non-Bayes procedures will also converge. These are things like the existence of very powerful consistent hypothesis tests (an approach favored by Ghosal, van der Vaart, et al., supposedly going back to Le Cam), or, inspired by learning theory, constraints on the effective size of the hypothesis space which are gradually relaxed as the sample size grows (as in Barron et al.). If these assumptions do not hold, one can construct situations in which Bayesian procedures are inconsistent.
Concentration of the posterior around the truth is only a preliminary. One would also want to know that, say, the posterior mean converges, or even better that the predictive distribution converges. For many finite-dimensional problems, what's called the "Bernstein-von Mises theorem" basically says that the posterior mean and the maximum likelihood estimate converge, so if one works the other will too. This breaks down for infinite-dimensional problems.
(PAC-Bayesian results don't fit into this picture particularly neatly. Essentially, they say that if you find a set of classifiers which all classify correctly in-sample, and ask about the average out-of-sample performance, the bounds on the latter are tighter for big sets than for small ones. This is for the unmysterious reason that it takes a bigger coincidence for many bad classification rules to happen to all work on the training data than for a few bad rules to get lucky. The actual Bayesian machinery of posterior updating doesn't really come into play, at least not in the papers I've seen.)
I believe I have contributed a Result to this area, on what happens when the data are dependent and all the models are mis-specified, but some are more mis-specified than others. This turns on realizing that Bayesian updating is just a special case of evolutionary search, i.e., an infinite-dimensional stochastic replicator equation.
Query: are there any situations where Bayesian methods are consistent but no non-Bayesian method is? (My recollection is that John Earman, in Bayes or Bust, provides a negative answer, but I forget how.)
- Recommended:
- Andrew Barron, Mark J. Schervish and Larry Wasserman, "The Consistency of Posterior Distributions in Nonparametric Problems", Annals of Statistics 27 (1999): 536--561 [While I am biased — Mark and Larry are senior faculty here — I think this is definitely one of the best-written papers on the topic.]
- Robert H. Berk [Old but quite nice papers on the
effect of mis-specification, though with IID data assumed, and stronger
assumptions about the models than modern writers are comfortable with.]
- "Limiting Behavior of Posterior Distributions when the Model is Incorrect", Annals of Mathematical Statistics 37 (1966): 51--58 [see also the correction]
- "Consistency a Posteriori", Annals of Mathematical Statistics 41 (1970): 894--906
- David Blackwell and Lester Dubins, "Merging of Opinions with Increasing Information", Annals of Mathematical Statistics 33 (1962): 882--886
- Taeryon Choi, R. V. Ramamoorthi, "Remarks on consistency of posterior distributions", arxiv:0805.3248
- Ronald Christensen, "Inconsistent Bayesian Estimation", Bayesian Analysis 4 (2009): 413--416 [An extremely simple example of how inconsistency can be generated]
- Dennis D. Cox, "An Analysis of Bayesian Inference for Nonparametric Regression", Annals of Statistics 21 (1993): 903--923
- Persi Diaconis and David Freedman, "On the Consistency of Bayes Estimates", The Annals of Statistics 14 (1986): 1--26 [With accompanying discussion; the latter is worth reading if only to fully savor the academic snark in Diaconis and Freedman's reply.]
- David Freedman, "On the Bernstein-von Mises Theorem with Infinite-Dimensional Parameters", Annals of Statistics 27 (1999): 1119--1140 [As you know, Bob, the Bernstein-von Mises theorem asserts that, "under the usual conditions", in the large sample limit the distribution of the maximum likelihood estimate is basically the same as the Bayesian posterior distribution, so you can take credible intervals as approximate confidence intervals and vice versa. It turns out that the usual conditions can fail drastically even for very simple infinite-dimensional problems.]
- Subhashis Ghosal, "A review of consistency and convergence rates of posterior distribution" [PDF]
- Subhashis Ghosal, Jayanta K. Ghosh and R. V. Ramamoorthi, "Consistency Issues in Bayesian Nonparametrics" [Review of the IID case, on Ghosal's website someplace]
- Subhashis Ghosal, Jayanta K. Ghosh and Aad W. van der Vaart, "Convergence Rates of Posterior Distributions", Annals of Statistics 28 (2000): 500--531
- Subhashis Ghosal and Yongqiang Tang, "Bayesian Consistency for Markov Processes", Sankhya 68 (2006): 227--239 [This is slick, but I think the cuteness of the proof of the main theorem is achieved at the cost of the ugliness of verifying the main conditions, as in their example. (That may just be jealousy speaking.) PDF]
- Subhashis Ghosal and Aad van der Vaart, "Convergence Rates of Posterior Distributions for Non-IID Observations", Annals of Statistics 35 (2007): 192--223
- J. K. Ghosh and R. V. Ramamoorthi, Bayesian Nonparametrics [Mini-review]
- Peter Grünwald, "Bayesian Inconsistency under Misspecification" [PDF preprint of talk given at the Valencia 8 meeting in 2006]
- Peter Grünwald and John Langford, "Suboptimal behavior of Bayes and MDL in classification under misspecification", Machine Learning 66 (2007): 119--149 [PDF reprint via Prof. Grünwald]
- B. J. K. Kleijn and A. W. van der Vaart, "Misspecification in infinite-dimensional Bayesian statistics", Annals of Statistics 34 (2006): 837--877
- Antonio Lijoi, Igor Prunster and Stephen G. Walker, "Bayesian Consistency for Stationary Models", Econometric Theory 23 (2007): 749--759 [Gives a Doob-style result, that the prior probability of failing to converge is zero.]
- David A. McAllester, "Some PAC-Bayesian Theorems", Machine Learning 37 (1999): 355--363
- Lorraine Schwartz, "On Bayes Procedures", Z. Wahrsch. Verw. Gebiete 4 (1965): 10--26 [The journal now known as Probability Theory and Related Fields]
- X. Shen and Larry Wasserman, "Rates of convergence of posterior distributions", Annals of Statistics 29 (2001): 687--714
- Stephen Walker, "New Approaches to Bayesian Consistency", Annals of Statistics 32 (2004): 2028--2043 = math.ST/0503672 [Clever martingale tricks.]
- Yang Xing, "Convergence rates of posterior distributions for observations without the iid structure", arxiv:0811.4677
- Yang Xing and Bo Ranneby, "Both necessary and sufficient conditions for Bayesian exponential consistency", arxiv:0812.1084 [Essentially, a unifying presentation of several existing conditions for IID samples.]
- Tong Zhang, "From $\epsilon$-entropy to KL-entropy: Analysis of minimum information complexity density estimation", Annals of Statistics 34 (2006): 2180--2210 = arxiv:math.ST/0702653
- Modesty forbids me to recommend:
- CRS, "Dynamics of Bayesian Updating with Dependent Data and Mis-specified Models", arxiv:0901.1342 = Electronic Journal of Statistics 3 (2009): 1039--1074 [Less-technical explanation of the paper]
- To read:
- P. J. Bickel and B. J. K. Kleijn, "The semiparametric Bernstein-von Mises theorem", Annals of Statistics 40 (2012): 206--237
- Natalia A. Bochkina, Peter J. Green
- "Consistency and efficiency of Bayesian estimators in generalised linear inverse problems", arxiv:1110.3015
- "The Bernstein-von Mises theorem for non-regular generalised linear inverse problems", arxiv:1211.3434
- Ismaël Castillo, Gerard Kerkyacharian, Dominique Picard, "Thomas Bayes' walk on manifolds", arxiv:1206.0459
- Ismaël Castillo, Richard Nickl, "Nonparametric Bernstein-von Mises Theorems", arxiv:1208.3862
- Ismaël Castillo and Aad van der Vaart, "Needles and Straw in a Haystack: Posterior concentration for possibly sparse sequences", Annals of Statistics 40 (2012): 2069--2101
- René de Jonge and Harry van Zanten, "Semiparametric Bernstein-von Mises for the error standard deviation", Electronic Journal of Statistics 7 (2013): 217--243
- J. L. Doob, "Application of the theory of martingales", pp. 23--27
in Colloques Internationaux du Centre National de la Recherche
Scientifique, no. 13, Centre National de la Recherche Scientifique,
Paris, 1949 [Summary
in Mathematical
Reviews by William Feller]
- Bradley Efron, "Bayesian inference and the parametric bootstrap", Annals of Applied Statistics 6 (2012): 1971--1997
- Stefano Favaro, Alessandra Guglielmi, and Stephen G. Walker, "A class of measure-valued Markov chains and Bayesian nonparametrics", Bernoulli 18 (2012): 1002--1030
- Subhashis Ghosal, Jüri Lember and Aad van der Vaart, "Nonparametric Bayesian model selection and averaging", Electronic Journal of Statistics 2 (2008): 63--89
- Evarist Giné and Richard Nickl, "Rates of contraction for posterior distributions in $L^r$-metrics, $1 \leq r \leq \infty$", Annals of Statistics 39 (2011): 2883--2911
- Peter Grünwald, "The Safe Bayesian: Learning the Learning Rate via the Mixability Gap" [PDF preprint]
- Marcus Hutter, "Exact Non-Parametric Bayesian Inference on Infinite Trees", arxiv:0903.5342
- Bas Kleijn, Bartek Knapik, "Semiparametric posterior limits under local asymptotic exponentiality", arxiv:1210.6204
- B. J. K. Kleijn and A. W. van der Vaart, "The Bernstein-Von-Mises theorem under misspecification", Electronic Journal of Statistics 6 (2012): 354--381
- John Langford, "Tutorial on Practical Prediction Theory for Classification", Journal of Machine Learning Research 6 (2005): 273--306 [For the PAC-Bayesian result]
- Lucien LeCam, "On the Speed of Convergence of Posterior Distributions" [PDF]
- Ryan Martin, "A martingale law of large numbers and convergence rates of Bayesian posterior distributions", arxiv:1201.3102
- Ryan Martin, Liang Hong, "On convergence rates of Bayesian predictive densities and posterior distributions", arxiv:1210.0103
- David A. McAllester, "PAC-Bayesian Stochastic Model Selection", Machine Learning 51 (2003): 5--21
- XuanLong Nguyen, "Borrowing strength in hierarchical Bayes: convergence of the Dirichlet base measure", arxiv:1301.0802
- Y. Ritov, P. J. Bickel, A. Gamst, B. J. K. Kleijn, "The Bayesian Analysis of Complex, High-Dimensional Models: Can it be CODA?", arxiv:1203.5471
- Vincent Rivoirard, Judith Rousseau
- "Bernstein Von Mises Theorem for linear functionals of the density", arxiv:0908.4167
- "Posterior Concentration Rates for Infinite Dimensional Exponential Families", Bayesian Analysis 7 (2012): 311--334
- Jean-Bernard Salomond, "Concentration rate and consistency of the posterior under monotonicity constraints", arxiv:1301.1898
- Alessio Sancetta, "Universality of Bayesian Predictions", Bayesian Analysis 7 (2012): 1--36
- Frank van der Meulen and Harry van Zanten, "Consistent nonparametric Bayesian inference for discretely observed scalar diffusions", Bernoulli 19 (2103): 44--63
- A. W. van der Vaart, J. H. van Zanten, "Rates of contraction of posterior distributions based on Gaussian process priors", Annals of Statistics 36 (2008): 1435--1463, arxiv:0806.3024
- Yuefeng Wu, Subhashis Ghosal, "Kullback Leibler property of kernel mixture priors in Bayesian density estimation", Electronic Journal of Statistics 2 (2008): 298--331, arxiv:0710.2746
- To write:
- CRS, "Bayesian Learning, Information Theory, and Evolutionary Search"
