## The Minimum Description Length Principle (MDL)

*26 Jun 2012 23:00*

MDL is an information-theoretic approach to machine learning, or statistical model selection, which basically says you should pick the model which gives you the most compact description of the data, including the description of the model itself. More precisely, given a probabilistic model, Shannon's coding theorems tell you the minimal number of bits needed to encode your data, i.e., the maximum extent to which it can be compressed. Really, however, to complete the description, you need to specify the model as well, from among some set of alternatives, and this will also require a certain number of bits. Hence you really want to minimize the combined length of the description of the model, plus the description of the data under that model. This works out to being a kind of penalized maximum likelihood — the data-given-model bit is the negative log likelihood, and the model-description term is the penalty.

It's a very appealing idea, and a lot of work has (rightly) been done under
this heading, though I have to say I'm not altogether convinced, both because
of the issues involved in chosing a coding scheme for models, and because it's
not clear that, in *practice*, it actually does that much better than
other penalization schemes — sometimes, even, than straightforward
likelihood maximization.

I should also say that what I've described above is the old-fashioned "two-part" MDL, and there are now "one-part" schemes, where (so to speak) the model coding scheme is supposed to be fixed by the data as well, in unambiguous and nearly-optimal ways. These are remarkably ingenious constructions, intimately linked to low-regret learning, as well as minimax statistics more generally, but I will not attempt to describe them here; see rather Grünwald.

See also: Universal Prediction Algorithms

- Recommended, the big story:
- Jorma Rissanen, Stochastic Complexity in Statistical
Inquiry [Review:
Less Is More, or
*Ecce data!*] - Peter Grünwald
- "A Tutorial Introduction to the Minimum Description Length Principle", math.ST/0406077
- The Minimum Description Length Principle [Blurb, sample chapter]

- MDL on the Web [Centralized website for MDL research]

- Recommended, details and applications:
- Pedro
Domingos, "The Role of Occam's Razor in Knowledge Discovery," Data
Mining and Knowledge Discovery,
**3**(1999) [Online] - Peter Grünwald and John Langford, "Suboptimal behaviour of Bayes
and MDL in classification under misspecification", math.ST/0406221
= Machine
Learning
**66**(2007): 119--149 - Peter T. Hraber, Bette T. Korber, Steven Wolinsky, Henry Erlich and Elizabeth Trachtenberg, "HLA and HIV Infection Progression: Application of the Minimum Description Length Principle to Statistical Genetics", SFI Working Paper 03-04-23
- Shane Legg, "Is There an Elegant Universal Theory of Prediction?", cs.AI/0606070 [A nice set of diagonalization arguments against the hope of a universal prediction scheme which has the nice features of Solomonoff-style induction, but is actually computable.]
- Beong Soo So, "Maximized log-likelihood updating and model
selection", Statistics and Probability Letters
**64**(2003): 293--303 [Shows how to relate some of Rissanen's ideas on predictive MDL to more conventionally-statistical notions, e.g., connecting Rissanen's "stochastic complexity" to something that looks like, but isn't quite, a Fisher information.] - Paul M. B. Vitanyi and Ming Li, "Minimum Description Length
Induction, Bayesianism, and Kolmogorov Complexity", IEEE Transactions on
Information Theory
**46**(2000): 446--464 = cs.LG/9901014

- Not recommended:
- Dana Ballard, An Introduction to Natural Computation [Review: Not Natural Enough]

- To read (with thanks to David
Dowe for suggestions):
- Pieter Adriaans and Paul Vitanyi, "The Power and Perils of MDL", cs.LG/0612095
- Joshua W. Comley and David L. Dowe, "Minimum Message Length and Generalized Bayesian Nets with Asymmetric Languages", in Grünwald et al.
- The Computer Journal,
**42:4**(1999) [Special issue on Kolmogorov complexity and inference] - Steven de Rooij and Peter Grünwald, "An Empirical Study of MDL Model Selection with Infinite Parametric Complexity", cs.LG/0501028
- Dowe, Korb and Oliver (eds.), Information, Statistics and Induction in Science
- M. Drmota and W. Szpankowski, "Precise minimax redundancy and
regret", IEEE Transactions on Information Theory
**50**(2004): 2686--2707 - Dean
P. Foster and Robert
A. Stine, "Local Asymptotic Coding and the Minimum Description Length", IEEE Transactions on
Information Theory
**45**(1999): 1289--1293 [PDF preprint] - Ciprian Doru Giurcuaneanu and Jorma Rissanen, "Estimation of AR and ARMA models by stochastic complexity", math.ST/0702765
- Peter D. Grünwald, In Jae Myung and Mark A. Pitt (eds.), Advances in Minimum Description Length: Theory and Applications [blurb]
- Marcus Hutter
- "General Loss Bounds for Universal Sequence Prediction," cs.AI/0101019
- "On Generalized Computable Universal Priors and their Convergence", cs.LG/0503026
- "Optimal Sequential Decisions based on Algorithmic Probability," cs/0306091
- "Towards a Universal Theory of Artificial Intelligence based on Algorithmic Probability and Sequential Decision Theory," cs.AI/0012011

- F. Liang and A. Barron, "Exact Minimax Strategies for Predictive
Density Estimation, Data Compression, and Model Selection", IEEE Transactions on Information Theory
**50**(2004): 2708--2726 - Daniel J. Navarro, "A Note on the Applied Use of MDL Approximations",
Neural
Computation
**16**(2004): 1763--1768 - S. L. Needham and D. L. Dowe, "Message Length as an Effective Ockham's Razor in Decision Tree Induction", pp. 253--260 in AI+STATS 2001 [Available from Prof. Dowe in several formats]
- Jan Poland and Marcus Hutter, "Asymptotics of Discrete MDL for
Online
Prediction",
IEEE Transactions on
Information Theory
**51**(2005): 3780--3795, arxiv:cs.IT/0506022 - Guoqi Qian and Hans R. Künsch, "Some Notes on Rissanen's Stochastic Complexity", Tech. Report 79, Seminar fur Statistik, ETH-Zurich (1996) [Online but I've mislaid the URL at the moment]
- Jorma Rissanen, Lectures on Statistical Modeling Theory [PDF]
- E. Rivals and J.-P. Delahae, "Optimal Representation in Average
Using Kolmogorov Complexity," Theoretical Computer Science
**200**(1998): 261--287 - Teemu Roos, Petri Myllymäki and Jorma Rissanen, "MDL Denoising Revisited", cs.IT/0609138
- Michael Small, "Optimal time delay embedding for nonlinear time series modeling", nlin.CD/0312011
- Michael Small and C. K. Tse, "Optimal embedding parameters: A modeling paradigm", physics/0308114
- Nikolai Vereshchagin and Paul Vitanyi, "Kolmogorov's Structure Functions with an Application to the Foundations of Model Selection," cs.CC/0204037
- C. S. Wallace, Statistical and Inductive Inference by Minimum Message Length [Blurb]
- C. S. Wallace and David L. Dowe, "Minimum Message Length and Kolmogorov Complexity", The Computer Journal
**42**(1999): 270--283