October 28, 2005

Gauss Is Not Mocked

By now, everyone and her brother has read, or at least read about, the papers by Albert-László Barabási and co., purporting to show that response times in e-mail, and in Darwin and Einstein's correspondence, follow a power law distribution, and that this is due to queuing processes.

Unfortunately, this is not true; the apparent power law is merely an artifact of a bad analysis of the data, which is immensely better described by a log-normal distribution. (Via Aaron Clauset.)

Daniel B. Stouffer, R. Dean Malmgren and Luís A. N. Amaral, "Comment on Barabási, Nature 435, 207 (2005)", physics/0510216
Abstract: In a recent letter, Barabási claims that the dynamics of a number of human activities are scale-free [1]. He specifically reports that the probability distribution of time intervals tau between consecutive e-mails sent by a single user and time delays for e-mail replies follow a power-law with an exponent -1, and proposes a priority-queuing process as an explanation of the bursty nature of human activity. Here, we quantitatively demonstrate that the reported power-law distributions are solely an artifact of the analysis of the empirical data and that the proposed model is not representative of e-mail communication patterns.
Authors' comment: This manuscript re-analyzes data from Barabási's paper in Nature, "The origins of bursts and heavy tails in human dynamics", but it should be clear that the same problems are to be found in physics/0510117 and the upcoming Nature advertised in Barabási's web site concerning the correspondence of Einstein and Darwin

As every school-child knows (at least, these school-children do!), adding together many independent random variables, each of which makes a small contribution to the over-all result, generally gives you a Gaussian or normal distribution (unless the contributing variables are, themselves, kind of pathological). This fact is the central limit theorem.

What happens if the inputs are multiplied together, rather than added? Well, take the logarithm: log(XY) = log(X) + log(Y). The logarithm of the product will be the sum of the logarithms of the inputs. The latter will still be independent, so the logarithm of the output will be normally distributed. Undoing the log gives what's imaginative called the log-normal distribution. Log-normals are very common, for the same reasons that normals are. Unlike normals, they are very easy to mistake for power law distributions, especially if your knowledge of statistics is as limited as most theoretical physicists'. (The distribution of links to weblogs, for instance, is much better fit by a log-normal than a power law, as we've seen.) In their comment, Stoffer et al. show that a log-normal actually gives a textbook-quality fit to Barabási's data. (The only change I'd make to their procedure is that I'd report the likelihood ratio directly, and let people work out their own Bayesian posteriors if so inclined.) Looking at the data reported in the new Nature paper on Darwin's and Einstein's correspondence, if it's not log-normal too — well, I'd say I'd eat my hat, but I don't own one; I'll buy a Notre Dame hat and eat it.

Let me turn the microphone over to Francis Galton (as quoted in Ian Hacking's The Taming of Chance):

I know of scarcely anything so apt to impress the imagination as the wonderful form of cosmic order expressed by `the law of error.' A savage, if could understand it, would worship it as a god. It reigns with severity in complete self-effacement amidst the wildest confusion. The huger the mob and the greater the anarchy the more perfect its sway. Let a large sample of chaotic elements be taken and marshalled in order of their magnitudes, and then, however wildly irregular they appeared, an unexpected and most beautiful form of regularity proves to have been present all along.
As Hacking notes, on further consideration Galton was even more impressed by the central limit theorem, and accordingly replaced the sentence about savages with "The law would have been personified by the Greeks and deified, if they had known of it." Whether deified by Hellenes or savages, however, the CLT has a message for those doing data analysis, and the message is:
Thou shalt have no other distribution before me, for I am a jealous limit theorem.

I restrain myself from making any observations on the editorial process at Nature, or on the competence of the referees of Barabási's papers. I do wish it to be noted, however, that this post is not an entry in the "Why Oh Why Can't Physicists Learn Better Probability and Statistics?" series, as Amaral and Barabási are both associated with Gene Stanley's school of statistical physics.

Update, Halloween: Suresh Venkatasubramanian, at Geomblog, turns his microphone over to Michael Mitzenmacher, who has some very good comments. (This led me to read Mitzenmacher's nice paper on generating mechanisms for power-laws.) I am more convinced by Mitzenmacher by the difference in the goodness of fits, simply because it is so overwhelmingly large. It hardly seems to make sense, in this case, to say that the data are even approximately power-law distributed...

Update, 23 November: Barabási's group has posted a reply (physics/0511186). To my eyes, the crucial observation by Stouffer et al. was that the fit of the data to a power law is in fact really, really bad, so it's pointless to talk about what mechanism might produce a power law in such situations. The reply's take on this point is that this is "merely" a statistical issue! In short, I don't find the reply at all convincing on the major points, but if you care, by all means read it. (The reply claims that Stouffer et al.'s comment was rejected by "three referees" at Nature; one wonders if they were the referees who approved Barabási's original paper.)

Update, 25 November: To hammer the point home, let's look at Figure 1b from Stouffer et al.'s comment. (Click for a larger version.)

The solid black line is the empirical distribution of the data. The red dashed line is the lognormal distribution. This is, as I said, a textbook-quality fit. Correcting for right censoring — the measured response intervals are all less than 83 days, because that's the length of time over which the data were collected &mdash would only improve the fit. (Thanks to Prof. Amaral for permission to reproduce the figure.)

Update, 29 November: Yet more commentary, from Aaron Clauset.

Update, 29 September 2006: In the event you still care about this, see G. Grinstein and R. Linsker, "Biased Diffusion and Universality in Model Queues", Physical Review Letters (2006): 130201. Grinstein and Linsker analytically solve for the asymptotic distribution of Barabási's queueing model, finding either a power law or a power-law with an exponential cut-off; they also show that the result is very sensitive to introducing a cost for switching between different kinds of tasks.

Manual trackback: In Search of 42; Pharyngula; hakank.blogg; Juan de Mairena [v.2.718]; Three Quarks Daily; Metamerist; Zoltán Sylvester; Language Log; Statistical Modeling, Causal Inference, and Social Science

Power Laws; Enigmas of Chance; Complexity

Posted by crshalizi at October 28, 2005 09:30 | permanent link

Three-Toed Sloth:   Hosted, but not endorsed, by the Center for the Study of Complex Systems