In which we consider evolutionary trends in body size, aided by regression modeling and the bootstrap.

Posted by crshalizi at January 31, 2012 19:11 | permanent link

Quantifying uncertainty by looking at sampling distributions. The bootstrap principle: sampling distributions under a good estimate of the truth are close to the true sampling distributions. Parametric bootstrapping. Non-parametric bootstrapping. Many examples. When does the bootstrap fail?

*Reading*:
Notes, chapter 5
(R for figures and examples;
`pareto.R`;
`wealth.dat`)<;
R for in-class examples

Posted by crshalizi at January 31, 2012 19:10 | permanent link

Fortunately, however, the methods of those who *can* handle big data
are neither grotesque nor incomprehensible, and we will hear about them on
Monday.

- Alekh Agarwal, "Computation Meets Statistics: Trade-offs and Fundamental Limits for Large Data Sets"
*Abstract:*The past decade has seen the emergence of datasets of unprecedented scale, with both large sample sizes and dimensionality. Massive data sets arise in various domains, among them computer vision, natural language processing, computational biology, social networks analysis and recommendation systems, to name a few. In many such problems, the bottleneck is not just the number of data samples, but also the computational resources available to process the data. Thus, a fundamental goal in these problems is to characterize how estimation error behaves as a function of the sample size, number of parameters, and the computational budget available.- In this talk, I present three research threads that provide complementary lines of attack on this broader research agenda: (i) lower bounds for statistical estimation with computational constraints; (ii) interplay between statistical and computational complexities in structured high-dimensional estimation; and (iii) a computational budgeted framework for model selection. The first characterizes fundamental limits in a uniform sense over all methods, whereas the latter two provide explicit algorithms that exploit the interaction of computational and statistical considerations.
- Joint work with John Duchi, Sahand Negahban, Clement Levrard, Pradeep Ravikumar, Peter Bartlett, and Martin Wainwright.
*Time and place*: 4--5 pm on Monday, 6 February 2012, in Scaife Hall 125

As always, the talk is free and open to the public.

Posted by crshalizi at January 31, 2012 19:00 | permanent link

Attention conservation notice:Only of interest if you (1) care about combinatorial stochastic processes and their statistical applications, and (2) will be in Pittsburgh on Wednesday afternoon.

It is only in very special weeks, when we have been very good, that we
get *two* seminars.

- Harry Crane, "The Cut-and-Paste Process"
*Abstract:*In this talk, we present the cut-and-paste process, a novel infinitely exchangeable process on the state space of partitions of the natural numbers whose samples paths differ from previously studied exchangeable coalescent (Kingman 1982; Pitman 1999) and fragmentation (Bertoin 2001) processes. Though it evolves differently, the cut-and-paste process possesses some of the same properties as its predecessors, including a unique equilibrium measure, associated measure-valued process, a Poisson point process construction and transition probabilities which can be described in terms of Kingman's paintbox process. A parametric subfamily is related to the Chinese restaurant process and we illustrate potential applications of this model to phylogenetic inference based on RNA/DNA sequence data. There are some natural extensions of this model to Bayesian inference, hidden Markov models and tree-valued Markov processes which we will discuss.- We also discuss how this process and its extensions fit into the more general framework of statistical modeling of structure and dependence via combinatorial stochastic processes, e.g. random partitions, trees and networks, and the practical importance of infinite exchangeability in this context.
*Time and place:*4--5 pm on Wednesday, 1 February 2012, in Scaife Hall 125

As always, the talk is free and open to the public.

Posted by crshalizi at January 31, 2012 18:45 | permanent link

Attention conservation notice: Associate editor at a non-profit scientific journal endorses a call for boycotting a for-profit scientific journal publisher.

I have for years been refusing to publish in or
referee for journals publisher by Elsevier; pretty much all of the commercial
journal publishers are bad deals^{1}, but
they are outrageously worse than most. Since learning
that Elsevier
had a business line in putting out publications designed to look like
peer-reviewed journals, and *calling* themselves journals, but actually
full of paid-for BS, I have had a form letter I use for declining requests
to referee, letting editors know about this, and inviting them to switch to a
publisher which doesn't deliberately seek to profit by corrupting the process
of scientific communication.

I am thus extremely happy to learn from Michael Nielsen that Tim Gowers is organizing a general boycott of Elsevier, asking people to pledge not to contribute to its journals, referee for them, or do editorial work for them. You can sign up here, and I strongly encourage you to do so. There are fields where Elsevier does publish the leading journals, and where this sort of boycott would be rather more personally costly than it is in statistics, but there is precedent for fixing that. Once again, I strongly encourage readers in academia to join this.

(To head off the inevitable mis-understandings, I am
not, today, calling for getting rid of journals as we
know them. I *am* saying that Elsevier is ripping us off outrageously,
that conventional journals can be published without ripping us off, and so we
should not help Elsevier to rip us off.)

*Disclaimer*, added 29 January: As I should have thought went without
saying, I am speaking purely for myself here, and not with any kind of
institutional voice. In particular, I am not speaking for the Annals of
Applied Statistics, or for the IMS,
which publishes it. (Though if the IMS asked its members to join in boycotting
Elsevier, I would be very happy.)

1: Let's review how scientific journals work, shall
we? Scientists are not paid by journals to write papers: we do that as
volunteer work, or more exactly, part of the money we get for teaching and from
research grants is supposed to pay for us to write papers. (We all have
day-jobs.) Journals are edited by scientists, who volunteer for this and get
nothing from the publisher. (New editors get recruited by old editors.)
Editors ask other scientists to referee the submissions; the referees are
volunteers, and get nothing from the publisher (or editor). Accepted papers
are typeset by the authors, who usually have to provide "camera-ready" copy.
The journal publisher typically provides an electronic system for keeping track
of submitted manuscripts and the refereeing process. Some of them also provide
a minimal amount of copy-editing on accepted papers, of dubious value.
Finally, the publisher actually prints the journal, and runs the server
distributing the electronic version of the paper, which is how, in this day and
age, most scientists read it. While the publisher's contribution
isn't *nothing*, it's also completely out of proportion to the fees they
charge, let alone economically efficient pricing. The
whole thing would grind to a halt without the work done by scientists, as
authors, editors and referees. That work, to repeat, is paid for either by our
students or by our grants, not by the publisher. This makes the whole system
of for-profit journal publication economically insane, a check on the
dissemination of knowledge which does nothing to encourage its creation.
Elsevier is simply one of the worst of these parasites.

*Manual trackback*: Cosmic Variance

Posted by crshalizi at January 28, 2012 11:15 | permanent link

Attention conservation notice: Only of interest if you (1) care about covariance matrices and (2) will be in Pittsburgh on Monday.

Since so much of multivariate statistics depends on patterns of correlation among variables, it is a bit awkward to have to admit that in lots of practical contexts, correlations matrices are just not very stable, and can change quite drastically. (Some people pay a lot to rediscover this.) It turns out that there are more constructive responses to this situation than throwing up one's hands and saying "that sucks", and on Monday a friend of the department and general brilliant-type-person will be kind enough to tell us about them:

- Emily Fox, "Bayesian Covariance Regression and Autoregression"
*Abstract*: Many inferential tasks, such as analyzing the functional connectivity of the brain via coactivation patterns or capturing the changing correlations amongst a set of assets for portfolio optimization, rely on modeling a covariance matrix whose elements evolve as a function of time. A number of multivariate heteroscedastic time series models have been proposed within the econometrics literature, but are typically limited by lack of clear margins, computational intractability, and curse of dimensionality. In this talk, we first introduce and explore a new class of time series models for covariance matrices based on a constructive definition exploiting inverse Wishart distribution theory. The construction yields a stationary, first-order autoregressive (AR) process on the cone of positive semi-definite matrices.- We then turn our focus to more general predictor spaces and scaling to high-dimensional datasets. Here, the predictor space could represent not only time, but also space or other factors. Our proposed Bayesian nonparametric covariance regression framework harnesses a latent factor model representation. In particular, the predictor-dependent factor loadings are characterized as a sparse combination of a collection of unknown dictionary functions (e.g., Gaussian process random functions). The induced predictor-dependent covariance is then a regularized quadratic function of these dictionary elements. Our proposed framework leads to a highly-flexible, but computationally tractable formulation with simple conjugate posterior updates that can readily handle missing data. Theoretical properties are discussed and the methods are illustrated through an application to the Google Flu Trends data and the task of word classification based on single-trial MEG data.
*Time and place*: 4--5 pm on Monday, 30 January 2012, in Scaife Hall 125

As always, the talk is free and open to the public.

Posted by crshalizi at January 27, 2012 14:25 | permanent link

The constructive alternative to complaining about linear regression is non-parametric regression. There are many ways to do this, but we will focus on the conceptually simplest one, which is smoothing; especially kernel smoothing. All smoothers involve local averaging of the training data. The bias-variance trade-off tells us that there is an optimal amount of smoothing, which depends both on how rough the true regression curve is, and on how much data we have; we should smooth less as we get more information about the true curve. Knowing the truly optimal amount of smoothing is impossible, but we can use cross-validation to select a good degree of smoothing, and adapt to the unknown roughness of the true curve. Detailed examples. Analysis o how quickly kernel regression converges on the truth. Using smoothing to automatically discover interactions. Plots to help interpret multivariate smoothing results. Average predictive comparisons.

*Readings*: Notes, chapter 4 (R); Faraway, section 11.1

*Optional readings*: Hayfield and Racine, "Nonparametric Econometrics: The `np` Package"; Gelman and Pardoe, "Average Predictive Comparisons for Models with Nonlinearity, Interactions, and Variance Components" [PDF]

Posted by crshalizi at January 26, 2012 10:30 | permanent link

In which we try to discern whether poor countries grow faster.

Posted by crshalizi at January 26, 2012 09:30 | permanent link

Goals of statistical analysis: summaries, prediction, scientific inference. Evaluating predictions: in-sample error, generalization error; over-fitting. Cross-validation for estimating generalization error and for model selection. Justifying model-based inferences; Luther and Süleyman.

*Reading*: Notes,
chapter 3
(R for
examples and figures).

Posted by crshalizi at January 24, 2012 10:30 | permanent link

Multiple linear regression: general formula for the optimal linear
predictor. Using Taylor's theorem to justify linear regression locally.
Collinearity. Consistency of ordinary least squares estimates under weak
conditions. Linear regression coefficients will change with the distribution
of the input variables: examples. Why R^{2} is usually a distraction.
Linear regression coefficients will change with the distribution of unobserved
variables (omitted variable problems). Errors in variables. Transformations of
inputs and of outputs. Utility of probabilistic assumptions; the importance of
looking at the residuals. What "controlled for in a linear regression" really
means.

*Reading*: Notes,
chapter 2
(R for
examples and figures); Faraway, chapter 1 (continued).

Posted by crshalizi at January 24, 2012 10:15 | permanent link

Attention conservation notice: A silly idea about gamifying credit cards, which would be evil if it worked.

To make a profit in an otherwise competitive industry, it helps if you
can impose switching costs on your customers, making them either pay to stop
doing business with you, or give up something of value to
them. There are whole books about this,
written by respected economists^{1}.

This is why credit card companies are happy to offer rewards for use: accumulating points on a card, which would not move with you if you got a new card and transferred the balance, is an attempt to create switching costs. Unfortunately, from the point of view of the banks, people will redeem their points from time to time, so some money must be spent on the rewards. The ideal would be points which people would value but which would never cost the bank anything.

Item: Computer games are, deliberately, addictive. Social games are especially addictive.

Accordingly, if I were an evil and unscrupulous credit card company (but I repeat myself), I would create an online game, where people could get points either from playing the game, or from spending money with my credit card. For legal reasons, I think it would probably be best to allow the game to technically be open to everyone, but with a registration fee which is, naturally, waived for card-holders. Of course, the game software would be set up to announce on Facebook (etc.) whenever the player/debtor leveled up. I would also be tempted to award double points for fees, and triple for interest charges, but one could experiment with this. If they close their credit card account, they have to start the game over from the beginning.

The fact that online acquaintances can't tell whether the debtor is advancing through spending or through game-play helps keep the reward points worth having. It's true that the credit card company has to pay for the game's design (a one-time start-up cost) and the game servers, but these are fairly cheap, and the bank never has to cash out points in actual dollars or goods. The debtors themselves do all the work of investing the points with meaning and value. They impose the switching costs on themselves.

My plan is sheer elegance in its simplicity, and I will be speaking to an attorney about a business method patent first thing Monday.

1: Much can be learned about our benevolent new-media overlords from the fact that this book carries a blurb from Jeff Bezos of Amazon, and that Varian now works for Google.

Posted by crshalizi at January 22, 2012 10:15 | permanent link

Attention conservation notice: An academic paper you've never heard of, about a distressing subject, had bad statistics and is generally foolish.

Because my so-called friends like to torment me, several of them made sure that I knew a remarkably idiotic paper about power laws was making the rounds, promoted by the ignorant and credulous, with assistance from the credulous and ignorant, supported by capitalist tools:

- M. V. Simkin and V. P. Roychowdhury, "Stochastic modeling of a serial killer", arxiv:1201.2458
*Abstract*: We analyze the time pattern of the activity of a serial killer, who during twelve years had murdered 53 people. The plot of the cumulative number of murders as a function of time is of "Devil's staircase" type. The distribution of the intervals between murders (step length) follows a power law with the exponent of 1.4. We propose a model according to which the serial killer commits murders when neuronal excitation in his brain exceeds certain threshold. We model this neural activity as a branching process, which in turn is approximated by a random walk. As the distribution of the random walk return times is a power law with the exponent 1.5, the distribution of the inter-murder intervals is thus explained. We confirm analytical results by numerical simulation.

Let's see if we can't stop this before it gets too far, shall we? The
serial killer in question is
one Andrei
Chikatilo, and that Wikipedia article gives the dates of death of his
victims, which seems to have been Simkin and Roychowdhury's data source as
well. Several of these are known only imprecisely, so I made guesses within
the known ranges; the results don't seem to be very sensitive to the guesses.
Simkin and Roychowdhury plotted the distribution of days between killings in a
binned histogram on a logarithmic scale;
as we've explained elsewhere, this
is a bad idea, which destroys information to no good purpose, and a better
display is shows the (upper or complementary) cumulative distribution
function^{1}, which looks like so:

When I fit a power law to this by maximum likelihood, I get an exponent of 1.4, like Simkin and Roychowdhury; that looks like this:

On the other hand, when I fit a log-normal (because Gauss is not mocked), we get this:

After that figure, a formal statistical test is almost superfluous,
but let's do it anyway, because why just trust our eyes when we can calculate?
The data are better fit by the log-normal than by the power-law (the data are
*e*^{10.41} or about 33 thousand times more likely under the
former than the latter), but that could happen via mere chance fluctuations,
even when the power law is
right. Vuong's model comparison
test lets us quantify that probability, and tells us a power-law would
produce data which seems to fit a log-normal this well no more than 0.4
percent^{2} of the time. Not only does the log-normal distribution fit
better than the power-law, the difference is so big that it would be absurd to
try to explain it away as bad luck. In absolute terms, we can find the
probability of getting as big a deviation between the fitted power law and the
observed distribution through sampling fluctuations, and it's about 0.03
percent^{2b} [R code for figures,
estimates and test, including data.]

Since Simkin and Roychowdhury's model produces a power law, and these data,
whatever else one might say about them, are not power-law distributed, I will
refrain from discussing all the ways in which it is a bad model.
I *will* re-iterate that it is an idiotic paper — which is
different from saying that Simkin and Roychowdhury are idiots; they are not and
have done interesting work on,
e.g., estimating how often
references are copied from bibliographies without being read by tracking
citation errors^{4}. But the idiocy in this paper goes beyond
statistical incompetence. The model used here was originally proposed for the
time intervals between epileptic fits. The authors realize that

[i]t may seem unreasonable to use the same model to describe an epileptic and a serial killer. However, Lombroso [5] long ago pointed out a link between epilepsy and criminality.That would be the 19th-century pseudo-scientist

As for the general issues about power laws and their abuse, say something once, why say it again?

**Update** 9 pm that day: Added the goodness-of-fit test (text
before note 2b, plus that note), updated code, added PNG versions of figures,
added attention conservation notice.

21 January: typo fixes (missing pronoun, mis-placed decimal point), added
bootstrap confidence interval for exponent, updated code accordingly.

*Manual trackback*: Hacker News (do I really need to link to this?), Naked Capitalism (?!);
Mathbabe;
Wolfgang Beirl;
Ars Mathematica (yes, I *am* that predictable)

1: This is often called the "survival function", but that seems inappropriate here.

2: On average, the log-likelihood of each observation was 0.20 higher under the log-normal than under the power law, and the standard deviation of the log likelihood ratio over the samples was only 0.54. The test statistic thus comes out to -2.68, and the one-sided *p*-value to 0.36%.

2b: Use a Kolmogorov-Smirnov test. Since the power
law has a parameter estimated from data (namely, the exponent), we can't just
plug in to the usual tables for a K-S test, but we can find a *p*-value by
simulating the power law (as in my
paper with Aaron and Mark), and when I do that, with a hundred thousand
replications, the *p*-value is about 3*10^{-4}.

3: There are in fact subtle, not to say profound,
issues in the sociology and philosophy of science here: was
Lombroso *always* a pseudo-scientist, because his investigations never
came up to any acceptable standard of reliable inquiry? Or just because they
didn't come up to the standards of inquiry prevalent at the time he wrote? Or
did Lombroso *become* a pseudo-scientist, when enough members of enough
intellectual communities woke up from the pleasure of having their prejudices
about the lower orders echoed to realize that he was full of it? However that
may be, this paper has the dubious privilege of being the first time I have
ever seen Lombroso cited as an *authority* rather than
a *specimen*.

4: Actually, for several years my bibliography data
base had the wrong page numbers for one of *my own* papers, due to a
typo, so their method would flag some of my subsequent works as written by
someone who had cited that paper without reading it, which I assure you was not
the case. But the idea seems reasonable in general.

Posted by crshalizi at January 17, 2012 20:23 | permanent link

In which we practice the art of linear regression upon the California real-estate market, by way of warming up for harder tasks.

(Yes, the data set is now about as old as my students, but last week in
Austin I was too busy ~~drinking on 6th street~~ having lofty
conversations about the future of statistics to update the file with
the `UScensus2000`
package.)

Posted by crshalizi at January 17, 2012 10:31 | permanent link

Statistics is the science which studies methods for learning from imperfect data. Regression is a statistical model of functional relationships between variables. Getting relationships right means being able to predict well. The least-squares optimal prediction is the expectation value; the conditional expectation function is the regression function. The regression function must be estimated from data; the bias-variance trade-off controls this estimation. Ordinary least squares revisited as a smoothing method. Other linear smoothers: nearest-neighbor averaging, kernel-weighted averaging.

*Readings*: Notes,
chapter 1; Faraway, chapter 1, through page 17.

Posted by crshalizi at January 17, 2012 10:30 | permanent link

If you sent me e-mail at my @stat.cmu.edu address in the last few days, I haven't gotten it, and may never get it. The address firstinitiallastname at cmu dot edu now points somewhere where I can read.

Posted by crshalizi at January 07, 2012 20:40 | permanent link

I'll be speaking at UT-Austin next week, through the kindness of the division of statistics and scientific computation:

- "When Can We Learn Network Models from Samples?"
*Abstract*: Statistical models of network structure are models for the entire network, but the data are typically just a sampled sub-network. Parameters for the whole network, which are what we care about, are estimated by fitting the model on the sub-network. This assumes that the model is "consistent under sampling" (forms a projective family). For the widely-used exponential random graph models (ERGMs), this trivial-looking condition is violated by many popular and scientifically appealing models; satisfying it drastically limits ERGMs' expressive power. These results are special cases of more general ones about exponential families of dependent variables, which we also prove. As a consolation prize, we offer easily checked conditions for the consistency of maximum likelihood estimation in ERGMs, and discuss some possible constructive responses.*Time and place*: 2--3 pm on Wednesday, 11 January 2012, in Hogg Building (WCH), room 1.108

This will of course be based on my paper with Alessandro, but since I understand some non-statisticians may sneak in, I'll try to be more comprehensible and less technical.

Since this will be my first time in Austin (indeed my first time in Texas), and I have (for a wonder) absolutely no obligations on the 12th, suggestions on what I should see or do would be appreciated.

Posted by crshalizi at January 06, 2012 14:15 | permanent link

It's that time again:

- 36-402, Advanced Data Analysis, Spring 2012
*Description*: This course introduces modern methods of data analysis, building on the theory and application of linear models from 36-401. Topics include nonlinear regression, nonparametric smoothing, density estimation, generalized linear and generalized additive models, simulation and predictive model-checking, cross-validation, bootstrap uncertainty estimation, multivariate methods including factor analysis and mixture models, and graphical models and causal inference. Students will analyze real-world data from a range of fields, coding small programs and writing reports.*Prerequisites*: 36-401 (modern regression); or consent of instructor, in extraordinary cases*Time and place*: 10:30--11:50 am, Tuesdays and Thursdays, in Porter Hall 100*Note*: Graduate students in other departments wishing to take this course for credit need consent of the instructor, and should register for 36-608.

Fuller details on the class homepage, including a detailed (but subject to change) list of topics, and links to the compiled course notes. I'll post updates here to the notes for specific lectures and assignments, like last time.

This is the same course I taught last spring, only grown from sixty-odd students to (currently) ninety-three (from 12 different majors!). The smart thing for me to do would probably be to change nothing (I haven't gotten to re-teach a class since 2009), but I felt the urge to re-organize the material and squeeze in a few more topics.

The biggest change I am making is introducing some quality-control sampling. The course is to big for me to look over much of the students' work, and even then, that gives me little sense of whether the assignments are really probing what they know (much less helping them learn). So I will be randomly selecting six students every week, to come to my office and spend 10--15 minutes each explaining the assignment to me and answering live questions about it. Even allowing for students being randomly selected multiple times*, I hope this will give me a reasonable cross-section of how well the assignments are working, and how well the grading tracks that. But it's an experiment and we'll see how it goes.

* (exercise for the student): Find the probability distribution of the number of times any given student gets selected. Assume 93 students, with 6 students selected per week, and 14 weeks. (Also assume no one drops the class.) Find the distribution of the total number of distinct students who ever get selected.

Posted by crshalizi at January 03, 2012 23:00 | permanent link

*Attention conservation notice*: Navel-gazing.

Paper manuscripts completed: 12

Papers accepted: 2 [i, ii], one from last year

Papers rejected: 10 (fools! I'll show you all!)

Papers rejected with a comment from the editor that no one should take the
paper I was responding to, published in the same glossy high-impact journal,
"literally": 1

Papers in refereeing limbo: 4

Papers in progress: I won't look in that directory and you can't make me

Grant proposals submitted: 3

Grant proposals rejected: 4 (two from last year)

Grant proposals in refereeing limbo: 1

Grant proposals in progress for next year: 3

Talk given and conferences attended: 20, in 14 cities

Manuscripts refereed: 46, for 18 different journals and conferences

Manuscripts waiting for me to referee: 7

Manuscripts for which I was the responsible associate editor
at Annals of Applied
Statistics: 10

Book proposals reviewed: 3

Classes taught: 2

New classes taught: 2

Summer school classes taught: 1

New summer school classes taught: 1

Pages of new course material written: about 350

Students who are now ABD: 1

Students who are not just ABD but on the job market: 1

Letters of recommendation written: 8 (with about 100 separate destinations)

Promotion packets submitted: 1 (for promotion to associate professor, but without tenure)

Promotion cases still working through the system: 1

Book reviews published on dead trees: 2 [i, ii]

Non-book-reviews published on dead trees: 1

Weblog posts: 157

Substantive weblog posts: 54, counting algal
growths

Books acquired: 298

E-book readers gratefully received: 1

Books driven by my mother from her house to Pittsburgh: about 800

Books begun: 254

Books finished: 204 (of which 34 on said e-book reader)

Books given up: 16

Books sold: 133

Books donated: 113

Book manuscripts completed: 0

Wisdom teeth removed: 4

Unwise teeth removed: 1

Major life transitions: 0

Posted by crshalizi at January 01, 2012 12:00 | permanent link