## April 20, 2012

### Just How Quickly Do We Forget?

Attention conservation notice: 2500+ words on estimating how quickly time series forget their own history. Only of interest if you care about the intersection of stochastic processes and statistical learning theory. Full of jargon, equations, log-rolling and self-promotion, yet utterly abstract.

I promised to say something about the content of Daniel's thesis, so let me talk about two of his papers, which go into chapter 4; there is a short conference version and a long journal version.

Daniel J. McDonald, Cosma Rohilla Shalizi and Mark Schervish, "Estimating beta-mixing coefficients", AIStats 2011, arxiv:1103.0941
Abstract: The literature on statistical learning for time series assumes the asymptotic independence or "mixing" of the data-generating process. These mixing assumptions are never tested, nor are there methods for estimating mixing rates from data. We give an estimator for the $$\beta$$-mixing rate based on a single stationary sample path and show it is $$L_1$$-risk consistent.
----, "Estimating beta-mixing coefficients via histograms", arxiv:1109.5998
Abstract: The literature on statistical learning for time series often assumes asymptotic independence or "mixing" of data sources. Beta-mixing has long been important in establishing the central limit theorem and invariance principle for stochastic processes; recent work has identified it as crucial to extending results from empirical processes and statistical learning theory to dependent data, with quantitative risk bounds involving the actual beta coefficients. There is, however, presently no way to actually estimate those coefficients from data; while general functional forms are known for some common classes of processes (Markov processes, ARMA models, etc.), specific coefficients are generally beyond calculation. We present an $$L_1$$-risk consistent estimator for the beta-mixing coefficients, based on a single stationary sample path. Since mixing coefficients involve infinite-order dependence, we use an order-d Markov approximation. We prove high-probability concentration results for the Markov approximation and show that as $$d \rightarrow \infty$$, the Markov approximation converges to the true mixing coefficient. Our estimator is constructed using d dimensional histogram density estimates. Allowing asymptotics in the bandwidth as well as the dimension, we prove $$L_1$$ concentration for the histogram as an intermediate step.

Recall the world's simplest ergodic theorem: if $$X_t$$ is a sequence of random variables with common expectation $$m$$ and variance $$v$$, and stationary covariance $$\mathrm{Cov}[X_t, X_{t+h}] = c_h$$. Then the time average $$\overline{X}_n \equiv \frac{1}{n}\sum_{i=1}^{n}{X_i}$$ also has expectation $$m$$, and the question is whether it converges on that expectation. The world's simplest ergodic theorem asserts that if the correlation time $T = \frac{\sum_{h=1}^{\infty}{|c_h|}}{v} < \infty$ then $\mathrm{Var}\left[ \overline{X}_n \right] \leq \frac{v}{n}(1+2T)$

Since, as I said, the expectation of $$\overline{X}_n$$ is $$m$$ and its variance is going to zero, we say that $$\overline{X}_n \rightarrow m$$ "in mean square".

From this, we can get a crude but often effective deviation inequality, using Chebyshev's inequality: $\Pr{\left(|\overline{X}_n - m| > \epsilon\right)} \leq \frac{v}{\epsilon^2}\frac{1+2T}{n}$

The meaning of the condition that the correlation time $$T$$ be finite is that the correlations themselves have to trail off as we consider events which are widely separated in time — they don't ever have to be zero, but they do need to get smaller and smaller as the separation $$h$$ grows. (One can actually weaken the requirement on the covariance function to just $$\lim_{n\rightarrow \infty}{\frac{1}{n}\sum_{h=1}^{n}{c_h}} = 0$$, but this would take us too far afield.) In fact, as these formulas show, the convergence looks just like what we'd see for independent data, only with $$\frac{n}{1+2T}$$ samples instead of $$n$$, so we call the former the effective sample size.

All of this is about the convergence of averages of $$X_t$$, and based on its covariance function $$c_h$$. What if we care not about $$X$$ but about $$f(X)$$? The same idea would apply, but unless $$f$$ is linear, we can't easily get its covariance function from $$c_h$$. The mathematicians' solution to this has been to invent stronger notions of decay-of-correlations, called "mixing". Very roughly speaking, we say that $$X$$ is mixing when, if you pick any two (nice) functions $$f$$ and $$g$$, I can always show that $\lim_{h\rightarrow\infty}{\mathrm{Cov}\left[ f(X_t), g(X_{t+h}) \right]} = 0$

Note (or believe) that this is "convergence in distribution"; it happens if, and only if, the distribution of events up to time $$t$$ is becoming independent of the distribution of events from time $$t+h$$ onwards.

To get useful results, it is necessary to quantify mixing, which is usually done through somewhat stronger notions of dependence. (Unfortunately, none of these have meaningful names. The review by Bradley ought to be the standard reference.) For instance, the "total variation" or $$L_1$$ distance between probability measures $$P$$ and $$Q$$, with densities $$p$$ and $$q$$ is, $d_{TV}(P,Q) = \frac{1}{2}\int{|p(u) - q(u)| du}$ This has several interpretations, but the easiest to grasp is that it says how much $$P$$ and $$Q$$ can differ in the probability they give to any one event: for any $$E$$, $$d_{TV}(P,Q) \geq |P(E) - Q(E)|$$. One use of this distance is to measure how the dependence between random variables, by seeing far their joint distribution is from the product of their marginal distributions. Abusing notation a little to write $$P(U,V)$$ for the joint distribution of $$U$$ and $$V$$, we measure dependence as $\beta(U,V) \equiv d_{TV}(P(U,V), P(U) \otimes P(V)) = \frac{1}{2}\int{|p(u,v)-p(u)p(v)|du dv}$ This will be zero just when $$U$$ and $$V$$ are statistically independent, and one when, on average, conditioning on $$U$$ confines $$V$$ to a set which would otherwise have probability zero. (For instance if $$U$$ has a continuous distribution and $$V$$ is a function of $$U$$ — or one of two randomly chosen functions of $$U$$.)

We can relate this back to the earlier idea of correlations between functions by realizing that $\beta(U,V) = \sup_{|r|\leq 1}{\left|\int{r(u,v) dP(U,V)} - \int{r(u,v)dP(U)dP(V)}\right|} ~,$ that $$\beta$$ says how much the expected value of a bounded function $$r$$ could change between the dependent and the independent distributions. (There is no assumption that the test function $$r$$ factorizes, and in fact it's important to allow $$r(u,v) \neq f(u)g(v)$$.)

We apply these ideas to time series by looking at the dependence between the past and the future: $\begin{eqnarray*} \beta(h) & \equiv & d_{TV}(P(X^t_{-\infty}, X_{t+h}^{\infty}), P(X^t_{-\infty}) \otimes P(X_{t+h}^{\infty})) \\ & = & \frac{1}{2}\int{|p(x^t_{-\infty},x_{t+h}^{\infty})-p(x^t_{-\infty})p(x^{\infty}_{t+h})|dx^t_{-\infty}dx^{\infty}_{t+h}} \end{eqnarray*}$ (By stationarity, the integral actually does not depend on $$t$$.) When $$\beta(h) \rightarrow 0$$ as $$h \rightarrow \infty$$, we have a "beta-mixing" process. (These are also called "absolutely regular".) Convergence in total variation implies convergence in distribution, but not vice versa, so beta-mixing is stronger than common-or-garden mixing.

Notions like beta-mixing were originally introduced purely for probabilistic convenience, to handle questions like "when does the central limit theorem hold for stochastic processes?" These are interesting for people who like stochastic processes, or indeed for those who want to do Markov chain Monte Carlo and want to know how long to let the chain run. For our purposes, though, what's important is that when people in statistical learning theory have given serious attention to dependent data, they have usually relied on a beta-mixing assumption.

The reason for this focus on beta-mixing is that it "plays nicely" with approximating dependent processes by independent ones. The usual form of such arguments is as follows. We want to prove a result about our dependent but mixing process $$X$$. For instance, we realize that our favorite prediction model will tend to do worse out-of-sample than on the data used to fit it, and we might want to bound the probability that this over-fitting will exceed $$\epsilon$$. If we know the beta-mixing coefficients $$\beta(h)$$, we can pick a separation, call it $$a$$, where $$\beta(a)$$ is reasonably small. Now we divide $$X$$ up into $$\mu = n/a$$ blocks of length $$a$$. If we take every other block, they're nearly independent of each other (because $$\beta(a)$$ is small) but not quite (because $$\beta(a) \neq 0$$). Introduce a (fictitious) random sequence $$Y$$, where blocks of length $$a$$ have the same distribution as the blocks in $$X$$, but there's no dependence between blocks. Since $$Y$$ is an IID process, it is easy for us to prove that, for instance, the probability of over-fitting $$Y$$ by more than $$\epsilon$$ is at most some small $$\delta(\epsilon,\mu/2)$$. Since $$\beta$$ tells us about how well dependent probabilities are approximated by independent ones, the probability of the bad event happening with the dependent data is at most $$\delta(\epsilon,\mu/2) + (\mu/2)\beta(a)$$. We can make this as small as we like by letting $$\mu$$ and $$a$$ both grow as the time series gets longer. Basically, anything result which holds for an IID process will also hold for a beta-mixing one, with a penalty in the probability that depends on $$\beta$$. There are some details to fill in here (how to pick the separation $$a$$? should the blocks always be the same length as the "filler" between blocks?), but this is the basic frame.

What it leaves open, however, is how to estimate the mixing coefficients $$\beta(h)$$. For Markov models, one could it principle calculate it from the transition probabilities. For more general processes, though, calculating beta from the known distribution is not easy. In fact, we are not aware of any previous work on estimating the $$\beta(h)$$ coefficients from observational data. (References welcome!) Because of this, even in learning theory, people have just assumed that the mixing coefficients were known, or that it was known they went to zero at a certain rate. This was not enough for what we wanted to do, which was actually calculate bounds on error from data.

There were two tricks to actually coming up with an estimator. The first was to reduce the ambitions a little bit. If you look at the equation for $$\beta(h)$$ above, you'll see that it involves integrating over the infinite-dimensional distribution. This is daunting, so instead of looking at the whole past and future, we'll introduce a horizon, $$d$$ steps away, and cut things off there: $\begin{eqnarray*} \beta^{(d)}(h) & \equiv & d_{TV}(P(X^t_{t-d}, X_{t+h}^{t+h+d}), P(X^t_{t-d}) \otimes P(X_{t+h}^{t+h+d})) \\ & = & \frac{1}{2}\int{|p(x^t_{t-d},x_{t+h}^{t+h+d})-p(x^t_{t-d})p(x^{t+h+d}_{t+h})|dx^t_{t-d}dx^{t+h+d}_{t+h}} \end{eqnarray*}$ If $$X$$ is a Markov process, then there's no difference between $$\beta^{(d)}(h)$$ and $$\beta(h)$$. If $$X$$ is a Markov process of order $$p$$, then $$\beta^{(d)}(h) = \beta(h)$$ once $$d \geq p$$. If $$X$$ is not Markov at any order, it is still the case that $$\beta^{(d)}(h) \rightarrow \beta(h)$$ as $$d$$ grows. So we have an approximation to $$\beta$$ which only involves finite-dimensional integrals, which we might have some hope of doing.

The other trick is to get rid of those integrals. Another way of writing the beta-dependence between the random variables $$U$$ and $$V$$ is $\beta(U,V) = \sup_{\mathcal{A},\mathcal{B}}{\frac{1}{2}\sum_{a\in\mathcal{A}}{\sum_{b\in\mathcal{B}}{\left| \Pr{(a \cap b)} - \Pr{(a)}\Pr{(b)} \right|}}}$ where $$\mathcal{A}$$ runs over finite partitions of values of $$U$$, and $$\mathcal{B}$$ likewise runs over finite partitions of values of $$V$$. I won't try to show that this formula is equivalent to the earlier definition, but I will contend that if you think about how that integral gets cashed out as a sum, you can sort of see how it would be. If we want $$\beta^{(d)}(h)$$, we can take $$U = X^{t}_{t-d}$$ and $$V = X^{t+h+d}_{t+h}$$, and we could find the dependence by taking the supremum over partitions of those two variables.

Now, suppose that the joint density $$p(x^t_{t-d},x_{t+h}^{t+h+d})$$ was piecewise constant, with those pieces being rectangles parallel to the coordinate axes. Then sub-dividing those rectangles would not change the sum, and the $$\sup$$ would actually be attained for that particular partition. Most densities are not of course piecewise constant, but we can approximate them by such piecewise-constant functions, and make the approximation arbitrarily close (in total variation). More, we can estimate those piecewise-constant approximating densities from a time series. Those estimates are, simply, histograms, which are about the oldest form of density estimation. We show that histogram density estimates converge in total variation on the true densities, when the bin-width is allowed to shrink as we get more data.

Because the total variation distance is in fact a metric, we can use the triangle inequality to get an upper bound on the true beta coefficient, in terms of the beta coefficients of the estimated histograms, and the expected error of the histogram estimates. All of the error terms shrink to zero as the time series gets longer, so we end up with consistent estimates of $$\beta^{(d)}(h)$$. That's enough if we have a Markov process, but in general we don't. So we can let $$d$$ grow as $$n$$ does, and that (after a surprisingly long measure-theoretic argument) turns out to do the job: our histogram estimates of $$\beta^{(d)}(h)$$, with suitably-growing $$d$$, converge on the true $$\beta(h)$$.

To confirm that this works, the papers go through some simulation examples, where it's possible to cross-check our estimates. We can of course also do this for empirical time series. For instance, in his this Daniel took four standard macroeconomic time series for the US (GDP, consumption, investment, and hours worked, all de-trended in the usual way). This data goes back to 1948, and is measured four times a year, so there are 255 quarterly observations. Daniel estimated a $$\beta$$ of 0.26 at one quarter's separation, $$\widehat{\beta}(2) = 0.15$$, $$\widehat{\beta}(3) = 0.02$$, and somewhere between 0 and 0.11 for $$\widehat{\beta}(4)$$. (That last is a sign that we don't have enough data to go beyond $$h = 4$$.) Optimistically assuming no dependence beyond a year, one can calculate the effective number of independent data points, which is not 255 but 31. This has morals for macroeconomics which are worth dwelling on, but that will have to wait for another time. (Spoiler: $$\sqrt{\frac{1}{31}} \approx 0.18$$, and that's if you're lucky.)

It's inelegant to have to construct histograms when all we want is a single number, so it wouldn't surprise us if there were a slicker way of doing this. (For estimating mutual information, which is in many ways analogous, estimating the joint distribution as an intermediate step is neither necessary nor desirable.) But for now, we can do it, when we couldn't before.

Posted by crshalizi at April 20, 2012 14:57 | permanent link

## April 15, 2012

### Graphical Causal Models (Advanced Data Analysis from an Elementary Point of View)

Probabilistic prediction is about passively selecting a sub-ensemble, leaving all the mechanisms in place, and seeing what turns up after applying that filter. Causal prediction is about actively producing a new ensemble, and seeing what would happen if something were to change ("counterfactuals"). Graphical causal models are a way of reasoning about causal prediction; their algebraic counterparts are structural equation models (generally nonlinear and non-Gaussian). The causal Markov property. Faithfulness. Performing causal prediction by "surgery" on causal graphical models. The d-separation criterion. Path diagram rules for linear models.

Posted by crshalizi at April 15, 2012 20:03 | permanent link

### Exam: Is This Test Really Necessary? (Advanced Data Analysis from an Elementary Point of View)

In which the analysis of multivariate data is recursively applied.

Posted by crshalizi at April 15, 2012 20:02 | permanent link

### Graphical Models (Advanced Data Analysis from an Elementary Point of View)

Conditional independence and dependence properties in factor models. The generalization to graphical models. Directed acyclic graphs. DAG models. Factor, mixture, and Markov models as DAGs. The graphical Markov property. Reading conditional independence properties from a DAG. Creating conditional dependence properties from a DAG. Statistical aspects of DAGs. Reasoning with DAGs; does asbestos whiten teeth?

Posted by crshalizi at April 15, 2012 20:01 | permanent link

### Mixture Models (Advanced Data Analysis from an Elementary Point of View)

From factor analysis to mixture models by allowing the latent variable to be discrete. From kernel density estimation to mixture models by reducing the number of points with copies of the kernel. Probabilistic formulation of mixture models. Geometry: planes again. Probabilistic clustering. Estimation of mixture models by maximum likelihood, and why it leads to a vicious circle. The expectation-maximization (EM, Baum-Welch) algorithm replaces the vicious circle with iterative approximation. More on the EM algorithm: convexity, Jensen's inequality, optimizing a lower bound, proving that each step of EM increases the likelihood. Mixtures of regressions. Other extensions.

Extended example: Precipitation in Snoqualmie Falls revisited. Fitting a two-component Gaussian mixture; examining the fitted distribution; checking calibration. Using cross-validation to select the number of components to use. Examination of the selected mixture model. Suspicious patterns in the parameters of the selected model. Approximating complicated distributions vs. revealing hidden structure. Using bootstrap hypothesis testing to select the number of mixture components.

Posted by crshalizi at April 15, 2012 20:00 | permanent link

## April 08, 2012

### "Generalization Error Bounds for Time Series"

On Friday, my student Daniel McDonald, who I have been lucky enough to jointly advise with Mark Schervish, defeated the snake — that is, defended his thesis:

Generalization Error Bounds for Time Series
In this thesis, I derive generalization error bounds — bounds on the expected inaccuracy of the predictions — for time series forecasting models. These bounds allow forecasters to select among competing models, and to declare that, with high probability, their chosen model will perform well — without making strong assumptions about the data generating process or appealing to asymptotic theory. Expanding upon results from statistical learning theory, I demonstrate how these techniques can help time series forecasters to choose models which behave well under uncertainty. I also show how to estimate the beta-mixing coefficients for dependent data so that my results can be used empirically. I use the bound explicitly to evaluate different predictive models for the volatility of IBM stock and for a standard set of macroeconomic variables. Taken together my results show how to control the generalization error of time series models with fixed or growing memory.
PDF

I hope to have a follow-up post very soon about the substance of Daniel's work, which is part of our INET grant, but in the meanwhile: congratulations, Dr. McDonald!

Posted by crshalizi at April 08, 2012 17:25 | permanent link

## April 06, 2012

### On Refereeing a Manuscript for PNAS with Roughly a Hundred Hypothesis Tests

Posted by crshalizi at April 06, 2012 01:03 | permanent link

## April 04, 2012

### On Academic Talks: Memory and Fear

Attention conservation notice: 2000 words of advice to larval academics, based on mere guesswork and ill-assimilated psychology.

It being the season for job-interview talks, student exam presentations, etc., the problems novices have with giving them are much on my mind. And since I find myself composing the same e-mail of advice over and over, why not write it out once and for all?

Once you understand the purpose of academic talks, it becomes clear that the two fundamental obstacles to giving good talks are memory and fear.

The point of academic talk is to try to persuade your audience to agree with you about your research. This means that you need to raise a structure of argument in their minds, in less than an hour, using just your voice, your slides, and your body-language. Your audience, for its part, has no tools available to it but its ears, eyes, and mind. (Their phones do not, in this respect, help.)

This is a crazy way of trying to convey the intricacies of a complex argument. Without external aids like writing and reading, the mind of the East African Plains Ape has little ability to grasp, and more importantly to remember, new information. (The great psychologist George Miller estimated the number of pieces of information we can hold in short-term memory as "the magical number seven, plus or minus two", but this may if anything be an over-estimate.) Keeping in mind all the details of an academic argument would certainly exceed that slight capacity*. When you over-load your audience, they get confused and cranky, and they will either tune you out or avenge themselves on the obvious source of their discomfort, namely you.

People can remember things more easily if they have a scheme they can relate them to, which helps them appreciate their relevance. Your audience will come to the talk with various schemata; use them.

• Use their existing schema to help them see why they should care about what you're talking about. Why should it interest or matter to them?
• Make sure to relate your new information to ideas the audience is already familiar with, as examples, extensions, etc.
• If you must introduce new ideas, try to build up to them from things the audience knows, explaining how to modify those ideas to get yours, rather than hammering them with an unmotivated and abstract definition. (Even if you are trying to persuade them that everything they think they know is wrong, and their ideas are mere nonsense, you want to be understood, which means starting from where they are.)
• Near the very beginning of your talk, give them a scheme or big, over-all picture or outline for your argument. (This is the rational kernel behind the ritual of a table-of-contents slide.) The point of this outline is to help them grasp the relevance of the particulars you present as you go along. (If it only all comes together in the end, you've lost them long before the end.)
• Avoid complicated sub-arguments. If you must make one, begin it with a sketch or outline of its own, and end them with the one important conclusion the audience needs to remember.

As for limiting the information the audience needs to remember, the main rule is to ask yourself "Do they need to know this to follow the argument?" and "Will they need to remember this later?" If they do not need to know it even for a moment, cut it. (Showing or telling them details, followed by "don't worry about the details", does not work.) If they will need to remember it later, emphasize it, and remind them when you need it.

To answer "Do they need to know this?" and "Will they have to recall this?", you need to be intimately familiar with the logic of your own talk. The ideal of such familiarity is to have that logic committed to memory — the logic, not some exact set of words. When you really understand it, when you grasp all the logical connections and see why everything that's necessary is needed, the argument can "carry you along" through the presentation, letting you compose appropriate words as you go, without rote memorization. This has many advantages, not least the ability to field questions.

As a corollary to limiting what the audience needs to remember, if you are using slides, their text should be (1) prompts for your exposition and your audience's memory, or (2) things which are just too hard to say, like equations**. (Do not, whatever you do, read aloud the text of your slides.) But whether spoken or on the slide, cut your talk down to the essentials. This requires you to know what is essential.

"But the lovely, no the divine, details!" you protest. "All those fine points I checked, all the intricate work I did, all the alternatives I ruled out? When do I get to talk about them?" To which there are several responses.

1. The point of the talk is not to please you, by reminding yourself of what a badass you are, but to tell your audience something useful and interesting. (Note to graduate students: It is important that you internalize that you are, in fact, a badass, but it is also important that then you move on. Needing to have your ego stroked by random academics listening to talks is a sign that you have not yet reached this stage.) Unless something matters to your actual message, it really doesn't belong in the main body of the talk.
2. You can stick an arbitrary amount of detail in the "I'm glad you asked that" slides, which go after the one which says "Thank you for your attention! Any questions?".
3. You also can and should put all these details in your paper, and the people who really care, to whom it really matters, will go read your paper. Once again, think of an academic talk as an extended oral abstract.

To sum up on memory, then: successful academic talks persuade your audience of your argument. To do this, and not instead alienate your audience, you have to work with their capacities and prior knowledge, and not against them. Negatively, this means limiting the amount of information you expect them to retain. Positively, you need to use, and make, schemata which help them see the relevance of particulars. You can still give an awful talk this way (maybe your argument is incredibly bad), but you can hardly give a good talk without it.

The major consideration in crafting the content of your talk is your audience's memory. The major consideration for the delivery of the talk is your fear. (Your own memory is not so great, but you have of course internalized the schema for your own talk, and so you can re-generate it as you go, using your slides as prompts.) Public speaking, especially about something important to you, and to an audience whose opinion matters to you, is intimidating to many people. Fear makes you a worse public speaker; you mumble, you forget your place in the argument, you can't think on your feet, you project insecurity (possibly by over-compensating), etc. You do not need to become a great, fearless public speaker; you do need to be adequate at it. The three major routes to doing this, in my experience, are desensitization, dissociation, and deliberate acts.

Desensitization is simple: the more you do it, and emerge unscathed, the less fearful you will be. Practice giving your talks to safe but critical audiences. ("But critical" is key: you need them to tell you honestly what wasn't working well. [Something can always be done better.]) If you can't get a safe-but-critical audience, get an audience you don't care about (e.g., some random conference), and practice on them. Remind yourself, too, that while your talk may be a big deal for you, it's rarely a big deal for your audience.

Dissociation is about embracing being a performer on a stage: the audience's idea of you is already a fictional character, so play a character. It can, once again, be very liberating to separate the persona you're adopting for the talk from the person you actually are. If that seems unethical, go read The Presentation of Self in Everyday Life. An old-fashioned insistence that what really matters are the ideas, and not their merely human vessel, can also be helpful here.

At the outset, I said that the two great obstacles to giving a good talk are memory and fear. The converse is that if you truly understand your own argument, and you truly believe in it, you can convey it in a way which works with your audience's memory, and overcome your own fear. The sheer mechanics of presentation will come with practice, and you will have something worth presenting.

• Aristotle, Rhetoric
• Erving Goffman, The Presentation of Self in Everyday Life
• Albert B. Lord, The Singer of Tales
• Neil Mercer, Words and Minds: How We Use Language to Think Together
• Herbert Simon, The Sciences of the Artificial
• Dan Sperber and Deirdre Wilson, Relevance: Cognition and Communication

*: Some branches of the humanities and the social sciences have the horrible custom of reading an academic paper out loud, apparently on the theory that this way none of the details get glossed over. The only useful advice which can be given about this is "Don't!". Academic prose has many virtues, but it is simply not designed for oral communication. Moreover, all of your audience consists of people who are very good at reading such prose, and can certainly do so at least as fast as you can recite it. Having people recite their papers, or even prepared remarks written in the style of a paper, does nothing except waste an hour in the life of the speaker and the audience — and none of us has hours to waste. ^

**: As a further corollary, and particularly important in statistics, big tables of numbers (e.g., regression coefficients) are pointless; and here "big" means "larger than 2x2". ^

Posted by crshalizi at April 04, 2012 01:09 | permanent link

## April 03, 2012

### How the Recent Mammals Got Their Size Distribution (Advanced Data Analysis from an Elementary Point of View)

Homework 8: in which returning to paleontology gives us an excuse to work with simulations, and to compare distributions.

Posted by crshalizi at April 03, 2012 23:40 | permanent link

### Red Brain, Blue Brain (Advanced Data Analysis from an Elementary Point of View)

Homework 8: in which we try to predict political orientation from bumps on the skull the volume of brain regions determined by MRI and adjusted by (unknown) formulas.

Posted by crshalizi at April 03, 2012 09:20 | permanent link

### Factor Analysis (Advanced Data Analysis from an Elementary Point of View)

Adding noise to PCA to get a statistical model. The factor model, or linear regression with unobserved independent variables. Assumptions of the factor model. Implications of the model: observable variables are correlated only through shared factors; "tetrad equations" for one factor models, more general correlation patterns for multiple factors. Our first look at latent variables and conditional independence. Geometrically, the factor model says the data cluster on some low-dimensional plane, plus noise moving them off the plane. Estimation by heroic linear algebra; estimation by maximum likelihood. The rotation problem, and why it is unwise to reify factors. Other models which produce the same correlation patterns as factor models.

Reading: Notes, chapter 19; factors.R and sleep.txt

Posted by crshalizi at April 03, 2012 09:15 | permanent link

### Principal Components Analysis (Advanced Data Analysis from an Elementary Point of View)

Principal components is the simplest, oldest and most robust of dimensionality-reduction techniques. It works by finding the line (plane, hyperplane) which passes closest, on average, to all of the data points. This is equivalent to maximizing the variance of the projection of the data on to the line/plane/hyperplane. Actually finding those principal components reduces to finding eigenvalues and eigenvectors of the sample covariance matrix. Why PCA is a data-analytic technique, and not a form of statistical inference. An example with cars. PCA with words: "latent semantic analysis"; an example with real newspaper articles. Visualization with PCA and multidimensional scaling. Cautions about PCA; the perils of reification; illustration with genetic maps.

Reading: Notes, chapter 18; pca.R, pca-examples.Rdata, and cars-fixed04.dat

Posted by crshalizi at April 03, 2012 09:10 | permanent link

### Relative Distributions and Smooth Tests (Advanced Data Analysis from an Elementary Point of View)

Applying the right CDF to a continuous random variable makes it uniformly distributed. How do we test whether some variable is uniform? The smooth test idea, based on series expansions for the log density. Asymptotic theory of the smooth test. Choosing the basis functions for the test and its order. Smooth tests for non-uniform distributions through the transformation. Dealing with estimated parameters. Some examples. Non-parametric density estimation on [0,1]. Checking conditional distributions and calibration with smooth tests. The relative distribution idea: comparing whole distributions by seeing where one set of samples falls in another distribution. Relative density and its estimation. Illustrations of relative densities. Decomposing shifts in relative distributions.