April 01, 2014

"Proper scoring rules and linear estimating equations in exponential families" (Next Week at the Statistics Seminar)

Attention conservation notice: Only of interest if you (1) care about estimating complicated statistical models, and (2) will be in Pittsburgh on Monday.

Steffen Lauritzen, "Proper scoring rules and linear estimating equations in exponential families"
Abstract: In models of high complexity, the computational burden involved in calculating the maximum likelihood estimator can be forbidding. Proper scoring rules (Brier 1950, Good 1952, Bregman 1967, de Finetti 1975) such as the logarithmic score, the Brier score, and others, induce natural unbiased estimating equations that generally lead to consistent estimation of unknown parameters. The logarithmic score corresponds to maximum likelihood estimation whereas a score function introduced by Hyvärinen (2005) leads to linear estimation equations for exponential families.
We shall briefly review the facts about proper scoring rules and their associated divergences, entropy measures, and estimating equations. We show how Hyvärinen's rule leads to particularly simple estimating equations for Gaussian graphical models, including Gaussian graphical models with symmetry.
The lecture is based on joint work with Philip Dawid, Matthew Parry, and Peter Forbes. For a recent reference see: P. G. M. Forbes and S. Lauritzen (2013), "Linear Estimating Equations for Exponential Families with Application to Gaussian Linear Concentration Models", arXiv:1311.0662.
Time and place: 4--5 pm on Monday, 7 April 2014, in 125 Scaife Hall.

Much of what I know about graphical models I learned from Prof. Lauritzen's book. His work on sufficienct statistics and extremal models, and their connections to symmetry and prediction, has shaped how I think about big chunks of statistics, including stochastic processes and networks. I am really looking forward to this.

(To add some commentary purely of my own: I sometimes encounter the idea that frequentist statistics is somehow completely committed to maximum likelihood, and has nothing to offer when that fails, as it sometimes does [1]. While I can't of course speak for every frequentist statistician, this seems silly. Frequentism is a family of ideas about when probability makes sense, and it leads to some ideas about how to evaluate statistical models and methods, namely, by their error properties. What justifies maximum likelihood estimation, from this perspective, is not the intrinsic inalienable rightness of taking that function and making it big. Rather, it's that in many situations maximum likelihood converges to the right answer (consistency), and in a somewhat narrower range will converge as fast as anything else (efficiency). When those fail, so much the worse for maximum likelihood; use something else that is consistent. In situations where maximizing the likelihood has nice mathematical properties but is computationally intractable, so much the worse for maximum likelihood; use something else that's consistent and tractable. Estimation by minimizing a well-behaved objective function has many nice features, so when we give up on likelihood it's reasonable to try minimizing some other proper scoring function, but again, there's nothing which says we must.)

[1]: It's not worth my time today to link to particular examples; I'll just say that from my own reading and conversation, this opinion is not totally confined to the kind of website which proves that rule 34 applies even to Bayes's theorem. ^

Posted by crshalizi at April 01, 2014 10:45 | permanent link

March 15, 2014

"The Neglect of Fluctuations in the Thermodynamics of Computation" (Next Week at the Energy and Information Seminar)

Lo these many years ago, I blogged about how a paper of John Norton's had led me to have doubts about Landauer's Principle. Prof. Norton has continued to work on this topic, and I am very happy to share the news about his upcoming talk at CMU's "Energy and Information" seminar:

John D. Norton, "The Neglect of Fluctuations in the Thermodynamics of Computation"
Abstract: The thermodynamics of computation assumes that thermodynamically reversible processes can be realized arbitrarily closely at molecular scales. They cannot. Overcoming fluctuations so that a molecular scale process can be completed creates more thermodynamic entropy than the small quantities tracked by Landauer's Principle. This no go result is the latest instance of a rich history of problems posed by fluctuations for thermodynamics.
Time and place: Noon--1 pm on Wednesday, 19 March 2014, in room D-210, Hamerschlag Hall.
Related papers: "All Shook Up: Fluctuations, Maxwell's Demon and the Thermodynamics of Computation", Entropy 15 (2013): 4432--4483; "The End of the Thermodynamics of Computation: A No-Go Result", Philosophy of Science 80 (2013): 1182--1192

(For the record, I remain of at least two minds about Landauer's principle. The positive arguments for it seem either special cases or circular, but the conclusion makes so much sense...)

Manual trackback / update, 1 April 2014: Eric Drexler's Metamodern, who objects that "the stages of computation themselves need not be in equilibrium with one another, and hence subject to back-and-forth fluctuations" (his italics). In particular, Drexler suggests introducing an external time-varying potential that "can carry a system deterministically through a series of stages while the system remains at nearly perfect thermodynamic equilibrium at each stage". But I think this means that the whole set-up is not in equilibrium, and in fact this proposal seems quite compatible with sec. 2.2 of Norton's "No-Go" paper. Norton's agrees that "there is no obstacle to introducing a slight disequilibrium in a macroscopic system in order to nudge a thermodynamically reversible process to completion"; his claim is that the magnitude of the required disequilibria, measured in terms of free energy, are large compared to Landauer's bound. The point is not that it's impossible to build molecular-scale computers (which would be absurd), but that they will have to dissipate much more heat than Landauer suggests. I won't pretend this settles the matter, but I do have a lecture to prepare...

Posted by crshalizi at March 15, 2014 11:25 | permanent link

March 04, 2014

"Unifying the Counterfactual and Graphical Approaches to Causality" (Tomorrow at the Statistics Seminar)

Attention conservation notice: Late notice of an academic talk in Pittsburgh. Only of interest if you care about the places where the kind of statistical theory that leans on concepts like "the graphical Markov property" merges with the kind of analytical metaphysics which tries to count the number of possibly fat men not currently standing in my doorway.

A great division in the field of causal inference in statistics is between those who like to think of everything in terms of "potential outcomes", and those who like to think of everything in terms graphical models. More exactly, while partisans of potential outcomes tend to denigrate graphical models (*), those of us who like the latter tend to presume that potential outcomes can be read off from graphs, and hope someone will get around to showing some sort of formal equivalence.

That somebody appears to have arrived.

Thomas S. Richardson, "Unifying the Counterfactual and Graphical Approaches to Causality via Single World Intervention Graphs (SWIGs)"
Abstract: Models based on potential outcomes, also known as counterfactuals, were introduced by Neyman (1923) and later popularized by Rubin (1974). Such models are used extensively within biostatistics, statistics, political science, economics, and epidemiology for reasoning about causation. Directed acyclic graphs (DAGs), introduced by Wright (1921) are another formalism used to represent causal systems. Graphs are also extensively used in computer science, bioinformatics, sociology and epidemiology.
In this talk, I will present a simple approach to unifying these two approaches via a new graph, termed the Single-World Intervention Graph (SWIG). The SWIG encodes the counterfactual independences associated with a specific hypothetical intervention on a set of treatment variables. The nodes on the SWIG are the corresponding counterfactual random variables. The SWIG is derived from a causal DAG via a simple node splitting transformation. I will illustrate the theory with a number of examples. Finally I show that SWIGs avoid a number of pitfalls that are present in an alternative approach to unification, based on "twin networks" that has been advocated by Pearl (2000).
Joint work with James Robins; paper, and shorter summary paper from the Causal Structure Learning Workshop at UAI 2013
Time and place: 4:30--5:30 pm on Wednesday, 5 March 2014, in Scaife Hall 125

As always, the talk is free and open to the public, whether the public follows their arrows or not.

*: I myself have heard Donald Rubin assert that graphical models cannot handle counterfactuals, or non-additive interactions between variables (particularly that they cannot handle non-additive treatments), and that their study leads to neglecting analysis-of-design questions. (This was during his talk at the CMU workshop "Statistical and Machine Learning Approaches to Network Experimention", 22 April 2013.) This does not diminish Rubin's massive contributions to statistics in general, and to causal inference in particular, but does not exactly indicate a thorough knowledge of a literature which goes rather beyond "playing with arrows".

Posted by crshalizi at March 04, 2014 17:50 | permanent link

February 28, 2014

On Reaching the Clearing at the End of the Tenure Track

Attention conservation notice: Navel-gazing by a middle-aged academic.

I got tenure a few weeks ago. (Technically it takes effect in July.) The feedback from the department and university which accompanied the decision was gratifyingly positive, and I'm pleased that blogging didn't hurt me at all, and perhaps even helped. I got here with the help of a lot of mentors, colleagues, and friends (not a few of whom I met through this blog), and I feel some vindication on their behalf. For myself, I feel — relieved, even pleased.

Relieved and pleased, but not triumphant. I benefited from a huge number of lucky breaks. I know too many people who would be at least as good in this sort of job as I am, and would like such a job, but instead have ones which are far less good for them. If a few job applications, grant decisions, or choices about what to work on when had turned out a bit different, I could have been just as knowledgeable, had ideas just as good, worked on them as obsessively, etc., and still been in their positions, at best. Since my tenure decision came through, I've had two papers and a grant proposal rejected, and another paper idea I've been working on for more than a year scooped. A month like that at the wrong point earlier on might well have sunk my academic career. You don't get tenure at a major university without being a productive scholar (not for the most part anyway), but you also don't get it without being crazily lucky, and I can't feel triumphant about luck.

It's also hard for me to feel triumph because, by the time I get tenure, I will have been at CMU for nine years and change. Doing anything for that long marks you, or at least it marks me, and I'm not sure I like the marks. The point of tenure is security, and I hope to broaden my work, to follow some interests which are more speculative and risky and seem like they will take longer to pay off, if they ever do. But I have acquired habits and made commitments which will be very hard to shift. One of those habits is to think of my future in terms of what sort of scholarly work I'm going to be doing, and presuming that I will be working all the time, with only weak separation between work and the rest of life. I even have some fear that this has deformed my character, making some ordinary kinds of happiness insanely difficult. But maybe "deformed" is the wrong word; maybe I stuck with this job because I was already that kind of person. I can't bring myself to wish I wasn't so academic in my interests, or that I hadn't pursued the career I have, or that I had been less lucky in it. But I worry about what I have given up for it, and how those choices will look in another nine years, or twenty-nine.

Sometime in the future, I may write about what I think about tenure as an institution. But today is a beautiful winter's day here in Pittsburgh, cold but clear, the sky is a brilliant pale blue right now. It's my 40th birthday. I'm going outside to take a walk — and then probably going back to work.

Posted by crshalizi at February 28, 2014 15:52 | permanent link

January 02, 2014

36-350, Fall 2013: Self-Evaluation and Lessons Learned (Introduction to Statistical Computing)

This was not one of my better performances as a teacher.

I felt disorganized and unmotivated, which is a bit perverse, since it's the third time I've taught the class, and I know the material very well by now. The labs were too long, and my attempts to shove the excess parts of the labs into revised homework assignments did not go over very well. The final projects were decent, but on average not as good as the previous two years.

I have two ideas about what went wrong. One is of course about kids these days (i.e., blaming the victims), and the other is about my own defects of character.

First, in retrospect, previous iterations of the course benefited from the fact that there hadn't been an undergraduate course here in statistical computing. This meant there was a large pool of advanced statistics majors who wanted to take it, but already knew a lot of the background materials and skills; the modal student was also more academically mature generally. That supply of over-trained students is now exhausted, and it's not coming back either — the class is going to become a requirement for the statistics major. (As it should.) So I need to adjust my expectations of what they know and can do on their own downward in a major way. More exactly, if I want them to know how to do something, I have to make sure I teach it to them, and cut other things from the curriculum to make room. This, I signally failed to do.

Second, I think the fact that this was the third time I have taught basically the same content was in fact part of the problem. It made me feel too familiar with everything, and gave me an excuse for not working on devising new material up until the last moment, which meant I didn't have everything at my finger's ends, and frankly I wasn't as excited about it either.

Putting these together suggests that a better idea for next time would be something like the following.

• They need to be drilled in interacting-with-the-machine skills, like how to use text editors (not word processors) and command lines. Therefore: require the use of RStudio, and that all assignments be turned in either in RMarkdown or Sweave. (This would be a bit hypocritical, since I normally don't use any of those myself, but I am not going to teach them my own Emacs-centric work habits.)
• Programming style (meaningful names, commenting, writing effect tests which are clearly separate from the main code, etc.) needs to be religiously enforced. Therefore: enforce it religiously, and make it an explicit part of the grading scheme. Grading rubrics need to be transparent about this.
• They need to learn to think through debugging, design, and re-design. Therefore: Devote some lectures to live examples of all these, with non-trivial code, and pair them with assignments.
• They need more practice with collaboration. (If nothing else this will help them see the importance of writing code others can read.) Therefore: Institute pair programming in labs, and/or require paired work on homework; rotate partners. (I'm not so sure about this requirement.)
• They need to learn about version control and collaboration tools. Therefore: at least one lecture on version control with git, and require its use on projects. (The start-up costs for learning git may be too harsh, but it's what I use and I really won't go through teaching myself something else.)
• They need me to walk them through linear regression (or at least lm) and the logic of maximum likelihood. Therefore: Add a lecture (and assignment) with a systematic demo of lm, formula, residuals, predict. Include at least one non-linear regression method as a contrast case.
• The in-class midterm exam consistently fails to give useful additional information about what they know and can do. Therefore: scrap it in favor of a mini-project.
• The final project needs a lot more scaffolding, or at least feedback. Therefore: have them start projects earlier, and require more interim reports from them.
• They need more concrete data-analytic examples earlier, they need to see the data-manipulation material earlier. Therefore: re-organize the lectures, putting off the classic computational-statistics topics of simulation and optimization towards the end. Chop off material as needed to fit the basics.
• The labs are too long and too poorly integrated with homework. (Also, they are sequenced to the current lectures, and there are too many copies of old solution sets circulating.) Therefore, throw out all the old assignments and write new ones. One student suggested making the labs not the beginning of a take-home assignment, but a chance to get feedback/corrections at the end; this seems promising. Provide very explicit rubrics.

All of this will be a lot of work for me, but that's part of the point. Hopefully, I will make the time to do this, and it will help.

Posted by crshalizi at January 02, 2014 18:01 | permanent link

January 01, 2014

End of Year Inventory, 2013

Attention conservation notice: Navel-gazing.

Paper manuscripts completed: 4
Papers accepted: 3
Papers rejected: 4 (fools! we'll show you all!)
Papers in revise-and-resubmit purgatory: 2
Papers in refereeing limbo: 1
Papers with co-authors waiting for me to revise: 7
Other papers in progress: I won't look in that directory and you can't make me

Grant proposals submitted: 5
Grant proposals funded: 1
Grant proposals rejected: 3 (fools! we'll show you all!)
Grant proposals in refereeing limbo: 2
Grant proposals in progress for next year: 1
Grant proposals refereed: 2

Talk given and conferences attended: 17, in 10 cities

Classes taught: 2 [i, ii]
New classes taught: 0
Summer school classes taught: 1
New summer school classes taught: 0
Pages of new course material written: not that much

Manuscripts refereed: 21
Number of times I was asked to referee my own manuscript: 0
Manuscripts waiting for me to referee: 5
Manuscripts for which I was the responsible associate editor at Annals of Applied Statistics: 4
Book proposals reviewed: 1
Book proposals submitted: 0
Book outlines made and torn up: 3
Book manuscripts completed: 0
Book manuscripts due soon: 1

Students who completed their dissertations: 0
Students who completed their dissertation proposals: 0
Students preparing to propose in the coming year: 4
Letters of recommendation sent: 60+
Dissertations at other universities for which I was an external examiner: 2 (i, ii)

Tenure packets submitted: 1
Days until final decision on tenure: < 30

Book reviews published on dead trees: 0

Weblog posts: 93
Substantive posts: 17, counting algal growths
Incomplete posts in the drafts folder: 39
Incomplete posts transferred to the papers-in-progress folder: 1

Books acquired: 260
Books begun: 104
Books finished: 76
Books given up: 3
Books sold: 28
Books donated: 0

Major life transitions: 1

Posted by crshalizi at January 01, 2014 00:01 | permanent link

December 04, 2013

Computing for Statistics (Introduction to Statistical Computing)

(My notes from this lecture are too fragmentary to post; here's the sketch.)

What should you remember from this class?

Not: my mistakes (though remember that I made them).

Not: specific packages and ways of doing things (those will change).

Not: the optimal algorithm, the best performance (human time vs. machine time).

Not even: R (that will change).

R is establishing itself as a standard for statistical computing,, but you can expect to have to learn at least one new language in the course of a reasonable career of scientific programming, and probably more than one. I was taught rudimentary coding in Basic and Logo, but really only learned to program in Scheme. In the course of twenty years of scientific programming, I have had to use Fortran, C, Lisp, Forth, Expect, C++, Java, Perl and of course R, with glances at Python and OCaml over collaborators' shoulders, to say nothing of near-languages like Unix shell scripting. This was not actually hard, just tedious. Once you have learned to think like a programmer with one language, getting competent in the syntax of another is just a matter of finding adequate documentation and putting in the time to practice it --- or finding minimal documentation and putting in even more time (I'm thinking of you, CAM-Forth). It's the thinking-like-a-programmer bit that matters.

Instead, remember rules and habits of thinking

1. Programming is expression: take a personal, private, intuitive, irreproducible series of acts of thought, and make it public, objective, shared, explicit, repeatable, improvable. This resembles both writing and building a machine: communicative like writing, but with the impersonal, it all-must-fit check of the machine. All the other principles follow from this fact, that it is turning an act of individual thought into a shared artifact --- reducing intelligence to intellect (cf.).
2. Top-down design
1. What are you trying to do, with what resources, and what criteria of success?
2. Break the whole solution down into a few (say, 2--6) smaller and simpler steps
3. If those steps are so simply you can see how to do them, do them
4. If they are not, treat each one as a separate problem and recurse
This means that you do not have to solve a complex problem all at once, but rather can work on pieces, or share pieces of the work with others, with confidence that your incremental efforts will fit meaningfully into an integrated whole; reductionism is a strategy for making the whole work.
3. Modular and functional programming
1. Use data structures to group related values together
1. Select, or build, data structures which make it easy to get at what you want
2. Select, or build, data structures which hide the implementation details of no relevance to you
2. Use functions to group related operations together
1. Do not reinvent the wheel
2. Do unify your approach
3. Avoid side-effects, if possible
4. Consider using functions as inputs to other functions
3. Take the whole-object view when you can
1. Cleaner
2. More extensible
3. Sometimes faster
4. Essential to clarity in things like split/apply/combine
4. Code from the bottom up; code for revision
1. Start with the smallest, easiest bits from your outline/decomposition and work your way up
2. Document your code as you go
1. Much easier than going back and trying to document after
2. Comment everything
3. Use meaningful names
4. Use conventional names
3. Write tests as you go
1. Make it easy to re-run tests
2. Keep working on your code until it passes your tests
3. Make it easy to add tests
4. Whenever you find a new bug, add a corresponding test
4. Prepare to revise
1. Once code is working, look for common or analogous operations: refactor as general functions; conversely, look if one big function mightn't be split apart
2. Look for data structures that show up together: refactor as one object; conversely, look if some pieces of one data structure couldn't be split off
3. Be willing to re-do some or all of the plan, once you know more about the problem
5. Statistical programming is different
1. Bear in mind noise in input data
1. Consider random tests cases, or generally avoiding neat numerical values in tests
2. Try to have some idea of how much precision is actually possible, and avoid asking for more (wasted effort)
2. Use probability and simulation to make test cases
1. What should (semi-) realistic data look like?
2. How well does the procedure work?
3. Use probability and simulation to approximate when exact answers are intractable
1. Monte Carlo
2. Stochastic optimization
3. Replacing deterministic exact problems with relaxed probabilistic counterparts
6. Get used to working with multiple systems
1. Use existing packages/software (unless instructed otherwise!)
2. Use specialized tools for specialized purposes
1. Numerical methods code (implicit but hidden)
2. Regular expressions for text processing: grep/sed/awk, Perl, ...
3. Databases for massive storage and retrieval
4. Graphics, etc.
3. Expect to have to keep learning
1. Learning new formalisms: gets easier as you go
2. Learning how to translate their data structures back and forth into yours
To sum up, in order of increasing importance:
• Plan for randomness
• Take the whole-object view when you can
• Think in terms of functions and objects
• Practice top-down design
• Code with an eye to debugging and revision
• Remember that programming is expression

Posted by crshalizi at December 04, 2013 10:30 | permanent link

December 02, 2013

Speed, Complexity and Interfacing with Other Systems (Introduction to Statistical Computing)

(My notes from this lecture are too fragmentary to type up; here's the sketch)

Programmer time is (usually) much more valuable than computer time; therefore, "premature optimization is the root of all evil". That said, computer time isn't free, and if you have to wait around for the machine to finish, computer time is programmer time.

Vectorize

In R, vectorization can lead to great speed-ups. The reason has to do with how R changes existing objects. This is to make a new instance of the object, copy over the unchanged values and fill in the changed one, and then delete the old instance of the object. This makes repeated small changes to large big objects very inefficient. (On the other hand, incrementally updating a handful of parameters in optimization isn't that bad.) Changing the size of the object is even more inefficient. Vectorization avoids all this allocation, copying, and de-allocation.
very.stupid.adder <- function(x,y) {
z <- c()
for (i in 1:length(x)) {
z <- c(z,x[i]+y[i])
}
return(z)
}

stupid.adder <- function(x,y) {
z <- vector(length=length(x))
for (i in 1:length(x)) {
z[i] <- x[i] + y[i]
}
return(z)
}

> a <- rnorm(1e5)
> b <- rnorm(1e5)
> 1e6*system.time(very.stupid.adder(a,b)) # time in microseconds
user   system  elapsed
20266000  3224000 23409000
user  system elapsed
204000    1000  204000
> 1e6*system.time(a+b)
user  system elapsed
0       0       0

(R's copy-on-change policy isn't a pointless perversity; knowing that objects don't "change in place" makes some memory management issues much more efficient.)

One way to speed up your code, then, is to think carefully about how to vectorize it. Again, the major advantage of vectorization is clarity, but orders-of-magnitude speed-ups aren't bad either.

Cultivate Slop and Arbitrary Ignorance

We can often speed things up by accepting a more approximate answer in place of a more exact one. For statistical problems, in particular, there is no point in making our answers more precise than the data will justify. We've emphasized this for optimization, but it applies to lots of numerical computations as well.

(Here, it is useful to understand the trade-off between computing effort and approximation. If getting a precision of $$\pm \epsilon$$ takes $$O(\log{1/\epsilon})$$ operations, then doubling the $$\epsilon$$ we tolerate shaves only a constant amount of time off the calculation --- we'd have to accept exponentially more error to get real speed-ups. [Optimistically, if we can do the calculation at all, we should be able to do it precisely.] If however the scaling is $$O(1/\epsilon^2)$$, doubling $$\epsilon$$ would reduce the time by a factor of four.)

Computational problems also generally slow down as the data size increases. For many statistical problems, we can actually make use of this, by sampling the data and only running our procedure on a sample. This introduces some approximation error, compared to running on the full data, but this trade-off can be well worth it. Moreover, we can cross-check different randomized subsets of the data, if we're worried about being misled by small parts of it.

If you can't sample, try to reduce how often you have to go over the data.

There is also a whole, very important, field of computer science concerned with developing randomized approximation algorithms (see e.g. Mahoney on randomized matrix algorithms).

Parallelize

If you can break your problem up into independent chunks, you can farm each chunk out to a separate processor, and do them in parallel. (This presumes you have a lot of processors.) The split/apply/combine strategy is designed to work with this. It may be possible to do this for one stage in your problem, followed by others which don't divide neatly; you then alternate between embarrassingly-parallel and centralized processing. There are also problems which lend themselves to parallel processing with communication between the processors; these tend to be ones which are somehow "local" in space.

I am vague here, because the technology for parallelizing data processing is advancing rapidly.

Exploit Other's Work

Don't re-invent the wheel: look for well-executed software which does a part of your job and use it. (Don't write your own linear algebra package, unless you really know what you are doing, and have a good reason not to just improve existing packages.) Modularity and abstraction make it easier to use other's work, and replace one external library with another, better one.

Give Up on R

Highly optimized tools in the Unix shell for particular data-manipulation jobs. (Bradnam and Koch just scratches the surface.) Cultivating their acquaintance is a good use of your time. Remember that system lets you call arbitrary system functions, and capture the output.

At some point, speed (and maybe memory) efficiency point to a more streamlined, compiled language for specific parts of your work. The natural choices here are C and (still) Fortran; see Writing R Extensions. You could write all your software for data analysis C, but then you will spend all your time tracing down memory allocation bugs.

Actual Complexity Analysis

Computational complexity: resources (time [in operations not seconds], space [in bits not liters], communication, samples) needed as function of problem size.

Merge-sort, for example.

Intrinsic limits (so far as we can tell) to how easily certain sorts of problems can be solved.

Clever algorithms vs. exploiting special structure in problems.

Really deserves attention on its own: read Moore and Mertens. (Though Kleinberg and Tardos may be a bit more directly practical, Moore and Mertens will do more for your soul.)

Posted by crshalizi at December 02, 2013 10:30 | permanent link

November 25, 2013

Simulation V: Matching Simulation Models to Data (Introduction to Statistical Computing)

$\newcommand{\Expect}[1]{\mathbb{E}\left[ #1 \right]} \DeclareMathOperator*{\argmin}{argmin}$

(My notes for this lecture are too incomplete to be worth typing up, so here's the sketch.)

Methods, Models, Simulations

Statistical methods try to draw inferences about unknown realities from data. They always involve some more or less tightly specified stochastic model, some story (myth) about how the data were generated, and so how the data are linked to the world. We want to know how well our methods work, which is usually hard to assess by just running them on real data. Sometimes, we can push through analytic probability calculations under our model assumptions, but often not. Then we can get some baseline idea of how well our methods should work by simulating the models, running the simulation through the methods, and comparing the results to what we know the truth was in the simulation world. (That is, we see if we can recover the myth.) We can also see how robust our methods are by simulating processes which break the modeling assumptions more or less grossly.

If your method, or your implementation of a common method, doesn't work under simulations, then you had better have a very good argument for why it should be trusted with real data. Always go back and forth between simulations and method development $\mathrm{simulation}:\mathrm{methods}::\mathrm{unit\ tests}:\mathrm{functions}$

Adjusting Simulation Models to Match Data

Many (most?) useful probability models are easily specified as step-by-step recipes for simulation, but have ugly, complicated distributions for the data. (Nonlinear dependencies and latent variables are both tend to lead to complicated distributions.) Since the models have adjustable parameters, we would like to adjust them so that they match the data we have about the world. If the distributions were straightforward, we could just calculate the likelihood and maximize it (or use Bayesian updating, etc.), but that's not really possible when the distribution is ugly. What to do?

Method of Moments and Friends

Moments ($$\Expect{X}, \Expect{X^2}, \ldots )$$ are functional of the probability distribution. If our model has a $$k$$-dimensional parameter vector $$\theta$$, then we should generally be able to find $$k$$ moments which characterize or identify the parameter, meaning that there are functions $$f_1, f_2, \ldots f_k$$ such that (i) $$\Expect{X} = f_1(\theta), \Expect{X^2} = f_2(\theta), \ldots \Expect{X^k} = f_k(\theta)$$, and (ii) if $$\theta_1 \neq \theta_2$$, then at least one of the $$k$$ moments differs, $$f_j(\theta_1) \neq f_j(\theta_2)$$ for some $$j \in 1:k$$.

If we know the functions $$f_1, \ldots f_k$$, and we knew the population moments, we could then solve for the matching $$\theta$$. (We could also test the model, as opposed to the parameter value, by seeing if that $$\theta$$ let us match other population moments; and of course if there were no solution for $$\theta$$.) Since we only see sample moments, in the method of moments we try to solve for the parameters which match them: $$\hat{\theta}_{MM}$$ is the $$\theta$$ where $f_i(\theta) = \overline{x^i} ~, ~ i \in 1:k$ (We can then test this by looking at whether we can match other, mathematically-independent functionals.) Since it's annoying to have to keep the subscripts and superscripts $$i$$ around, let's say that $$f(\theta)$$ is the vector-valued function of theoretical moments, and $$m$$ is the vector of sample moments. Our earlier assumption (ii) means that $$f^{-1}$$ exists, so $\hat{\theta}_{MM} = f^{-1}(m)$ By the law of large numbers, we expect $$m$$ to approach the population moments as we see more data. For $$n$$ not-too-dependent observations, the error should be $$O(1/\sqrt{n})$$, and then if $$f^{-1}$$ is a reasonably smooth function, propagation of error tells us that $$\|\hat{\theta}_MM - \theta \| = O(1/\sqrt{n})$$ itself.

It might happen, due to sampling fluctuations, that the observed moments $$m$$ are not ones which could ever be theoretical moments for the model --- that $$m$$ is not in the range of $$f(\theta)$$, no matter how we adjust $$\theta$$. The usual approach then is to minimize: $\hat{\theta}_{MM} = \argmin_{\theta}{\| f(\theta) - m \|}$ (While I've written this for the ordinary Euclidean norm, it is often better to minimize a weighted norm, giving more emphasis to the moments which have smaller standard errors. This refinement can come later.)

So far I've said nothing about simulation. I've presumed that we can find the moments as explicit functions of the parameters --- that we know $$f$$. However, if we can simulate our model at our favorite $$\theta$$, we can get arbitrarily accurate approximations to $$f$$. Specifically, if we can do $$b$$ separate runs of the model at $$\theta$$, and average the output, we get an approximation of $$f(\theta)$$, say $$\tilde{f}(\theta)$$, whose error is $$O(1/\sqrt{b}$$. We can thus look at $\hat{\theta}_{MSM} = \argmin_{\theta}{\| \tilde{f}(\theta) - m \|}$ and try to minimize it, to get our method of simulated moments estimate. There will now be two sources of error, simulation error (shrinking with $$b$$, and under our control), and statistical error (shrinking with $$n$$, and out of our control). Again, it can be better to use a weighted norm, emphasizing the moments which can be estimated more precisely.

There is nothing magic about moments --- all we need is a set of functionals of the distribution which characterize the parameters. The foregoing carries over without modification to the generalized method of moments, and the generalized method of simulated moments. It is often particularly useful to make the "generalized moments" parameters in some other statistical model, which is mis-specified but very easy to estimate; then we have indirect inference.

A Brute-Force Approach to Simulated Maximum Likelihood

Observe data $$x$$. Pick an initial $$\theta$$. Do $$b$$ independent runs at that parameter value. Use your favorite non-parametric density estimator to get $$\tilde{p}_{\theta}$$. (As $$b \rightarrow \infty$$, the density estimate converges on the true density, $$\tilde{p}_{\theta} \rightarrow p_{\theta}$$.) Now evaluate that estimated density at $$x$$, and treat $$\tilde{p}_{\theta}(x)$$ as the likelihood. Adjust $$\theta$$ to maximize $$\tilde{p}_{\theta}(x)$$.

This is not very efficient --- as the dimension of $$x$$ grows, $$b$$ has to grow exponentially to maintain a constant accuracy. Moreover, our representation of $$\tilde{p}_{\theta}$$ is going to be horridly redundant, using a huge number of degrees of freedom when we only need $$k$$. There are ways one could begin to get around these issues, but that's beyond the scope of these notes.

Checking the Simulation Model Against Data

Because we can simulate, we can look at what our model predicts for arbitrary aspects of the distribution of data. Basically any statistic of the data can thus be a test statistic. What makes a good test statistic?

Obviously, if we adjusted the parameters to match a certain statistic, that is not a good test. (Or at least, I hope it's obvious that if we adjusted the parameters to match the sample variance, we learn nothing by seeing that simulations have about the sample variance.) A little more subtly, nothing which depends mathematically on the statistics we matched is a good test. If we matched the first $$k$$ moments, no function of those moments is suitable. (If we matched $$\overline{x}$$ and $$\overline{x^2}$$, we learn nothing by seeing if we can also match the sample variance.) We might turn to seeing if we can match higher moments, but those tend to be rather unstable, so that doesn't have much power.

The best recommendation I can make is to actually understand the process you're studying, and pick out interesting, relevant features which are logically independent of the moments you matched. This however requires subject-matter knowledge, rather than statistical or computational skill. In the absence of understanding, we can try to check the model using summary statistics other than the moments we used to estimate parameters --- e.g., if we matched the mean, how well do we do on the median? Another useful fall-back is to try re-running exploratory data analyses on simulation output, and seeing how well that matches the EDA on the data.

Over-identifying: If $$\theta$$ is $$k$$-dimensional but we try to match more than $$k$$ moments, it's not guaranteed that we can match all of them at once. The over-all discrepancy $$\| \tilde{f}(\theta) - m \|$$ should still shrink as we get more data if the model is right. Under some circumstances, one can work out analytically what the distribution of that discrepancy should be under the model (e.g., $$\chi^2$$ or related), but you can also just simulate.

Breaking models: If you want to know how robust some model prediction is, you can try to deliberately search for parameters which will not make that prediction. Again, out of space I won't go into details, but see Miller's paper. This isn't, strictly speaking, about whether the model fits the data, though it could be adapted to that.

(Woefully incomplete, merely alphabetical order)

Posted by crshalizi at November 25, 2013 10:30 | permanent link

November 22, 2013

Homework: Several Hundred Degrees of Separation (Introduction to Statistical Computing)

Homework 10: in which we refine our web-crawler from the previous assignment, by way of further working with regular expressions, and improving our estimates of page-rank.

(This assignment ripped off from Vince Vu, with permission.)

Posted by crshalizi at November 22, 2013 11:30 | permanent link

Lab: Baseball Salaries (Introduction to Statistical Computing)

In which America's true past-time proves to be wrestling with relational databases. Assignment, database (large!)

Posted by crshalizi at November 22, 2013 10:30 | permanent link

November 20, 2013

Databases (Introduction to Statistical Computing)

Lecture 25: The idea of a relational database. Tables, fields, keys, normalization. Server-client model. Example of working with a database server. Intro to SQL, especially SELECT. Aggregation in databases is like split/apply/combine. Joining tables: what it is and how to do it. Examples of joinery. Accessing databases from R with the DBI package.

Example database file (30MB): baseball.db

Readings: Spector, chapter 3

Posted by crshalizi at November 20, 2013 10:30 | permanent link

November 18, 2013

Change of Representation (Introduction to Statistical Computing)

(My notes for this lecture are too fragmentary to post. What follows is the sketch.)

The "raw data" is often not in the format most useful for the model one wants to work with. Lots of statistical computing work is about moving the information from one format to another --- about changing representations. Lossless transformations vs. lossy; why we often want lossy transformations. Re-organizing data to group it properly. (Example: going from multi-dimensional arrays to 2D data-frames and vice versa.) Aggregation as a change of representation. (Example: Going from dates of adoption for each doctor to cumulative proportion of adopters.)

Text processing via change of representation: the bag-of-words ("vector space") representation. Cosine and Jaccard similarities. Term frequency-inverse document frequency. Document clustering and classification.

Readings: Spector, chapters 8 and 9.

On text: Lectures 1, 2 and 4 (+ slides) from data mining (vintage 2009).

Posted by crshalizi at November 18, 2013 10:30 | permanent link

November 15, 2013

Homework: A Maze of Twisty Little Passages (Introduction to Statistical Computing)

Homework 10: In which we build a little web-crawler to calculate page-rank (the hard way), so as to practice working with text, regular expressions, and Markov chains. Supplied code, which may or may not contain deliberate bugs.

(This assignment ripped off from Vince Vu, with permission.)

Posted by crshalizi at November 15, 2013 11:30 | permanent link

Lab: Scrape the Rich (Introduction to Statistical Computing)

In which we practice extracting data from text, to learn about our betters.

(This assignment ripped off from Vince Vu, with permission.)

Posted by crshalizi at November 15, 2013 10:30 | permanent link

Three-Toed Sloth:   Hosted, but not endorsed, by the Center for the Study of Complex Systems