Attention conservation notice: Academic statistico-algorithmic navel-gazing.
With the grading done, but grades not yet posted while we wait for the students to fill out faculty evaluations, it's time to reflect on the class just finished. (Since this is the third time I've done a post like this, I guess it's now one of my traditions.)
Overall, it went a lot better than my worst fears, especially considering this was the first time the class was offered. There was a lot of attrition initially, both from students who had taken a lot of programming, and from students who had done no programming at all. (I was truly surprised by how many students had never used a command-line before.) The ones who stuck around all (I think) learned a lot --- more for those who knew less about programming to start with, naturally. Most of the credit for this goes to Vince, naturally.
Some stuff didn't work well:
Stuff that worked well:
Stuff I'd try to do next time:
Over-all assessment: B; promising, but with clear areas for definite improvement.
Obligatory disclaimer: Don't blame Vince, or anyone else, for what I say here.
Posted by crshalizi at December 20, 2011 09:35 | permanent link
Posted by crshalizi at December 18, 2011 16:35 | permanent link
Lecture 26: Aggregation in databases is like split/apply/combine. Joining tables: what it is and how to do it. Examples of joinery. Accessing databases from R with the DBI package.
Posted by crshalizi at December 18, 2011 16:34 | permanent link
Lecture 25: The idea of a relational database. Tables, fields, keys, normalization. Server-client model. Example of working with a database server. Intro to SQL, especially SELECT.
Posted by crshalizi at December 18, 2011 16:33 | permanent link
Posted by crshalizi at December 18, 2011 16:32 | permanent link
Posted by crshalizi at December 18, 2011 16:31 | permanent link
Lecture 23: Importing data from webpages. Example: scraping weblinks. Using regular expressions again (with multiple capture groups). Example: how long does a random surfer take to get to Facebook? Exception handling. R
Posted by crshalizi at December 18, 2011 16:30 | permanent link
One of the final projects was to build first- and second- order Markov models based on the text of Heart of Darkness. I present their last slide:
(Whatever merit this might have is due to the students: Jason Capehart, Seung Su Han, Alexander Murray-Watters, and Elizabeth Silver.)
Update, 18 December: Of course, what I should have titled this post is "I'm now becoming my own self-fulfilled prophecy". (I'm really not very good at quotation-capping.)
Posted by crshalizi at December 07, 2011 11:56 | permanent link
Attention conservation notice: I have no taste.
The Commonwealth of Lettersl Scientifiction and Fantastica; Writing for Antiquity; The Progressive Forces; Philosophy; Natural Science of the Human Species; The Collective Use and Evolution of Concepts; Commit a Social Science; Linkage; The Beloved Republic
Posted by crshalizi at November 30, 2011 23:59 | permanent link
Attention conservation notice: Only of interest if you (1) do statistical computing and (2) will be in Pittsburgh on Monday.
As always, the talk is free and open to the public. R groupies should however contain themselves while Prof. Wickham is speaking.
Posted by crshalizi at November 30, 2011 15:00 | permanent link
Attention conservation notice: 1000+ words on the limits of welfare economics, in the form of a thought experiment or parable superficially tuned to the holiday (and brooding on my hard-disk for months). Gloomy, snarky, heavy-handed, academic, and obvious to anyone who knows enough about the subject to care. Have you no friends and family to whom you should be showing your love (perhaps in the form of food)?
Let us consider a simple economy with three individuals. Alice is a restaurateur; she has fed herself, and has just prepared a delicious turkey dinner, at some cost in materials, fuel, and her time.
Dives is a wealthy conceptual artist1, who has eaten and is not hungry, but would like to buy the turkey dinner so he can "feed" it to the transparent machine he has built, and film it being "digested" and eventually excreted2. To achieve this, he is willing and able to spend up to $5000. Dives does not care, at all, about what happens to anyone else; indeed, as an exponent of art for art's sake, he does not even care whether his film will have an audience.
Huddled miserably in a corner of the gate of Dives's condo is Lazarus, who is starving, on the brink of death, but could be kept alive for another day by eating the turkey. The sum total of Lazarus's worldly possession consist of filthy rags, of no value to any one else, and one thin dime. Since, however, he is starving, there is no amount of money which could persuade Lazarus to part with the turkey, should he gain possession of it.
Assume that everyone is a rational agent, with these resources and preferences. What does economics tell us about this situation?
First, whatever Alice has spent preparing the turkey is a sunk cost, and irrelevant to deciding what to do next.
Second, Alice would be better of selling the turkey to either Dives or Lazarus than keeping it for herself, and either trade would also benefit the buyer, so that's a win-win. Either trade would be Pareto-improving. However, neither trade is strictly better for everyone than the other: if she sells to Lazarus, Dives is disappointed, and if she sells to Dives, Lazarus starves. Of course, if we are being exact, Lazarus starves to death whether Alice keeps the turkey or sells it to Dives, so that trade makes Lazarus no worse off.
Third, Lazarus can only offer ten cents. Since Dives would be willing to spend up to $5000, Alice will prefer to sell to Dives. Since Dives, being a rational agent, knows how much Lazarus can pay, he will offer 11 cents, which Alice will accept as the superior offer. (Alternately, we add in a Walrasian auctioneer, and reach this price by tatonnement.) [Update: See below.] The market clears, Alice is 11 cents better off, Dives enjoys a consumer surplus of $4999.89, and Lazarus starves to death in the street, clutching his dime.Nothing can be changed without making someone worse off, so this is Pareto optimal.
And so, in yet another triumph, the market mechanism has allocated a scarce resource, viz., the turkey, to its most efficient use, viz., being turned into artificial shit. What makes this the most efficient use of the scarce resource? Why, simply that it goes to the user who will pay the highest price for it. This is all that economic efficiency amounts to. It is not about meeting demand, but meeting effective demand, demand backed by purchasing power.
(Incidentally, nothing in this hinges on some failure of perfect competition arising from having only three agents in the market. If we had another copy of Alice, another copy of Dives, and another copy of Lazarus, both Alices will sell their turkeys to the Diveses, and both Lazaruses will starve. By induction, then, the same allocation will be replicated for any finite number of Alices, Diveses, and Lazaruses, so long as there are at least as many Diveses as there Alices.)
You may be refusing to take this seriously, objecting that I have loaded the rhetorical deck pretty blatantly --- and I have! (Though not more than is customary in teaching economics.) But this is the core of Amartya Sen's model of famines, which grows from the observation that food is often exported, at a profit, from famine-stricken regions in which people are dying of hunger. This occurs not just in cases like the USSR in the 1930s, but in impeccably capitalist situations, like British India. This happens, as Sen shows, because the hungry, while they have a very great need for food, do not have the money to buy it, or, more precisely, people elsewhere will pay more. It is thus not economically efficient to feed the hungry, so the market starves them to death.
I do not, however, want to end this on a completely gloomy note. As Sen said, the same market would feed the hungry if they could afford it, so the way to combat famines is to make sure they have money or paying work or both. (If in this country we don't have to worry about famine, it's because we've arranged things so that most of us do have those resources; we still have a hunger problem because our arrangements are imperfect.) The larger point is that while what is technologically efficient depends on facts of nature, what is economically efficient is a function of our social arrangements, of who owns how much of what. Economic efficiency may be a good tool, but it is perverse to serve your own tools, and monstrous to be ruled by them. Let us be thankful for the extent to which we escape perversion and monstrosity.
Update, 27 November: Yes, I was presuming an ascending-price auction to get a price of 11 cents. If the auctioneer uses a descending-price auction, Alice could extract up to $5000 from Dives, driving his consumer surplus to zero; Lazarus, of course, starves at any price which clears the market. No, I did not say (and do not think) that we should abolish the market and replace it with a National Turkey Allocation Board. No, Dives having orders of magnitude more money than Lazarus is not essential; Dives just needs to be willing and able to spend 11 cents.
Also, further to the theme of delicious food and the invisible hand.
Manual trackback: Quomodocumque, MetaFilter; The Edge of the American West; The Browser; Aluation; Nanopolitan; Crooked Timber; I Got Here on My Bike; Oook; Siris; Slacktivist; Wolfgang Beirl; Andrew Gelman
1: It's a thought experiment.
2: I actually saw such a machine at the modern art museum in Lyon in 2003, fed in turn by the city's leading restaurants, but I cannot now remember the artist's name. Perhaps this is just as well. Update: Cris Moore, with whom I saw it, reminds me that the work in question was "Cloaca", by Wim Delvoye.
Posted by crshalizi at November 24, 2011 10:48 | permanent link
Someone, somewhere, has assembled a fairly reliable, comprehensive and machine-readable data set on contentious politics in the United States over the 20th century, or some large part of it. A detailed event catalog would be ideal, but I would settle for an annual index-number time series if need be. Who has done there, where are the results, and how can I get them? Leads will be rewarded with acknowledgments and/or citations, as appropriate.
In the meanwhile:
Posted by crshalizi at November 22, 2011 14:00 | permanent link
Attention conservation notice: Puffery about a paper in statistical learning theory. Again, if you care, why not let the referees sort it out, and check back later?
Now this one I really do not have time to expound on (see: talk in Chicago on Thursday), but since it is related to the subject of that talk, I tell myself that doing so will help me with my patter.
VC dimension is one of the many ways of measuring how flexible a class of models are, or their capacity to match data. Specifically, it is the largest number of data-points which the class can always (seem to) match perfectly, no matter how the observations turn out, by judiciously picking one model or another from the class. It is called a "dimension" because, through some clever combinatorics, this turns out to control the rate at which the number of distinguishable models grows with the number of observations, just as Euclidean dimension governs the rate at which the measure of a geometrical body grows as its length expands. Knowing the number of effectively-distinct models, in turn, tells us about over-fitting. The true risk of a fitted model will generally be higher than its in-sample risk, precisely because it was fitted to the data and so tuned to take advantage of noise. High-capacity model classes can do more such tuning. One can actually bound the true risk in terms of the in-sample risk and the effective number of models, and so in terms of the VC dimension.
How then does one find the VC dimension? Well, the traditional route was through yet more clever combinatorics. As someone who has never quite gotten the point of the birthday problem, I find this unappealing, especially when the models are awkward and fractious, as the interesting ones generally are.
An alternative, due to Vapnik, Levin and LeCun, is to replace math with experiment. Roughly, the idea is this: make up simulated data, fit the model class to it, see how variable the fit is from run to run, and then plug this average discrepancy into a formula relating it to the VC dimension and the simulated sample size. Simulating at a couple of sample sizes and doing some nonlinear least squares then yields an estimate of the VC dimension, which is consistent in the right limits. (If you really want details, see the papers.)
The problem with the experimental approach is that it doesn't tell you how to use the estimated VC dimension in a risk bound, which is after all what you want it for. The estimate, after all, is not perfectly precise, and how is one to account for that imprecision in the bound?
This turns out to be an eminently solvable problem. One can use the estimated VC dimension, plus a comparatively-small-and-shrinking safety margin, and plug it into the usual risk bound, with just a small hit to the confidence level. Showing this qualitatively relies on the results in van de Geer's Empirical Processes in M-Estimation, which, pleasingly, was one of the first books I read on statistical learning theory lo these many years ago. Less pleasingly, getting everything needed for a bound we can calculate (and weakening some assumptions) meant re-proving many of those results, excavating awkward constants previously buried in big C's and K's and little oP's.
In the end, however, all is as one would hope: estimated VC dimension concentrates (at a calculable rate) around the true VC dimension, and the ultimate risk bound is very close to the standard one. As someone who likes learning theory and experimental mathematics, I find this pleasing. It is also a step in Cunning Plan, which I will not belabor, as it will be obvious to readers who go all the way through the paper.
Update, 22 November: The title is a catch-phrase of my mother's; I believe she got it from one of her biochemistry teachers.
Posted by crshalizi at November 15, 2011 21:25 | permanent link
Attention conservation notice: The only thing more pathetic than a writer whining about editorial decisions is a writer whining about negative reviews and being misunderstood. Also, nothing which is both so geeky and so careless as to begin with a mis-quotation of Monty Python can end well.
So, Henry Farrell and I have an opinion piece in New Scientist about how the "libertarian paternalism" of Sunstein and Thaler, and policy-making by "nudging" more generally, are Bad Ideas. The reason we think they are Bad Ideas is that they try to do good by stealth, and thereby break the feedback mechanisms which (1) keep policy-makers accountable to those over whom they exercise power, and (2) allow policy-makers to tell whether what they are doing is working, and revise their initial policies and plans in light of experience. (And by this we very much include the experience of getting something you think you want, and discovering that it is no good for you at all.) Granting the best will in the world on the part of the nudgers, it is putting a very high value on one's own conjectures to deliberately break the most important mechanism for improving them.
In short, I thought we were making a Popperian point about how democracy is best understood not in terms of "the people's will" or the like, but accountability and rational policy revision. I also thought we were making a Popperian point about the dangers of top-down social engineering. Indeed, I was strongly tempted to quote chapter (10 and 9, respectively) and verse from The Open Society and Its Enemies for both points, but the constraints of space, and of not sounding like complete pedants, prevailed. It would, I thought, be tolerably plain what our objections were.
I had not counted on two things. First, we were, evidently, nowhere near as clear in our writing as I thought. (I take full blame for this.) Second, whoever is in charge of such matters at New Scientist gave us the headline "Nudge Policies Are Another Name for Coercion". This was so far from being our objection that we rather deliberately did not use the word "coercion" (or "coerce", etc.) at all. Everyone who is not a complete anarchist, after all, believes that some coercion is legitimate, and so the question is what sorts, to what ends, under what conditions, etc. And I regard the usual right-libertarian attempt to claim that deploying coercion only and always in favor of the interests of the rich is somehow minimizing it to be simply confused, when it is not deliberate sophistry. (As Sunstein put it in a good book with Stephen Holmes, "liberty depends on taxes".) It is, I suppose, a testimony to the hegemony of right-wing ideas that when we said something which amounted to a paraphrase and dilution of the third thesis on Feuerbach, the headline writer heard Milton Friedman, or perhaps Ayn Rand. This did not help get our point across, and the only two responses I've seen which obviously got it are two comments at Crooked Timber, by Scott Martens and by Salient.
I would also like to add that I had no idea New Scientist would syndicate our piece to Slate. The latter changed the headline to the less actively-misleading "Nudge No More", but added a gratuitous cheesecake photo, and provoked some ribbing on the part of friends who recalled my stated views about the magazine. Those views, for the record, remain unchanged; if anything, I am disturbed that Slate thought we fit either their editorial line or their tone. I console myself with thoughts Dahlia Lithwick and Jordan Ellenberg.
Disclaimer: Henry is not responsible for this post.
Posted by crshalizi at November 14, 2011 23:30 | permanent link
Attention conservation notice: Puffery about a new manuscript on the statistical theory of some mathematical models of networks. In the staggeringly unlikely event this is actually of interest to you, why not check back later, and see if peer review has exposed it all as a tissue of fallacies?
A new paper, which I flatter myself is of some interest to those who care about network models, or exponential families, or especially about exponential families of network models:
Obligatory Disclaimer: Ale didn't approve this post.
This started because Ale and I shared an interest in exponential family random graph models (ERGMs), whose basic idea is sheer elegance in its simplicity. You want to establish some distribution over graphs or networks; you decree some set of functions of the graph to be the sufficient statistics; and then you make the log probability of any given graph proportional to a weighted sum of these statistics. The weights are the parameters, and this is an exponential family.1. They inherit all of the wonderfully convenient mathematical and statistical properties of exponential families in general, e.g., finding the maximum likelihood estimator by equated expected and observed values of the sufficient statistics. (This is also the maximum entropy distribution, though I set little store by that.) They are also, with judicious choices of the statistics, quite spiffy-looking network models. This paper by Goodreau et al., for instance, is exemplary in using them to investigate teenage friendship networks and what they can tell us about general social mechanisms, and deserves a post of its own. (Indeed, a half-written post sits in my drafts folder.) This is probably the best class of statistical models of networks now going, which I have happily taught and recommended to students, with a special push for statnet.
What Ale and I wanted to do was to find conditions under which maximum likelihood estimation would be consistent --- when we saw more and more data from the same source, our estimates of the parameters would come closer and closer to each other, and to the truth. The consistency of maximum likelihood estimates for independent observations is classical, and but networks, of course, are full of dependent data. People have proved the consistency of maximum likelihood for some kinds of models of time series and of spatial data, but those proofs (at least the ones we know) mostly turned on ordering or screening-off properties of time and space, lacking in arbitrary graphs. Those which didn't turned on the "blocking" trick, where one argues that widely-separated events are nearly independent, and so approximates the dependent data by independent surrogates, plus weak corrections. This can work with random fields on networks, as in this excellent paper by Xiang and Neville, but it doesn't seem to work for models of networks, where distance itself is endogenous.
I remember very distinctly sitting in Ale's office on a sunny October afternoon just over a year ago2, trying to find some way of making the blocking trick work, when it occurred to us that maybe the reason we couldn't show that estimates converged as we got more and more data from the same ERGM was that the very idea of "more and more data from the same ERGM" did not, in general, make sense. What exactly prompted this thought I do not recall, though I dare say the fact that we had both recently read Lauritzen's book on sufficiency, with its emphasis on repetitive and projective structures, had something to do with it.
The basic point is this. Suppose we observe a social network among (say) a
sample of 500 students at a high school, but know there are 2000 students in
all. We might think that the whole network should be described by some ERGM or
other. How, however, are we to estimate it from the mere sample? Any graph
for the whole network implies a graph for the sampled 500 students, so the
toilsome and infeasible, but correct, approach would be to enumerate all
whole-network graphs compatible with the observed sample graph, and take the
likelihood to be the sum of their probabilities in the whole-network ERGM. (If
you do not strictly know how large the whole network is, then I believe you are
strictly out of luck.) This is not, of course, what people actually do.
Rather, guided by experience with problems of survey sampling, regression, time
series, etc., they have assumed that the same ERGM, with the same
sufficient statistics and the same parameter values, applies to both the whole
network and to the sample. They have assumed, in other words, that the ERGMs
Once you recognize this, it turns out to be straightforward to show that projectibility imposes very strong restrictions on the sufficient statistics --- they have to obey a condition about how they "add up" across sub-graphs which we called3 "having separable increments". This condition is "physically" reasonable but not automatic, and I will not attempt to write it out in HTML. (Read the paper!) Conversely, so long as the statistics have such "separable increments", the exponential family is projectible. (Pinning down the converse was the tricky bit.) Once we have this, conditions for consistency of maximum likelihood turn out to be straightforward, as all the stuff about projectibility implies the change to the statistics when adding new data must be unpredictable from the old data. The sufficient statistics themselves form a stochastic process with independent increments, something for which there is a lot of convergence theory. (This does not mean the data must be independent, as we show by example.) All of these results prove to be perfectly general facts about exponential families of dependent variables, with no special connection to networks.
The punch-line, though, is that the most commonly used specifications for ERGMs all include — for good reasons! — statistics which break projectibility. Models with "dyadic independence", including the models implicit or explicit in a lot of community discovery work, turn out to be spared. Anything more sophisticated, however, has got a very real, though admittedly somewhat subtle, mathematical pathology. Consistency of estimation doesn't even make sense, because there is no consistency under sampling.
We have some thoughts on where this leaves statistical models of networks, and especially about how to actually move forward constructively, but I will let you read about them in the paper.
Update, next day: fixed typos, clarified a sentence and added a reference.
1: Or if, like me, you were brought up in statistical mechanics, a Boltzmann-Gibbs ensemble, with the statistics being the extensive thermodynamic variables (think "volume" or "number of oxygen molecules"), and the parameters their conjugate intensive variables (think "pressure" or "chemical potential of oxygen"). If this line of thought intrigues you, read Mandelbrot.
2: With merely a year between the idea and the submission, this project went forward with what is, for me, unseemly haste.
3: We couldn't find a name for the property the statistics needed to have, so we made one up. If you have encountered it before, please let me know.
Posted by crshalizi at November 14, 2011 22:00 | permanent link
Posted by crshalizi at November 14, 2011 10:31 | permanent link
Posted by crshalizi at November 14, 2011 10:30 | permanent link
Today seems like a good idea to propose that setting up a stela on the National Mall in Washington, with a celebratory inscription in Maya and a Long Count date, to be inaugurated on 21 December 2012, or rather 126.96.36.199.0.
Posted by crshalizi at November 11, 2011 11:11 | permanent link
Posted by crshalizi at November 11, 2011 10:38 | permanent link
Lecture 21: Regular expressions are descriptions of patterns. Why we want to use them. Search, search and replace.
Posted by crshalizi at November 11, 2011 10:37 | permanent link
Lecture 20: Overview of character data. Basic string operations: extract and concatenate.
Posted by crshalizi at November 11, 2011 10:36 | permanent link
Posted by crshalizi at November 11, 2011 10:35 | permanent link
Posted by crshalizi at November 11, 2011 10:34 | permanent link
Lecture 19: Mixing times and correlation time. Continuous-valued Markov processes. The Metropolis algorithm for Markov chain Monte Carlo.
Posted by crshalizi at November 11, 2011 10:33 | permanent link
Posted by crshalizi at November 11, 2011 10:32 | permanent link
Posted by crshalizi at November 11, 2011 10:31 | permanent link
Lecture 18: the Monte Carlo method for numerical integration; Monte Carlo for expectation values; importance sampling. Markov chains: definition, the roots of the Markov property; asymptotics of Markov chains via linear algebra; Markov chains and graphs; the law of large numbers (ergodic theorem) for Markov chains.
Posted by crshalizi at November 11, 2011 10:30 | permanent link
I mentioned trips for upcoming talks, didn't I?
The Columbia talk is free and open to the public. I will be disillusioned unless Chicago not only charges admission, but uses a carefully optimized scheme of price discrimination.
Posted by crshalizi at November 02, 2011 10:30 | permanent link
Attention conservation notice: I have no taste.
Books to Read While the Algae Grow in Your Fur; Enigmas of Chance; Scientifiction and Fantastica; Statistical Computing; The Dismal Science; Physics; Kith and Kin; Cthulhiana; The Beloved Republic; Philosophy; The Natural Science of the Human Species; Complexity; Mathematics
Posted by crshalizi at October 31, 2011 23:59 | permanent link
By now, you have probably heard about how the Washington Post decided to illustrate a news story about the Oakland police using tear gas on peaceful demonstrators, breaking skulls, etc., with a picture of a police officer "pet[ting] a cat that was left behind by protestors". There is now an Oakland Riot Cat tumblr, naturally (this is my favorite so far), and I wouldn't be surprised if the animal becomes a minor icon of the movement — and Scott Olsen, the veteran of our war in Iraq who got his head cracked open by the police, becomes a major icon.
What I keep thinking though, is that somebody in Oakland must be very upset not just at having been assaulted by the cops while exercising their rights, but at losing their cat at the same time. How is that cat ever going to get home? Is anybody even trying to get it back to its owner?
Posted by crshalizi at October 28, 2011 23:00 | permanent link
Between now and mid-December, I have a class to teach, a grant proposal to fabricate, and four trips to take and five talks to give. As for manuscripts to referee, letters of recommendation to write, and papers to finish, it would be futile to try counting them; only mass nouns are appropriate. Since it is unlikely that you will see much here other than teaching materials for the next few months, look elsewhere:
(These are some of what I happen to have been reading recently. I should really update my blog-roll, apparently last touched in 2006.)
Posted by crshalizi at October 27, 2011 23:55 | permanent link
Attention conservation notice: 4600 words on a legal ruling in another country, from someone who knows nothing about the law even in his own country. Contains many long quotations from the ruling, plus unexplained statistical jargon; written while trapped in an airport trying to get home, and so probably excessively peevish.
Back at the beginning of the month, as constant readers will recall, there was a bit of a kerfluffle over newspaper reports — starting with this story in the Guardian, by one Angela Saini — to the effect that a judge had ruled the application of Bayes's rule was inadmissible in British courts. This included much wailing and gnashing of teeth over the innumeracy of lawyers and the courts, anti-scientific obsurantism and injustice, etc., etc. At the time, I was skeptical that anything like this had actually happened, but had no better information than the newspaper reports themselves. A reader kindly sent me a copy of the judgment by the court of appeals [PDF], and US Airlines kindly provided me with time to read it.
To sum up what follows, the news reports were thoroughly misleading: the issue in the case was the use not of Bayes's rule but of likelihood ratios; the panel of three judges (not one judge) affirmed existing law, rather than new law; the existing law allows for the use of likelihood ratios and of Bayes's theorem when appropriate; and the court gave sound reasons for thinking that their use in cases like this one would be mere pseudo-science. We are, then, listening to the call-in show on Radio Yerevan:
Question to Radio Yerevan: Is it correct that Grigori Grigorievich Grigoriev won a luxury car at the All-Union Championship in Moscow?
Answer: In principle, yes. But first of all it was not Grigori Grigorievich Grigoriev, but Vassili Vassilievich Vassiliev; second, it was not at the All-Union Championship in Moscow, but at a Collective Farm Sports Festival in Smolensk; third, it was not a car, but a bicycle; and fourth he didn't win it, but rather it was stolen from him.
Taking advantage again of the generous opportunities provided to me by US Airlines, I will try to explain the case before the court, and what it decided and why. [Square brackets will indicate the numbered paragraphs of the judgment.] I will not fisk the news story (you can go back and read it for yourself), but I will offer some speculations about who found this eminently sensible ruling so upsetting, and why, that we got treated to this story.
The case (Regina vs. T.) was an appeal of a murder conviction. The appeal apparently raised three issues, the only one of which is not redacted in the public judgment is "the extent to which evaluative expert evidence of footwear marks is reliable and the way in which it was put before the jury" . One of — and it fact it seems to be the main — pieces of evidence claimed to identify T. as the murder was the match between shoe marks found at the scene of the murder and those of a pair of "trainers" (what I believe we'd call "sneakers") "found in the appellant's house after his arrest" . A forensic technician, one Mr. Ryder, compared the prints and concluded, in a written report, that there was "a moderate degree of scientific evidence to support the view that the [Nike trainers recovered from the appellant] had made the footwear marks" . This report was entirely qualitative and contained no statistical formulas or results of any kind. This, however, did not reflect how the conclusion was actually reached, as I will come to shortly.
Statistics were mentioned during the trial. T.'s lawyers (who seem rather hapless and were not retained on appeal) cross-examined Ryder about
figures in the UK over 7--8 years for the distribution of Nike trainers of the same model as that found in the appellant's house; some figures had been supplied to him by the defence lawyers the day before. Mr. Ryder gave evidence that there were 1,200 different sole patterns of Nike trainers; the pattern of Nike trainers that made the marks on the floor was encountered frequently and had been available since 1995; distribution figures for the pattern were only available from 1999. In the period 1996--2006 there would have been 786,000 pairs of trainers distributed by Nike. On those figures some 3% were size 11 [like those in question: CRS]. The pattern could also have been made by shoes distributed by Foot Locker and counterfeits of Nike shoes for which there were no figures. In answer to the suggestion that the pattern on the Nike trainers found at the appellant's house was of a common type, he said: "It is just one example of the vast number of different shoes that are available and to put the figures into context, there are around 42 million pairs of shoes sold every year so if you put that back over the previous 7 or 8 years, sports shoes alone, that multiplies up to nearly 300 million pairs of sports shoes so that particular number of shoes, produced which is a million, based on round numbers, is a very small proportion." These figures were repeated, with emphasis, by the trial judge in his instructions to the jury .
I said a moment ago that Ryder's written report, pre-trial, was entirely qualitative. This turns out to not really reflect what he did. In addition to looking at the shoes and the shoe-prints, he also worked through a likelihood ratio calculation, as follows [34--38]. The two hypotheses he considered were, as nearly as I can make out, "These prints were made by these shoes", and "These prints were made by some other shoe, randomly selected from all of the UK". (I will come back to these alternatives.) He considered that there were four variables he could work with: the pattern of the print, the size, the amount of wear, and the amount of damage to the shoe.
He then turned to a scale which had been plucked from the air (to put it politely) by some forensics policy entrepreneurs a few years before, which runs as follows :
|>1--10||Weak or limited support|
|100--1,000||Moderately strong support|
|10,000--1,000,000||Very strong support|
|>1,000,000||Extremely strong support|
In Mr Ryder's reports for the trial... there was no reference at all to any of these statistics, the formula [for the likelihood ratio], or to the use of a likelihood ratio or to the scale of numerical values set out [above]. The conclusion in his first report, which was supported by the statistics, formula, and resulting likelihood ratio, was expressed solely in terms of the verbal scale... this was dated one day after the notes in which he had recorded his calculations. Mr Ryder's explanation for the omission was that it was not standard practice for the detail relating to the statistics and likelihood ratios to be included in a report. He made clear that the data were not available to an exact and precise level and it was only used to confirm an opinion substantially based on his experience and so that it could be expressed in a standardised form. 
There are a couple of things to note about this, not all of which the court did.
First, the numbers Ryder used were vastly different from those mentioned
during the trial. "He made clear that the pattern was the one that was
encountered most frequently in the laboratory, but he did not give the actual
figures used by him... even though the figures in the database which he used in
his formula were more favorable to the appellant". With those
numbers, the likelihood ratio would be not 100:1 but 13,200:1 in favor of T.'s
shoes having left the marks. But what's two orders of magnitude
murder trial between friends?
Second, neither set of numbers is anything like a reliable basis for calculation:
It is evident from the way in which Mr Ryder identified the figures to be used in the formula for pattern and size that none has any degree of precision. The figure for pattern could never be accurately known. For example, there were only distribution figures for the UK of shoes distributed by Nike; these left out of account the Footlocker shoes and counterfeits. The figure for size again could not be any more than a rough approximation because of the factors specified by Mr Ryder. Indeed, as Mr Ryder accepted, there is no certainty as to the data for pattern and size.(The Guardian, incidentally, glossed this as "The judge complained that he couldn't say exactly how many of one particular type of Nike trainer there are in the country", which is not the point at all.)
More importantly, the purchase and use of footwear is also subject to numerous other factors such as fashion, counterfeiting, distribution, local availability and the length of time footwear is kept. A particular shoe might be very common in one area because a retailer has bought a large number or because the price is discounted or because of fashion or choice by a group of people in that area. There is no way in which the effect of these factors has presently been statistically measured; it would appear extremely difficult to do so, but it is an issue that can no doubt be explored for the future. [81--82]
Third, the use of the likelihood ratio and statistical evidence is more than a bit of a bureaucratic fiction.
Mr Lewis [the "principal scientist as the FSS responsible for Case Assessment and Interpretation"] explained that in relation to footwear the first task of the examiner was to decide whether the mark could have been made by the shoe. If it could have been made, then what the FSS tried to do was to use the likelihood ratio to convey to the court the meaning of "could have been made" and how significant that was.
As Mr Lewis accepted, numbers were not put into reports because there was a concern about the accuracy and robustness of the data, given the small size of the data set and factors such as distribution, purchasing patterns and the like. It was therefore important that the emphasis on the use of a numerical approach was to achieve consistency; the judgment on likelihood was based on experience. [57--58]
Or, shorter: the examiners go by their trained judgments, but then work backwards to the desired numbers to satisfy bureaucratic mandates, even though everyone realizes the numbers don't bear scrutiny.
Fourth, to the extent that likelihood ratios and related statistics actually are part of the forensic process, they need to be presented during the trial, so that they can be assessed like any other evidence. Using them internally for the prosecution, but then sweeping them away, is a recipe for mischief. "It is simply wrong in principle for an expert to fail to set out the way in which he has reached his conclusion in his report.... [T]he practice of using a Bayesian approach and likelihood ratios to formulate opinions placed before a jury without that process being disclosed and debated in court is contrary to principles of open justice."  This, ultimately, was the reason for granting the appeal.
So where do we get to the point where (to quote The Guardian again) "a mathematical formula was thrown out of court"? Well, nowhere, because, to the extent that the court limited the use of Bayes's rule and likelihood ratios, it was re-affirming long-settled British law. As the judgment makes plain, "the Bayesian approach" and this sort of use of likelihood ratios were something "which this court had robustly rejected for non-DNA evidence in a number of cases" starting with R. vs. Dennis Adams in 1996 . The basis for this "robust rejection" is also old, and in my view sound:
The principles for the admissibility of expert evidence [are that] the court will consider whether there is a sufficiently reliable scientific basis for the evidence to be admitted, but, if satisfied that there is a sufficiently reliable scientific basis for the evidence to be admitted, then it will leave the opposing views to be tested in the trial before the jury. 
In the case of DNA evidence, "there has been for some time a sufficient statistical basis that match probabilities can be given" . But for footwear,
In accordance with the approach to expert evidence [laid down by previous judgments], we have concluded that there is not a sufficiently reliable basis for an expert to be able to express an opinion based on the use of a mathematical formula. There are no sufficiently reliable data on which an assessment based on data can properly be made... An attempt to assess the degrees of probability where footwear could have made a mark based on figures relating to distribution is inherently unreliable and gives rise to a verisimilitude of mathematical probability based on data where it is not possible to build that data in a way which enables this to be done; none in truth exists for the reasons we have explained. We are satisfied that in the area of footwear evidence, no attempt can realistically be made in the generality of cases to use a formula to calculate the probabilities. The practice has no sound basis.
It is of course regrettable that there are, at present, insufficient data for a more certain and objective basis for expert opinion on footwear marks, but it cannot be right to seek to achieve objectivity by reliance on data which does not enable this to be done. We entirely understand the desire of the experts to try and achieve the objectivity in relation to evidence of footwear marks, but the work done has never before, as we understand it, been subject to open scrutiny by a court. [86--87]
It is worth repeating that, despite the newspapers, this is not new law: "It is quite clear therefore that outside the field of DNA (and possibly other areas where there is a firm statistical base), this court has made it clear that Bayes theorem and likelihood ratios should not be used" . Nonetheless, this does not amount to an obscurantist rejection of Bayes's theorem:
It is not necessary for us to consider ... how likelihood ratios and Bayes theorem should be used where there is a sufficient database. If there were a sufficient database in footwear cases an expert might be able to express a view reached through a statistical calculation of the probability of the mark being made by the footwear, very much in the same way as in the DNA cases subject to suitable qualification, but whether the expert should be permitted to go any further is, in our view, doubtful. The judgment goes on [91--95] to make clear that experts can have a sound scientific basis for their opinions even if these cannot be expressed as statistical calculations from a database. The objection rather is to spurious precision, and spurious claims to a scientific status .
There is a legitimate criticism to make of the court here, which is that it is not very specific about what would count as a "sufficient database", or "firm" statistics. It may be that the earlier cases cited fill this in; I haven't read them. This didn't matter for DNA, because people other than the police had other reasons for assembling the relevant data, but for something like shoes it's hard to see who would ever do it other than something like the FSS, and they are not likely to do so without guidance about what would be acceptable to the courts. On the other hand, the judges might have felt that articulating a specific standard simply went beyond what was needed to decide this case.
There is more in the judgment, including a discussion of what the court thought footwear examiners legitimately can and cannot generally say based on the evidence (drawing heavily on how this is done in the US). Rather than go into that, I will mention some more technical issues suggested by, but not discussed in, the judgment.
So, we have a situation where the "Bayesian approach" supposedly being taken by the forensic specialists was not noticeably Bayesian, in addition to being based on hopelessly vague numbers and more than a bit of an administrative fiction.
The verbal scale for likelihoods I mentioned above was the brain-child of a trade organization of British forensic specialists [52--53] in the 2000s. It grew out of a movement to formalize the evaluation of forensic evidence through likelihood ratios, which participants described as "the Bayesian approach". "On the evidence before us this development occurred in the late 1990s and was based on the approach to expert evidence on DNA. It was thought appropriate to translate [that] approach... to other areas of forensic evidence" . Several of the leading participants in this movement were evidently employees of the FSS, or otherwise closely affiliated with it. They seem to have been the ones responsible for insisting that all evaluative opinions be justified for internal consumption by a likelihood ratio calculation, and then expressed on that verbal scale.
That they started pushing for that just a few years after the British courts had ruled that such calculations were inadmissible when based on unreliable (or no) data might explain why these calculations were kept internal, rather than being exposed to scrutiny. That they pushed such calculations at all seems to be explained by a very dogmatic case of Bayesian ideology, expressed, e.g., in an extraordinary statement of subjectivism  that out-Savages Savage. Why they thought likelihood ratios were the Bayesian approach, though, I couldn't begin to tell you. (It would certainly be news to, say, Neyman and Pearson.) It would be extraordinary if these people were confusing likelihood ratios and Bayes factors, but that's the closest I can come to rationalizing this.
Sociologically considered, "forensic science", so called, is a relatively new field which is attempting to establish itself as a profession, with legitimate and recognized claims to authority over certain matters, what Abbott, in the book linked to just now, calls "jurisdiction". Part of professionalization is convincing outsiders that they really do need the specialized knowledge of the professionals, and it's very common, in attempts at this, for people to try to borrow authority from whatever forms of knowledge are currently prestigious. I suppose it's a good thing for us statisticians that Bayesian inference currently seems, to a would-be profession, like a handy club with which to beat down those who would claim its desired territory.
Still, if this aspect of professionalization often seems like aping the external forms of real science, while missing everything which gives those forms meaning, I think that's because it is. Forensics people making a fetish of the probability calculus when they have no basis for calculation is thus of a piece with attempts to turn cooking into applied biochemistry, or eliminate personality conflicts through item response theory. One has to hope that if a profession does manage to establish itself, it grows out of such things; sometimes they don't.
Naturally, being comprehensively smacked down by the court is going to smart for these people. I imagine prosecutors are unhappy as well, as this presumably creates grounds for appeals in lots of convictions. Expert witnesses (such as those quoted in the Guardian story) are probably not best pleased at having to admit that when they give precise probabilities, it is because their numbers are largely made up. I can sympathize with these people as human beings in an awkward and even, in some cases, deeply unenviable position, and certainly understand why they'd push back. (If I had to guess why a decision dated October 2010 got written up, in a thoroughly misleading way, in a newspaper in October 2011, it would be that it took them a while to find a journalist willing to spin it for them.) But this doesn't change the fact that they are wrong, and the judges were right. If they really want to use these formulas, they need to get better data, not complain that they're not allowed to give their testimony — in criminal trials, no less! — a false air of scientific authority.
Update, next day: Typo fixes, added name and link for the journalist.
Update, 29 October: Scott Martens points to a very relevant paper, strikingly titled "Is it a Crime to Belong to a Reference Class?" (Mark Colyvan, Helen M. Regan and Scott Ferson, Journal of Political Philosophy 9 (2001): 168--181; PDF via Prof. Colyvan). This concerns a US case (United States vs. Shonubi). There, the dispute was not about whether Shonubi was smuggling drugs (he was), or had been convicted fairly (he had), but about whether his sentence could be based on a statistical model of how much he might have smuggled on occasions when he was not caught. The appeals court ruled that this was not OK, leading to a parallel round of lamentations about "the legal system's failure to appreciate statistical evidence" and the like. The paper by Colyvan et al. is a defense of appeals court's decision, largely on the grounds of the reference class problem, or, as they equivalently note (p. 179 n. 27) of model uncertainty (as well as crappy figures), though they also raise some interesting points about utilities.
Manual trackback: Abandoned Footnotes
1: I say "unfortunate", because, while the court
makes clear he was just following standard procedure as set by his bosses and
is not to be blamed in any way, cannot be a popular man with those bosses after
2: To drive home the difference between more
more probable, recall Kahnemann and Tversky's famous example of Linda
the feminist bank teller:
3: I have no reason to think this murder case had
anything to do with Bristol in general or Clifton in particular, both of which
I remember fondly from a year ago.
4: I think one could do more with notions
ergodicity, and algorithmic,
Martin-Löf randomness, than Spanos is inclined to, but in practice of
course one simply uses a model.
Linda is 31 years old, single, outspoken, and very bright. She
majored in philosophy. As a student, she was deeply concerned with issues of
discrimination and social justice, and also participated in anti-nuclear
demonstrations. Which is more probable? Linda is a bank teller, or
Linda is a bank teller and is active in the feminist movement.
The trick is that while Linda is more likely to be as described if she
is a feminist bank teller than if she is a bank teller with unknown views on
feminism, she is nonetheless more probable to be a bank teller. Of course in the legal case the alternatives are not nested (as here) but mutually
2: To drive home the difference between more
more probable, recall Kahnemann and Tversky's famous example of Linda
the feminist bank teller:
3: I have no reason to think this murder case had
anything to do with Bristol in general or Clifton in particular, both of which
I remember fondly from a year ago.
4: I think one could do more with notions
ergodicity, and algorithmic,
Martin-Löf randomness, than Spanos is inclined to, but in practice of
course one simply uses a model.
3: I have no reason to think this murder case had anything to do with Bristol in general or Clifton in particular, both of which I remember fondly from a year ago.
4: I think one could do more with notions like ergodicity, and algorithmic, Martin-Löf randomness, than Spanos is inclined to, but in practice of course one simply uses a model.
Posted by crshalizi at October 27, 2011 23:50 | permanent link
Attention conservation notice: 2300 words about an odd, un-influential old book on radical political economy and statistical mechanics, plus gratuitous sniping at respectable mainstream economics.
I tracked down this down because I somehow ran across a link to a conference devoted to it. It appears to have emerged from the debates provoked among heterodox economists by input-output analysis, especially as employed by Sraffa and his followers.
A word about input-output analysis. This is a technique, developed largely by the great economist Wassily Leontief, for analyzing the technological interdependencies of different sectors of the economy, and especially physical resource flows. Start with some good, say (because I am writing this as I do laundry) washing machines. Making a washing machine calls for certain inputs: so much steel, rubber, glass, wire, a motor, switches, tubing, paint, ball-bearings, etc.; also electric power for the factory, workers, wear and tear on assembly-line machinery. To provide each one of those inputs in turn requires other inputs. Ultimately, one can imagine (if not actually estimate) a gigantic matrix which shows, for each distinct good in the economy, the physical quantity of all other goods required to produce one unit of that commodity. (At least, in a linear approximation.) Given an initial vector of inputs, this defines the range of possibilities of production. Conversely, given a desired vector of outputs, this defines the minimum required inputs. (It is no coincidence that input-output analysis fits so well together with linear programming.)
If one takes input-output analysis seriously, and assumes (following Marx, and indeed the whole tradition of classical economics back to Adam Smith at least) a uniform rate of profit across industries and even firms, then one runs into insuperable difficulties for the labor theory of value. Put simply, prices, at least equilibrium prices, are then determined by the uniform profit rate and the coefficients in the input-output matrix, with no real relation to how much labor goes into different commodities.
The authors — mathematicians who are, plainly, Marxian socialists, if not perhaps strictly Marxists — deny the premise that the rate of profit is uniform. ("Profit" here is defined as money received for goods sold, minus money paid for wages, raw materials, rent, and wear-and-tear on capital assets. It is thus before taxes, repayment of loans, and investment.) They agree that firms and industries where it is above average will tend to attract investment, and those where it is below average will tend to shed capital, and that these forces tend to equalize the profit rate. But they deny that there is any reason to think that this force should produce complete uniformity, or even very close uniformity. After all, there is a tendency for the speed of molecules in a gas to equalize, but that doesn't mean they all end up with the same speed. This is their main, driving analogy, and they think it so important that they devote chapter two [of eight] to accurately expounding the elements of the kinetic theory of gases and of statistical mechanics. They suggest that there should be a random distribution of profit rates, and that (on the analogy with statistical mechanics again, and no deeper reason that I noticed) that it should be a gamma distribution. (Why not a beta? Why not a log-normal? Why any of the cookbook distributions?)
They then try to mesh this with something very much like a labor theory of value, though they are careful not to actually assert such a theory. Starting from the assumption that "labor" is a universal input into the production of all commodities, they define the "labor content" of a commodity as total amount of labor needed to produce it using current technology (and summing over all the goods needed to produce that technology, all the goods needed to produce those goods, and so on). Because this is defined with respect to current technology, this is not the same as the amount of labor which, historically, happened to have gone into any one good. (By design, it is however reminiscent of Marx's attempt to define the value of a commodity as the quantity of "socially necessary" labor time which went into producing it.) They further claim that there will be a certain characteristic distribution of labor content over the commodities bought and sold in a given economy over a given span of time.
With these two distributions, they then argue as follows.
It is only in the last chapter that they present any sort of empirical evidence whatsoever. This is scanty, and it is not clear that the compilations they found on profit rates really are using a definition of "profit", much less of "capital", which matches theirs, but the comparison between the histograms and their fitted gamma distributions isn't visually painful. It shows that realized profit rates, from firms which are large enough, and live long enough, to be included in directories of companies have a wide dispersion and are somewhat right-skewed. Even this does not quite settle the matter of uniformity of profit rates. Because investments must be made now for profit later, what the forces of competition should equalize are not these realized, ex-post profit rates, but rather predicted, ex ante rates. Even if everyone agreed in their predictions of profitability (obviously not the case), and even if ex ante rates were uniform, one would expect the ex-post profit rates to have non-trivial dispersion, though a stable distribution for the latter is another story.
To sum up what's gone so far: I am happy with the idea that there is no uniform rate of profit, though their case is hardly air-tight. I am utterly unpersuaded of the attempts to rehabilitate even a shadow of the labor theory of value on this basis. There are, it seems to me, to be two key points where it fails. One is the traditional problem that labor is not really a homogeneous commodity. The other is that labor does not have any unique role in their formal framework.
The traditional issue here is that they have to assume there is a single commodity called "labor" (or "labor-power" or "abstract labor"), and that producing one unit of this requires the same inputs, no matter where in the economic system the labor is applied, i.e., what type of work it is really doing. This has long been recognized as a huge problem with labor theories of value; they devote Appendix II to acknowledging it; and they wave it away. This seems to me to make no more economic sense than lumping together all the different fuels produced by an oil refinery, electricity from a wind-mill, and fields of beets as the commodity of "energy" (or even "abstract energy").
Granting, for the sake of argument, that we can treat all forms of labor as equivalent (including equality in what's needed to produce them), there is still another problem. They can define a labor-content for every commodity because labor is "universal", a direct or indirect input to the production of every other commodity. But this is the only feature of labor which they really use in their arguments. So any other universal input would do as well. Water, for instance, is an input into the production of labor, and so one could just as well go through everything in their analysis in terms of water-content rather than labor-content. Indeed, water and electricity, being much more nearly homogeneous physical substances than "labor", would seem to make an even better basis for the analysis. So to the extent that they have a basis for saying that the ratio between the prices of commodities and their labor content is nearly constant, I could equally say the same of the ratio between prices and water content, or electric content1. They were, I think, aware of this objection to at least some degree, since they single out labor on the grounds that economists should be interested in the metabolism of the social organism, which necessarily involves labor. But I fail to see why materialist economists, studying the social metabolism, should not be equally interested in water, or electricity, or indeed thermodynamic free energy in general.
At a deeper level, Farjoun and Machover think economics suffers from assuming economic variables have deterministic relationships, which we just measure imperfectly; they want to take stochastic models as basic. (They want to introduce noise into the dynamics, and not just into observations2.) I am, naturally, very sympathetic to this, but they fail to convince me that it really would make as much difference as they claim. Someone like Haavelmo could, I think, have accepted this postulate with no change at all in his econometric practice. On the other hand, something like John Sutton's approach of finding inequalities which hold across huge ranges of economic models actually seems to lead to real insights into how the economy is organized and evolves, and is a much bigger departure, methodologically, from the mainstream approach than what Farjoun and Machover advocated.
If you want to understand how capitalism works, I think you are no worse off
spending your time reading Farjoun and Machover than,
say, Kydland and
Prescott3. The math is fine, and where sketchy could be
elaborated endlessly by clever graduate students, but in neither case does it
really support a valuable
the mechanisms and processes of the real economy, because the mathematical
structure is raised along lines laid down by a tradition which is irrelevant
when not actively misguided. One might ask, then, why one of these efforts
languishes in obscurity, and the other does not, but
because one of them is very congenial to both right-wing
politics and to a well-entrenched style of economics, and the other is
not a question I will leave to the competence of the historians of
1: This would imply a nearly-constant ratio between labor content and water content, which I suspect would be the ratio of the entries for labor and water in the dominant eigenvector of the input-output matrix. But that's just a guess based on the Frobenius-Perron theorem. (It does not seem worthwhile to pursue this to a definite answer.)
2: Note that in a dynamic stochastic general equilibrium model, the "stochastic" part comes solely from an unobserved, and generically unobservable, "shock" process. (This process may be vector-valued, and its projections along some preferred basis may be given suggestive names, like "technology".) The actions of the agents in such models are however deterministic functions of the state of the system facing them, which leads to the use of complicated, face-saving machinery for observational noise.
3: It may be worth noting that Kydland and Prescott, and their intellectual tradition, also assume homogeneous abstract labor. In fact, the Kydland-Prescott "real business cycle" model further assumes homogeneous abstract "capital", and a single homogeneous abstract consumption good. (One could even argue that it has an embedded labor theory of value.) In fairness, this was to some degree inherited from earlier approaches like Solow's growth model; in further fairness, Solow is too wise to mistake his model for a deep, "structural" description of the economy.
Furthermore, all models in the real business cycle/DSGE tradition have a huge, but generally ignored, measurement problem, since it is by no means obvious that the model variables called "output", "capital", "labor", etc., correspond exactly to standard statistics like GDP, market capitalization, and recorded hours worked (respectively), though almost all attempts to connect these models to data assumes that they do. At most, typically, one allows for IID Gaussian measurement error. (Boivin and Giannoni's "DSGE Models in a Data-Rich Envirnment" is a notable exception, and even they handle this systematic mis-match between theoretical variables and empirical measurements through an ad hoc factor model.) The point being, while Farjoun and Machover's scheme has serious issues with the definition of its variables and their measurement, it is not as though such defects stop economists from adopting modeling approaches they otherwise find attractive, or even bothers them very much.
Posted by crshalizi at October 26, 2011 09:45 | permanent link
Lecture 16: Why simulate? Generating random variables as first step. The built-in R commands: rnorm, runif, etc.; sample. Transforming uniformly-distributed random variables into other distributions: the quantile trick; the rejection method; illustration of the rejection method. Understanding pseudo-random number generators: irrational rotations; the Arnold cat map as a toy example of an unstable dynamical system; illustrations of the Arnold cat map. Controlling the random number seed.
Posted by crshalizi at October 24, 2011 13:54 | permanent link
Lecture 15: Abstraction as a way to make programming more friendly to human beings. Refactoring as a form of abstraction. The rectification of names. Consolidation of related values into objects. Extracting common operations. Defining general operations. Extended example with the jackknife. R.
Posted by crshalizi at October 24, 2011 13:53 | permanent link
Lecture 14: Implementing the split/apply/combine pattern with the plyr package. Advantages over implementations in base R. Drawbacks. Examples. Limitations of the split/apply/combine pattern. R and data.
Posted by crshalizi at October 24, 2011 13:52 | permanent link
Posted by crshalizi at October 24, 2011 13:51 | permanent link
Lecture 12: Design patterns and their benefits: clarity on what is to be done, flexibility about how to do it, ease of adapting others' solutions. The split/apply/combine pattern: divide big structured data sets up into smaller, related parts; apply the same analysis to each part independently; combine the results of the analyses. Trivial example: rowSums, colSums. Further examples. Iteration as a verbose, painful and clumsy implementation of split/apply/combine. Tools for split/apply/combine in basic R: the apply function for arrays, lapply for lists, mapply, etc.; split. Detailed example with a complicated data set: Masters 2011 Golf Tournament. R, data.
Posted by crshalizi at October 24, 2011 13:50 | permanent link
In which we estimate the parameters of a linear regression by minimizing the median absolute error, rather than the mean squared error, so as to reduce the influence of outliers (and to practice using functions as arguments and as return values).
Posted by crshalizi at October 08, 2011 17:30 | permanent link
In which we made our cats a likelihood function, but we maximized it*.
*: I was tempted to title this lab "I can has likelihood surface?", but resisted.
Posted by crshalizi at October 08, 2011 17:20 | permanent link
Attention conservation notice: Only of interest if you (1) care about learning complex stochastic models from limited data, and (2) are in Pittsburgh.
The CMU statistics department sponsors an annual distinguished lecture series in memory of our sainted founder, Morris H. DeGroot. This year, it comes at the end of the workshop on Case Studies in Bayesian Statistics and Machine Learning. We are very happy to have as the lecturer Daphne Koller.
As always, the talk is free and open to the public.
Update, after the talk: We more than filled the auditorium; I had to sit on the stairs.
Posted by crshalizi at October 07, 2011 18:00 | permanent link
Lecture 11: Functions in R are objects, just like everything else, and so can be returned by other functions, with no special machinery required. Examples from math (especially calculus) of operators, which turn one function into another. The importance of scoping when using functions as return values. Example: creating a linear predictor. Example: implementing the gradient operator (two different ways). Example: writing surface, as a two-dimensional analog to the standard curve. The use of eval and substitute to control when and in what context an expression is evaluated. Three increasingly refined versions of surface, employing eval. — R for examples.
Posted by crshalizi at October 05, 2011 11:50 | permanent link
Lecture 10: Functions in R are objects, just like everything else, and so can be both arguments to and return values of functions, with no special machinery required. Examples from math (especially calculus) of functions with other functions as arguments. Some R syntax relating to functions. Examples with curve. Using sapply to extend functions of single numbers to functions of vectors; its combination with curve. We write functions with lower-level functions as arguments to abstract out a common pattern of operations. Example: calculating a gradient. Numerical gradients by first differences, done two different ways. (Limitations of taking derivatives by first differences.) Incorporating this as a part of a larger algorithm, such as gradient descent. Using adapters, like wrapper functions and anonymous functions, to fit different functions together. — R for examples.
Posted by crshalizi at October 03, 2011 10:30 | permanent link
In which we use Tukey's rule for identifying outliers as an excuse to learn about debugging and testing.
Posted by crshalizi at October 03, 2011 10:29 | permanent link
Via Mason Porter, Danny Yee and others, I see a news story which my kith are glossing along the lines of a judge has ruled that Bayes's Theorem does not apply in Britain. Leave to one side my "tolerate/hate" relationship with Bayesianism; there are certainly cases, and ones of legal application at that, where Bayes's rule amounts to a simple arithmetic statement about population counts, so it would be very remarkable indeed if these were inadmissible in court. While I enjoy disparaging the innumeracy of the legal profession as much as the next mathematically-trained person, this seems like a distortion.
Let me quote from the Guardian story Mason linked to. (I can't find the actual opinion, at least not without more work than it's worth before lecture.) The story
begins with a convicted killer, "T", who took his case to the court of appeal in 2010. Among the evidence against him was a shoeprint from a pair of Nike trainers, which seemed to match a pair found at his home. While appeals often unmask shaky evidence, this was different. This time, a mathematical formula was thrown out of court. The footwear expert made what the judge believed were poor calculations about the likelihood of the match, compounded by a bad explanation of how he reached his opinion. The conviction was quashed.
But more importantly, as far as mathematicians are concerned, the judge also ruled against using similar statistical analysis in the courts in future. ...
In the shoeprint murder case, for example, [applying Bayes's rule] meant figuring out the chance that the print at the crime scene came from the same pair of Nike trainers as those found at the suspect's house, given how common those kinds of shoes are, the size of the shoe, how the sole had been worn down and any damage to it. Between 1996 and 2006, for example, Nike distributed 786,000 pairs of trainers. This might suggest a match doesn't mean very much. But if you take into account that there are 1,200 different sole patterns of Nike trainers and around 42 million pairs of sports shoes sold every year, a matching pair becomes more significant.
The data needed to run these kinds of calculations, though, isn't always available. And this is where the expert in this case came under fire. The judge complained that he couldn't say exactly how many of one particular type of Nike trainer there are in the country. National sales figures for sports shoes are just rough estimates.
And so he decided that Bayes' theorem shouldn't again be used unless the underlying statistics are "firm". The decision could affect drug traces and fibre-matching from clothes, as well as footwear evidence, although not DNA.
What I take from this is that the judge was asking for reasons to believe the numbers going in to Bayes's rule be accurate. This is, of course, altogether the right reaction. Unless the component numbers in the calculation --- the base rates and the likelihoods --- are right, the posterior probability has no value as evidence, because it has no connection whatsoever to the truth. Unless those components are validated, the differences between a witness who says "My posterior probability is 0.99" and one who says "I'm, like, really sure" are:
To reinforce just how wrong a simple-minded application of Bayes's rule can go, I invite you to consider the saga of the Phantom of Heilbronn. The combined police forces of Europe spent years searching for a criminal known from high-quality forensic evidence (DNA) left at more 40 crime scenes across a wide swathe of Europe. In the end, it turned out that the reason all these different crime scenes turned up the same DNA, is that the swabs used to collect the DNA from the scenes all came from the same factory, and had been contaminated by DNA from a worker there. (Presumably the contamination was accidental.) The case unraveled because while the common DNA was female, it was recovered from a male corpse. If it had been recovered from some unfortunate woman, it's very likely that this would now be regarded as a closed case. No doubt we would then be hearing Bayesian calculations about the odds against the suspect being anyone other than the Heilbronn serial killer --- who, recall, did not exist. (In fact, it's instructive to do a back-of-the-envelope version of the calculation, ignoring the contamination of the swabs.) If you say "Well, of course those calculations are off, the likelihood of the suspect matching a crime-scene in the test when the suspect wasn't really there is all wrong", I can only reply, "Exactly", and add that sensitivity analysis is no substitute for actually understanding where and how the data arise. This is related, of course, to the certainty of the Bayesian fortune-teller.
It is never pleasant to have claims to professional authority checked, so I certainly feel where my learned British colleagues are coming from*. But I have to conclude that, in so far as the judge said that Bayes's rule "shouldn't ... be used unless the underlying statistics are 'firm'," he was being entirely reasonable. He may, of course, have gone on to establish unreasonable standards for what counts as "firm" statistics; the news stories don't say. Unless that can be shown, however, the most damning verdict we statisticians can return is (what else?) "not proven".
Update, later that day: A reader has kindly supplied me with a copy of the ruling. On a first scan, phrases like "Maths! Nasty, wicked, tricksy maths! We hates them, Precious, hates them forever!" are absent, but I will try to read it and report back.
Update, 27 October: More than you would ever want to know.
*: Let me remind them that one trick which is proven to help people use Bayes's rule rightly is to eschew talk of probabilities, and employ frequency formats. Since Gigerenzer and Hoffrage were able to get doctors — a tribe notorious for their mis-understanding and mis-use of inverse probability — to use Bayes's rule correctly this way, it would be rather surprising if lawyers weren't helped too.
Posted by crshalizi at October 03, 2011 08:50 | permanent link
Attention conservation notice: I have no taste.
(Out of sequence because I didn't get around to posting on the weekend.)
Despite how it looks, I actually put most of my reading time this month into a most wonderful mathematical book, but a review will have to wait until I am completely finished with it.
"Elsinore itself? The very Elsinore? God bless my soul: and yours too, joy. A noble pile. I view it with reverence. I had supposed it to be merely ideal — hush, do not move. They come, they come!"
A flight of duck wheeled overhead, large powerful heavy swift-flying duck in files, and pitched between the castle and the ship.
"Eiders without a doubt," said Stephen, his telescope fixed upon them. "They are mostly young: but there on the right is a drake in full dress. He dives: I see his black belly. This is a day to mark with a white stone." A great jet of white water sprang from the surface of the sea. The eiders vanished. "Good God!" he cried, staring in amazement, "What was that?"
"They have opened on us with their mortars," said Jack. "That was what I was looking for." A puff of smoke appeared on the nearer terrace, and half a minute later a second fountain rose, two hundred yards short of the Ariel.
"The Goths," cried Stephen, glaring angrily at Elsinore. "They might have hit the birds. These Danes have always been a very froward people. Do you know, Jack, what they did at Clonmacnois? They burnt it, the thieves, and their queen sat on the high altar mother-naked, uttering oracles in a heathen frenzy. Ota was the strumpet's name. It is all of a piece: look at Hamlet's mother. I only wonder her behaviour caused any comment."
Posted by crshalizi at September 30, 2011 23:59 | permanent link
In which we practice debugging and testing, while learning about measures of nonlinear association.
Posted by crshalizi at September 28, 2011 15:16 | permanent link
Lecture 9: Our code implements a method for solving problems we expect to encounter in the future; but why should we trust those solutions? We establish the reliability of the code by testing it. To respect the interfaces of the code, we test the substance of the answers, not the procedure used to obtain them, even though it is the reliability of the procedure we ultimate care about. We test both for the actual answer in particular cases and by cross-checking different uses of the same code which should lead to the same answer. Because we do not allow our tests to give us any false alarms, their power to detect errors is limited, and must be focused at particular kinds of errors. We make a virtue of necessity by using a diverse battery of tests, and shaping the tests so that they tell us where errors arise. The testing-programming cycle alternates between writing code and testing its correctness, adding new tests as new errors are discovered. The logical extreme of this is test-driven development, where tests represent the specification of the software's behavior in terms of practical consequences. Drawbacks of testing. Some pointers to more advanced tools for writing, maintaining and using tests in R.
(Why yes, this lecture was something of a lay sermon on epistemology.)
Posted by crshalizi at September 28, 2011 15:15 | permanent link
Attention conservation notice: Defense of professional territory (or jursidiction) against potential rivals.
Cathy O'Neil has an interesting post up about "Why and how to hire a data scientist for your business". I confess that I have never been on the hiring end of such a decision, but everything she says sounds quite reasonable. What strikes me about it, though, is that the skills she's describing a good "data scientist" as having are a subset of the skills of a good statistician. At most, they are a subset of the skills of a good computationally competent statistician. These are even, at least here, undergraduate-level skills. Everyone who gets a bachelor's degree from our department has, after all, taken modern regression and advanced data analysis, and most of them respond to our promptings to take statistical graphics and visualization, data mining, and/or statistical computing. (IMHO, graphics and computing ought to be mandatory courses, but that's another story for another audience.) While I modestly admit to the unrivaled greatness of our undergrad program, I draw two conclusions:
Obligatory disclaimer: I am, of course, speaking for myself and not for the department, much less the school.
Manual trackback: Mims's Bits
Posted by crshalizi at September 27, 2011 13:31 | permanent link
Yet again, the Santa Fe Institute is recruiting post-docs for three year appointments. If the idea of having the freedom to pursue your own interdisciplinary research in a remarkably stimulating, genuinely collaborative, and physical beautiful environment is appealing, then I strongly encourage you to apply. (Even though more applications will mean more for me to read during the evaluations.) Follow the link for details.
Posted by crshalizi at September 26, 2011 12:25 | permanent link
Lecture 8: Debugging is an essential and perpetual part of programming. Debugging as differential diagnosis: characterize the bug, localize it in the code, try corrections. Tactics for characterizing the bug. Tactics for localizing the bug: traceback, print, warning, stopifnot. Test cases and dummy input generators. Interactive debuggers. Programming with an eye to debugging: writing code with comments and meaningful names; designing the code in a top-down, modular, functional manner. A hint at the exception-handling system.
Posted by crshalizi at September 26, 2011 12:20 | permanent link
In which we meet the jackknife, by way of seeing how much error there is in our estimates from the last lab.
Posted by crshalizi at September 26, 2011 09:32 | permanent link
In which we
meet the parametric bootstrap traveling
incognito probe the precision of our estimation method from the last
lab, by seeing how well it would work when the model is true and we know the
Posted by crshalizi at September 26, 2011 09:31 | permanent link
Lecture 7: R looks for the values of names in the current environment; if it cannot find a value, it looks for the name in the environment which spawned this one, and so on up the tree to the common, global environment. Assignment is modifying the name/value association list which represents the environment. The scope of a name is limited by the current environment. Implications: changes within the current scope do not propagate back to the larger environments; changes in the larger environment do propagate to all smaller ones which it encloses, unless over-ridden by local names. Subtlety: the larger environment for a function is the one in which it was defined, not the one in which it is called. Some implications for design. Examination of the last homework from this stance.
Posted by crshalizi at September 26, 2011 09:30 | permanent link
There's not much connection between the talks, other than that they should both be great, and I don't feel like writing two posts.
As always, both talks are free and open to the public.
Posted by crshalizi at September 24, 2011 16:06 | permanent link
Lecture 6: Top-down design is a recursive heuristic for solving problems by writing functions: start with a big-picture view of the problem; break it into a few big sub-problems; figure out how to integrate the solutions to each sub-problem; and then repeat for each part. The big-picture view: resources (mostly arguments), requirements (mostly return values), the steps which transform the one into the other. Breaking into parts: try not to use more than 5 sub-problems, each one a well-defined and nearly-independent calculation; this leads to code which is easy to understand and to modify. Synthesis: assume that a function can be written for each sub-problem; write code which integrates their outputs. Recursive step: repeat for each sub-problem, until you hit something which can be solved using the built-in functions alone. Top-down design forces you to think not just about the problem, but also about the method of solution, i.e., it forces you to think algorithmically; this is why it deserves to be part of your education in the liberal arts. Exemplification: how we could write the lm function for linear regression, if it did not exist and it were necessary to invent it.
Posted by crshalizi at September 19, 2011 10:30 | permanent link
In which we practice the arts of writing functions and of estimating distributions, while contemplating just how little room there is in the heart of a cat.
Posted by crshalizi at September 19, 2011 10:29 | permanent link
Attention conservation notice: 1900+ words of log-rolling promotion of an attempt by friends to stir up an academic controversy, in a matter where pedantic points of statistical theory intersect the artificial dilemmas of psychological experiments.
There's a growing interest among psychologists in modeling how people think as a process of Bayesian learning. Many of the papers that come from this are quite impressive as exercises in hypothetical engineering, in the Design for a Brain tradition, but long-time readers will be bored and unsurprised to hear that I don't buy them as psychology. Not only do I deny that Bayesianism is any sort of normative ideal (and so that Bayesian models are standards of rationality), but the obstacles to implementing Bayesian methods on the nervous system of the East African Plains Ape seem quite insurmountable, even invoking the computational power of the unconscious mind*. Nonetheless, there are all those experimental papers, and it's hard to argue with experimental results...
Unless, of course, the experimental results don't show what they seem to. This is the core message of a new paper, whose insight is completely correct and something I kick myself for not having realized.
Let me give an extended quotation from the paper to unfold the logic.
In a standard experimental set-up used to confirm a Bayesian model, experimental participants are provided with a cover story about the evidence they are about to see. This cover story indicates (either implicitly or explicitly) the possible hypotheses that could explain the forthcoming data. Either the cover story or pre-training is used to induce in participants a prior probability distribution over this space. Eliciting participants' prior probabilities over various hypotheses is notoriously difficult, and so the use of a novel cover story or pre-training helps ensure that every participant has the same hypothesis space and nearly the same prior distribution. In addition, cover stories are almost always designed so that each hypothesis has equal utility for the participants, and so the participant should care only about the correctness of her answer. In many experiments, an initial set of questions elicits the participant's beliefs to check whether she has extracted the appropriate information from the cover story. Participants are then presented with evidence relevant to the hypotheses under consideration. Typically, in at least one condition of the experiment, the evidence is intended to make a subset of the hypotheses more likely than the remaining hypotheses. After, or sometimes even during, the presentation of the evidence, subjects are asked to identify the most likely hypothesis in light of the new evidence. This identification can take many forms, including binary or n- ary forced choice, free response (e.g., for situations with infinitely many hypotheses), or the elicitation of numerical ratings (for a close-to-continuous hypothesis space, such as causal strength, or to assess the participant's confidence in their judgment that a specific hypothesis is correct). Any change over time in the responses is taken to indicate learning in light of evidence, and those changes are exactly what the Bayesian model aims to capture.
These experiments must be carefully designed so that the experimenter controls the prior probability distribution, the likelihood functions, and the evidence. This level of control ensures that we can confirm the predictions of the Bayesian model by directly comparing the participants' belief changes (as measured by the various elicitation methods) with the mathematically computed posterior probability distribution predicted by the model. As is standard in experimental research, results are reported for a participant population (split over the experimental conditions) to control for any remaining individual variation. Since the model is supposed to provide an account of each participant in the population individually, experimental results must be compared to the predictions of an aggregate (or "population") of model predictions.
Here's the problem: in these experiments (at least the published ones...), there is a decent match between the distribution of choices made by the population, and the posterior distribution implied plugging the experimenters' choices of prior distribution, likelihood, and data into Bayes's rule. This is however not what Bayesian decision theory predicts. After all, the optimal action should be a function of the posterior distribution (what a subject believes about the world) and the utility function (the subjects' preferences over various sorts of error or correctness). Having carefully ensured that the posterior distributions will be the same across the population, and having also (as Eberhardt and Danks say) made the utility function homogeneous across the population, Bayesian decision theory quite straightforwardly predicts that everyone should make the same choice, because the action with the highest (posterior) expected utility will be the same for everyone. Picking actions frequencies proportional to the posterior probability is simply irrational by Bayesian lights ("incoherent"). It is all very well and good to say that each subject contains multitudes, but the experimenters have contrived it that each subject should contain the same multitude, and so should acclaim the same choice. Taking the distribution of choices across individuals to confirm the Bayesian model of a distribution within individuals then amounts to a fallacy of composition. It's as though the poet saw two of his three blackbirds fly east and one west, and concluded that each of the "was of three minds", two of said minds agreeing that it was best to go east.
By hypothesis, then, the mind is going to great lengths to maintain and update a posterior distribution, but then doesn't use it in any sensible way. This hardly seems sensible, let alone rational or adaptive. Something has to give. One possibility, of course, is that is sort of cognition is not "Bayesian" in any strong or interesting sense, and this is certainly the view I'm most sympathetic to. But in fairness we should (as Eberhardt and Danks) do, explore branches of the escape tree for the Bayesians.
There are, of course, situations where the utility-maximizing strategy is randomized; but the conditions needed for that don't seem to hold for these sorts of experiments. The decision problem the experimentalists are trying to set up is one where the optimal decision is indeed a deterministic function of the posterior distribution. And even when a randomized strategy is optimal, it rarely just matches posterior probabilities. An alternative escape is to consider that why the experimentalists try to make prior, likelihood, data and utility homogeneous across the subject population, they almost certainly don't succeed completely. One way this could be modeled is to actually include a random term in the decision model. This sort of technology has actually been fairly well developed by economists, who also try to match actual human behavior to (specious, over-precise) models of choice. This "curse of determinism" is broken by economists by adding a purely stochastic term to the utility being maximized, leading to a distribution of choices. Such random-utility models have not been applied to Bayesian cognition experiments, and, yet again, assuming that the individual-level noise terms could be adjusted just so as to get the distribution of individual choices to approximate the noise-free posterior distribution, why should they be?
Now, I do want to raise a possibility which goes beyond Eberhardt and Danks, which goes to the specificity of the distributional evidence. The dynamics of Bayesian updating is an example of the replicator dynamics from evolutionary theory, with hypotheses as replicators and fitness as likelihood. But not only is Bayes a very narrow special case of the replicator equations (no sources of variation analogous to mutation or sex; no interaction between replicators analogous to frequency dependence), lots of other adaptive processes approximately follow those equations as well. Evolutionary search processes (a la Holland et al.'s Induction) naturally do so, for instance, but so does mere reinforcement learning, as several authors have shown. At the level of changing probability distributions within an individual, all of these would look extremely similar to each other and to Bayesian updating. Even if Bayesian models find a way to link distributions within subjects to distributions across populations, specifically supporting Bayesian models would need evidence which differentially favored them over all other replicator-ish models. One way to provide such differential support would be to show that Bayesian models are not only rough matches matches to the data, they fit it in detail, and fit it better than non-Bayesian models could. Another kind of differential support would be showing that the Bayesian models account for other features of the data, beyond the dynamics of distributions, that their rivals do not. It's for the actual psychologists to say how much hope there is for any such approach; I will content myself by observing that it is very easy to tell an evolutionary-search or reinforcement-learning story that ends with the distribution of people's choices matching the global probability distribution**.
What is not secondary at all is the main point of this paper: Bayesian models of inference and decision do not predict that the population distribution of choices across individuals should mirror the posterior distribution of beliefs within each individual. That is rather so far from the models' predictions as to refute the models. Perhaps, with a lot of technical work in redefining the decision problem and/or modeling experimental noise, the theories could be reconciled with the data. Unless that work is done, and done successfully, then as accounts of human cognition these theories are doomed. Anyone who finds these issues interesting would do well to read the paper.
Disclaimer: Frederick is a friend, and David is on the faculty here, though in a different department. Neither of them is responsible for anything I'm saying here.
*: There are times when uninstructed people are quite good at using Bayes's rule: these are situations where they are presented with some population frequencies and need to come up with others. See Gerd Gigerenzer and Ulrich Hoffrage, "How to Improve Bayesian Reasoning without Instruction: Frequency Formats", Psychological Review 102 (1995): 684--704, and Leda Cosmides and John Tooby, "Are Humans Good Intuitive Statisticians After All? Rethinking Some Conclusions from the Literature on Judgement Under Uncertainty", Cognition 58 (1996): 1--73 [PDF]. In my supremely arrogant and unqualified opinion, this is one of those places where evolutionary psychology is not only completely appropriate, but where Cosmides and Tooby's specific ideas are also quite persuasive.
**: It is also very easy to tell an evolutionary-search story in which people have new ideas, while (as Andy and I discussed) it's impossible for a Bayesian agent to believe something it hasn't always already believed at least a little.
Posted by crshalizi at September 18, 2011 21:29 | permanent link
In which we see how to estimate both parameters of the West et al. model from lab, in the process learning about writing functions, decomposing problems into smaller steps, testing the solutions to the smaller steps, and minimization by gradient descent.
Posted by crshalizi at September 15, 2011 11:06 | permanent link
Lecture 5: Using multiple functions to solve multiple problems; to sub-divide awkward problems into more tractable ones; to re-use solutions to recurring problems. Value of consistent interfaces for functions working with the same object, or doing similar tasks. Examples: writing prediction and plotting functions for the model from the last lab. Advantages of splitting big problems into smaller ones with their own functions: understanding, modification, design, re-use of work. Trade-off between internal sub-functions and separate functions. Re-writing the plotting function to use the prediction function. Recursion. Example: re-writing the resource allocation code to be more modular and recursive. R for examples.
Posted by crshalizi at September 15, 2011 11:05 | permanent link
Lecture 4: Just as data structures tie related values together into objects, functions tie related commands together into objects. Declaring functions. Arguments (inputs) and return values (outputs). Named arguments, defaults, and calling functions. Interfaces: controlling what the function can see and do; first sketch of scoping rules. The importance of the interface. An example of writing and improving a function, for fitting the model from the last lab. R for examples.
Posted by crshalizi at September 15, 2011 11:04 | permanent link
In which we use nonlinear least squares to fit the West et al. model.
Posted by crshalizi at September 15, 2011 11:03 | permanent link
In which we make incremental improvements to our code for planning by incremental improvements.
Posted by crshalizi at September 15, 2011 11:02 | permanent link
Lecture 3: Conditioning the calculation on the data: if; what is truth?; Boolean operators again; switch. Iteration to repeat similar calculations: for and iterating over a vector; while and conditional iteration (reducing for to while); repeat and unconditional iteration, with break to exit loops (reducing while to repeat). Avoiding iteration with "vectorized" operations and functions: the advantages of the whole-object view; some examples and techniques: mathematical operators and functions, ifelse; generating arrays with repetitive structure.
Posted by crshalizi at September 15, 2011 11:01 | permanent link
In which we play around with basic data structures and convince ourself that the laws of probability are, in fact, right. (Or perhaps that R's random number generator is pretty good.)
Posted by crshalizi at September 15, 2011 11:00 | permanent link
With great data comes great responsibility:
Posted by crshalizi at September 01, 2011 13:20 | permanent link
Attention conservation notice: I have no taste.
I think transduction (in the modern sense of the word, perhaps not what Vovk et al discuss) is statistically distinct from induction. I'm not aware of any transductive sample complexity upper bounds that beat corresponding lower bounds for inductive sample complexity. However, transductive upper bounds often beat inductive ones, e.g., "Collaborative Filtering with the Trace Norm: Learning, Bounding, and Transducing".My superficial impression from the paper Shiva points me to is that it deals with a finite set of objects (entries in a matrix), and the difference between the "inductive" and "transductive" set-ups comes from the former sampling entries with replacement, which is kind of silly in this context, while the latter does not. But clearly I need to read and think more deeply before being entitled to an opinion. (This concludes this edition of Shalizi Smackdown Watch.)
The reduction you posted doesn't work for matrix completion. By considering a hypothetical new missing entry, one eliminates a present entry, which could change the predicted values for the other missing entries.
Books to Read While the Algae Grow in Your Fur; Pleasures of Detection, Portraits of Crime; Scientifiction and Fantastica; Writing for Antiquity; Enigmas of Chance; Cthulhiana; The Collective Use and Evolution of Concepts; Minds, Brains, and Neurons; Complexity; Commit a Social Science; Kith and Kin; Philosophy; Networks
Posted by crshalizi at August 31, 2011 23:59 | permanent link
In which we practice working with data frames, and grapple with some of the subtleties of R's system of data types.
Assignment, due at the start of class, Wednesday, 7 September 2011
Posted by crshalizi at August 31, 2011 10:31 | permanent link
Matrices as a special type of array; functions for matrix arithmetic and algebra: multiplication, transpose, determinant, inversion, solving linear systems. Using names to make calculations clearer and safer: resource-allocation mini-example. Lists for combining multiple types of values; access sub-lists, individual elements; ways of adding and removing parts of lists. Lists as key-value pairs. Data frames: the data structure for classic tabular data, one column per variable, one row per unit; data frames as hybrids of matrices and lists. Structures of structures: using lists recursively to creating complicated objects; example with eigen.
Posted by crshalizi at August 31, 2011 10:30 | permanent link
Introduction to the course: statistical programming for autonomy, honesty, and clarity of thought. The functional programming idea: write code by building functions to transform input data into desired outputs. Basic data types: Booleans, integers, characters, floating-point numbers. Subtleties of floating point numbers. Operators as basic functions. Variables and names. An example with resource allocation. Related pieces of data are bundled into larger objects called data structures. Most basic data structures: vectors. Some vector manipulations. Functions of vectors. Naming of vectors. Continuing the resource-allocation example. Building more complicated data structures on top of vectors. Arrays as a first vector structure.
Posted by crshalizi at August 29, 2011 10:31 | permanent link
Posted by crshalizi at August 29, 2011 10:30 | permanent link
There is a line in my review of Networks, Crowds, and Markets which I feel guilty about:
Nowadays, companies whose sole and explicit purpose is the formalization of social networks have hundreds of millions of active customers. (Although they are not often seen this way, these firms are massive exercises in centrally planned social engineering, inspired by sociological theories.)The reason I feel guilty about this is that this is not my insight at all, but rather something I owed to a manuscript which Kieran Healy was kind enough to share with me some years ago. I have been bugging Kieran ever since to make it public, and he has now, for unrelated reasons, finally done so:
Kieran's paper is one of the most interesting things I've read about social networks in a long time. Even those of us who are interested in networks from the viewpoint of modeling complex systems should care about it, because it has implications for us: viz., we need to think about how much of what we're modeling reflects engineering decisions made in the South Bay or lower Manhattan, as opposed to (other) social processes. Go read.
Posted by crshalizi at August 25, 2011 09:30 | permanent link
Brad DeLong, contemplating the slide from bad things are happening because people do not act like economic models assume they do to people are irrational and deserve to be punished, remarks, sensibly enough, that "a system that for good outcomes requires that people act in ways people do not do is not a good system — and to blame the people rather than the system is to commit a major intellectual error." Somehow, this made me think of the following, which I offer with all due apologies to Brecht's memory:
Some economist decreed that the people
had lost the market's confidence
and could only regain it with redoubled effort.
If that is the case, would it not be be simpler,
If the market simply dissolved the people
And purchased another?
(This is related, of course, to the way that hexapodia is the key insight into neutralizing the Dutch menace.)
Manual trackback: MetaFilter
Posted by crshalizi at August 24, 2011 23:59 | permanent link
Since the semester begins on Monday, I might as well admit to myself that I am, in fact, teaching a new class:
(For tedious reasons, this class has the same number as the data-mining class I've taught previously; that course is now numbered 36-462, and will be taught in the spring by somebody else, while I'll be returning to 36-402, advanced data analysis.)
Posted by crshalizi at August 24, 2011 23:58 | permanent link
I should perhaps add that there is no particular connection among these.
Update, later that day: John Kozak offers the example of French boeuf ("cow") -> English "beef" + "steak" -> French bifteck, and suggests that there are probably many more French -> English -> French loops.
Update, 16 August: The question about (what I learn from Scott Martens are properly called) re-borrowings seems to have struck a chord. Some suggestions from readers follow.
Ádám Tóth offers the chain froc (French) -> "frock" (English) -> frac (French); the mind boggles slightly at the idea of French borrowing an English word for clothing, but apparently so.
Continuing on the French-English-French loop, Matthieu Authier offers French pied de grue, "crane's foot", a drawing of which was apparently used to mark succession in family trees, hence English "pedigree", whence French pedigrée.
Shifting from going back and forth across the English Channel to going back and forth across the North Sea, Marius Nijhuis provides an interesting list for Dutch. I'll quote his e-mail (with permission) at some length:
In these cases the original word is still in use, mostly in roughly its original meaning, next to a reimported version with a clearly different meaning. They all involve French or English or both. German would be the obvious third language to look for loops. But German dialects and Dutch are historically too close, so words move back and forth too easily to get interesting changes in meaning.
- Mannequin, currently used in Dutch for 'runway model'. From 'manneken', still the Flemish word for 'little guy'.
- Boulevard, used in Dutch mostly for a road running paralel to a beach. The French word comes from 'bolwerk', Dutch for bastion.
- Sketch, used in Dutch only for a short comedy act. The English word comes from 'schets', meaning sketch.
- Drugs, used in Dutch only for narcotics. From English, through French 'drogue' meaning 'spices' in those days, from Dutch (or Old Dutch) 'droge waren', meaning 'dry goods' in shipping. Much earlier than 'drugs', 'drogue' had already returned as 'drogist', the Dutch word for 'drug store', from French 'droguiste', 'seller of spices'.
- Etappe, nowadays mostly used in Dutch for a stage in a cycling contest. From French 'etape' in the Tour de France, through a chain of military uses from old French 'estaple', meaning trade depot, in turn derived from Old Dutch 'stapel'. And stapel also moved to English, leading to "staple foods", that returned to Dutch as 'stapelvoedsel'. In this case, 'stapel' has nearly disappeared from Dutch in its original meaning, Its current ordinary meaning is simply 'stack'.
- Dock, written with a c, is in modern Dutch a device to put an ipod in. Dok is still the word for the maritime structure.
- Cruise, used in Dutch for a luxury boat trip. From 'kruisen', tacking against in the wind in a sailing boat. Cruisecontrol (one word) is the normal Dutch word for the speed control in cars. Cruising as done by aircraft got reattached to 'kruisen', eventually leading to 'kruisraket', cruise missile. That is a curious word Dutch might never have formed on its own, since it can also mean 'cross missile' or 'crotch missile'.
There are actually even more reader e-mails on this subject in the queue, but I don't have permission to quote from them yet.
Posted by crshalizi at August 13, 2011 11:00 | permanent link
Posted by crshalizi at August 04, 2011 12:05 | permanent link
Attention conservation notice: Almost 1000 words of follow-up to a post on an inter-blog dispute, complete with graphs and leaden sarcasm.
Some follow-up to the last post, in response to e-mails, and discussion elsewhere.
Now, continuing with the theme of harmonizing means and ends, if you want capitalism, but you find a state that powerful very scary (as you have every right to do), then you have a problem. You might, on reflection, favor some other economic system which does not require such a powerful state. (This is not a popular option, save among marginal advocates of rural poverty and idiocy.) You might, on reflection, decide that such power is perfectly A-OK, so long as it's used for ends you approve of and there's no danger of the people taking over. (Hence Hayek's anti-democratic political ideas, and viewing Pinochet's reign of terror as less damaging to [what he saw as] liberal values than the British National Health Service.) Or you might try to find ways of taming or domesticating state power, of civilizing it. (I think that has a pretty good track-record, but who knows how long we can keep it up?) What you cannot do, with any intellectual honesty or even hope of getting what you want, is pretend that capitalism can work without a powerful, competent and intrusive state. As Ernest Gellner once wrote, "Political control of economic life is not the consummation of world history, the fulfilment of destiny, or the imposition of righteousness; it is a painful necessity."
|Data from the St. Louis Federal Reserve Bank's FRED service: GDP from the GDPCA series, population from the POP series. The plot starts at 1952 because that's when the population series does.|
|Annual exponential growth rates from the previous figure: yearly values (dots) and an 11-year moving average (black line). (The correlation time of the growth rate series is about 3.5 years.)|
At a finer-grained level, you can look at performance over the business cycle, and again see that the new policy regime doesn't deliver any more aggregate growth. It certainly doesn't lead to faster productivity growth (again, despite claims to the contrary). But one thing which has changed is that aggregate growth does a lot less for most people than it used to. Again, you could tell a counter-factual stories about how all of these would be much worse without those policies, but by this point you are claiming that your drumming repels not just tigers but also snow leopards, elephants, and Glyptodon.
Manual trackback: Agnostic Liberal
Posted by crshalizi at August 01, 2011 09:50 | permanent link
Attention conservation notice: I have no taste.
What can I offer you, lady?
A fig, perhaps? You are April
and morning and I would line
every street with blueberries
who would tip their tiny crowns
whenever you appeared,
border your life with trumpts
until your shadow was famous,
but I would still be filthy,
and you so starry and upturned,
so yes, perhaps a fig.
Posted by crshalizi at July 31, 2011 23:59 | permanent link
Attention conservation notice: 1900 dry, pedantic, abstract words about "theory of politics", and why it might matter to bringing about progressive political changes, from someone who is completely ineffective at actual politics, and not notably engaged in it either. None of it is at all original, and much of it is painfully obvious. Plus, it was provoked by a squabble among bloggers, and reading anything responding to a
literary controversyflame-war is usually a waste of time.
Some posts by Henry Farrell on "left neoliberalism" and "theory of politics" (1, 2, 3) provoked quite a huge response, which I will not try to catalog. (I will however point to a useful older post by Ben Alpers on the term "neoliberalism", and to Timothy Burke's reaction.) I tried to explain, in the comments at Unfogged, what I thought Henry was trying to say, and for want of other material I'll repost that here, with a few modifications. By way of disclaimer, Henry is a friend and collaborator, and we've spent a lot of time talking about related issues, but what follows is in no way endorsed by him.
With that out of the way: What Henry means when he talks about "a theory of politics" is a theory about how political change (or stasis) happens, not about what political ends are desirable, or just, or legitimate, which is much of what I take "political theory" to be. "What are the processes and mechanisms by which political change happens?" is, at least in part, a separate question from "What would a good polity look like?", and Henry is talking about the former, not the latter. Of course the answer to the first question will tend to be context-dependent, so specialize it to "in contemporary representative democracies", or even "America today" if you like.
The first importance of such a theory is instrumental: if you want to have policies that look like X, a good theory of politics would help you figure out how to achieve X-shaped policies. But the second importance is that the theory might change your evaluation of policies, because it would change your understanding of their effects. The U.S. tax deduction for mortgage interest is arguably economically inefficient, since it promotes buying housing over renting, for no very clear economic rationale. But in so doing it (along with massive government intervention in forming and sustaining the mortgage market, building roads, using zoning to limit the construction of rental property, etc.) helps create a large group of people who are, or think of themselves as, property owners, possessors of substantial capital assets and so with a stake in the system*. If the deduction were, for instance, means-tested, it would not be nearly so effective politically.
Or again, if, for instance, you like material prosperity, you might favor
policy X because (you think) it promotes economic efficiency. (Some other
time, we can and should have the conversation about "economic efficiency", and
the difference between "allocating scarce resources to their most valuable
uses" and "allocating resources to meet effective demand", i.e., about the
inherent in the market's social welfare function.) But if you are also
egalitarian, and policy X would make it easier for a small group of
already-privileged people to wield political influence, then you might decide
that policy X is not, after all, worth it, because of its inegalitarian
political effects. (At a guess, some, but not all**,
DeLong's reaction to Henry's posts is explained by letting X = "Clinton-era
financial deregulation".) If you value a certain kind of distribution of
political power as such (democracy, aristocracy, the vanguard party, rule
bankers, etc.), a theory of politics would be an important part of how you
gauge the value of different policies, at least ones which you think would tend
to change how much power different individuals, or groups of individuals, would
If you are more or less egalitarian about economic resources and political power, then you will want to see policies that not only contribute to material prosperity, and to distributing that prosperity, but also to making it easier and more feasible for those who are poorer and of lower social status to make their interests felt politically. (Rich, high-status people typically have little trouble on that score. Also, this presumes that interests are not completely homogeneous, but that's OK, because they're not.) Sometimes these goals will reinforce each other, sometimes they will conflict and one will need to make trade-offs. It is hard to make an intelligent trade-off, however, if you do not have any tools for recognizing they exist, or assessing what they are; this, again, is why Henry thinks achieving progressive goals needs a theory of politics.
Now, if I tried to back out a theory of politics from the practice of left neo-liberals, it would something like this: what matters most to the interest of voters is the over-all growth of the economy; as it grows, they will become more prosperous, and reward the political party which implemented those policies. They will also be willing to support unobtrusive welfare-state measures, especially if they look like they are run efficiently and go to the truly deserving, because prosperous people feel generous. So the most important thing is "the economy, stupid", and making sure the voters know who is responsible for good economic times.
I do not want to discount this completely, but, even if they're right about which policies will promote economic growth, it seems oddly naive about how any sort of representative democracy, yoked to capitalism, is going to work. We do indeed have lots of common interests (to give some innocuous ones: not being turned into smears of radioactive glass, not living amid pandemic or endemic communicable illness, having prosperous neighbors, etc.), but we also have diverging interests. Groups or classes of people often have systematically diverging interests. This is because whenever two or more parties have a positive-sum collaborative interaction, there is inevitably a zero-sum struggle over dividing the gains from cooperation. (Voluntary market exchange may be welfare-enhancing for everyone, but whenever you buy something and would still have done so for a penny more, your consumer surplus is the seller's failure of price discrimination.) In this struggle, as in all bargaining games, there is a natural advantage to the side which is already better off. Beyond and beside interests, there are of course also values, which may be unselfish but also diverge.
Capitalism seems to inevitably produce a small number of people who are extremely rich and command considerable economic power; this gives them very distinctive interests. (Often they will also identify themselves with their business enterprises, and their interests as on-going and growing bureaucracies.) Being human, many of them will try using that power to advance those interests and further enrich themselves, by dominating others and by bending the government to their will. (Capitalism needs a very high degree of internal peace and automatic obedience to uniform legal authority — when the courts decide whom disputed property belongs to, or what contracts require, it must stick — to say nothing of physical infrastructure and human resources, and so it always presumes a very powerful state.) They have the resources, and the incentives, to exert influence and to keep doing so. Rich and powerful people can be wrong about the effects of their actions, but when they are not, one should expect a positive feedback, with economic power being used to enhance political power, which in turn is exercised to enhance economic power.
Against this, there are the vast majority of ordinary people, who have their varying interests, but also pretty uniformly have interests which oppose those of the rich and powerful. (Again, they also have interests in common with the rich and powerful.) They are on the receiving, losing end of the feedback between wealth and political influence. Since they have fewer resources than the rich and powerful, it is simply harder for them to get the government to listen, or even to keep track of what it is doing that might affect them. If we want a society which is even close to equal politically and economically --- if we do not want the majestic equality of the law which forbids the rich and poor equally from stealing bread and sleeping under bridges --- then effective counter-vailing power must be organized, which means institutions for collective action. Of course, on the usual Logic of Collective Action grounds, this will be harder for large groups of people with few resources than for small, already advantaged classes...
I would also add --- and this is something Henry and I have ben thinking about a lot --- that it is often not at all trivial to figure out what your interests are, or how to achieve them, and that (small-d) democrats should try to find ways to help people work that out. Actually having political clout is often going to depend on collective action, but this needs to be complemented by collective cognition, which is how people figure out what to want and how to achieve it. That, however, is part of a much larger and rather different story, for another time.
All of this can be boiled down to something much shorter (and perhaps should have been at the start): "When you tell us that (1) the important thing is to maximize economic growth, and never mind the distributional consequences because (2) we can always redistribute through progressive taxation and welfare payments, you are assuming a miracle in step 2." For where is the political power to enact that taxation and redistribution, and keep it going, going to come from? A sense of noblesse oblige is too much to hope for (especially given how many of our rich people have taken lots of economics courses), and, for better or worse, voluntary concessions will no longer come from fear of revolution***.
There are I think two reasonable defenses left neoliberals could make. One is to say that creating or strengthening any forms of countervailing power under modern American conditions would itself take a miracle. That goal would be futile and idle, but we could increase economic growth, which would at least benefit some people. The other would be to deny that anyone has a reliable theory of politics, in this sense, certainly none which could be used as a guide to action, and no hope of developing one; whereas we do know a bit about economics. I find neither of these convincing, but I've gone on long enough already. Have a cartoon:
Update: See next post for some
*: I owe this argument to my father, lo these many years ago.
**: I esteem DeLong's writings very highly, have learned much from them, and
think he is on balance very much a force for good, but there are times when I
simply cannot understand how his mind works, and do not particularly want to.
***: It would be fascinating to know to what extent the development and
decline of the welfare state tracks, not fears of Communism as such, but fears
of other people finding Communism attractive. (By 1980, the USSR was a
powerful state, but also an obviously unappealing model.) I have no idea how
to study this.
*: I owe this argument to my father, lo these many years ago.
**: I esteem DeLong's writings very highly, have learned much from them, and think he is on balance very much a force for good, but there are times when I simply cannot understand how his mind works, and do not particularly want to.
***: It would be fascinating to know to what extent the development and decline of the welfare state tracks, not fears of Communism as such, but fears of other people finding Communism attractive. (By 1980, the USSR was a powerful state, but also an obviously unappealing model.) I have no idea how to study this.
Posted by crshalizi at July 25, 2011 13:20 | permanent link
Attention conservation notice: 1800+ words on yet more academic controversy over networks. (Why should those of us in causal inference have all the fun?) Contains equations, a plug for the work of a friend, and an unsatisfying, "it's more complicated than that" conclusion. Wouldn't you really rather listen to William Burroughs reading "Ah Pook Is Here"?
The following paper appeared a few months ago:
(I will hold my tongue over the philosophy of science in the first sentence of the abstract.)
Liu et al. looked specifically at systems of linear differential equations, with one (scalar) variable per node, and some number of outside control signals. Numbering the nodes/variables from 1 to N, the equation for the ith node is
Following the engineers, we say that the system is controllable if it can be moved from any state vector x to any other state vector x', in a finite time, by applying the proper input signal u(t). (This abstracts from questions about deciding what state to put it in, or for that matter about how we know what state it starts in ["observability"].) Liu et al. asked how the graph --- the pattern of non-zero links between nodes --- affects controllability. It's easy to see that it has to matter some: to give a trivial example, imagine that the nodes form a simple feed-forward chain, x1 -> x2 -> ... -> xN-1 -> xN, only the last of which gets input. This system cannot then be controlled, because there is no way for an input at the last node to alter the state at any earlier one. Liu et al. went through a very ingenious graph-theoretic argument to try to calculate how many distinct inputs such linear networks need, in order to be controlled.
Their conclusions are telegraphed in their abstract, which however does not play up one of their claims very much: namely, the minimum number of inputs needed is usually, they say, very large,, a substantial fraction of the number of nodes in the network. This is, needless to say, bad news for anyone who actually has a dynamical system on a complex network which they want to control.
Before we start making too much of this (I can already imagine the mangled David Brooks rendition, if it hasn't appeared already), it's worth pointing out a slight problem: the Liu et al. result is irrelevant to any real-world network.
Look at the equation for the Liu et al. model: x, the state of the node in question, does not appear on the right-hand side. This means that, in their model, nodes have no internal dynamics --- they change only due to outside forces, otherwise they stay put wherever they happen to be. A more typical linear model, which does allow for internal dynamics, would be
This seems like a very small change, but it has profound consequences for these matters. As Cowan et al. say, one can actually bring this case within the mathematical framework of Liu et al. by treating the internal dynamics of each node as a loop from the node to itself. Doing so has the immediate consequence (Proposition 1 in Cowan et al.) that any directed network could be controlled with only one input signal. To give a very rough analogy, in the Liu et al. model, a node move only as long as it is being actively pushed on; as soon as the outside force is released, it stops. In the more general situation, nodes can and will move even without outside forcing --- since it's a linear model, the natural motions are combinations of sinusoidal oscillations and exponential return to equilibrium --- and this actually makes it easier to drive the system to a desired state. It is a little surprising that this always reduces the number of input signals needed to 1, but that does indeed follow very directly from Liu et al.'s theorems.
Now, constant readers may have been wondering about why I've not said anything about the linearity assumption. Despite appearances, I actually have nothing against linear models --- some of my best friends use nothing but linear models --- and it seemed perfectly reasonable to me that Liu et al. would work with a linear set-up, at least as a local approximation to the real nonlinear dynamics. Unfortunately, that turns out to be a really bad way to approximate this sort of qualitative property:
As Cowan et al. go on to observe, being controllable is an entirely qualitative property --- it says "there exists a control signal", not "there exists a control signal you could ever hope to apply". There are several ways of quantifying how hard it is to control a technically-controllable system, and this seems unavoidably to depend on much more information than just that provided by the network's degree distribution, or even the full graph of the network. This would be particularly true of nonlinear systems, which of course are most of the interesting ones.
So, to sum up, there were two very striking and interesting claims in the Liu et al. paper: (i) that the degree distribution alone of a network gives us deep insight into its a specific aspect of its dynamics, and (ii) this shows that most complex networks are very hard to control. What both the follow-up papers show is that (ii) is wrong, that with this sense of "control", you can, generically, control an arbitrarily complex network by manipulating just a single input signal. But this, together with the recognition that we need to get beyond this very qualitative notion of control, also undermines (i). That to me is rather disappointing. It would have been great if we could have inferred so much from just the degree distribution. (It would have given us a good reason to care about the degree distribution!) Instead we're back to the messy situation where ignoring the network leads us into error, but merely knowing the network doesn't tell us enough to be useful, and non-network details matter. Back, I suppose, to the science.
Aside I may regret later: Barabási really does not have a great track record when it comes to Nature cover-stories, does he? But, if past trends hold good, neither the Cowan et al. nor the Wang et al. paper have any chance of appearing in that journal.
Manual trackback: Resilience Science
Update, 29 July 2011: I should have been clearer above that the paper by Wang et al. is not written as a comment on the original Nature paper, unlike that by Cowan et al.
Update, 30 August 2011: I haven't had a chance to read it, but I thought it only right to note the appearance of "Comment on 'Controllability of Complex Networks with Nonlinear Dynamics'," by Jie Sun, Sean P. Cornelius, William L. Kath, and Adilson E. Motter (arxiv:1108.5739).
Posted by crshalizi at July 13, 2011 19:35 | permanent link
Accidentally left in my drafts folder for two months. I still haven't looked at my student evaluations.
Now that most of the final exams are graded, but before I've gotten to see my student evaluations, it seems like a good time to reflect on the class. Also, I have had enough May wine, with woodruff from my garden, that the prospect of teaching it again next year can be greeted with equanimity.
First, and conditioning everything else, this was by far the largest class I've taught (70 students), and to the extent it went well it's entirely due to my teaching assistants, Gaia Bellone, Shuhei Okumura and Zachary Kurtz. I'd say I couldn't thank them enough, but clearly I'll have to do 30--40% better than that next year, when there will be between 90 and 100 students. (Memo to self: does the university allow me to pay bonuses to TAs in whiskey?)
Posted by crshalizi at July 11, 2011 17:15 | permanent link
Attention conservation notice: 1700-word Q-and-A on technical points of statistical theory, prompted by a tenuous connection to recent academic controversies.
Q: What is a statistical parameter?
A: The fundamental objects in statistical modeling are
probability distributions, or random processes. A
Think of these distributions as being like geometrical figures, and the parameters as various aspects of the figures: their volume, or area in some cross-section, or a certain linear dimension.
Q: So I'm guessing that whether a parameter is "identifiable" has something to do with whether it actually makes a difference to the distribution?
A: Yes, specifically whether it makes a difference to the observable part of the distribution.
Q: How can a probability distribution have observable and unobservable parts?
A: We specify models involving the variables we think are physically (biologically, psychologically, socially...) important. We don't get to measure all of these. Fixing what we can observe, each underlying distribution induces a distribution on the variables we do measure, the observables. In the analogy, we might only get to see the shadows cast by the geometric figures, or see what volume they displace when submerged in water.
Q: And how does this relate to identifiability?
A: Every (measurable) functional of the observable
In the analogy, if we know all the figures are boxes (i.e., rectangular prisms), but we only get to see their displacement, then volume is identifiable, but breadth, height and width are not. It is not a matter of not having enough data (not measuring the displacement precisely enough); even knowing box's volume exactly would not, by itself, tell us the height of the box.
Q: Are all identifiable parameters equally easy to estimate?
A: Not at all. For real-value parameters, the natural quantification of identifiability is the Fisher information, i.e., the expectation value of the second derivative of the log-likelihood with respect to the parameter. (In general the first derivative is zero.) But this seems like, precisely, a second-order issue after identifiability as such. Of course, if a parameters is unidentifiable, the derivative of the log-likelihood with respect to it is zero. But at this point we are leaving the clear path of identifiability for the thickets of estimation theory, and had better get back on track.
Q: So is identifiability solely a function of what's observable?
A: No, it depends on the combination of what we can measure and what models we're willing to entertain. If we observe more, then we can identify more. Thus if we can measure the volume of a box and its area in horizontal cross-section, then we can identify its height (but not its breadth or width). But likewise, if we can rule out some possibilities a priori, then we can identify more. If we can only measure volume, but know the box is a cube, then we can find height (and all its other dimensions). Of course we could also identify height from volume and the assumption that the proportions are 1:4:9, like the monolith in 2001.
Q: I get why expanding the observables lets you identify more parameters, but restricting the set of models to get identification seems to have "all the benefits of theft over honest toil". Do people really report such results with a straight face?
A: Identifying parameters by restricting the models we entertain is just as secure as those restrictions. If we have good actual reasons for the restrictions, then it would be silly not to take advantage of that. On the other hand, restricting models simply to get identifiability seems quite contrary to goals of science, since it is as important to admit what we do not yet know as to mark out what we do. At the very least, these are the sorts of hypotheses which need to be checked — and which must be checked with other or different data, since, by non-identifiability, the data in question are silent about them. (If you are going to assume all boxes are cubes, you should check that; but looking at their volumes won't tell you whether or not they are cubes. That data is indifferent between your sensible cubical hypothesis and the idle fancies of the monolith-maniac.)
Q: Couldn't we get around non-identifiability by Bayesian methods?
A: Expressing "soft" restrictions by a prior distribution about the unidentified parameters doesn't actually make those parameters identified. Suppose, for instance, that you have a prior distribution over the dimensions of boxes, p(B,H,W). The three parameters B,H,W completely characterize boxes, and in this are equivalent to the three parameters of volume V = BHW and the two proportions or ratios h = H/B and w = W/B. Thus the prior p(B,H,W) is equivalent to an unconditional prior on volume multiplied by a conditional prior on the proportions, p(V) p(h, w|V). Since the likelihood is a function of V alone, Bayesian updating will change the posterior distribution over volumes, but leave the (volume-conditional) distribution over proportions alone. This reasoning applies more generally: the prior can be divided into one part which refers to the identifiable parameters, and another which refers to the purely-identifiable parameters, and learning only updates the former. (If a Bayesian agent's prior prejudices happen to link the identified parameters to the unidentified ones, its convictions about the latter will change, but strictly through those prior prejudices.) The prior over the identifiable parameters can and should be tested; that over the unidentified ones cannot. (Not with that data, anyway.)
Q: If a parameter is unidentified, why bother with it at all? Why not just use Occam's Razor to shave them away?
A: That seems like an excess of positivism. (And I say this as someone who is sympathetic to positivism.) After all, which parameters are identifiable depends on what we can observe. It seems excessive to regard boxes as one-dimensional when we can only measure displaced volume, but then three-dimensional when we figure out how to use a ruler.
Q: Still, shouldn't there be a presumption against the existence or importance of unidentifiable parameters?
A: Not at all. It is very common in politics to
simultaneously assert that the electorate leans towards certain parties in
certain years; that people born in certain years have certain
inclinations; and that people's political inclinations go through a certain
sequence as they age. If we admit all three kinds of processes, we have to try
to separate the effects on political opinions of people's age, the year they
were born (their
Q: I fail to see how this isn't actually an example in favor of my position — people think these are three different effects, but they're just wrong.
A: We can break this sort of impasse by specifying more detailed mechanisms (and hoping we get more data). For instance, suppose that people tend to become more politically conservative as they age, but that this is because they accumulate more property as they grow older. Then, with data on property holdings, we could separate the effects of cohort (were you born in 1967?) and age (are you 45?) from period (are you voting in 2012?), because aging influences political opinions not through a mysterious black box but through an observable mechanism. Or again, there are presumably mechanisms which lead to period effects, as in Hibbs's "Bread and Peace" election model. (Even if that model is wrong, it illustrates the kind of way a more elaborate theory can bring evidence to bear on otherwise-unidentifiable questions.) Of course these more elaborated, mechanistic theories need to be checked themselves, but that's science.
Q: So, what does all this have to do with the social-contagion debate?
A: What Andrew Thomas and I showed is that the distinction between the effects of homophily and those of social influence or contagion is unidentifiable in observational (as opposed to experimental) data. This, to my way of thinking, is a much more consequential problem for claims that such-and-such a trait is socially contagious than doubts about whether this-or-the-other significance test was really appropriate; it says that the observational data was all irrelevant to begin with. Instead, trying to attribute shares of the similarity between social-network neighbors to influence vs. pre-existing similarity is just like trying to say how much of the volume of a box is due to its height as opposed to its width — it's not really a question data could answer. It could be that we could use other evidence to show that most boxes are cubes, but that's a separate question. No amount of empirical evidence about the degree of similarity between network neighbors can tell us anything about whether the similarity comes from homophily or influence, just as no amount of measuring the volume of boxes can tell us about their proportions.
Q: Mightn't there be assumptions about how social influence works, or how social networks form, which let us estimate the relative strengths of social contagion and homophily?
A: There might be indeed; we hope to find them; and to find external checks on such assumptions. Discovering such cross-checks would be like finding ways of measuring the volume of a geometrical body and and its horizontal cross-section. Andrew and I talk about some possibilities towards the end of our paper, and we're working on them. So I'm sure are others.
Q: I find your ideas intriguing; how may I subscribe to your newsletter?
A: For more, see Partial Identification of Parametric Statistical Models; my review of Manski on identification for prediction and decision; and Manski's book itself.
Posted by crshalizi at July 11, 2011 13:27 | permanent link
Posted by crshalizi at July 06, 2011 15:44 | permanent link
Attention conservation notice: I have no taste.
Posted by crshalizi at June 30, 2011 23:59 | permanent link
Three papers have appeared recently, critiquing methods which people have been using to try to establish social influence or social contagion: "The Spread of Evidence-Poor Medicine through Flawed Social Network Analysis" (arxiv:1007.2876) by Russell Lyons; "The Unfriending Problem" by Hans Noel and Brendan Nyhan (arxiv:1009.3243); and "Homophily and Contagion Are Generically Confounded in Observational Social Network Studies" (arxiv:1004.4704, blogged about here) by Andrew Thomas and myself. All three were of course inspired by the works of Nicholas Christakis, James Fowler and collaborators. This has lead to a certain amount of chatter online, including rash statements about how social influence may not exist after all. That last is silly: to revert to my favorite example of accent, there is a reason that my Pittsburgh-raised neighbors say "yard" differently than my friends from Cambridge, and it's not the difference between drinking from the Monongahela rather than the Charles. Similarly, the reason my first impulse when faced with a causal inference problem is to write out a graphical model and block indirect paths, rather than tattooing counterfactual numbers in invisible ink on my experimental subjects, is the influence of my teachers. (Said differently: culture happens.) So, since we know social influence exists and matters, the question is how best to study it.
Fortunately, once consequence of this recent outbreak of drama is a very long and thoughtful message from Tom Snijders to the SOCNET mailing list. Since there is a public archive, I do not think it is out of line to quote parts of it, though I would recommend anyone interested in the subject to (as the saying goes) read the whole thing:
What struck me most in the paper by Lyons ... are the following two points. The argument for social influence proposed by Christakis and Fowler (C&F) that earlier I used to find most impressive, i.e., the greater effect of incoming than of outgoing ties, was countered: the difference is not significant and there are other interpretations of such a difference, if it exists; and the model used for analysis is itself not coherent. This implies that C&F's claims of having found evidence for social influence on several outcome variables, which they already had toned down to some extent after earlier criticism, have to be still further attenuated. However, they do deserve a lot of credit for having put this topic on the agenda in an imaginative and innovative way. Science advances through trial and error and through discussion. Bravo for the imagination and braveness of Nick Christakis and James Fowler.
...Our everyday experience is that social influence is a strong and basic aspect of our social life. Economists have found it necessary to find proof of this through experimental means, arguing (Manski) that other proofs are impossible. Sociologists tend to take its existence for granted and are inclined to study the "how" rather than the "whether". The arguments for the confoundedness of influence and homophilous selection of social influence (Shalizi & Thomas Section 2.1) seem irrefutable. Studying social influence experimentally, so that homophily can be ruled out by design, therefore is very important and Sinan Aral has listed in his message a couple of great contributions made by him and others in this domain. However, I believe that we should not restrict ourselves here to experiments. Humans (but I do not wish to exclude animals or corporate actors) are purposive, wish to influence and to be influenced, and much of what we do is related to achieve positions in networks that enable us to influence and to be influenced in ways that seem desirable to us. Selecting our ties to others, changing our behaviour, and attempting to have an influence on what others do, all are inseparable parts of our daily life, and also of our attempts to be who we wish to be. This cannot be studied by experimental assignment of ties or of exchanges alone: such a restriction would amount to throwing away the child (purposeful selection of ties) with the bathwater (strict requirements of causal inference).
The logical consequence of this is that we are stuck with imperfect methods. Lyons argues as though only perfect methods are acceptable, and while applauding such lofty ideals I still believe that we should accept imperfection, in life as in science. Progress is made by discussion and improvement of imperfections, not by their eradication.
A weakness and limitation of the methods used by C&F for analysing social influence in the Framingham data was that, to say it briefly, these were methods and not generative models. Their methods had the aim to be sensitive to outcomes that would be unlikely if there were no influence at all (a sensitivity refuted by Lyons), but they did not propose credible models expressing the operation of influence and that could be used, e.g., to simulate influence processes. The telltale sign that their methods did not use generative models is that in their models for analysis the egos are independent, after conditioning on current and lagged covariates; whereas the definition of social influence is that individuals are not independent....
Snijders goes on, very properly, to talk about the models he and his collaborators have been developing for quite a few years now (e.g.), which can separate influence from homophily under certain assumptions, and to aptly cite Fisher's dictum that the way to get causal conclusions from observations studies is to "Make your theories elaborate" --- not give up. Lyons's counsels of perfection and despair are "words of a knight riding in shining armour high above the fray, not of somebody who honours the muddy boots of the practical researcher". (Again, if this sounds interesting, read the full message.) I agree with pretty much everything Snijders says, but feel like adding a few extra points.
Posted by crshalizi at June 29, 2011 13:24 | permanent link
Within a year, Kanazawa will have a fellowship at the American Enterprise Institute (where he'll fit right in); he will not have learned anything about factor models, or data analysis, or indeed anything else. I suspect he will also have a book under way about how the politically correct hordes drove him from England, in which he will compare himself to Galileo, but that is not so securely supported by my model.
Oh, and as for Henry's point, I feel like I should offer a back-handed defense of evolutionary psychology. It's true that a field where Kanazawa could get away with so much for so long as nothing to be proud of, but it's not at all clear that evolutionary psychology is actually worse in this regard than other branches of psychology, in some which the mistakes are much more prenicious and much more entrenched. Or that psychology is any worse than other fields; I will plug, once again, Hamilton's The Social Misconstruction of Reality: Validity and Verification in the Scholarly Community. (Nonetheless I would not be surprised if standard really were lower in evolutionary psychology than elsewhere.)
Posted by crshalizi at June 03, 2011 22:01 | permanent link
Attention conservation notice: I have no taste.
Books to Read While the Algae Grow in Your Fur; Scientifiction and Fantastica; Pleasures of Detection, Portraits of Crime; The Dismal Science; Writing for Antiquity; Networks; Power Laws; Complexity; The Beloved Republic Commit a Social Science; The Progressive Forces; The Collective Use and Evolution of Concepts
Posted by crshalizi at May 31, 2011 23:59 | permanent link
Why I am so unresponsive lately:
Working on both of these undertakings reminds me uncomfortably of the
gambler making sure the dice know how much he desires
a shot at tenure
down the road a new pair of shoes.
Update, 3 June: My laptop entered a cataleptic state as of noon on the 31st. I still managed to submit the six papers with co-authors. Blogging will continue to be sparse in the immediate future.
Posted by crshalizi at May 29, 2011 10:30 | permanent link
Attention conservation notice: 3200 words on a silly academic paper about popular music and narcissism. Contains complaints about bad data analysis, firm statements about writing poetry from someone who can't, and largely unsupported gloomy reflections about the condition of the house of intellect.
Let me begin with a quotation from one of my favorite books:
A good many years ago a neighbour whose sex chivalry forbids me to disclose exclaimed upon learning of my interest in philosophy: `Don't you just adore Pluto's Republic?'
Pluto's Republic has remained in my mind ever since as a superlatively apt description of [the] intellectual underworld.... We each populate Pluto's Republic according to our own prejudices: for me its most prominent citizens are IQ psychologists.... Other prominent citizens include all practitioners of `scientism', especially those who apply what they mistakenly believe to be the methods of science to the investigation of matters upon which science has no bearing whatsoever...
— Peter Medawar, Pluto's Republic, p. 1
I have no taste, and so a large part of my reading consists of what is frankly mind candy, and a large part of the candy consists of mystery series in which murders are solved by amateur sleuths. It is part of the norms of this genre or tradition that the heroine (and it is almost always a heroine) is a not-too-old woman who pursues a more or less genteel occupation in a small town or not-too-large city, which has an (unremarked-on) rate of violent death comparable to post-invasion Iraq. It is equally a norm of the genre that the novels are narrated in the first person (singular). Two of my other addictions are secondary-world fantasies and space opera science fiction, which use first-person narration far more sparingly*. While one might come up with functional-rhetorical rationales for this contrast (perhaps based on some of the experiments in Bortolussi and Dixon), the proximate explanation is simply genre norms. To argue on this basis that writers, or readers, of amateur-sleuth mysteries are less narcissistic than writers or readers of space opera and lap-breaker fantasies would be stupid.
More specifically, it would be to ignore the fact that, while mind-candy genre novels are perhaps very humble works of art, they are works of art, and, as the poet says, all art is artifice. They are things which are made to achieve certain ends (which may be vague), employing skills and traditions and what one might call internal norms. Even when writers pour their hearts out on to the page, to treat works of art as direct, unmediated expressions of their makers' personalities trembles on the border between utter philistinism and not-safe-to-be-let-outdoors-without-grownup-supervision naivete. And this is true not just of popular novels, but also of poems which demand musical accompaniment and take about three minutes to recite, bringing us to today's reading.
I want to add a few remarks to what Mark Liberman has already said ("Lyrical Narcissism?", "Vampirical Hypotheses", "Pop-culture narcissism again"), first about the methodological inadequacies, then about the statistics, and finally on the larger lessons.
The empirical basis for inferring narcissism from using first person singular pronouns appears to be Robert Raskin and Robert Shaw, "Narcissism and the Use of Personal Pronouns", Journal of Personality 56 (1988): 393--404. This shows that, over twenty years ago, there was a modest positive correlation (+0.26) between scores on a quiz intended to measure narcissism, and how often 48 UC Santa Cruz undergrads used first-person singular pronouns in extemporized five minute monologues. Top 100 songs are not spontaneous monologues by undergrads looking for a painless way to get $5 and/or check off a Psych. 1 requirement, and DeWall et al. offer no evidence that this correlation generalizes to any other context. In particular they offer no reason to think that differences over time, as language and culture changes, should be explained in the same way as these differences across people, at a single time and in a single school.
Let me sketch an analogy. You can measure the height of a building from the length of its shadow, using trigonometry. If you gather a data set of many building heights and shadow lengths taken at nearly the same time of day on the same day of the year, there will in fact be an excellent correlation between the two, and a genuinely linear relationship. (Indeed, the only reason the correlation would be even slightly less than 1 would be measurement noise.) But the relationship between the height of buildings and the length of their shadows depends on where the sun is in the sky. At a different time of day or a different day of the year, you will get a different linear relationship. If you just plug in to your formula blindly, you will get bad estimates of the height. If you were a morning person, and precisely operationalized your initial data as "length of shadow to the west of the building", you would get negative estimated heights in the afternoon, when shadows point to the east. (Sure, it's counterintuitive that buildings are actually sunk below the group, but are you going to argue with the numbers?) On cloudy days, whatever you measured in place of shadows would just be noise.
To draw the moral explicitly, even if there is such a thing as a one-dimensional personality trait of narcissism**, and even if that was correlated with pronoun use in one particular historical population, in one particular social/rhetorical context, that tells us nothing at all about the correlation in other situations. I don't assert that it can't be true, but there is no psychological or statistical reason to presume that it is true, and so it needs to be established. In more psychological terms, thinking otherwise is not so much slipping into the fundamental attribution error as wallowing in it.
Composing a popular song is not coming up with a five-minute off-the-cuff monologue. Lyrics are in fact composed. They are deliberately made to achieve certain effects on the audience, including meshing in certain ways with the music (which is also being composed), they are stylized, and their composition is guided by inherited traditions and formulas of the genre and by individual habits of writing. Those guides and constraints are at once cognitive — it is computationally necessary to cut down the search space (see e.g. Lord or Simon or Boden) — and aesthetic — they are norms (see e.g. Wellek and Warren). The persona of the song or poem is not the personality of the song-writer or poet. (David Byrne is not actually a psycho-killer.) This is true no matter how strong the emotions which motivate the song-writer are, or how lacking the writer may be in self-conscious artistry.
Commercially successful popular songs are artistic compositions which have been filtered through a rather byzantine industry of gate-keepers and intermediaries. The songs which survive this filtration must then be bought by many thousands of people, for their own reasons. The song might succeed by appealing to a single very popular taste; or simultaneously appealing to many different tastes; or, indeed, merely by already being popular.
If the question is whether musicians have become more narcissistic, we need to ask whether more narcissistic musicians compose songs which use first person singular pronouns more often, and, if so, whether this signal survives the filtering process of the music industry. If the question is whether audiences have become more narcissistic, we need to ask whether more narcissistic people prefer songs which use such pronouns more often, and, if so, whether this signal survives the filtering process of the music industry. (Anyone who thinks individual preferences simply translate into aggregate outcomes has simply not been paying attention, and for a very long time at that.) We are so far from the laboratory situation of Raskin and Shaw that it's not even funny.
Let me turn to more specific weaknesses in the logic of the paper.
So, to sum up, we have basically no reason to think that changes in the use of first-person singular pronouns measure changes in narcissism (certainly not over time or in this context), and a slew of alternative explanations for any changes which might be found, other than "Americans are becoming narcissistic and this is reflected in their popular songs". One might, perhaps, write these off as the excessive scruples which come from over-indulging in skeptical philosophy. Let's have the courage to assume away all these inconvenient possibilities, and look at what the data show.
The centerpiece Figure 1 from DeWall et al. goes like so:
Many journalists seem to have found this very convincing. Fortunately, however, DeWall et al. also provide a table with the mean and standard deviation of the first person pronoun use for each year, and a 95% confidence interval. (They don't say how they calculated the latter, but I'll take them at their word and presume they did that properly.) This lets me plot the actual data, which looks like this:
(My code, in R.) The black dots, joined by lines to guide the eye, are the actual percentages. The dashed lines are the 95% confidence limits. The horizontal grey line is the over-all mean percentage, over the whole data set. The two colored lines are two smoothing spline fits, one with (purple) and one without (blue) giving extra weight to years with smaller standard deviations. Making the smoothing splines requires a little knowledge of statistics; everything else just needs the ability to draw the numbers DeWall et al. provide.
The flat horizontal line is inside the confidence limits in 27 of the 28 years. This is exactly what we would expect if there was no signal here whatsoever, and all fluctuations from year to year were just noise****. (95% coverage per year and 28 years yields 1.4 expected non-coverage events.) There is nothing here to explain; the appearance that there is something in their Figure 1 is one part bad data analysis to one part How to Lie with Statistics-level bad graphing*****.
While perhaps not a truly epic fail, this is not a creditable performance. The paper probes a hugely complex tangle of issues relating individual minds, communication, social norms, artistic expression, social change and cultural transformation. There is no shame in not unraveling the whole snarl at once, but between the incompetent data analysis, the failure of logical imagination, and the deep misunderstanding of how works of art are made and used, it does nothing to advance our knowledge of anything. I am not sure which is more needed here, remedial reading in Richard Berk and Denny Borsboom, or in Wellek and Warren and Erving Goffman, but they need something. Psychology of Aesthetics, Creativity, and the Arts does not seem to be a very highly ranked journal within psychology, but the authors of this paper have plenty of papers elsewhere, so this says something about the intellectual standards of the discipline.
Looking at the reception of the paper (see Liberman, again, for linkage), one finds dreary moralizing about how kids these days are selfish brutes and nobody makes decent music any more, given an unearned air of authority by the pretense to science. It should not, by this point, come as a surprise that many science journalists and pundits lack the numeracy, imagination and skepticism to avoid being taken in by such foolishness.
Public trust in scientists — that we generally know what we ware talking about — is an extremely valuable resource for the scientific community. It is, I think, ultimately why people are willing to devote such vast resources to the scientific enterprise, to letting us gratify our curiosity. This trust has been painfully built up over many long years and generations and even centuries, by, among other things, taking great pains to be trustworthy. This trust is even a valuable resource for the public, when it is not misplaced. The more I see of this kind of thing, the more I wonder how well-founded that trust really is. This specific myth — that it has been scientifically proven that pop songs reflect increasing American narcissism — will persist as a minor vampirical hypothesis, occasionally draining the blood from graduate students in psychology. This kind of pointless myth-making and perversion of science will continue as long as the implicit goal of our institutions for cultivating knowledge is in fact to realize Pluto's Republic.
Update, later that day: Mark Liberman points out, by e-mail, that the famous 23rd psalm ("The Lord is my shepherd; I shall not want") clocks in at 14.3% first person pronouns in the King James Version, above the DeWall et al. confidence limits for all but seven years. I would add that "Rock of Ages" is a lower but still well-above-average 13.3%. On the other hand, "Rock of Ages" by Def Leppard (a top 100 song in 1983, and so part of the data) is between 4.6% and 6.2% first person singular pronouns (depending on how you want to count "gimme"). Clearly, the only thing saving American popular culture from epidemic narcissism in the early 1980s was preferring heavy metal to hymns.
Update, next day: John Emerson points out that "Like a Rolling Stone" contains plenty of instances of second person singular pronouns, and no first person singular pronouns, fully consistent with Bob Dylan's famed selflessness.
*: Cue fannish nit-picking.
**: My skepticism about the "constructs" of correlational psychology is not limited to IQ/g, but that's another story for another time. For the present, I am willing to stipulate that narcissism exists and can be measured by the psychometric instruments which purport to do so.
***: Hole one: their four genre categories are incredibly crude, and have huge amounts of internal diversity. (Likewise, the genre norms of amateur-sleuth mysteries are rather different from private-eye detective stories, police procedurals, and serial-killer thrillers. Calling them all "mysteries", with one dummy variable, would not answer.) Hole two: "controlling for" variables this way only gets you an all-else-being-equal prediction if the regression model is actually well specified, which they hadn't the wit to check. Hole three: the counterfactual issue. (This is the only evenly slightly tricky one.) We have a certain distribution of the regressor variables in the training data, and so certain correlations among them. These correlations mean that each regressor can, to some extent, be linearly predicted from the others. The regression coefficients are basically the correlation between the response and the distinct, linearly-unpredictable part of each regressor. This means that when you change the distribution of regressors, the regression coefficients will, in general, change too. The regression coefficients can only be used to answer counterfactual questions ("what would the proportion of first-person pronouns be, if genre composition had stayed constant?") under very special assumptions, which we have no reason to think hold here. (See the notes for lectures 2, 22 and 23 for more.)
****: More exactly, this is what we would expect if the causes producing year-to-year shifts were so many, so various, and had such hard-to-describe inter-relations with each other that they cannot be effectively compressed or predicted from the past of the time series, and must simply be described in all their unique historical detail. As I tell my undergrads, "any signal distinguishable from noise is insufficiently complicated". If you want the full technical version of this idea, read Li and Vitanyi.
*****: Abbreviated scale on the vertical axis, visually exaggerating the change; inappropriate use of a linear model, which is guaranteed to give the impression of a steady and relentlessly one-directional march; no indication of uncertainty. I also can't figure out why they binned the values on the horizontal axis.
Posted by crshalizi at May 08, 2011 13:15 | permanent link
Attention conservation notice: I have no taste.
Books to Read While the Algae Grow in Your Fur; Scientifiction and Fantastica; Writing for Antiquity; The Commonwealth of Letters; The Progressive Forces; Islam; The Dismal Science; Minds, Brains and Neurons; Pleasures of Detection, Portraits of Crime
Posted by crshalizi at April 30, 2011 23:59 | permanent link
Like many people, when I was a student, I used to have nightmares where I found I had to take a final exam in a class I did not remember being in. (One particularly vivid one also involved discovering a new building on the Berkeley campus, between Evans Hall and Le Conte.) When I became a teacher, this flipped around, and I had sporadic nightmares where I was giving an examination in a class I didn't remember teaching. Thankfully, it's been several years since that one last visited me.
After watching this xtranormal movie from my friend Cris Moore, I suspect that I will have nightmares about examining students in subjects they did not know they were taking:
I hasten to add that, to be the best of my knowledge, nothing like this has ever happened in this department.
Posted by crshalizi at April 29, 2011 15:49 | permanent link
DeLong asks for the best response to "My City Was Gone", itself a response to "Mountains beyond Mountains". Since the content of the game has clearly moved from "sprawl" to "one-upmanship", I claim that the best response is in fact "Stadium Love":
I actually prefer the official video, but concert footage seems to be an implicit rule of the game.
Had the conversation not strayed,
"Nothing but Flowers",
or "The Big Country"
would have been admissible. (I am vexed by the fact that I cannot instantly
call up high-quality video recordings of these songs performed when first
released, several decades ago. Where's
Manual trackback: Grasping Reality with $Numerosity $Appendages
Posted by crshalizi at April 27, 2011 20:00 | permanent link
Very constant readers may recall having seen this line of research at various points down the years, most recently in "Our New Filtering Techniques Are Unstoppable!". Georg's goal is to make those methods work for continuous-valued fields, which was not needed for studying cellular automata but will be very handy for data analysis, and where already has some preliminary results. Beyond that, the goal is to develop the statistical theory which would go along with it and let us get things like confidence intervals on statistical complexity.
I can say without any shame that I was quite pleased with Georg's presentation, because I really had no part in making it; all the credit goes to him in the first place, and to provided by Larry, Chris Genovese, Cris Moore and Chad Schafer. Based on this experience, and Georg's publication record, I imagine he will have all the problems polished off by the NIPS deadline, with a monograph or two to follow by the end of the summer.
I will, however, try not to read any omens into my first Austrian student commencing a dissertation on automatic pattern discovery on the day Skynet declares war on humanity.
Posted by crshalizi at April 21, 2011 15:00 | permanent link
How do we get our causal graph? Comparing rival DAGs by testing selected conditional independence relations (or dependencies). The crucial difference between common causes and common effects. Identifying colliders, and using them to orient arrows. Inducing orientation to enforce consistency. The SGS algorithm for discovering causal graphs; why it works. Refinements of the SGS algorithm (the PC algorithm). What about latent variables? Software: TETRAD and pcalg. Limits to observational causal discovery: universal consistency is possible (and achieved), but uniform consistency is not.
Posted by crshalizi at April 21, 2011 12:04 | permanent link
Reprise of causal effects vs. probabilistic conditioning. "Why think, when you can do the experiment?" Experimentation by controlling everything (Galileo) and by randomizing (Fisher). Confounding and identifiability. The back-door criterion for identifying causal effects: condition on covariates which block undesired paths. The front-door criterion for identification: find isolated and exhaustive causal mechanisms. Deciding how many black boxes to open up. Instrumental variables for identification: finding some exogenous source of variation and tracing its effects. Critique of instrumental variables: vital role of theory, its fragility, consequences of weak instruments. Irremovable confounding: an example with the detection of social influence; the possibility of bounding unidentifiable effects. Matching and propensity scores as computational short-cuts in back-door adjustment. Summary recommendations for identifying and estimating causal effects.
Posted by crshalizi at April 21, 2011 12:03 | permanent link
Statistical dependence, counterfactuals, causation. Probabilistic prediction (selecting a sub-ensemble) vs. causal prediction (generating a new ensemble). Graphical causal models, structural equation models. The causal Markov property. Faithfulness. Counterfactual prediction by "surgery" on causal graphical models. The d-separation criterion. Path diagram rules.
Posted by crshalizi at April 21, 2011 12:02 | permanent link
The second exam; In which we attempt to discover structure in ten-dimensional data of unknown origin.
Posted by crshalizi at April 21, 2011 12:01 | permanent link
In which we learn how to read and use diagrams full of circles and arrows and a paragraph on he back explaining what each one is.
Posted by crshalizi at April 21, 2011 12:00 | permanent link
In which we practice the art of principal components analysis on currency exchange rates, in the process discovering the Pacific and the Americas.
Posted by crshalizi at April 09, 2011 23:53 | permanent link
Conditional independence and dependence properties in factor models. The generalization to graphical models. Directed acyclic graphs. DAG models. Factor, mixture, and Markov models as DAGs. The graphical Markov property. Reading conditional independence properties from a DAG. Creating conditional dependence properties from a DAG. Statistical aspects of DAGs. Reasoning with DAGS; does asbestos whiten teeth? Appendix: undirected graphical models, the Gibbs-Markov theorem; directed but cyclic graphical models. Appendix: Some basic notions of graph theory; Guthrie diagrams.
Posted by crshalizi at April 09, 2011 23:52 | permanent link
Precipitation in Snoqualmie Falls revisited. Fitting a two-component Gaussian mixture; examining the fitted distribution; checking calibration. Using cross-validation to select the number of components to use. Examination of the selected mixture model. Suspicious patterns in the parameters of the selected model. Approximating complicated distributions vs. revealing hidden structure. Using bootstrap hypothesis testing to select the number of mixture components. The multivariate Gaussian distribution: definition, relation to the univariate or scalar Gaussian distribution; effect of linear transformations on the parameters; plotting probability density contours in two dimensions; using eigenvalues and eigenvectors to understand the geometry of multivariate Gaussians; estimation by maximum likelihood; computational aspects, specifically in R.
Posted by crshalizi at April 09, 2011 23:51 | permanent link
From factor analysis to finite mixture models by allowing the latent variable to be discrete. From kernel density estimation to mixture models by reducing the number of points with copies of the kernel. Probabilistic formulation of mixture models. Geometry: q+1 points define a q-dimensional plane. Clustering. Estimation of mixture models by maximum likelihood, and why it leads to a vicious circle. The expectation-maximization (EM, Baum-Welch) algorithm replaces the vicious circle with iterative approximation. More on the EM algorithm: convexity, Jensen's inequality, optimizing a lower bound, proving that each step of EM increases the likelihood. Mixtures of regressions. Other extensions.
Posted by crshalizi at April 09, 2011 23:50 | permanent link
Let us cleanse the palate after that last unpleasant announcement. My brother Aryaman (the talented one) writes: "A colleague of mine who is interested in pursuing science education after her PhD was directed to a collection of (I think apocryphal) answers to science questions from 5th and 6th graders in Japan. I noticed many of them were almost little haikus. So I took the time to work some into form..."
Broke up molecules,
stuffed with atoms. Broke atoms,
stuffed with explosions
People run around
in circles they are crazy.
But planets orbit.
Our sun is a star.
But it changes back to a
Sun in the daytime.
A vibration is
motion that cannot decide
which way it should go.
Some know the time by
looking at the sun. I can't
make out the numbers
To some, solutions
are answers. To chemists they
are still all mixed up.
you are looking for some air
but finding water.
It is so hot some
places that the people there
have to live elsewhere.
Clouds circling the earth
round and round and round. There is
not much else to do.
is blamed when people forget
to put the top on.
Vacuums are nothings.
We mention them to let them
know we know they're there.
Some past animals
became fossils while others
prefer to be oil.
Cyanide is this
bad: one drop in a dog's tongue
kills the strongest man
Law of gravity
says no jumping up unless
you will come back down.
why you look like your father
or why you might not.
cold summers and hot winters;
somehow they manage.
What is it with biologists and poetry anyway?
Posted by crshalizi at April 08, 2011 19:25 | permanent link
Attention conservation notice: Of no use unless you care about mathematical statistics, and will be in Pittsburgh on Monday.
As I have had a number of occasions to tell the kids this semester, and will certainly repeat later, one of the most valuable things a data analyst can know is that some variables have nothing to do with each other. (Visions of the totality of interconnections making up the Cosmic All are for higher beings, like the Arisians, Marxist literary critics, and the Medium Lobster, not mere empiricists.) This is not at all easy when confronting high-dimensional data, and so I am especially pleased by the topic of next week's seminar.
As always, the talk is free and open to the public.
Those of you wishing to follow along at home may find it enlightening to read "Brownian distance covariance" (arxiv:1010.0297) by Székely and Rizzo, along with the commentaries linked there — all published, I can't resist pointing out, in the Annals of Applied Statistics.
Update, 8 April: Due to the looming uncertainty about whether we will have a functioning National Science Foundation, the talk has been canceled. So, this is another Bad Thing which I blame on the wingnuts' apocalyptic fear of poor women having contraceptives. (I do not of course speak for the statistics department, for CMU, or for Dr. Székely.)
Posted by crshalizi at April 07, 2011 20:30 | permanent link
Attention conservation notice: Self-promotional; only of interest if you care about theoretical statistics and will be in the Boston area on Monday.
A talk, based on Bayes < Darwin-Wallace paper.
Posted by crshalizi at April 03, 2011 17:10 | permanent link
Attention conservation notice: I was going to follow Andy's example, and write a statistical April Fool's post on why parametric models are superior to non-parametric ones in every way; but this popped out instead.
Proposition: God only ever judges a creature for an offense against another creature, not against God.
Proof: Every being is either a created or not; but the only uncreated being is God. Every offense is therefore against another creature or against God. If God were to punish a creature for a sin against God, the deity would be at once plaintiff and judge (there being no division or disunity within God); yet even human justice recognizes that no one should decide their own case. Divine justice being perfect, the proposition follows.
Remark: this leaves open the possibility that a creature who sins against God could rightly be judged for this by another creature, if the latter could be impartial. This suggests that Satan's role has been misunderstood: the Adversary must be a disinterested gentleman with no love of his creator, because only such a being could judge impartially between God and God's creatures.
Rétrolien manuel: Anniceris
Posted by crshalizi at April 01, 2011 12:00 | permanent link
Attention conservation notice: I have no taste.
Posted by crshalizi at March 31, 2011 23:59 | permanent link
Adding noise to PCA to get a statistical model. The factor analysis model, or linear regression with unobserved independent variables. Assumptions of the factor analysis model. Implications of the model: observable variables are correlated only through shared factors; "tetrad equations" for one factor models, more general correlation patterns for multiple factors. (Our first look at latent variables and conditional independence.) Geometrically, the factor model says the data have a Gaussian distribution on some low-dimensional plane, plus noise moving them off the plane; and that is all. Estimation by heroic linear algebra; estimation by maximum likelihood. The rotation problem, and why it is unwise to reify factors. Other models which produce the same correlation patterns as factor models; in particular the Thomson sampling model, in which the appearance of factors arises from not knowing what the real variables are or how to measure them.
Update, 9 April: A correspondent points me to this tweet, in what I can only call a "let's you and him fight" spirit. While the implicit charge against me by Adams is not without some justice, if you don't want this to happen, you really shouldn't brag about how many beauty pageants your child has won, or for that matter dress the poor beast in such funny clothes.
Posted by crshalizi at March 30, 2011 23:06 | permanent link
Principal components: the simplest, oldest and most robust of dimensionality-reduction techniques. PCA works by finding the line (plane, hyperplane) which passes closest, on average, to all of the data points. This is equivalent to maximizing the variance of the coordinates of projections on to the line/plane/hyperplane. Actually finding those principal components reduces to finding eigenvalues and eigenvectors of the sample covariance matrix. Why PCA is a data-analytic technique, and not a form of statistical inference. An example with cars. PCA with words: "latent semantic analysis"; an example with real newspaper articles. Visualization with PCA and multidimensional scaling. Cautions about PCA; the perils of reification; illustration with genetic maps.
Posted by crshalizi at March 30, 2011 23:05 | permanent link
Background on the 1969 Psychology Today survey and Fair's theory of optimal adultery. Logistic regression for counts vs. logistic regression for binary outcomes. Comparison of predictions on a qualitative level. Quantitative comparison of predicted probabilities of adultery across models. Checking calibration. Sanity-checking of the specification. Scientific evaluation of the models. Does it make sense to keep analyzing this data?
Posted by crshalizi at March 30, 2011 23:04 | permanent link
Background on the long-term study of diabetes among the Pima in Arizona. Data-set cleaning. (Note that while about half the records contain physically impossible values, this is routinely used as an example of a data set with no missing values in computer science.) Fitting logistic regression for a binary outcome. Calculations with fitted logistic regression models. Separating associations from between variables from significant predictors. Model-checking.
Posted by crshalizi at March 30, 2011 23:03 | permanent link
Building a weather forecaster for Snoqualmie Falls, Wash., with logistic regression. Exploratory examination of the data. Predicting wet or dry days form the amount of precipitation the previous day. First logistic regression model. Finding predicted probabilities and confidence intervals for them. Comparison to spline smoothing and a generalized additive model. Model comparison test detects significant mis-specification. Re-specifying the model: dry days are special. The second logistic regression model and its comparison to the data. Checking the calibration of the second model.
Manual trackback: SnoValley Star (!)
Posted by crshalizi at March 30, 2011 23:02 | permanent link
Modeling conditional probabilities; using regression to model probabilities; transforming probabilities to work better with regression; the logistic regression model. Maximum likelihood for logistic regression; numerical maximum likelihood by Newton's method and by iteratively re-weighted least squares. Logistic-additive models as a non-parametric alternative (which you should probably use unless you have very definite reasons); bootstrap specification testing for logistic regression.
PDF notes, incorporating R examples
Posted by crshalizi at March 30, 2011 23:01 | permanent link
It is only appropriate that a talk about influential outliers be held at on an unusual day, at an unusual time and place:
As always, the talk is free and open to the public.
Posted by crshalizi at March 30, 2011 23:00 | permanent link
It has long been one of my ambitions to be denounced as a tentacle of a mysterious and shadowy conspiracy bent on global domination. Reading this therefore fills me with conflicting emotions. On the one hand, satisfaction of a cherished dream; on the other, is that all there is to a conspiracy? But predominating at the moment is regret — for I turned down my invitation to the conference at Bretton Woods because it conflicted with my teaching schedule. In retrospect, this was dumb. Surely the kids could have taken care of themselves for a week, while I joined the rest of the Immense Shadowy Global Conspiracy in hatching nefarious schemes? How could I have passed up an opportunity to commune with sinister intellects in the wild hills of New England in order to teach? How, when I spent days on end this month reviewing grant proposals for them, did I fail to spot the question on the evaluation forms about "Potential to advance the sinister designs of Mr. Soros and his associates who must not be named"? How could I have asked for so little in my grant application, when clearly any proper subversive conspiracy could have paid for so, so much more?
I take comfort only in the fact that I will, after all, be lecturing that week on how to detect the hidden common causes linking apparently disparate events — and there's always next year to go and explain how combining ergodic theory and statistical learning methods will let us take over the world.
Posted by crshalizi at March 30, 2011 22:30 | permanent link
Posted by crshalizi at March 16, 2011 19:00 | permanent link
Via Bill Tozier comes news of this blog post by Eric Hellman, which is part of a controversy over how libraries should pay publishers for electronic books. I have not thought or studied that enough to have any sort of opinion, though since it seems to go very, very far from marginal cost pricing, I am naturally suspicious. Be that as it may, Hellman suggests that the specific proposal of Harper Collins needs to be seen in the light of the "long tail" of the circulation distribution of library books. That is, most books circulate very little, while a few circulate and awful lot, accounting for a truly disproportionate share of the circulation, and (says Hellman) the Harper Collins proposal would, compared to the status quo, shift library funds from the publishers of the large mass of low-circulation books to the publishers of the tail of high-circulation books, Harper Collins prominent among them.
To support this, Hellman uses some data from a very rich data set released by the libraries of the University of Huddersfield in England. (I have been meaning to look this up since Magistra et Mater mentioned the cool stuff Huddersfield is doing with library data-mining.) Hellman's analysis went as follows. (See his post for details.) He made a cumulative histogram of how often each book had circulated (binning counts over 100 by tens); plotted it on a log-log scale; fit a straight line by least squares; and declared the distribution a power law because the R2 was so high.
Constant readers can imagine my reaction.
Having gotten hold of the data, I plotted the cumulative distribution function (without binning), and including both Hellman's power law (purple), and a discrete log-normal distribution (red):
This is clearly heavy tailed and massively skewed to the right. It is equally clear that it is not a power law: there are simply orders of magnitude too few books which circulated 500 or 1000 or 2000 times. (Remember that the vertical axis here is on a log scale.) The difference in log-likelihoods is 200 in favor of log-normal, i.e, the data were e200 times less likely under the power law. Applying the non-nested model comparison test from my paper with Aaron and Mark, the chance this big a difference in likelihoods arising through fluctuations when the power law is actually as good or better than the log-normal is about 10-35. I have not attempted to see whether the deviations from the log-normal curve are significant, but it does look quite good over almost the whole range of the data. There could be some systematic departures at the far right, but over-all it looks like Gauss is not mocked at Huddersfield.
I should say right away that Hellman was very gracious in our correspondence about this (I am after all a quibbling pedant quite unknown to him). More importantly, his analysis of the Harper Collins proposal does not, that I can see, depend at all on circulation following a power law; it just has to be strongly skewed to the right. That being the case, I hope this particular power law can be eradicated before it has a chance to become endemic in such permanent reservoirs of memetic infection as the business literature and Physica A.
Update, next day: In the interests of reproducibility, the circulation totals for the data (gzipped), and the R code for my figure. The latter needs the code from our paper, which I will turn into a proper R package Any Time Now.
Previously on "Those That Resemble Power Laws from a Distance": the link distribution of weblogs (and again); the distribution of time taken to reply to e-mail; the link distribution of biochemical networks; urban economies in the US.
Posted by crshalizi at March 16, 2011 18:45 | permanent link
A.k.a. the 10th International Conference on Complexity in Acute Illness, is happening in Bonn, 9--11 September. "Complexity and" or "Nonlinear dynamics and" conferences often have a lot of fluff, but one of the organizers of this is my friend Sven Zenker. Unsurprisingly, therefore, this actually looks interesting and substantive, and so perhaps of interest to readers. It also gives me an excuse to link to one of Sven's interesting papers about combining serious physiological modeling with modern statistical tools. Among other virtues, this is methodologically interesting in showing a way to learn not just from a non-identifiable model, but from the way in which it fails to be identified. I have been meaning to do this since first hearing him talk about it in July 2007...
Posted by crshalizi at March 16, 2011 18:30 | permanent link
Attention conservation notice: 350+ grumbled words about the price of academic papers, a topic of little moment even to scholars directly involved, let alone anyone with a sense of perspective.
One of the fundamental principles of economics is the virtue of marginal cost pricing: everything should be sold for just the cost of producing one extra unit. Prices below marginal cost are obviously bad for the sellers — you can't keep losing money on every sale and keep producing — but prices above marginal costs mean that the good will be consumed less than it ought to be, than its actual utility warrants.
With this in mind, let us look at the National Bureau of Economic Research, which is about a good an embodiment of the discipline's core as one could hope to find. One of its key outputs is working papers. To read these, you need either an institutional subscription, or you need to pay $5 per download. This price is orders of magnitude more than the marginal cost of serving a few hundred more kilobytes of PDF*. It is literally a textbook Econ. 1 result that NBER is ensuring its own economic research will be under-consumed. It isn't even recovering the fixed costs of production through average-cost pricing, since those costs are paid not by NBER (which is a non-profit largely funded by grants anyway) but by the authors of the papers. Rather, it confirms Healy's law, that each "discipline's organizational life inverts its core intellectual commitments".
(This rant was brought to you by wanting to read this paper, mentioned by Krugman. CMU has a subscription, so I can, but it is senseless that twenty years after the beginning of the arxiv, a central organization of the discipline of economics prices its preprints this way.)
Update, later that day: As several people have written me to point out, the authors of the paper in question have a free version PDF version online. This does not make NBER's policy any more efficient or even sensible; quite the contrary. Fortunately, I am not the kind of man who goes around making revealed preference arguments.
Update, 23 March: Hopefully, the good example of the Brookings Institution will help establish a norm, and shame NBER into adopting marginal cost pricing.
*: Pair.com will sell you 240 GB/month for $50, which works out to something like 0.02 cents for a 1 MB paper, which would be quite large or graphics-heavy. I decline to believe that NBER is being ripped off by their Internet service provider by a factor of 25,000. (This is not an endorsement of Pair's hosting services, or even a claim that their prices are especially good.)
Posted by crshalizi at March 06, 2011 13:40 | permanent link
In statistics, we say that a high-dimensional model is "sparse" if most of the large numbers of variables do not actually contribute to the outcome --- the true set of relevant predictors is small compared to the number of covariates. Some of the most interesting work in statistics and machine learning over the last decade and a half has been about finding and using sparsity, often starting from ideas like the lasso, but becoming considerably more general and flexible, and connecting to ideas about compressed sensing. (I will probably never get around to writing a post about SpAM, but may yet turn it into a homework problem; I still have hopes about TESLA.) Exploiting sparsity is one of the principal ways of lifting the curse of dimensionality, which otherwise weighs on us more and more every year.
The aim of the workshop is to bring together theory and practice in modeling and exploring structure in high-dimensional data. Participation of researchers working on methodology, theory and applications, both from the frequentist and Bayesian point of view is strongly encouraged in order to discuss different approaches for tackling challenging high-dimensional problems. Furthermore, the workshop will link with the signal processing community, which has worked on similar topics and with whom exchanges of ideas will be very fruitful. We encourage genuine interaction between proponents of different approaches and hope to better understand possibilities for modeling of structure in high dimensional data. We invite submissions on various aspects of structured sparse modeling in high-dimensions. Here is an example of two key questions:See the full call for papers for more details and submission information.
- How can we automatically learn the hidden structure from the data?
- Once the structure is learned or pre-given, how can we utilize the structure to conduct more effective inference?
(I remember when Han took stochastic processes from me --- how can he be organizing workshops?)
Posted by crshalizi at March 04, 2011 01:44 | permanent link
In which we compare the power-law scaling model of urban economies due to Bettencourt et al. to an alternative in which city size is actually irrelevant.
This was a one-week take-home exam, intended to use more or less everything taught so far.
Posted by crshalizi at March 02, 2011 17:10 | permanent link
In which we estimate and test the power-law scaling model of urban economies due to Bettencourt et al.
Posted by crshalizi at March 02, 2011 17:00 | permanent link
Attention conservation notice: I have no taste.
Posted by crshalizi at February 28, 2011 23:59 | permanent link
When I was a student at Madison, I was happy to be part of our union, the Teaching Assistants' Association. They are, naturally, deeply involved in the events in Wisconsin, and I am very proud. If you want to help the demonstrators materially, the TAA will take your money and put it to good use. (It is characteristic, and in a good way, that there is a fund especially for cleaning up the state capitol building afterwards.) And if you're not sure why the fight in Wisconsin matters, well, there are lots people explaining the many reasons.
To add my little bit, and repeat myself: the single biggest thing which has gone wrong with America during my lifetime has been the economic stagnation for most of the country, accompanied by shifting risk from those who have resources and large organizations to individuals who don't have much. And that has gone hand in hand with the decline --- the repression --- of organized labor. Unions are not perfect, but no human institutions are, and to condemn unions, specifically, because they are sometimes hide-bound or self-serving is either folly or deceit. Unions are the only organized force in this country which seriously advocates, which pushes, for the material interests and dignity of ordinary working people. The fight in Wisconsin is about whether there is, finally, a limit to how far the dismantling of American labor can be pushed.
Manual trackback: Lisa Schweitzer
Posted by crshalizi at February 23, 2011 00:12 | permanent link
I'll let the abstract speak for me on this one:
Figures and calculations were done with this code and data. I realize that's not fully up to spec for reproducible computational science, but I'm getting there.
(Yes, this the paper which I started because readers kept asking me questions, and yes, A Fermi Problem in Western Pennsylvania was spun off from the first draft, which was going to be just a blog post. It turns out that the journal is OK with putting submitted manuscripts on arxiv, or at least not too upset.)
Posted by crshalizi at February 22, 2011 00:05 | permanent link
Reading this interesting post on why protests can bring down authoritarian regimes, and a response distinguishing how long a regime happens to survive from how able it is to withstand crises, I can't help thinking of what Mr. Hume would say; or rather, had said:
NOTHING appears more surprizing to those, who consider human affairs with a philosophical eye, than the easiness with which the many are governed by the few; and the implicit submission, with which men resign their own sentiments and passions to those of their rulers. When we enquire by what means this wonder is effected, we shall find, that, as FORCE is always on the side of the governed, the governors have nothing to support them but opinion. It is therefore, on opinion only that government is founded; and this maxim extends to the most despotic and most military governments, as well as to the most free and most popular. The soldan of EGYPT, or the emperor of ROME, might drive his harmless subjects, like brute beasts, against their sentiments and inclination: But he must, at least, have led his mamalukes, or prætorian bands, like men, by their opinion.
Opinion is of two kinds, to wit, opinion of INTEREST, and opinion of RIGHT. By opinion of interest, I chiefly understand the sense of the general advantage which is reaped from government; together with the persuasion, that the particular government, which is established, is equally advantageous with any other that could easily be settled. When this opinion prevails among the generality of a state, or among those who have the force in their hands, it gives great security to any government.
Right is of two kinds, right to POWER and right to PROPERTY. What prevalence opinion of the first kind has over mankind, may easily be understood, by observing the attachment which all nations have to their ancient government, and even to those names, which have had the sanction of antiquity. Antiquity always begets the opinion of right; and whatever disadvantageous sentiments we may entertain of mankind, they are always found to be prodigal both of blood and treasure in the maintenance of public justice. There is, indeed, no particular, in which, at first sight, there may appear a greater contradiction in the frame of the human mind than the present. When men act in a faction, they are apt, without shame or remorse, to neglect all the ties of honour and morality, in order to serve their party; and yet, when a faction is formed upon a point of right or principle, there is no occasion, where men discover a greater obstinacy, and a more determined sense of justice and equity. The same social disposition of mankind is the cause of these contradictory appearances.
It is sufficiently understood, that the opinion of right to property is of moment in all matters of government. A noted author has made property the foundation of all government; and most of our political writers seem inclined to follow him in that particular. This is carrying the matter too far; but still it must be owned, that the opinion of right to property has a great influence in this subject.
Upon these three opinions, therefore, of public interest, of right to power, and of right to property, are all governments founded, and all authority of the few over the many. There are indeed other principles, which add force to these, and determine, limit, or alter their operation; such as self-interest, fear, and affection: But still we may assert, that these other principles can have no influence alone, but suppose the antecedent influence of those opinions above-mentioned. They are, therefore, to be esteemed the secondary, not the original principles of government.
For, first, as to self-interest, by which I mean the expectation of particular rewards, distinct from the general protection which we receive from government, it is evident that the magistrate's authority must be antecedently established, at least be hoped for, in order to produce this expectation. The prospect of reward may augment his authority with regard to some particular persons; but can never give birth to it, with regard to the public. Men naturally look for the greatest favours from their friends and acquaintance; and therefore, the hopes of any considerable number of the state would never center in any particular set of men, if these men had no other title to magistracy, and had no separate influence over the opinions of mankind. The same observation may be extended to the other two principles of fear and affection. No man would have any reason to fear the fury of a tyrant, if he had no authority over any but from fear; since, as a single man, his bodily force can reach but a small way, and all the farther power he possesses must be founded either on our own opinion, or on the presumed opinion of others. And though affection to wisdom and virtue in a sovereign extends very far, and has great influence; yet he must antecedently be supposed invested with a public character, otherwise the public esteem will serve him in no stead, nor will his virtue have any influence beyond a narrow sphere.
A Government may endure for several ages, though the balance of power, and the balance of property do not coincide. This chiefly happens, where any rank or order of the state has acquired a large share in the property; but from the original constitution of the government, has no share in the power. Under what pretence would any individual of that order assume authority in public affairs? As men are commonly much attached to their ancient government, it is not to be expected, that the public would ever favour such usurpations. But where the original constitution allows any share of power, though small, to an order of men, who possess a large share of the property, it is easy for them gradually to stretch their authority, and bring the balance of power to coincide with that of property.
This leaves open, of course, how anyone, subject or mamaluke, learns the opinions of their fellows regarding rights and interests; but this is one thing public political action is for.
Applications to other contemporary events, in which subjects cease to let themselves be led like brute beasts, will occur to my learned and sagacious readers, and so I will not belabor the obvious.
Posted by crshalizi at February 21, 2011 17:13 | permanent link
This coming Thursday (Feb. 24th), I'll be at the Blogs and Bullets 2011 conference at Stanford, being organized around the eponymous report for the United States Institute of Peace. (I imagine I'll have more to say about one than the other.) It's an invitation-only workshop, but if readers in the Palo Alto area would like to get in touch the next day, drop me a line; I will have free time but no car.
Posted by crshalizi at February 17, 2011 21:45 | permanent link
The curse of dimensionality limits the usefulness of fully non-parametric regression in problems with many variables: bias remains under control, but variance grows rapidly with dimensionality. (The number of points required to pin down a [hyper-]surface to within a given tolerance grows exponentially in the number of dimensions.) Parametric models do not have this problem, but have bias and do not let us discover anything about the true function. Structured or constrained non-parametric regression compromises, by adding some bias so as to reduce variance. Additive models are an example, where each input variable has a "partial response function", which add together to get the total regression function; the partial response functions are otherwise arbitrary. Additive models include linear models as a special case, but still evade the curse of dimensionality. Visualization and interpretation of additive models by display of the partial response functions. Fitting additive models is done iteratively, starting with some initial guess about each partial response function and then doing one-dimensional smoothing, so that the guesses correct each other until a self-consistent solution is reached. Incorporation of parametric terms, and interactions by joint smoothing of subsets of variables. Examples in R using the California house-price data. Conclusion: there is hardly ever any reason to prefer linear models to additive ones, and the continued thoughtless use of linear regression is a scandal.
PDF notes, incorporating R examples
Posted by crshalizi at February 17, 2011 21:30 | permanent link
An extended example of re-writing code to make it more powerful, flexible, and clear, based on in-class discussion.
Calculating a standard error for the median of a particular Gaussian sample by repeated simulation, "manually" at the R console. Writing a function to automate this task, with everything hard-coded. Adjusting the function to let the number of simulation runs be an argument. Writing a parallel function to do the same job for an exponential distribution. Since this is almost entirely the same, why have two functions? Putting in a logical switch between hard-coded options. Better approach: abstract out the simulation into a separate function, and make the simulator an argument to the standard-error-in-median function. Example of applying the latter function to a much more complicated simulator. Advantages of the modular approach: flexibility, clarity, ease of adjustment. Example: removing a for loop in favor of replicate in the find-the-standard-error function, without having to change any of the simulators. Writing parallel functions to find the interquartile range of the median, or the standard error of the mean. Repeating the process of abstraction: the common element is taking a simulator, estimating some property of the simulation, and summarizing the simulated distribution. All three tasks are logically distinct and should be performed by separate functions. Reduction of bootstrapping to a two-line function taking other functions as arguments.
PDF handout, incorporating R examples
Posted by crshalizi at February 16, 2011 01:48 | permanent link
Kernel regression controls the amount of smoothing indirectly by bandwidth; why not control the irregularity of the smoothed curve directly? The spline smoothing problem is a penalized least squares problem: minimize mean squared error, plus a penalty term proportional to average curvature of the function over space. The solution is always a continuous piecewise cubic polynomial, with continuous first and second derivatives. Altering the strength of the penalty moves along a bias-variance trade-off, from pure OLS at one extreme to pure interpolation at the other; changing the strength of the penalty is equivalent to minimizing the mean squared error under a constraint on the average curvature. To ensure consistency, the penalty/constraint should weaken as the data grows; the appropriate size is selected by cross-validation. An example with the data from homework 4, including confidence bands. Writing splines as basis functions, and fitting as least squares on transformations of the data, plus a regularization term. A brief look at splines in multiple dimensions. Splines versus kernel regression. Appendix: Lagrange multipliers and the correspondence between constrained and penalized optimization.
PDF notes, incorporating R examples
Posted by crshalizi at February 16, 2011 01:47 | permanent link
In which we attempt to weigh the heart of the cat.
Posted by crshalizi at February 16, 2011 01:46 | permanent link
Testing parametric model specifications against parametric imposes strong assumptions about how we can be wrong, and so is often dubious. Non-parametric smoothers can be used to test parametric models instead. Forms of tests: differences in in-sample performance; differences in generalization performance; whether the parametric model's residuals have expectation zero everywhere. Constructing a test statistic based on in-sample performance. Using bootstrapping from the parametric model to find the null distribution of the test statistic. An example where the parametric model is correctly specified, and one where it is not. Cautions on the interpretation of goodness-of-fit tests. Why use parametric models at all? Answers: speed of convergence when correctly specified; and the scientific interpretation of parameters, if the model actually comes from a scientific theory. Mis-specified parametric models can predict better, at small sample sizes, than either correctly-specified parametric models or non-parametric smoothers, because of their favorable bias-variance characteristics; an example.
PDF notes, incorporating R examples
Posted by crshalizi at February 16, 2011 01:45 | permanent link
Attention conservation notice: Only of interest if you are (1) in Pittsburgh and (2) want to spend seven and a half precious hours of your life hearing about the econometric analysis of group effects on individual behavior.
Steve Durlauf, visiting CMU from Madison, will be giving a series of workshops on the economics and econometrics of social interactions (i.e., ones not mediated through anonymous market exchange). The talks will be February 17--18 and 21--23, 9:00--10:20 am in Hamburg Hall 1502, except on the 18th when it will be in Hamburg Hall 2503. I strongly recommend this to anyone who finds things like this interesting; Durlauf has been thinking about these matters for a long time, and is a leading scholar in the field. I myself have cleverly arranged to have scheduling conflicts on every single one of those days, but will be re-reading Blume, Brock, Durlauf and Ioannides.
Disclaimer: Durlauf and I are both affiliated with Santa Fe, and I was friends with a couple of his students in graduate school.
Update: Prof. Durlauf will be giving the statistics department seminar, "On the Observational Implications of Taste-Based Discrimination in Racial Profiling", at 4 pm on Monday, 21 February, in Scaife Hall 125.
Posted by crshalizi at February 14, 2011 12:25 | permanent link
ITA was great, partly for the reasons visible at right, and partly for getting to enjoy to gracious hospitality of Doug White, but mostly for the scientific exchange. So, some links to my favorite talks. (Note "favorite" and not "best".) I will not attempt to explain any of these adequately, or to list everyone's co-authors. It's good that so many of the papers are on arxiv, but unfortunate that not all of them are.
In addition to the talks, and many enlightening conversations, Anand introduced Maxim and me to the Noble Experiment, surely the best cocktail lounge in which the wall opposite the bar is entirely covered in gilded skulls. At least one of the three of us should probably have done some memento mori blogging.
Manual trackback: Anand, "zombie-blogging" the workshop, which makes me fear for the future of ITA. (He says he was only sick with the flu, but by this point we all know how the rest of that story goes.)
Posted by crshalizi at February 13, 2011 17:30 | permanent link
As in some of my previous classes, there is a wide range of programming skill among the students in 402. The following notes are mostly intended to help those at the lower end of the scale catch up, but may be of some interest to others. (It presumes familiarity with using R from the command line.) The last section
rips offlargely incorporates Minimal Advice to Undergraduates on Programming.
Statisticians must be able to do basic programming; someone who only knows how to run canned routines is not a data analyst but a technician who tends a machine they do not understand. Programming in R is best organized around functions. Parts of a function and a function declaration. Writing functions to encapsulate repeated procedures. First example: calculating quantiles of Pareto distributions, by hand and by a function; checking the function. Extending the function. Writing functions which call other user-defined functions. Sanity-checking arguments, e.g., with stopifnot. More layering of functions: writing a Pareto random number generator. Our first bug. The debugging process; traceback as a useful utility. Checking the Pareto generator. Automating the checking process. Passing arguments from function to function with the ... pseudo-argument. More debugging. Contexts and "scope". Revising functions to work with each others. Avoiding iteration in R for speed and clarity. Returning lists and other complex data structures; writing a function to estimate a Gaussian. General programming advice: take a real programming class; comment your code; RTFM; start from the beginning and break it down; break your code into many short, meaningful functions; avoid writing the same thing twice; use meaningful names; check whether your code works; complain rather than giving up; avoid iteration.
Posted by crshalizi at February 07, 2011 22:40 | permanent link
From (very, very late) Tuesday through the end of the week, I'll be at the Information Theory and Applications workshop at UCSD. Inexplicably, the organizers of the session in memory of the late, great David Blackwell on Thursday asked me to talk about Bayesian convergence under dependence and mis-specification; inexplicably, because the rest of the line-up is excellent. (And not just for that session.) Anand has already promised/ threatened near-live-blogging.
(And considering my flights: "bust" is not an ignorable alternative.)
Update, 13 February: follow-up.
Posted by crshalizi at February 07, 2011 22:20 | permanent link
Attention conservation notice: I continue my efforts to make this unreadable by promising to post 15--30 pages of lecture notes twice a week, plus homework assignments.
My class this semester is 36-402, "Advanced Data Analysis", for 68 students, about half statistics majors, most in their junior year, and about half seniors from other majors. They've just come off 36-401, modern regression, as taught by the excellent Prof. Nugent, so there's nothing more about linear models which would actually be useful that I could actually teach them. Instead, I've decided to take the "advanced" part seriously, and present modern techniques and concepts in ways which, hopefully, well-prepared undergraduates can actually grasp. (By the time they get to me, our majors are very well-prepared — but they are still undergraduates.) On the theory that the course notes might be of more general interest, I'll be posting them here. When I've tried things like this in the past, I put them all together on a page I updated over the semester, but I've been told separate posts would be more convenient; though this page will point to them all.
Some of the lectures are drafts for sections of STACS.
(Many of the notes are revisions of those for my data mining course. I confess I originally intended "data analysis" to just be data mining with the serial numbers filed off, but by the end of the semester I imagine the overlap will be no more than 50%.)
Posted by crshalizi at February 04, 2011 01:42 | permanent link
Getting comfortable with simulations and the bootstrap; and, in the hidden curriculum, writing functions.
Posted by crshalizi at February 04, 2011 01:41 | permanent link
Learning to estimate variances and conditional densities.
Posted by crshalizi at February 04, 2011 01:40 | permanent link
The "Get comfortable with cross-validation and kernels" problem set.
Posted by crshalizi at February 04, 2011 01:39 | permanent link
The "As you all learned in
kindergarden last semester"
Posted by crshalizi at February 04, 2011 01:38 | permanent link
Statisticians quantify uncertainty in inference from random data to parameters through the sampling distributions of statistical functionals. These distributions are inaccessible in all but the simplest and most implausible cases. The bootstrap principle: sampling distributions under a good estimate of the truth are close to the true sampling distributions. Parametric bootstrapping: methods for finding standard errors, biases and confidence intervals, and for performing hypothesis tests. Double-bootstraps. Examples of parametric distribution with Pareto's law of income inequality. Non-parametric bootstrapping: using the empirical distribution itself as our model. The Pareto distribution continued. Bootstrapping regressions: resampling data-points versus resampling residuals; resampling of residuals under heteroskedasticity. Examples with homework data. Cautions on bootstrapping with dependent data. When does the bootstrap fail?
Posted by crshalizi at February 04, 2011 01:37 | permanent link
"Simulation" means: implementing the story encoded in the model, step by step, to produce something data-like. Stochastic models have random components and so their simulation requires some random steps. Stochastic models specified through conditional distributions are simulated by chaining together random numbers; the importance of conditional independence structures. Methods of generating random numbers with specified distributions. Simulation shows us what a model predicts (expectations, higher moments, correlations, regression functions, sampling distributions); analytical probability calculations are short-cuts for exhaustive simulation. Simulation lets us check aspects of the model: does the data look like typical simulation output? if we repeat our exploratory analysis on the simulation output, do we get the same results? If not, how specifically does the model fail? Simulation-based estimation: the method of simulated moments. Indirect inference, left as an exercise for the reader.
Posted by crshalizi at February 04, 2011 01:36 | permanent link
The desirability of estimating not just conditional means, variances, etc., but whole distribution functions. Parametric maximum likelihood is a solution, if the parametric model is right. Histograms and empirical cumulative distribution functions are non-parametric ways of estimating the distribution: do they work? The Glivenko-Cantelli law on the convergence of empirical distribution functions, a.k.a. "the fundamental theorem of statistics". More on histograms: they converge on the right density, if bins keep shrinking but the number of samples per bin keeps growing. Kernel density estimation and its properties; some error analysis. An example with data from the homework. Estimating conditional densities; another example with homework data. Some issues with likelihood, maximum likelihood, and non-parametric estimation.
Posted by crshalizi at February 04, 2011 01:35 | permanent link
Average predictive comparisons. Weighted least squares estimates. Heteroskedasticity and the problems it causes for inference. How weighted least squares gets around the problems of heteroskedasticity, if we know the variance function. Estimating the variance function from regression residuals. An iterative method for estimating the regression function and the variance function together. Locally constant and locally linear modeling. Lowess.
Comment: Predictive comparisons were really a held-over topic from the previous lecture, and I am not quite happy with putting local polynomials here.
Posted by crshalizi at February 04, 2011 01:34 | permanent link
The bias-variance trade-off tells us how much we should smooth; introduction to the Oracle. Our ignorance of both bias and variance, now that the Oracles have fallen silent. Estimating the sum of bias and variance with cross-validation. Adaptation as a substitute for knowledge. Adapting to unknown roughness with cross-validation; detailed examples. Using kernel regression with multiple inputs: multivariate kernels, product kernels. Using smoothing to automatically discover interactions. Plots to help interpret multivariate smoothing results. Appendix: the multivariate Gaussian distribution.
Posted by crshalizi at February 04, 2011 01:33 | permanent link
The three big uses of statistical models: as summaries of data; as predictive instruments; as scientific models. Evaluation depends on the use. Prediction is the goal which admits of the most definite evaluations; reducing the evaluation of scientific models to checking predictions (without necessarily becoming an instrumentalists). Evaluating predictions by their average errors: in-sample error distinguished from generalization error; the latter is what really needs to be controlled. A gesture in the direction of statistical learning theory. Over-fitting defined and illustrated. Cross-validation for estimating generalization error and for model selection. Forms of cross-validation; k-fold CV generally preferable to leave-one-out CV.
Posted by crshalizi at February 04, 2011 01:32 | permanent link
Using Taylor's theorem to justify linear regression locally. Collinearity. Consistency of ordinary least squares estimates under weak conditions. Linear regression coefficients will change with the distribution of the input variables: examples. Why R2 is usually a distraction. Linear regression coefficients will change with the distribution of unobserved variables (omitted variable effects). Errors in variables. Transformations of inputs and of outputs. Utility of probabilistic assumptions; the importance of looking at the residuals. What "controlled for in a linear regression" really means.
Posted by crshalizi at February 04, 2011 01:31 | permanent link
Statistics is the science which studies methods for learning from imperfect data. Regression is a statistical model of functional relationships between variables. Getting relationships right means being able to predict well. The least-squares optimal prediction is the expectation value; the conditional expectation function is the regression function. The regression function must be estimated from data; the bias-variance trade-off controls this estimation. Ordinary least squares revisited as a smoothing method. Other linear smoothers: nearest-neighbor averaging, kernel-weighted averaging.
Posted by crshalizi at February 04, 2011 01:30 | permanent link
Attention conservation notice: I have no taste.
I'm not sure why I read so many mysteries this month.
Books to Read While the Algae Grow in Your Fur; Pleasures of Detection, Portraits of Crime; The Dismal Science; Scientifiction and Fantastica; Writing for Antiquity; Enigmas of Chance; The Collective Use and Evolution of Concepts; Commit a Social Science; Philosophy; The Running Dogs of Reaction
Posted by crshalizi at January 31, 2011 23:59 | permanent link
For the first statistics seminar of 2011, we are very happy to welcome —
Posted by crshalizi at January 19, 2011 14:50 | permanent link
Attention conservation notice: Obvious reflections on a tired question, written down to get them out of my head while I work on other stuff, and posted to amuse connoisseurs with their naive presumption.
1. Obviously, macroeconomic phenomena are the aggregated (or, if you like, the emergent) consequences of microeconomic interactions. What else could they be? Analogously, the macroscopic physical properties of condensed matter all ultimately emerge from molecular interactions.
2. Macroeconomic theories which do not derive such phenomena from microscopic interactions are thus incomplete, and intellectually unsatisfying. Analogously, theories of condensed matter which do not derive the phenomena from molecular interactions are incomplete.
So: the true and complete theory of macroeconomics must emerge from the true and complete theory of microeconomics.
3. Incomplete theories are not (necessarily) false, or even lacking in value. (A model of the bulk properties of steel, or plastic, or bone, which doesn't include a derivation from molecular dynamics can be accurate, precise and useful. It could even be more accurate than a micro-founded model, if, e.g., we lack a precise understanding of the microscopic structure, or we can only calculate macroscopic consequences through crude approximations.)
4. If a well-established macro-level theory does not, currently, have any micro-foundations, the scientific approach would seem to be to dig those foundations, not to pull down the theory.
5. If a good macro-level theory cannot be founded on our current micro-level theory, this could be due to: (a) defects or weaknesses in our techniques for calculating aggregate consequences of micro-level interactions; (b) specifying the wrong sort of initial/boundary conditions, or interaction structures, in the microscopic models; (c) errors in our understanding of micro-level interactions and dynamics; (d) errors in our formulation of the macro-level theory.
There will certainly be some situations where (d) is right, e.g., because there is no possible way to derive the macro theory from any micro-level one. But it is hard for me to see why (d) should always be the preferred option in economics. Some adjustment of our various theories, models and techniques is required, but it seems mere prejudice that it should always be macro which adjusts. Even if one thought that standard microeconomic theory was very securely established and successful (which is dubious), even well-established and successful theories can contain systematic mistakes, which might (e.g.) only be detectable when one looks at aggregate consequences. (If nothing else, statistical power grows with sample size!) Or again: perhaps there's nothing wrong with the specification of what individuals are like and how they interact, but the simplification of always solving for the equilibrium is wrong, and while it's not very wrong for any one market over a short period, the error accumulates when one goes to the level of whole economies over years and decades.
To continue the analogy, circa 1900 classical mechanics and electromagnetism were extremely well-confirmed theories, in much better shape that microeconomics is. Nonetheless, any attempt to explain condensed matter physics on that basis, starting from molecular interactions, was doomed. (For instance, classical physics predicts that matter should be unstable.)
6. Even if one has a true microscopic theory, the best way to develop theories of macroscopic phenomena is not, necessarily, to start from a microscopic model. Sometimes it will be, but there's no reason to think that's a general rule for all problems. The answer might even vary from theorist to theorist, depending on skills, experience, etc.
7. Micro-founded models would be more suitable for policy-making only if it is easier to develop an accurate causal model of how individuals and their interactions respond to policy changes, and to aggregate the results, than it is to develop a macro-level causal model. Why should we think this?
For "economics", read "sociology", "political science", "ecology", etc., etc., as appropriate.
Update, 24 January: Both J. W. Mason at the Slack Wire and Tom Slee at Whimsley have written substantial responses which deserve detailed replies, but are unlikely to get them soon. Since I can't stand to work on tomorrow's lecture any more tonight, I will just say a few words about Slee's, which is easier for me to reply to.
First, effective evolutionary explanations resolve themselves into causal ones. This is because they contain feedback mechanisms which make selection operative. This in turn means that the current properties of organisms and populations are the result of the causal interactions of their ancestors with past environments, and so causes do indeed precede effects. To use a distinction which I believe originated with Monod, these explanations are teleonomic, not teleological.
Second, Slee raises the issue of when descriptions and explanations in terms of coarse-grained macro-variables are not just more familiar but actually in some sense more effective than ones in terms of fine-grained micro-variables is a very deep one. (Mason also brings this up, but invokes additional considerations which I don't feel I have time to go into now.) To my mind this is the heart of emergence, at least in the sense in which I can make sense of the word and don't find it trivial. I have tried to tackle this at length elsewhere, by giving an information-theoretic account of when a set of macroscopic variables, or more precisely the states defined by them, emerge from microstates by enabling more efficient and self-contained prediction at the higher level. My paper with Cris Moore (here or here) has details, though we presumed some familiarity with stat. mech. (I also took a stab at connecting this to cognitive science; and you could always read the last chapter of my dissertation, if you want to chance death by boredom.) I am not, obviously, well-placed to judge my own efforts in this line, but if it's even roughly right, then there is, indeed, no contradiction between insisting on reductionist accounts for higher-level phenomena, and pursuing autonomous (or nearly autonomous) causal explanations in terms of higher-level variables. Whether the variables used in current macroeconomics have the right properties is, of course, a different question. I should also mention, in this connection, Clark Glymour's "When Is a Brain Like the Planet?"; I believe Clark's arguments mesh very well with those in my papers, though we've never hashed that out and he might disagree.
Third, I suspect anyone who likes Slee's examples will also enjoy Wolfgang Beirl's notes on the predictability of physicists, and vice versa.
Mason next time, inshallah.
Manual trackback: The Slack Wire; Beyond Microfoundations; Blake Riley; Grasping Reality with $Numerosity $Instrumentalities; Critiques of Libertarianism; Whimsley; Critiques of Collectivism; the blog formerly known as The Statistical Mechanic; D2 Digest (I agree; see point 5 above); Unfogged [I have explained why I do not find "supervenience" a useful notion elsewhere]
Posted by crshalizi at January 18, 2011 20:45 | permanent link
I'm speaking at the CMU philosophy department's colloquium this week. I do not pretend to fully understand how this happened, but no doubt that by the end of the day I will enjoy a simultaneously higher and more profound level of puzzlement about many matters.
Posted by crshalizi at January 17, 2011 12:45 | permanent link
Attention conservation notice: A 2000-word attempt to reduce decades of painstaking empirical work and careful theorizing in economic geography to a back-of-the-envelope calculation; includes a long quotation from a 19th century textbook of political economy. An outtake from a post that turned into a paper-in-progress, posted now because I'm stuck on a proof in another paper, and don't want to work on writing the next problem set for 402.
Physicists are fond of a kind of rough estimation exercise they call "Fermi problems", since our folklore attributes them to the great Enrico Fermi. A classic instance is the one I first encountered, as a physics undergrad at Berkeley: how many piano tuners are there in the East Bay? Well, there are about a million people living around the eastern shore of the San Francisco Bay, i.e., on the order of 106. How many people are there per piano? 10 per piano seems high, but 10,000 per piano seems low, say 103 per piano. How often a piano needs to be tuned? Clearly not every day, or even every week, but also not once a decade, so something like once a year. Thus the East Bay needs about 103 piano-tunings per year. How quickly can a piano be tuned? Probably in less than a week but more than an hour, so something like a day, or about 10-2 years. So there should be about 10 piano tuners in the East Bay. The professor, having elicited these numbers, then told us to "look it up in the phone book"; having pulled the same stunt myself since then, I can tell you that any number between 5 and 50 will be declared "the right order of magnitude".
Suppose we were interested not in greater San Francisco but Stewart Township, Pennsylvania, the site of Fallingwater: how many piano tuners does it have? Stewart Township has a population 7*102, and our reasoning above says it's got something like one piano, and so demands one day of piano-tuning per year. What does the piano tuner do the rest of the time? They could be an ordinary citizen, who only becomes a piano tuner once a year when it's called for. Or it could be that Stewart Township shares a specialist piano-tuner (or three) with the 6*105 other people of the Laurel Highlands. Since tuning a piano is a reasonably demanding skill, it's much more likely that it's done by a specialist.
What goes for piano tuners goes for other specialists. Most people need their skills rarely, or need only a small fractional share of their output, or need it only indirectly. (You want to hear piano music, so the pianist needs to find a tuner.) Small settlements cannot keep them occupied full time. But there are fixed costs to specialist services --- tools, of course, but more essentially the time and effort needed to acquire, maintain and develop the specialist's skills. It is more efficient for one specialist to serve many people, thereby spreading the fixed costs over many customers, which rules out the part-time amateur in each village. (More exactly, since the local amateurs lack the skills to do the job well, they can only compete with the specialists by being much cheaper, or if customers can't tell the difference.) This will tend to divide up a dispersed population into regions served by one or another specialist; increasingly specialized skills will require increasing large population bases.
It is not required by this argument that the specialists be located near each other; but it tends to happen. After all, they need each others' services, and being located near each other reduces transport costs for them, and there will often be economies of scope in setting up specialists near each other. (If everyone needs to take or make freight deliveries, they can share one set of loading docks, etc.) If demand is high enough to support multiple specialists, there can be "agglomeration economies": they can begin to benefit from each other by sharing information and knowledge, creating a local market for their specialist suppliers, etc. There is a famous passage from Alfred Marshall (in 1890) which is traditionally trotted out on these occasions, and far be it from me to break with tradition:
When an industry has thus chosen a locality for itself, it is likely to stay there long: so great are the advantages which people following the same skilled trade get from near neighbourhood to one another. The mysteries of the trade become no mysteries; but are as it were in the air, and children learn many of them unconsciously. Good work is rightly appreciated, inventions and improvements in machinery, in processes and the general organization of the business have their merits promptly discussed: if one man starts a new idea, it is taken up by others and combined with suggestions of their own; and thus it becomes the source of further new ideas. And presently subsidiary trades grow up in the neighbourhood, supplying it with implements and materials, organizing its traffic, and in many ways conducing to the economy of its material.
Again, the economic use of expensive machinery can sometimes be attained in a very high degree in a district in which there is a large aggregate production of the same kind, even though no individual capital employed in the trade be very large. For subsidiary industries devoting themselves each to one small branch of the process of production, and working it for a great many of their neighbours, are able to keep in constant use machinery of the most highly specialized character, and to make it pay its expenses, though its original cost may have been high, and its rate of depreciation very rapid.
Again, in all but the earliest stages of economic development a localized industry gains a great advantage from the fact that it offers a constant market for skill. Employers are apt to resort to any place where they are likely to find a good choice of workers with the special skill which they require; while men seeking employment naturally go to places where there are many employers who need such skill as theirs and where therefore it is likely to find a good market. The owner of an isolated factory, even if he has access to a plentiful supply of general labour, is often put to great shifts for want of some special skilled labour; and a skilled workman, when thrown out of employment in it, has no easy refuge. Social forces here co-operate with economic: there are often strong friendships between employers and employed: but neither side likes to feel that in case of any disagreeable incident happening between them, they must go on rubbing against one another: both sides like to be able easily to break off old associations should they become irksome. These difficulties are still a great obstacle to the success of any business in which special skill is needed, but which is not in the neighbourhood of others like it: they are however being diminished by the railway, the printing-press and the telegraph.
What we have argued ourselves into, on the basis of little more than a realization that comparatively high fixed costs matter, is to think that there should be spatial clumps of economic activity, where we find a lot of specialists, and that these clumps should come in grades, with more clumps containing less-specialized enterprises with less-increasing returns, and fewer clumps containing the more-specialized, more-increasing-returns enterprises. We call the clumps "towns" and "cities". (And indeed, if I can trust my searching, the nearest piano tuner to Fallingwater is located in the town of Connellsville, population 9*103.) The gradations of the clumps form the "hierarchy of urban places", an idea which has been familiar to economic geographers since at least the work of Christaller and Lösch in the 1930s. It implies that there isn't just quantitatively more economic activity in a bigger settlement, but generally different kinds of activity. Stewart Township is not a scaled-down version of Connellsville, which is not a scaled-down Pittsburgh, which is not a scaled-down Chicago or New York.
Moreover, the argument is more generally than just specialized services. It turns on having low marginal costs (a day of a heart surgeon's time to do an operation) compared to high fixed costs (ten years of training to become a heart surgeon). But the fixed costs don't have to be time, and similar logic will work for just about any industry with increasing returns, if transport costs are not prohibitive. So as we move up the hierarchy of urban places, we should find not only more, and more specialized, service providers, but also more industries with increasing returns, and, you should forgive the expression, increasingly increasing returns at that. One way industries come to have increasing returns is by being relatively capital- (as opposed to labor-) intensive, which will tend to increase the output per worker.
All of the above applies with great force to creating and disseminating new abstract, formalized, discursive knowledge. It is highly specialized, the fixed costs of entering are very high, economies of scope are important, the effects of agglomeration are important, and the cost of transporting the finished product is zero. All else being equal, we should expected knowledge production to be concentrated towards the top of the urban hierarchy.
All of this is, as I said, very standard stuff in economic geography and urban and regional economics. I learned much of it at (pretty literally) my father's knee, and it was old when he learned it from his teachers. (There is even a version of it in ibn Khaldun's Muqaddimah, from 1377: see ch. 5, sec. 15--22 [pp. 314--318 of the Rosenthal/Dawood translation] on the crafts, and again ch. 6, sec. 7--8 [pp. 340--343] on the sciences.) Of course the version I gave above was a bit of a cheat, in at least two ways. First, it was a story about how a certain outcome would be efficient, but that efficiency rested on a lot of unspoken or hinted-at premises about the relative sizes of different sorts of costs and values. (How many camel caravans are there in the East Bay?) Second, even granting the efficiency, would it really be brought about by the acts of interacting decision-makers, in the absence of a super-detailed coordinating plan?
Both of these questions, but especially the latter, have been the focus of a
lot of very interesting work in economics over the last few decades. (Filial
piety requires me to
recommend this paper as an
overview, but it's good, so that's easy to do.) One of those involved in this
has been none other than Paul Krugman, who was one of the people who realized
that new techniques for modeling imperfect competition with increasing returns
could be used to attack the origin of cities and of industrial clusters. One
of the things he also realized is that the problem of where the
specialists should locate themselves is one
of symmetry breaking, just like
many kinds of pattern formation from
physics — and named it as such, in a lovely little book from
1996, The Self-Organizing
Economy. A later book, Fujita, Venables and
Spatial Economy elaborated on that analysis, showing how the mixing
increasing returns with the logic of comparative advantage leads naturally to
spatial patterns of what can only be called combined and uneven development,
again through symmetry breaking. (The nucleation of a high-productivity center
not only inhibits the growth of other centers near it, it de-industrializes its
periphery.) In my
humble supremely arrogant opinion, this one
of the few places where interesting ideas from physics
have been productively used in the social sciences.
Update, next day: typo fixed, thanks to Cris Moore.
Posted by crshalizi at January 15, 2011 17:45 | permanent link
Posted by crshalizi at January 14, 2011 12:25 | permanent link
Posted by crshalizi at January 11, 2011 10:30 | permanent link
<pomposity level="more than usual"> As may be verified from the date-stamps (and confirmed, in case there should be any question, through the Wayback Machine), I posted my neutral model of scientific inquiry days and days before the appearance of the risible ESP paper, and the nearly equally risible New Yorker piece on something being wrong with "the scientific method". (I shall not dignify either with a direct link.) Before, yet not so long before! Clearly, this is no mere coincidence! A vulgar mind, bound to what it misleadingly regards as material "realities", might suggest that I was led to write about a long-standing pet idea when several acquaintances who has been exposed to the pre-publication publicity for the ESP paper asked me what I thought of it. Higher beings, on the contrary, will clearly perceive that I now possess powers of prediction which allow me to see through the mist of time itself as though it were clear mountain air. (As those links suggest, I attribute the development of my powers to rigorously following the secret ascetic practices transmitted to me by the ascended masters of the turquoise trail, whom I sought out in the demon-haunted western deserts many years ago.) While it is gratifying to have so many people bring these proofs of my pre-cognitive abilities to my attention (gratifying, yet, in the nature of the case, quite unsurprising), you may now cease to do so. My spiritual energies are currently fully devoted to helping my students achieve enlightenment, and I must leave the pre-refuted to their fates. </pomposity>
Manual trackback: An Ergodic Walk
Posted by crshalizi at January 07, 2011 16:52 | permanent link
Papers finished during 2010: 7
Papers written in response to "what's up with this?" e-mails from readers: 1
Papers accepted: 1
Papers in refereeing limbo: 2
Papers where I am grumbling about the third referee: 3
Papers which will be submitted next week, after giving typos a chance to ripen and become obvious: 1
Papers rejected: 0
Paper with co-authors waiting on my contributions: 2
"We should totally write a paper about this" conversations with non-trivial follow-up: 7
Manuscripts refereed: 34, for 14 journals and conferences
Manuscripts waiting for me to referee: 2
Manuscripts for which I was the responsible associate editor at Annals of Applied Statistics: 6
Grant proposals submitted: 4
Proposals funded: 1
Proposals in refereeing limbo: 2
Proposals rejected: 1
Weblog posts written: 80
Substantive posts written: 34, counting algal growths
Books started: 202
Books finished: 186
Books bought: 366
Books sold: 304
Books donated: 750
Book manuscripts completed: 0
Major life changes: 1
Posted by crshalizi at January 01, 2011 16:45 | permanent link