Attention conservation notice: Puffery about a new manuscript on the statistical theory of some mathematical models of networks. In the staggeringly unlikely event this is actually of interest to you, why not check back later, and see if peer review has exposed it all as a tissue of fallacies?
A new paper, which I flatter myself is of some interest to those who care about network models, or exponential families, or especially about exponential families of network models:
Obligatory Disclaimer: Ale didn't approve this post.
This started because Ale and I shared an interest in exponential family random graph models (ERGMs), whose basic idea is sheer elegance in its simplicity. You want to establish some distribution over graphs or networks; you decree some set of functions of the graph to be the sufficient statistics; and then you make the log probability of any given graph proportional to a weighted sum of these statistics. The weights are the parameters, and this is an exponential family.1. They inherit all of the wonderfully convenient mathematical and statistical properties of exponential families in general, e.g., finding the maximum likelihood estimator by equating expected and observed values of the sufficient statistics. (This is also the maximum entropy distribution, though I set little store by that.) They are also, with judicious choices of the statistics, quite spiffy-looking network models. This paper by Goodreau et al., for instance, is exemplary in using them to investigate teenage friendship networks and what they can tell us about general social mechanisms, and deserves a post of its own. (Indeed, a half-written post sits in my drafts folder.) This is probably the best class of statistical models of networks now going, which I have happily taught and recommended to students, with a special push for statnet.
What Ale and I wanted to do was to find conditions under which maximum likelihood estimation would be consistent --- when we saw more and more data from the same source, our estimates of the parameters would come closer and closer to each other, and to the truth. The consistency of maximum likelihood estimates for independent observations is classical, and but networks, of course, are full of dependent data. People have proved the consistency of maximum likelihood for some kinds of models of time series and of spatial data, but those proofs (at least the ones we know) mostly turned on ordering or screening-off properties of time and space, lacking in arbitrary graphs. Those which didn't turned on the "blocking" trick, where one argues that widely-separated events are nearly independent, and so approximates the dependent data by independent surrogates, plus weak corrections. This can work with random fields on networks, as in this excellent paper by Xiang and Neville, but it doesn't seem to work for models of networks, where distance itself is endogenous.
I remember very distinctly sitting in Ale's office on a sunny October afternoon just over a year ago2, trying to find some way of making the blocking trick work, when it occurred to us that maybe the reason we couldn't show that estimates converged as we got more and more data from the same ERGM was that the very idea of "more and more data from the same ERGM" did not, in general, make sense. What exactly prompted this thought I do not recall, though I dare say the fact that we had both recently read Lauritzen's book on sufficiency, with its emphasis on repetitive and projective structures, had something to do with it.
The basic point is this. Suppose we observe a social network among (say) a
sample of 500 students at a high school, but know there are 2000 students in
all. We might think that the whole network should be described by some ERGM or
other. How, however, are we to estimate it from the mere sample? Any graph
for the whole network implies a graph for the sampled 500 students, so the
toilsome and infeasible, but correct, approach would be to enumerate all
whole-network graphs compatible with the observed sample graph, and take the
likelihood to be the sum of their probabilities in the whole-network ERGM. (If
you do not strictly know how large the whole network is, then I believe you are
strictly out of luck.) This is not, of course, what people actually do.
Rather, guided by experience with problems of survey sampling, regression, time
series, etc., they have assumed that the same ERGM, with the same
sufficient statistics and the same parameter values, applies to both the whole
network and to the sample. They have assumed, in other words, that the ERGMs
Once you recognize this, it turns out to be straightforward to show that projectibility imposes very strong restrictions on the sufficient statistics --- they have to obey a condition about how they "add up" across sub-graphs which we called3 "having separable increments". This condition is "physically" reasonable but not automatic, and I will not attempt to write it out in HTML. (Read the paper!) Conversely, so long as the statistics have such "separable increments", the exponential family is projectible. (Pinning down the converse was the tricky bit.) Once we have this, conditions for consistency of maximum likelihood turn out to be straightforward, as all the stuff about projectibility implies the change to the statistics when adding new data must be unpredictable from the old data. The sufficient statistics themselves form a stochastic process with independent increments, something for which there is a lot of convergence theory. (This does not mean the data must be independent, as we show by example.) All of these results prove to be perfectly general facts about exponential families of dependent variables, with no special connection to networks.
The punch-line, though, is that the most commonly used specifications for ERGMs all include — for good reasons! — statistics which break projectibility. Models with "dyadic independence", including the models implicit or explicit in a lot of community discovery work, turn out to be spared. Anything more sophisticated, however, has got a very real, though admittedly somewhat subtle, mathematical pathology. Consistency of estimation doesn't even make sense, because there is no consistency under sampling.
We have some thoughts on where this leaves statistical models of networks, and especially about how to actually move forward constructively, but I will let you read about them in the paper.
Update, next day: fixed typos, clarified a sentence and added a reference.
1: Or if, like me, you were brought up in statistical mechanics, a Boltzmann-Gibbs ensemble, with the statistics being the extensive thermodynamic variables (think "volume" or "number of oxygen molecules"), and the parameters their conjugate intensive variables (think "pressure" or "chemical potential of oxygen"). If this line of thought intrigues you, read Mandelbrot.
2: With merely a year between the idea and the submission, this project went forward with what is, for me, unseemly haste.
3: We couldn't find a name for the property the statistics needed to have, so we made one up. If you have encountered it before, please let me know.
Posted by crshalizi at November 14, 2011 22:00 | permanent link