There is an old Soviet-era joke about a nail factory which was assigned a
target, under the five year plan, of 1600 tons of nails, and spent the whole
five years producing a *single* gargantuan nail weighing (of course)
1600 tons. The joke illustrates not only the follies of "actually existing
socialism", but a broader problem with using quantitative performance targets,
namely that people will tend to
adjust
their efforts to meet the quantitative criteria, which can be
only very poorly
related to the real job they are supposed to be doing. This is not to say
that objective performance criteria are always bad, because often the
alternative is subjective evaluations by superiors, i.e., prejudice and
caprice; but it does point to the need to carefully design those criteria, so
that, as far as possible, they track what you actually want to have happen, and
not just what's easy to measure or to calculate.

One place where easy calculation threatens to overwhelm substantive validity is in "bibliometrics", or the use of numerical methods to study patterns of scientific publication. For many years now, scientific journals have been advertising their "impact factor", as determined by ISI/Thompson Scientific, which is roughly the number of citations (as tracked by ISI/Thompson) to that journal, divided by the number of papers published in the journal. The idea is that journals with high impact factors are ones which publish articles people take note of, and go on to cite. Now, leaving to one side the big gap between "is cited a lot" and "is good science", there are huge, glaring holes with this as a way of measuring the quality or influence of a journal. An obvious one is that a citation from the World Journal of Cartesian Snooker and Even More Obscure Problems means much less than one from Nature. But another problem, perhaps even larger, is that different fields have different patterns of citation.

A stereotypical math paper, for example, will use a huge number of previously existing results, but contain very few citations, on the presumption that most of those results are assimilated background which its readers have already absorbed from any number of standard sources. If I write a paper on stochastic processes, I might well use the ergodic theorem for Markov chains, which says (roughly) that there is a way of assigning probabilities to states which is invariant under the chain's dynamics, and moreover the amount of time any sufficiently long trajectory spends in any one state is equal to that state's probability. This is a result with a very intricate history, going back to Markov himself in his struggles with his arch-enemy, but I'd look ridiculous if I cited any of this history, or even a textbook like Grimmett and Stirzaker. On the other hand, sociologists have a reputation for providing as many citations as possible for absolutely everything, and a pious habit of referring back to the 19th and early 20th century Masters. A leading sociology journal, then (say, American Journal of Sociology) might have an impact factor of around 5, while a leading mathematics journal (say, Annals of Probability) would have one significantly lower, even though both are near the top of their respective prestige hierarchies.

Now, you *could* say this is just another reason why we shouldn't try
to rank journals. But there *are* times when doing things like this is
going to be very helpful, e.g. when trying to decide which journals to spend a
limited subscription budget on. So it would be nice if there was a way of
doing something *like* this, which corrected for problems like the
differences in citation customs across academic tribes.

One way to imagine doing this is as follows. Pick a completely random
journal, and a random article from that journal. Now pick one of its
references, again completely at random, and follow it up. Repeat this process
by following a random reference in *that* paper, until you come to a
dead end, namely a citation to something outside of your data set. Pick
another random starting point and repeat, many times. Looking back over your
random walks through the scientific literature, how much time did you spend in
any given journal? It's not hard to convince yourself that you will spend more
time in journals whose papers are highly cited by papers in other journals
which are themselves highly cited. If you come to a paper with many
references, you are that much less likely to follow any one of them, and so you
will spend less time, all else being equal, on those papers than you will in
the references of papers which are more sparing of citation. Saying
"influential journals are ones which are often cited by influential journals"
makes the definition *sound* hopelessly circular, but the random walk
procedure makes it clear that it's not, or at least not *hopelessly* so.

It turns out that the random walk scheme is computationally very demanding — you need a lot of random walkers, taking a lot of very long walks, to get good results — but there is a short cut. The random process I've described is a well-behaved Markov chain. The ergodic theorem now tells us that a time average (how often does the walk hit a given journal?) can be replaced with a "space" average (what is the probability of being at a given journal?), where the probability weights are left unchanged by the action of the Markov chain. Finding these invariant distributions is an exercise in linear algebra; specifically it's going to be the leading eigenvector of the chain's transition matrix. (One of the beauties of the theory of Markov processes is how it lets us replace nasty nonlinear problems about individual trajectories with clean linear problems about probabilities.) And there are very nice, very fast algorithms for finding eigenvectors, even of very large matrices.

Thus the reasoning behind eigenfactor.org, the latest brainstorm from Carl Bergstrom's lab — most of the actual code and elbow-grease being provided by Jevin West and Ben Althouse. It covers all the journals that impact factor would, but also gives an estimate of the impact of citations to non-journals (which lets us see that some software is more influential than some journals). Plus you get to see all kinds of useful things about how much the journals cost (something Carl's been interested in for some time), and how that breaks down by paper or by citation. All in all, it's a very fun and potentially very useful tool for anyone interested in the academic publishing system, and/or applications of Markov chains.

*Disclaimer*: Rumors that Carl arranged for
me to publicize everything his lab does in this weblog in exchange for beers
from his private collection whenever I'm in Seattle are — sadly
exaggerated.

*Manual trackback*: Geomblog; Muck and Mystery; Outsider; Structure+Strangeness;
Flags and
Lollipops; Dan O'Huiginn;
MetaFilter; Yorkshire Ranter

(Thanks to Owen "Vlorbik" Thomas for typo correction.)

Networks; Learned Folly; Enigmas of Chance; Incestuous Amplification

Posted by crshalizi at March 20, 2007 21:08 | permanent link