October 27, 2011

Bayesianism Not Banned in Britain

Attention conservation notice: 4600 words on a legal ruling in another country, from someone who knows nothing about the law even in his own country. Contains many long quotations from the ruling, plus unexplained statistical jargon; written while trapped in an airport trying to get home, and so probably excessively peevish.

Back at the beginning of the month, as constant readers will recall, there was a bit of a kerfluffle over newspaper reports — starting with this story in the Guardian, by one Angela Saini — to the effect that a judge had ruled the application of Bayes's rule was inadmissible in British courts. This included much wailing and gnashing of teeth over the innumeracy of lawyers and the courts, anti-scientific obsurantism and injustice, etc., etc. At the time, I was skeptical that anything like this had actually happened, but had no better information than the newspaper reports themselves. A reader kindly sent me a copy of the judgment by the court of appeals [PDF], and US Airlines kindly provided me with time to read it.

To sum up what follows, the news reports were thoroughly misleading: the issue in the case was the use not of Bayes's rule but of likelihood ratios; the panel of three judges (not one judge) affirmed existing law, rather than new law; the existing law allows for the use of likelihood ratios and of Bayes's theorem when appropriate; and the court gave sound reasons for thinking that their use in cases like this one would be mere pseudo-science. We are, then, listening to the call-in show on Radio Yerevan:

Question to Radio Yerevan: Is it correct that Grigori Grigorievich Grigoriev won a luxury car at the All-Union Championship in Moscow?

Answer: In principle, yes. But first of all it was not Grigori Grigorievich Grigoriev, but Vassili Vassilievich Vassiliev; second, it was not at the All-Union Championship in Moscow, but at a Collective Farm Sports Festival in Smolensk; third, it was not a car, but a bicycle; and fourth he didn't win it, but rather it was stolen from him.

Taking advantage again of the generous opportunities provided to me by US Airlines, I will try to explain the case before the court, and what it decided and why. [Square brackets will indicate the numbered paragraphs of the judgment.] I will not fisk the news story (you can go back and read it for yourself), but I will offer some speculations about who found this eminently sensible ruling so upsetting, and why, that we got treated to this story.

The Judgment

The case (Regina vs. T.) was an appeal of a murder conviction. The appeal apparently raised three issues, the only one of which is not redacted in the public judgment is "the extent to which evaluative expert evidence of footwear marks is reliable and the way in which it was put before the jury" [1]. One of — and it fact it seems to be the main — pieces of evidence claimed to identify T. as the murder was the match between shoe marks found at the scene of the murder and those of a pair of "trainers" (what I believe we'd call "sneakers") "found in the appellant's house after his arrest" [19]. A forensic technician, one Mr. Ryder, compared the prints and concluded, in a written report, that there was "a moderate degree of scientific evidence to support the view that the [Nike trainers recovered from the appellant] had made the footwear marks" [24]. This report was entirely qualitative and contained no statistical formulas or results of any kind. This, however, did not reflect how the conclusion was actually reached, as I will come to shortly.

Statistics were mentioned during the trial. T.'s lawyers (who seem rather hapless and were not retained on appeal) cross-examined Ryder about

figures in the UK over 7--8 years for the distribution of Nike trainers of the same model as that found in the appellant's house; some figures had been supplied to him by the defence lawyers the day before. Mr. Ryder gave evidence that there were 1,200 different sole patterns of Nike trainers; the pattern of Nike trainers that made the marks on the floor was encountered frequently and had been available since 1995; distribution figures for the pattern were only available from 1999. In the period 1996--2006 there would have been 786,000 pairs of trainers distributed by Nike. On those figures some 3% were size 11 [like those in question: CRS]. The pattern could also have been made by shoes distributed by Foot Locker and counterfeits of Nike shoes for which there were no figures. In answer to the suggestion that the pattern on the Nike trainers found at the appellant's house was of a common type, he said: "It is just one example of the vast number of different shoes that are available and to put the figures into context, there are around 42 million pairs of shoes sold every year so if you put that back over the previous 7 or 8 years, sports shoes alone, that multiplies up to nearly 300 million pairs of sports shoes so that particular number of shoes, produced which is a million, based on round numbers, is a very small proportion." [42]
These figures were repeated, with emphasis, by the trial judge in his instructions to the jury [44].

I said a moment ago that Ryder's written report, pre-trial, was entirely qualitative. This turns out to not really reflect what he did. In addition to looking at the shoes and the shoe-prints, he also worked through a likelihood ratio calculation, as follows [34--38]. The two hypotheses he considered were, as nearly as I can make out, "These prints were made by these shoes", and "These prints were made by some other shoe, randomly selected from all of the UK". (I will come back to these alternatives.) He considered that there were four variables he could work with: the pattern of the print, the size, the amount of wear, and the amount of damage to the shoe.

Pattern
The pattern of the marks at the scene matched that of the shoes recovered from T.'s house. Presumably this had probability (close to) 1 if those shoes left those prints. What probability did they have under the alternative? Ryder took this to be the frequency of that pattern in a database maintained by the Forensic Science Service (FSS), contained "shoes received by the FSS" [36i], and not intended to be a representative sample. This pattern was the most common one in the FSS database, and in fact had a frequency of 20%. So this gave a contribution to the likelihood ratio of 1/0.2 = 5.
Size
Both the shoes and the prints were size 11 (roughly, for the prints), and 3% of the shoes in a database run by a shoe trade association were of that size. (It is not clear to me if this was conditional on the pattern, or if Ryder assumed independence between pattern and size.) Ryder used a likelihood ratio of not 1/0.03 but merely 1/0.10, apparently to allow for imprecision in guessing the size of a shoe from a print.
Wear
"Ryder considered that the wear on the trainers meant that he could exclude half of the trainers of this pattern type and approximate size/configuration. He therefore calculated the likelihood ratio ... as 1/0.5" [36iii].
Damage
"He concluded that he could exclude very few pairs of shoes that could not previously have been excluded by the other factors" [36iv].
Putting this together, Ryder came up with a likelihood ratio of 5*10*2=100 in favor of the shoes at the crime scene being those from T.'s house.

He then turned to a scale which had been plucked from the air (to put it politely) by some forensics policy entrepreneurs a few years before, which runs as follows [31]:

Likelihood ratio Verbal
>1--10 Weak or limited support
10--100 Moderate support
100--1,000 Moderately strong support
1,000--10,000 Strong support
10,000--1,000,000 Very strong support
>1,000,000 Extremely strong support
This is where Ryder's phrase "a moderate degree of scientific evidence" came from. Or, sort of:
In Mr Ryder's reports for the trial... there was no reference at all to any of these statistics, the formula [for the likelihood ratio], or to the use of a likelihood ratio or to the scale of numerical values set out [above]. The conclusion in his first report, which was supported by the statistics, formula, and resulting likelihood ratio, was expressed solely in terms of the verbal scale... this was dated one day after the notes in which he had recorded his calculations. Mr Ryder's explanation for the omission was that it was not standard practice for the detail relating to the statistics and likelihood ratios to be included in a report. He made clear that the data were not available to an exact and precise level and it was only used to confirm an opinion substantially based on his experience and so that it could be expressed in a standardised form. [38]

There are a couple of things to note about this, not all of which the court did.

First, the numbers Ryder used were vastly different from those mentioned during the trial. "He made clear that the pattern was the one that was encountered most frequently in the laboratory, but he did not give the actual figures used by him... even though the figures in the database which he used in his formula were more favorable to the appellant". With those numbers, the likelihood ratio would be not 100:1 but 13,200:1 in favor of T.'s shoes having left the marks. But what's two orders of magnitude in a murder trial between friends?

Second, neither set of numbers is anything like a reliable basis for calculation:

It is evident from the way in which Mr Ryder identified the figures to be used in the formula for pattern and size that none has any degree of precision. The figure for pattern could never be accurately known. For example, there were only distribution figures for the UK of shoes distributed by Nike; these left out of account the Footlocker shoes and counterfeits. The figure for size again could not be any more than a rough approximation because of the factors specified by Mr Ryder. Indeed, as Mr Ryder accepted, there is no certainty as to the data for pattern and size.

More importantly, the purchase and use of footwear is also subject to numerous other factors such as fashion, counterfeiting, distribution, local availability and the length of time footwear is kept. A particular shoe might be very common in one area because a retailer has bought a large number or because the price is discounted or because of fashion or choice by a group of people in that area. There is no way in which the effect of these factors has presently been statistically measured; it would appear extremely difficult to do so, but it is an issue that can no doubt be explored for the future. [81--82]

(The Guardian, incidentally, glossed this as "The judge complained that he couldn't say exactly how many of one particular type of Nike trainer there are in the country", which is not the point at all.)

Third, the use of the likelihood ratio and statistical evidence is more than a bit of a bureaucratic fiction.

Mr Lewis [the "principal scientist as the FSS responsible for Case Assessment and Interpretation"] explained that in relation to footwear the first task of the examiner was to decide whether the mark could have been made by the shoe. If it could have been made, then what the FSS tried to do was to use the likelihood ratio to convey to the court the meaning of "could have been made" and how significant that was.

As Mr Lewis accepted, numbers were not put into reports because there was a concern about the accuracy and robustness of the data, given the small size of the data set and factors such as distribution, purchasing patterns and the like. It was therefore important that the emphasis on the use of a numerical approach was to achieve consistency; the judgment on likelihood was based on experience. [57--58]

Or, shorter: the examiners go by their trained judgments, but then work backwards to the desired numbers to satisfy bureaucratic mandates, even though everyone realizes the numbers don't bear scrutiny.

Fourth, to the extent that likelihood ratios and related statistics actually are part of the forensic process, they need to be presented during the trial, so that they can be assessed like any other evidence. Using them internally for the prosecution, but then sweeping them away, is a recipe for mischief. "It is simply wrong in principle for an expert to fail to set out the way in which he has reached his conclusion in his report.... [T]he practice of using a Bayesian approach and likelihood ratios to formulate opinions placed before a jury without that process being disclosed and debated in court is contrary to principles of open justice." [108] This, ultimately, was the reason for granting the appeal.

So where do we get to the point where (to quote The Guardian again) "a mathematical formula was thrown out of court"? Well, nowhere, because, to the extent that the court limited the use of Bayes's rule and likelihood ratios, it was re-affirming long-settled British law. As the judgment makes plain, "the Bayesian approach" and this sort of use of likelihood ratios were something "which this court had robustly rejected for non-DNA evidence in a number of cases" starting with R. vs. Dennis Adams in 1996 [46]. The basis for this "robust rejection" is also old, and in my view sound:

The principles for the admissibility of expert evidence [are that] the court will consider whether there is a sufficiently reliable scientific basis for the evidence to be admitted, but, if satisfied that there is a sufficiently reliable scientific basis for the evidence to be admitted, then it will leave the opposing views to be tested in the trial before the jury. [70]

In the case of DNA evidence, "there has been for some time a sufficient statistical basis that match probabilities can be given" [77]. But for footwear,

In accordance with the approach to expert evidence [laid down by previous judgments], we have concluded that there is not a sufficiently reliable basis for an expert to be able to express an opinion based on the use of a mathematical formula. There are no sufficiently reliable data on which an assessment based on data can properly be made... An attempt to assess the degrees of probability where footwear could have made a mark based on figures relating to distribution is inherently unreliable and gives rise to a verisimilitude of mathematical probability based on data where it is not possible to build that data in a way which enables this to be done; none in truth exists for the reasons we have explained. We are satisfied that in the area of footwear evidence, no attempt can realistically be made in the generality of cases to use a formula to calculate the probabilities. The practice has no sound basis.

It is of course regrettable that there are, at present, insufficient data for a more certain and objective basis for expert opinion on footwear marks, but it cannot be right to seek to achieve objectivity by reliance on data which does not enable this to be done. We entirely understand the desire of the experts to try and achieve the objectivity in relation to evidence of footwear marks, but the work done has never before, as we understand it, been subject to open scrutiny by a court. [86--87]

It is worth repeating that, despite the newspapers, this is not new law: "It is quite clear therefore that outside the field of DNA (and possibly other areas where there is a firm statistical base), this court has made it clear that Bayes theorem and likelihood ratios should not be used" [90]. Nonetheless, this does not amount to an obscurantist rejection of Bayes's theorem:

It is not necessary for us to consider ... how likelihood ratios and Bayes theorem should be used where there is a sufficient database. If there were a sufficient database in footwear cases an expert might be able to express a view reached through a statistical calculation of the probability of the mark being made by the footwear, very much in the same way as in the DNA cases subject to suitable qualification, but whether the expert should be permitted to go any further is, in our view, doubtful. [91]
The judgment goes on [91--95] to make clear that experts can have a sound scientific basis for their opinions even if these cannot be expressed as statistical calculations from a database. The objection rather is to spurious precision, and spurious claims to a scientific status [96].

There is a legitimate criticism to make of the court here, which is that it is not very specific about what would count as a "sufficient database", or "firm" statistics. It may be that the earlier cases cited fill this in; I haven't read them. This didn't matter for DNA, because people other than the police had other reasons for assembling the relevant data, but for something like shoes it's hard to see who would ever do it other than something like the FSS, and they are not likely to do so without guidance about what would be acceptable to the courts. On the other hand, the judges might have felt that articulating a specific standard simply went beyond what was needed to decide this case.

There is more in the judgment, including a discussion of what the court thought footwear examiners legitimately can and cannot generally say based on the evidence (drawing heavily on how this is done in the US). Rather than go into that, I will mention some more technical issues suggested by, but not discussed in, the judgment.

Some Statistical Commentary

  1. Nobody involved in the case used Bayes's rule. The unfortunate1 Mr. Ryder simply calculated a likelihood ratio. A properly Bayesian approach would have required at least the posterior odds, which would have meant putting a prior probability on the hypothesis that the shoes taken from T.'s house made the marks. (The prior probability of the alternative hypothesis would presumably have been one minus this.) What probability, though, would that have been?
    The current population of the UK is about 60 million. If we thus took the prior odds of T. being the murder as 60 million to 1, then after Ryder's calculation of the likelihood ratio, the posterior odds climb to 600,000 to 1. If one calculates the likelihood ratio from the numbers mentioned at the trial, it comes to 13,200, pushing the posterior odds all the way to 4500 to 1. Presumably the prosecutors would say that the prior odds were a lot better than that, but that hardly helps the case for using Bayes's rule. Two Bayesians, seeing the same evidence and using the same likelihood function, can have posterior odds which are arbitrarily far apart, if their priors are sufficiently different.
    Without those prior probabilities, however, this use of the likelihood ratio is in fact a classic case of base-rate neglect, which is one of the things Bayes's rule is supposed to guard us against2. Of course, one can treat the prior as a testable part of the model, but doing so means giving up on the simple "probability that the hypothesis is true given the evidence" ideology at play here.
  2. Wishing this away, there is still an issue about specifying the alternatives whose likelihood ratio is to be calculated. In this case, the two hypotheses were that the marks were made by the pair of shoes from T.'s house and the marks were made by some other pair of shoes. This was the source of the 100:1 likelihood ratio. If the second hypothesis was the marks were made by some other, equally worn pair of shoes of the same pattern and size, the likelihood ratio would presumably have been pretty close to 1. (Close, because there might be differences due damage, or people carving "for a good time, follow me" or "hah hah, coppers, you'll never prove it!" into their soles, etc.) In the terms used by American footwear examiners [65], the (semi-fictional) likelihood calculation would bear on "class" characteristics, not "identifying" characteristics. Yet one doubts, somehow, that any prosecutor would be inclined to state that "there is extremely weak scientific support for the print having been made by these shoes, rather than others of the same type", which is what the general formulas would entail. But perhaps that is unfair: "It is important to emphasise that the evidence [in DNA cases] is not directed to whether DNA came from the suspect, but the probability of obtaining a match that came from an unknown person who is unrelated to the suspect but has the same profile" [77].
  3. The issue the court correctly raises, about all the factors which could alter the local frequency of shoes, and the difficulty of measuring them, is related to the classic "reference class problem". This is a difficulty confronting simple relative-frequency theories of probability, namely, relative frequency in which "reference class" of instances: shoes sold this year in Britain? shoes sold over the last eight years in Britain? shoes in Bristol? Shoes within a mile of the Clifton Bridge? Shoes worn by respectable Cliftonians? By disreputable Cliftonians?3 Etc.
    Bayesians solve the reference class problem by fiat modeling assumptions. As Aris Spanos points out, so do modern frequentists4. In both cases, though, one then has to justify the model. (Andy is right to keep saying that thinking the likelihood function is just given and beyond question is a serious mistake.) This is not impossible in principle — it's been pretty much done with DNA, for instance — but it would plainly be very hard, for all the reasons the judges list and more besides.

So, we have a situation where the "Bayesian approach" supposedly being taken by the forensic specialists was not noticeably Bayesian, in addition to being based on hopelessly vague numbers and more than a bit of an administrative fiction.

Where Did This Story Come From?

The verbal scale for likelihoods I mentioned above was the brain-child of a trade organization of British forensic specialists [52--53] in the 2000s. It grew out of a movement to formalize the evaluation of forensic evidence through likelihood ratios, which participants described as "the Bayesian approach". "On the evidence before us this development occurred in the late 1990s and was based on the approach to expert evidence on DNA. It was thought appropriate to translate [that] approach... to other areas of forensic evidence" [49]. Several of the leading participants in this movement were evidently employees of the FSS, or otherwise closely affiliated with it. They seem to have been the ones responsible for insisting that all evaluative opinions be justified for internal consumption by a likelihood ratio calculation, and then expressed on that verbal scale.

That they started pushing for that just a few years after the British courts had ruled that such calculations were inadmissible when based on unreliable (or no) data might explain why these calculations were kept internal, rather than being exposed to scrutiny. That they pushed such calculations at all seems to be explained by a very dogmatic case of Bayesian ideology, expressed, e.g., in an extraordinary statement of subjectivism [75] that out-Savages Savage. Why they thought likelihood ratios were the Bayesian approach, though, I couldn't begin to tell you. (It would certainly be news to, say, Neyman and Pearson.) It would be extraordinary if these people were confusing likelihood ratios and Bayes factors, but that's the closest I can come to rationalizing this.

Sociologically considered, "forensic science", so called, is a relatively new field which is attempting to establish itself as a profession, with legitimate and recognized claims to authority over certain matters, what Abbott, in the book linked to just now, calls "jurisdiction". Part of professionalization is convincing outsiders that they really do need the specialized knowledge of the professionals, and it's very common, in attempts at this, for people to try to borrow authority from whatever forms of knowledge are currently prestigious. I suppose it's a good thing for us statisticians that Bayesian inference currently seems, to a would-be profession, like a handy club with which to beat down those who would claim its desired territory.

Still, if this aspect of professionalization often seems like aping the external forms of real science, while missing everything which gives those forms meaning, I think that's because it is. Forensics people making a fetish of the probability calculus when they have no basis for calculation is thus of a piece with attempts to turn cooking into applied biochemistry, or eliminate personality conflicts through item response theory. One has to hope that if a profession does manage to establish itself, it grows out of such things; sometimes they don't.

Naturally, being comprehensively smacked down by the court is going to smart for these people. I imagine prosecutors are unhappy as well, as this presumably creates grounds for appeals in lots of convictions. Expert witnesses (such as those quoted in the Guardian story) are probably not best pleased at having to admit that when they give precise probabilities, it is because their numbers are largely made up. I can sympathize with these people as human beings in an awkward and even, in some cases, deeply unenviable position, and certainly understand why they'd push back. (If I had to guess why a decision dated October 2010 got written up, in a thoroughly misleading way, in a newspaper in October 2011, it would be that it took them a while to find a journalist willing to spin it for them.) But this doesn't change the fact that they are wrong, and the judges were right. If they really want to use these formulas, they need to get better data, not complain that they're not allowed to give their testimony — in criminal trials, no less! — a false air of scientific authority.

Update, next day: Typo fixes, added name and link for the journalist.

Update, 29 October: Scott Martens points to a very relevant paper, strikingly titled "Is it a Crime to Belong to a Reference Class?" (Mark Colyvan, Helen M. Regan and Scott Ferson, Journal of Political Philosophy 9 (2001): 168--181; PDF via Prof. Colyvan). This concerns a US case (United States vs. Shonubi). There, the dispute was not about whether Shonubi was smuggling drugs (he was), or had been convicted fairly (he had), but about whether his sentence could be based on a statistical model of how much he might have smuggled on occasions when he was not caught. The appeals court ruled that this was not OK, leading to a parallel round of lamentations about "the legal system's failure to appreciate statistical evidence" and the like. The paper by Colyvan et al. is a defense of appeals court's decision, largely on the grounds of the reference class problem, or, as they equivalently note (p. 179 n. 27) of model uncertainty (as well as crappy figures), though they also raise some interesting points about utilities.

Manual trackback: Abandoned Footnotes

1: I say "unfortunate", because, while the court makes clear he was just following standard procedure as set by his bosses and is not to be blamed in any way, cannot be a popular man with those bosses after all this.

2: To drive home the difference between more likely and more probable, recall Kahnemann and Tversky's famous example of Linda the feminist bank teller:

Linda is 31 years old, single, outspoken, and very bright. She majored in philosophy. As a student, she was deeply concerned with issues of discrimination and social justice, and also participated in anti-nuclear demonstrations. Which is more probable? Linda is a bank teller, or Linda is a bank teller and is active in the feminist movement.
The trick is that while Linda is more likely to be as described if she is a feminist bank teller than if she is a bank teller with unknown views on feminism, she is nonetheless more probable to be a bank teller. Of course in the legal case the alternatives are not nested (as here) but mutually exclusive.

3: I have no reason to think this murder case had anything to do with Bristol in general or Clifton in particular, both of which I remember fondly from a year ago.

4: I think one could do more with notions like ergodicity, and algorithmic, Martin-Löf randomness, than Spanos is inclined to, but in practice of course one simply uses a model.

Enigmas of Chance; Bayes, anti-Bayes

Posted by crshalizi at October 27, 2011 23:50 | permanent link

Three-Toed Sloth:   Hosted, but not endorsed, by the Center for the Study of Complex Systems