- Robert A.J. Matthews: Facts versus Factions: the use and abuse of subjectivity in scientific research
(Bayesian inference / Bayesian statistics / - Baysian Basian - getting the mis-spellings)
- At
http://ourworld.compuserve.com/homepages/rajm/openesef.htm
- "Summary:
This paper explores the use and abuse of subjectivity in science, and the ways in which the scientific community has attempted to explain away its curiously persistent
presence in the research process. This disingenuousness is shown to be not only unconvincing but also unnecessary, as the axioms of probability reveal subjectivity to
be a mathematically ineluctable feature of the quest for knowledge. As such, concealing or explaining away its presence in research makes no more sense than
concealing or explaining away uncertainty in quantum theory. The need to acknowledge the ineluctability of subjectivity transcends issues of intellectual honesty,
however. It has profound implications for the assessment of new scientific claims, requiring that their inherent plausibility be taken explicitly into account. Yet as I
show, the statistical methods currently used throughout the scientific community lack this crucial feature. As such, they grossly exaggerate both the size of implausible
effects and their statistical significance, and lend misleading support to entirely spurious "discoveries". These fundamental flaws in conventional statistical methods
have long been recognised within the statistics community, but repeated warnings about their implications have had little impact on the practices of working scientists.
The result has been an ever-growing number of spurious claims in fields ranging from the paranormal to cancer epidemiology, and continuing disappointment as
supposed "breakthroughs" fail to live up to expectations. The failure of the scientific community to take decisive action over the flaws in standard statistical methods,
and the resulting waste of resources spent on futile attempts to replicate claims based on them, constitutes a major scientific scandal. "
- Robert Millikan is widely regarded as one of the founders of modern American science, his determination of the charge on the electron winning him the 1923 Nobel
Prize for physics. In a now-famous study, the physicist and historian Gerald Holton examined the log-books for Millikan’s experiments with the electron, and
revealed that he repeatedly rejected data that he deemed "unacceptable" (Holton 1978). The criteria he used were blatantly subjective, as revealed by the comments
in the log-books, such as "Very low - something wrong" and "This is almost exactly right". Throughout, Millikan appears to have been driven partly by a desire to
get results that were self-consistent, broadly in agreement with other methods, and consistent with his personal view that the electron is the fundamental and
indivisible unit of electric charge.
While these criteria may seem reasonable enough, they carry inherent dangers. Even today a fundamental explanation of the precise numerical value of the charge on
the electron remains lacking, so Millikan was hardly in a position to decide objectively which values were high and which ones low. Previous results may have been
fundamentally flawed, while the demand for self-consistent results may mask the existence of subtle but genuine properties of the electron. Millikan could also have
been proved wrong in his belief that the electron was fundamental.
However, it is also clear that Millikan had another powerful motivation for using all means to obtain a convincing determination of the electronic charge: he was in a
race against another researcher, Felix Ehrenhaft at the University of Vienna. Ehrenhaft had obtained similar results to those of Millikan, but they were interspersed
with much lower values that suggested that the electron was not, in fact, the fundamental unit of charge. Millikan had no such doubts, published his results, and went
on to win the Nobel Prize.
- Apologists for Millikan’s hand-picking of data also point out that the numerical result he obtained, - 1.592 x 10-19 coulombs, is just 0.6 per cent below the modern
value of - 1.6021892 x 10-19 C (Weinberg 1993 p 99). At first sight, this does indeed seem impressive. However, Millikan’s stated result was based on a faulty
value for the viscosity of air, which when corrected changes Millikan’s result to - 1.616 x 10-19 C, increasing the discrepancy with the modern value by over 40 per
cent. More importantly, however, it puts the latter well outside the error-bounds of Millikan’s central estimate. Indeed, the discrepancy is so large that the probability
of generating it by chance alone is less than 1 in 103. Millikan’s "remarkable ability" to scent out the correct answer was clearly not as great as his apologists would
have us believe. Rather more remarkable is Millikan’s ability, almost half a century after his death, to evade recognition as an insouciant scientific fraudster who won
the Nobel Prize by deception.
The dangers of the injudicious use of subjective criteria is further highlighted by the aftermath of Millikan’s experiments. In the decades following his work and Nobel
Prize, other investigators made determinations of the electronic charge. The values they obtained show a curious trend, creeping further and further away from
Millikan’s "canonical" value, until finally settling down at the modern figure with which, as we have seen, it is wholly incompatible. Why was this figure not reached
sooner ? The Nobel Prizewinning physicist Richard Feynman has given the answer in his own inimitable style (Feynman 1988, p 382):
"It’s apparent that people did things like this: when they got a number that was too high above Millikan’s, they thought something was wrong - and they
would look for and find a reason why something might be wrong. When they got a number closer to Millikan’s value they didn’t look so hard. And so
they eliminated the numbers that were too far off"
- Semmelweiss’s long and unsuccessful struggle during the 1840s to introduce antiseptic practices into hospitals (Asimov 1975 p 348). Despite the existence of
a dramatic fall in the numbers of cases of childbed fever produced by the use of antiseptics, the practice was rejected because of resentment by the doctors
that they could be causing so many deaths, nationalistic prejudice against a Hungarian working in a Viennese hospital, and annoyance at the way the
antiseptics eliminated the "professional odour" on their hands after returning to the wards from working in the mortuary.
- The rejection and ridiculing of Francis Peyton Rous’s evidence for the existence of viruses capable of transmitting cancer (Williams 1994, p422). First put
forward in 1911, Rous’s evidence came at a time when the existence of viruses was still controversial - they were beyond the reach of contemporary
microscopy - and when cancer was thought to be caused by "tissue irritation". Rous’s claim was finally vindicated 25 years later. In 1966 he was awarded the
Nobel Prize - at the age of 87.
- The vociferous response of geologists to the proposal by Alfred Wegener, a German astronomer and meteorologist, that the continents moved across the face
of the Earth. Having found considerable evidence for the phenomenon, but unable to propose a physical mechanism for it, Wegener’s proposal was dismissed
as a "fairy tale", the product of "auto-intoxication in which the subjective idea comes to be considered as an objective fact" (Hellman 1998 p150). His claims
were subsequently vindicated in the 1960s, 50 years after he first proposed them, and 30 years after his death.
- In the early 1980s, the Australian physician Barry Marshall encountered derision and hostility for his claim that a previously unknown bacterium, Helicobacter
pylori, was responsible for stomach ulcers. Marshall’s evidence went against the prevailing view that bacteria were incapable of thriving within the acidic
conditions of the stomach. H. pylori is now accepted as the principal cause of stomach ulcers, and has also been implicated in gastric cancer.
- Subjectivity in the testing of theories
The value of any scientific theory, no matter how theoretically elegant or plausible, is ultimately tested by experiment. Conventionally, this crucial element of the
scientific process involves extracting a clear and unequivocal prediction from the theory, investigating this prediction experimentally, and assessing the outcome
objectively. Exactly how this comparison is performed, and what conclusions are drawn, has long been a subject of debate among scientists and philosophers. Many
scientists consider themselves to be followers of Karl Popper and the concept of falsifiability (Popper 1963): that to be considered scientific, a theory must be
capable of being proved wrong. On this view, the experiment and the analysis of data should be performed to discover if the theory is falsified, and if it is, it must be
abandoned. As such, theories are never proved "correct": they merely survive until the next experimental attempt at falsification.
There are a great many fundamental problems with Popper’s widely-held - and admittedly appealing - view of the scientific process (see especially Howson &
Urbach 1993). Put simply, these problems boil down to the fact that the concept of falsification is supported neither in principle nor in practice. Over 90 years ago
the French physicist and philosopher Pierre Duhem pointed out that the testable consequences of scientific theories are not a pure reflection of the theory itself, but
are based on many extra assumptions. As a result, if an experiment appears to falsify a theory, this does not automatically imply that the theory must be false: it is
always possible to blame one of the auxiliary assumptions.
- This "curious" fact, combined with the many problems and pitfalls associated with frequentist measures of "significance", raises an obvious question: is there a better
way? As I now show, the answer is yes.
- Facts versus Factions: the use and abuse of subjectivity in scientific
research - PART 2 :
http://ourworld.compuserve.com/homepages/rajm/twooesef.htm
- The classical frequentist techniques of inference are not, in fact, "classical" at all, but relative newcomers in the long history of statistical inference. Before the 1920s,
another approach to statistical inference was in general use, based on a result that flows directly from the axioms of probability. As such, this approach has solid
theoretical foundations, produces intuitive, readily-understood measures of "significance", and remains as valid today as it did before it was eclipsed by the flawed
attempts of Fisher et al. to create an objective theory of statistical inference. It is known as Bayesian inference, after the 18th Century English cleric Thomas Bayes
who first published the key theorem behind it: Bayes’s theorem.
The power and importance of this theorem is immediately apparent in its solution to one of the central problems of standard statistical inference. As we have seen,
frequentist methods do not tell us Prob(theory | data); that is, they do not tell us what our belief in a theory should be, given the data we actually saw. To answer that
question, we must turn to the axioms of probability theory, from which we find that (see, e.g. Feller 1968 Ch 5):
- In short, Bayesian inference provides a coherent, comprehensive and strikingly intuitive alternative to the flawed frequentist methods of statistical inference. It leads to
results that are more easily interpreted, more useful, and which more accurately reflect the way science actually proceeds. It is, moreover, unique in its ability to deal
explicitly and reliably with the provably ineluctable presence of subjectivity in science.
- For over 250 years, Princeton students have attended Commencement on a Tuesday in late May or early June, an outdoor event for which good weather is vital.
According to local folklore, good weather does usually prevail, prompting claims that those attending may "wish" good weather into existence. By analysing local
weather records spanning many decades, Nelson found that Princeton’s weather was generally no different from that of its surroundings. However, he did find some
evidence that the town was less likely to be rained on during the outdoor events. The phenomenon gave z-scores as high as 1.996, which on a frequentist basis gives
a "significant" P-value of 0.046. Properly mindful of the implausibility of the phenomenon, however, Nelson was reluctant to take this "objective" finding at face value,
and instead reached a more subjective conclusion: "These intriguing results certainly aren’t strong enough to compel belief, but the case presents a very challenging
possibility".
A Bayesian analysis allows a far more concrete assessment of plausibility to be made. Clearly, with such a bizarre claim, there is little one can say about the precise
value of a sensible prior probability for the null hypothesis of no real effect, other than to say that the probability is likely to be pretty high. In such cases, Bayesian
inference still gives valuable insight, as it allows one to estimate the level of prior probability necessary to sustain a belief that the effect is illusory, even in the light of
Nelson’s data. Using (4) and (3) and z = 1.996, this inverse Bayesian inference shows that Prob(Null | data) > 0.5 for all Pr(Null) > 0.88 In other words, for anyone
whose prior scepticism about the effectiveness of "wishful thinking" exceeds 90 per cent, the balance of probabilities is that the effect is illusory, despite Nelson’s
data.
As this example shows, frequentist methods greatly exaggerate the "significance" of intrinsically implausible data. However, as we shall now see, frequentist methods
can also seriously exaggerate both the size and significance of effects in much more important mainstream areas of research, such as clinical trials.
- Misleading "significance" of clinical trial results Misleading P-values
The most common methods for investigating the efficacy of a new drug or therapy, or the impact of exposure to some risk-factor, are the so-called randomised
clinical trials and case-control studies, in which a group of people given the new treatment or known to have the disease are compared with a "control" group. One
common frequentist method of analysing the outcome is to reduce the results to a test-statistic (such as c2 ), which is then turned into a P-value; as before, if this is
less than 0.05, then the difference between the two groups is deemed to be significant. Again, however, a Bayesian analysis reveals that the real "significance" of such
a finding is typically much less impressive than the P-values imply.
As before, I shall demonstrate this by taking a real-life case. During the early 1990s, research emerged to suggest that the risk of coronary heart disease (CHD) is
associated with childhood poverty (Elford et al. 1991). Following the discovery that infection with the bacterium H. pylori is also linked to poverty, some
researchers suspected that the bacterium may form the "missing link" between the two. Precisely how a bacterium in the stomach might cause heart disease is less
than clear - raising the key issue of plausibility, to which we shall return shortly. Nevertheless, a number of studies were undertaken to investigate the link between
CHD and H. pylori. In one of the first such studies (Mendall et al. 1994), 60 per cent of patients who suffered CHD were found to be infected with H. pylori,
compared with 39 per cent of normal controls. When the effects of age, CHD risk factors and current social class had been controlled for, the results led to a c2
value of 4.73. Using frequentist methods, this leads to a P-value of 0.03, implying that the rate of CHD among those infected with H. pylori is "significantly" higher
than those without.
On the face of it, this finding raises the intriguing prospect of being able to tackle one of the major killers of the western world using nothing more than antibiotics. Yet
while the evidence that both CHD and H. pylori infection are more common among the poor is suggestive of a link between the two, it is hardly unequivocal. Such
scepticism is underscored by the lack of any convincing mechanism by which a gastric bacterium could trigger heart disease. The frequentist P-value, however,
cannot reflect any of these justifiable qualms; sceptics of the link have no option but to say that on this occasion they are just going to ignore the supposed
"significance" of Mendall et al. ’s finding.
In contrast, Bayesian inference requires no such arbitrary "moving of the goalposts": it allows explicit account to be taken of the plausibility of the findings. In the case
of the supposed link between CHD to H. pylori, the lack of any convincing mechanism balanced against the socio-economic evidence of a link suggests that an
agnostic prior probability of Prob(Null) = 0.5 would be a reasonable starting-point for assessing results like those found by Mendall et al. .
[TEXT DELETED]
Inserting the value of c 2 = 4.73 found by Mendall et al. into (6) shows that the BF is at least 0.337. Putting this in (6) we find that Prob(Null | data), the probability
that Mendall et al.’s results are due to nothing more than chance is at least 0.25. In other words, even using an agnostic prior, the frequentist P-value has
over-estimated the real "significance" of the findings by almost an order of magnitude.
Those taking a more sceptical view of a link between a gastric bacterium and CHD would, of course, set Prob(Null) somewhat higher. Applying the concept of
inverse Bayesian inference used earlier, it emerges that even a relatively modest sceptical prior of just Prob(Null) = 0.75 is enough to lead to a balance of
probabilities that Mendall et al.’s findings are entirely illusory.
- Reputable researchers would no doubt feel more confident defending evidence for an anomalous phenomenon by applying at least a mild level of scepticism in their
assessment of significance. In this case, a P-value of no more than around 2x10-4 is appropriate, a value 250 times more demanding than the conventional 0.05
criterion. These technical results can be stated much more succinctly, however: extraordinary claims require extraordinary evidence. This is a well-attested and
widely-accepted principle, yet it is noticeable by its absence in the mathematics of frequentist inference.
- There is a dangerous irony in the continuing reluctance of the scientific community to adopt Bayesian inference. For this reluctance stems largely from a deep-rooted
fear that adopting methods that embrace subjectivity is tantamount to conceding that the scientific enterprise really is a social construct, as claimed by the
post-modern advocates of the "anti-science" movement. The central lesson of Bayes’s theorem is, however, quite the opposite. It shows, with full mathematical
rigour, that while evidence for a specific theory may indeed start out vague and subjective, the accumulation of data progressively drives the evidence towards a
single, objective reality about which all can agree.
It is ironic indeed that by failing to recognise this, the scientific
community continues to use techniques of inference whose unreliability
undermines confidence in the scientific process, and which thus
threatens to deliver science into the hands of its enemies.
|