What is your work about?
The “statistics wars” reflect disagreements on one of the deepest, oldest, philosophical questions: How do humans learn about the world despite threats of error due to incomplete and variable data? At the same time, the statistics wars are the engine behind current controversies surrounding high-profile failures of replication in the social and biological sciences. A fundamental debate concerns the role of probability in inference. Should probability enter to ensure we will not reach mistaken interpretations of data too often in the long run of experience (performance) or to capture degrees of belief or support about claims (probabilism)? For over 100 years, the field of statistics has been marked by disagreements between frequentists and Bayesians that have been so contentious that everyone wants to believe we are long past them. But the truth is that long-standing battles still simmer below the surface. They show up unannounced amidst the current problems of scientific integrity, irreproducibility, and questionable research practices.
In this book, I identify a third role for probability (beyond probabilism and performance). It grows out of the key source of today’s handwringing: that high-powered methods make it easy to uncover impressive-looking findings even if they are false: spurious correlations and other errors have not been severely probed. I set sail with a simple tool: If little or nothing has been done to rule out flaws in inferring a claim, then it has not passed a severe test. In the severe testing view, probability arises in scientific contexts to assess and control how capable methods are at uncovering and avoiding erroneous interpretations of data. That is what it means to view statistical inference as severe testing. A claim is severely tested to the extent it has been subjected to and passes a test that probably would have found flaws, were they present. The probability that a method commits an erroneous interpretation of data is an error probability. Statistical methods based on error probabilities I call error statistics. It is not probabilism or performance we seek to quantify, but how severely probed claims are.
How do you relate your work to other well-known philosophies?
The severe testing perspective substantiates, using modern statistics, the idea Karl Popper promoted, but never adequately cashed out. While Popper’s account rests on severe tests–tests that would probably falsify claims if false–Popper cannot warrant saying a method is reliable because that would appear to concede that he has a “whiff of induction” after all. Thus Popper defined severity in terms of novel predictive success which is inadequate. Corroboration by severe testing is indeed inductive, in the sense of being ampliative, but unlike other inductive accounts it is not a probabilism. Nevertheless, I argue it is an epistemological use of probability: to determine well and poorly warranted claims. It is not purely formal, but rather depends on developing a repertoire of errors and methods for probing them. Solving the relevant problem of induction, in my view, becomes demonstrating the existence of severe tests, and showing we may use them to capture the capabilities of methods to probe severely. Rather than ask what distinguishes scientific from unscientific theories (as in Popper’s demarcation), the severe tester asks: When is an inquiry into a theory, or an appraisal of claim H, unscientific? We want to distinguish meritorious modes of inquiry from those that violate the minimal requirement for evidence.
Why did you feel the need to write this work?
First the arguments between rival statistical schools have changed radically from just 10 or 20 years ago and cry out for fresh analysis. Bayesian practitioners have adopted “non-subjective” or default Bayesian positions, where prior probabilities, rather than express beliefs, are to allow data to be dominant in some sense. These default Bayesianisms are supposed to unify or reconcile frequentist and Bayesian accounts, but there is disagreement as to which to use and how to interpret them. Attaining agreement on numbers that are measuring different things leads to chaos and confusion.
Second the statistics wars are increasingly being brought to bear on some very public controversies. Once replication failures had spread beyond the social sciences to biology and medicine, people became serious about reforms. While many are welcome, others are quite radical. Many methods advocated by data experts do not stand up to severe scrutiny, and are even in tension with successful strategies for blocking or accounting for cherry picking and selective reporting. If statistical consumers are unaware of assumptions behind rival evidence reforms, they cannot scrutinize the consequences that affect them (in personalized medicine, psychology, and so on). I felt it was of urgent importance to lay these assumptions bare.
Which of your insights or conclusions do you find most exciting?
One of the most exciting features of the severe testing philosophy is that it brings into focus the crux of the issues around today’s statistics wars without presupposing any one side of the debate. This was the central challenge I faced in writing the book. Identifying a minimal theses about “bad evidence, no test (BENT)” enables a scrutiny of any statistical inference account – at least on the meta-level. One need not accept the severity principle to use it to this end. But by identifying when methods violate minimal severity, we can pull back the veil on a key source of disagreement behind the battles. The concept of severe testing is sufficiently general to apply to any of the methods now in use, whether for exploration, estimation, or prediction.
It also provides a way to get around a long-standing problem in frequentist statistical foundations: to show the relevance of error probabilities for inference in the case at hand. What bothers us about the cherry-picker, data-dredger and multiple tester is not that their method will often be wrong in the long run (although that is true). It is that they have done a poor job of blocking known fallacies in the case at hand. The severity principle captures this intuition. In so doing it links the error probabilities of methods with an assessment of what is and is not warranted in the case at hand. We should oust mechanical, recipe-like uses of statistical methods that have long been lampooned, and are doubtless made easier by Big Data mining. They should be supplemented with tools to report magnitudes of effects that have and have not been warranted with severity. But simple significance tests have their uses, and should not be banned simply because some people are liable to violate the very strictures set out by Fisher and Neyman-Pearson. They should be seen as a part of a conglomeration of error statistical tools for distinguishing genuine and spurious effects. Instead of rehearsing the same criticisms over and over and over again, challengers on all sides should now begin by grappling with the arguments we trace within.
Deborah G. Mayo
Deborah Mayo is Professor Emerita in the Department of Philosophy at Virginia Tech. She’s the author of Error and the Growth of Experimental Knowledge(1996, Chicago), which won the 1998 Lakatos Prize awarded to the most outstanding contribution to the philosophy of science during the previous six years. She co-edited, with Aris Spanos, Error and Inference: Recent Exchanges on Experimental Reasoning, Reliability, and the Objectivity and Rationality of Science(2010, CUP), and co-edited, with Rochelle Hollander, Acceptable Evidence: Science and Values in Risk Management (1991, Oxford). Her most recent work is Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars(2018, CUP). Otherpublications are available here.