The futility of Null Hypothesis Signficance Testing (NHST): Pillorying the indefensible

Write by Muhammad Irfan bin Abdul Jalal

(This is the first article of a two-part discussion on Bayesian Statistics)

Bayesian statistics has recently gained traction in the medical community as an alternative to frequenstist statistics1. This has stemmed from the difficulties in understanding and utilising the theoretical underpinning of frequentist (classical) statistics to interpret medical data in a coherent fashion without sacrificing mathematical rigor2. A fairly recent study Polack et al. (2020) on the efficacy of BNT162b2 Covid 19 vaccine (Pfizer) in a phase III clinical trial utilised Bayesian framework that produced robust and rigorous results in a timely manner  that withstand methodical scientific scrutinies from the medical community3. As a result,  the vaccine enters the market in a timely fashion, thus bringing the formidable onslaught of Covid-19 pandemic to a halt.

Before further exposition is made on Bayesian statistics, further elaborations on the Null Hypothesis Significance Testing (NHST) framework and its fallacies are warranted. The first pillar of NHST profoundly lies in the Popperian hypothetico-deductive method and falsificationist paradigm as solutions for the problems of scientific and non-scientific demarcation (i.e. the problem of distinguishing between scientific (testable) and non-scientific (non-testable) conjectures) and inductive reasoning (i.e. the problem of drawing a conclusion by going from specific observations (e.g. experimental data) to the construction of a  general theory explaining such observations)4,5. Sir Karl Raimund Popper (1902-1994) theorized that the advancements in scientific discoveries could only be made by making bold, sophisticated conjectures (hypotheses) followed by attempts in falsifying such conjectures, not through verifying them (verificationism)6. For instance, when dealing with a falsifiable scientific proposition “All swans are white”, we can never be able to prove the veracity of such a statement based on our current scientific observations since we need to demonstrate that the colour of all swans at all global locations and across all time points (i.e. from the beginning of time until our very last day on Earth) is uniformly white before we can substantiate such a proposition6,7. However, by showing the existence of a single black swan (a contradictory evidence), we can satisfactorily demonstrate that such a proposition is false. This hence becomes one of the mantras of NHST; we can either reject or fail to reject a null hypothesis based on our experimental observations (data).  Hence, accepting an alternative hypothesis when a null hypothesis is rejected is an absolute anathema in Popperian falsificationism since this is only logically plausible if verificationism is our scientific guiding principle in evaluating the truth of a conjecture (hypothesis). 

However, subsequent philosophers of science such as Thomas Kuhn, Paul Feyerabend and Larry Laudan had criticised the falsifiability of a hypothesis as the sole concept and demarcating criterion of testing the truth of a hypothesis8-10. This arised from the indisputable fact there is no scientific theory or conjecture that can be falsified with a complete certainty. An accepted scientific theory can be considered as an amalgamation of different critical and auxillary hypotheses. Hence, a scientific theory might discredited due to the problems with the auxillary hypotheses, not the critical foundations of the theory per se. With slight modifications to the auxillary hypotheses or assumptions, such scientific theories might be salvaged later on. As an example, if we want to evaluate the law of gravity on an object on earth, we may hypothesise that there are no other exogenous forces exerting their influence on such object other than gravitation (auxillary hypothesis). However, in several occassions, we might observe leaves fly upwards due to the effects of wind and without modifying our auxillary hypothesis, we may erroneously discredit the law of gravity11. Besides, observations might be incongruent with a scientific theory due to serious measurement errors committed during the data collection process. As a result, the universal truth of falsificationism is not beyond reproached and these shortcomings hence indirectly enfeebles the first pillar of NHST.

Frequentist paradigm can be considered as the second pillar of NHST. It was propounded by the works of three eminent statisticians; Sir Ronald Aylmer Fisher (1890-1960), Jerzy Neyman (1894-1981) and Egon Sharpe Pearson (1895-1980)12. RA Fisher introduced the null hypothesis as the falsifiable hypothesis and p<0.05 (i.e. fixing the type I error (α) rate to  0.05),  as the threshold at which statistical significance is declared13. Later, in 1933, Jerzy Neyman and Egon S. Pearson invented the type II error rate, alternative hypothesis and study power as additions to the concepts of null hypothesis and fixed significance threshold as introduced by RA Fisher previously12. In this case, both statisticians tried to contrast two rival hypotheses, namely the null and alternative hypotheses, based on a prespecified type I and II error rates and study power. It is therefore sensible to assume that frequentist statistics is a hybrid of two contrasting old statistical paradigms for the following two reasons. First, RA Fisher consistently detested the two notions that a fixed routine significant threshold (e.g. p<0.05) should be used for every problem of scientific experiments and an explicit alternative hypothesis should be specified as the opposing hypothesis to the null hypothesis14. Second, both Jerzy Neyman and Egon S. Pearson rejected the sole of usage of null hypothesis as the tested hypothesis in favour of evaluating two competing hypotheses based on binary statistical decision criteria (Neyman-Pearson Lemma)12. The confusion over these two contrasting old frequentist theoretical groundwork resulted in the birth and subsequent abuse of NHST.

Hence, what are common misconceptions of NHST? The most commonly abused notion within NHST paradigm is that p value represents the probability of a null hypothesis being true or false given the data. This actual definition of p value is the probability of obtaining the data or more extreme data given the null hypothesis is true, Pr (Data | H0 = TRUE)15-16. In this case, we are just testing the compatibility of our observations (data) with a null hypothesis that has already been assumed to be true, not the truth of a hypothesis relative to another competing hypothesis given our data (Pr (H0  = TRUE | Data)). Consequently, this defeats the purpose of hypothesis testing itself since under the NHST paradigm, we are not testing two competing hypotheses but the compatibility of our data with a single “seemingly” true null hypothesis. The only way to circumvent this problem is by using  Bayes Theorem to compute the inverse probability of a hypothesis being true given the data.

Similar controversy can also be alluded to the frequentist confidence interval, an inferential measure of parameter estimate obtained using the experimental data. Confidence interval (CI) is actually a measure of our confidence in the performance of the methods used for constructing the interval, not the probability of our true population parameter (e.g. the population mean difference, median survival time etc.) lies between the upper and lower limits/bounds of the interval15. As an illustration, let’s say we construct a single 95% CI for the prevalence of diabetes in country X. Upon collecting the relevant data and constructing the 95% CI based on a Gaussian (normal) distribution assumption, we discovered that the point estimate of diabetes prevalence in country X is 23.8% (95% CI: 12.7%, 38.9%).  In this case, it is tempting to interpret the 95% CI  as an interval within which there is a 95% probability that the prevalence of diabetes lies between the lower and upper boundaries of the interval (i.e. between 12.7% and 38.9%). However, this is conceptually wrong since 95% CI is constructed  based on the repeated sampling property of frequentist statistics. Hence, the correct interpretation for 95% CI in this case is after constructing 100 different CIs using 100 different repeated samples of the population in country X, we are confident that 95 out of those 100 constructed CIs will contain the true population prevalence of diabetes in country X. However, for a single 95% CI, the probability of the true population prevelance of diabetes is between 12.7% and 38.9% is either 0 or 1, since we are now making our inference based on a single CI. As a result, our single 95% CI neither measure our statistical uncertainty over the obtained estimate nor the probabilistic coverage of our single 95% CI. To address this problem, only the Bayesian 95% Credible Interval will satisfactorily inform us both issues mentioned above . 

The discussion will be continued in part 2 of this article; Bayesian Statistics:  A Paradigm Shift.

 

References

  1. Adamina M, Tomlinson G, Guller U. Bayesian statistics in oncology: a guide for the clinical investigator. Cancer. 2009;115(23):5371-81. doi: 10.1002/cncr.24628. 
  2. Spiegelhalter DJ, Abrams KR, Myles JP. Bayesian Approaches to Clinical Trials and Health Care Evaluations. Chichester, England: Wiley; 2004.
  3. Polack FP, Thomas SJ, Kitchin N, Absalon J, Gurtman A, Lockhart S, et al. Safety and Efficacy of the BNT162b2 mRNA Covid-19 Vaccine. New England Journal of Medicine;383(27):2603-2615. doi: 10.1056/NEJMoa2034577. 
  4. Wilkinson M. Testing the null hypothesis: the forgotten legacy of Karl Popper? Journal of Sports Science. 2013;31(9):919-20. doi: 10.1080/02640414.2012.753636.
  5. Orlitzky M. How Can Significance Tests Be Deinstitutionalized? Organizational Research Methods. 2012;15(2):199-228. doi:10.1177/1094428111428356
  6. Popper K. The logic of scientific discovery. New York,  NY: Basic Books; 1959.
  7. Hakan T. Philosophy of science and black swan. Child’s Nervous System. 2021. https://doi.org/10.1007/s00381-020-05009-3
  1. Kuhn T. The Structure of Scientific Revolutions. Chicago: Chicago University Press; 1970.
  2. Feyerabend P. Against Method. London: Verso; 1975
  3. Laudan L. Progress and its Problems: Towards a Theory of Scientific Growth. Berkeley: UC Press; 1977
  4. Pearce N, Crawford-Brown D. Critical discussion in epidemiology: problems with the Popperian approach. J Clin Epidemiol. 1989;42(3):177-84. doi: 10.1016/0895-4356(89)90053-x.
  5. Gigerenzer G. Mindless statistics. The Journal of Socio-Economics. 2004;33(5):587-606.
  6. Fisher RA. Statistical Methods for Research Workers. Edinburgh, UK: Oliver and Boyd;1925. p. 47.
  7. Fisher R.A. Statistical Methods and Scientific Inference. Oliver & Boyd, Edinburgh; 1956. p.44-45.
  8. Shaddick G, Zidek J. Spatio-Temporal Methods in Environmental Epidemiology. 1st Edition. Boca Raton, FL: CRC Press; 2016. pp. 61, 109-110.
  9. Goodman, Steven. “A dirty dozen: twelve p-value misconceptions.” Seminars in hematology. Vol. 45. No. 3. WB Saunders, 2008.