American Journal of Epidemiology Advance Access originally published online on July 10, 2008
American Journal of Epidemiology 2008 168(4):384-388; doi:10.1093/aje/kwn148
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Invited Commentary: Evidence-based Evaluation of p Values and Bayes Factors
From the Division of Cancer Epidemiology and Genetics, National Cancer Institute, Rockville, MD
Correspondence to Dr. Hormuzd A. Katki, Division of Cancer Epidemiology and Genetics, National Cancer Institute, 6120 Executive Blvd., Room 8014, Executive Plaza South, MSC 7244, Rockville, MD 20852-4910 (e-mail: katkih{at}mail.nih.gov).
Received for publication February 1, 2008. Accepted for publication March 13, 2008.
| ABSTRACT |
|---|
|
|
|---|
Despite clear deficiencies of the p value as a summary of statistical evidence, compelling alternatives with strong theoretical justification, such as the Bayes factor and the related likelihood ratio, are rarely presented in epidemiologic publications. Comparison of the historical performance of the p value with that of its competitors in the epidemiologic literature may help epidemiologists evaluate whether Bayes factors or likelihood ratios lead to conclusions more quickly and reliably than a p value, given the same data. Empirical evidence presented by Ioannidis (Am J Epidemiol 2008;168:374–83) demonstrates that findings with p values near 0.05 tend not to be confirmed in future studies. Similarly, Bayes factors interpret p values near 0.05 as having, at best, promising evidence against the null hypothesis. However, the different types of Bayes factors require empirical evaluation of their performance in practice. P values remain popular because miniscule p values are unlikely to mislead and p values do not require alternative hypotheses. Publishing p values near 0.05 could be considered a low-threshold screen to allow many (possibly null) results to be published for follow-up consideration. Meta-analyses and studies meant to decisively convince skeptics require a stronger standard (p values much below 0.05) and a Bayes factor to interpret the p value and to facilitate incorporation of background expertise necessary for drawing comprehensive conclusions.
Bayes theorem; empirical research; epidemiologic methods; meta-analysis; observation; statistics
Abbreviations: FDR, false discovery rate; SNP, single nucleotide polymorphism
| INTRODUCTION |
|---|
|
|
|---|
The ultimate judge of statistical procedures is performance in practice. Different research fields use different statistical frameworks, probably for a combination of historical and empirical performance reasons. For example, in quality control testing, the frequentist accept-reject hypothesis-testing framework seems a reasonable model for identifying defective items under repeated sampling of identical widgets, but scientific studies are not identical widgets cranked off an assembly line. On the one hand, physicists, motivated by their abundant prior knowledge of physical theory, do not seem to object to objective Bayesian inference (1, 2). On the other hand, geneticists routinely specify models under the null hypothesis of no genetic linkage between a marker and a trait but seem less comfortable specifying models under the complicated alternative hypotheses of linkage, ancestry, or population stratification (3), making Bayesian inference impossible. For inference on a null hypothesis without alternative hypotheses, R. A. Fisher, the geneticist who founded modern statistics, advocated the p value (4). The p value is the probability, under the null hypothesis, that one would observe data at least as extreme from the null as the observed data.
The p value does not describe the extent of evidence in the data against a null hypothesis. The p value is merely hypothetical and cannot represent an observable error rate unless the null holds exactly; and because the p value assumes that the null is true, it cannot be seen as the probability that the null is true. Intuitively, the evidence in the data ought to rely on the data alone, not on "more extreme" data that one could have observed but did not. Finally, a small p value cannot be considered evidence against the null if the alternatives are even less likely. The interested reader should read the provocative and engaging book Statistical Evidence: A Likelihood Paradigm (5) for more about proper evidential reasoning.
Is there a better statistical framework for epidemiology and medical research than p values and confidence intervals? There has been much theoretical debate on the benefits of two similar alternative measures: likelihood ratios (5) and Bayes factors (see the article by Goodman (6)—an excellent, accessible, and entertaining reference). In brief, a Bayes factor or likelihood ratio formally represents the evidence in the data about a null hypothesis versus a specified alternative hypothesis, quantifying the factor by which the odds that the null hypothesis is true change in light of the data. For example, a Bayes factor of 0.5 means that the posterior odds that the null hypothesis is true are half the prior odds. To compute the odds that the null hypothesis is true, multiply the Bayes factor by the prior odds that the hypothesis is true. This prior exploits biologic knowledge, medical judgments, and previous studies, in at least an informal way, to draw conclusions about hypotheses. Unlike a p value, the Bayes factor or likelihood ratio can be coupled with a scientist's expert knowledge to draw conclusions based on the data. There are many examples of situations where likelihood ratios and Bayes factors can help investigators intuitively resolve p value quandaries (5).
In addition, as I show below, a p value of 0.05 is equivalent to a Bayes factor no smaller than 0.05 or 0.47. These large Bayes factors provide surprisingly little evidence against the null hypothesis: A Bayes factor of 0.05 reduces a skeptical prior probability of the null from 75 percent to 13 percent; a Bayes factor of 0.47 reduces it from 75 percent to merely 59 percent. Thirteen percent is not enough for most people to definitively believe the null hypothesis is false, much less 59 percent. Bayes factors suggest that a p value of 0.05 is, at best, a promising result, so a claim of definitiveness requires a much smaller p value.
To date, these conceptual arguments have had little impact on the practice of epidemiology and medical research. Empirical studies evaluating the historical performance of p values and Bayes factors in epidemiology may help epidemiologists decide whether Bayes factors lead to conclusions more quickly and reliably than p values. The paper by Ioannidis (7) in this issue of the Journal is a first step in this direction.
| OPERATING CHARACTERISTICS OF STATISTICAL PROCEDURES |
|---|
|
|
|---|
Any statistical procedure, regardless of its theoretical motivation, can be simply regarded as an abstract set of rules whose operating characteristics must be evaluated with data (8). Regardless of whether a procedure is frequentist, Fisherian, or Bayesian, we can all agree that we want to use the procedure most often on the right side of the truth. Without strong empirical support, scientists lack practical reasons to fundamentally change how they do business.
This empirical support requires considering many important research questions from the past, for which today we know the "truth," and consider whether the p value led to conclusions more slowly and less reliably than a Bayes factor would have if it had been used. This is a hard question to formulate and answer. Ioannidis (7) addressed a slightly easier question: How often are p values around 0.05 misleading, and would consideration of the Bayes factor help prevent mistaken understandings of the degree of evidence in the data? Ioannidis provides empirical support for believing that p values around 0.05 are rather weak evidence against the null (7).
In his table 4, Ioannidis notes that meta-analyses which had p values near 0.05 and thus a weak Bayes factor tended not to retain significance in subsequent meta-analyses (7). Some of this effect is to be naturally expected as regression to the mean (9, 10): Any study achieving a certain level of significance, when repeated, is more likely to be less rather than more significant the second time around. Regression to the mean also occurred for meta-analyses with decisive support. However, these meta-analyses usually retain strong Bayes factor support, providing empirical evidence that a p value far lower than 0.05 is needed to indicate definitive results.
| BAYES FACTORS |
|---|
|
|
|---|
Explaining Ioannidis's Bayes factor
While Ioannidis's Bayes factor (reference 7, equation 4) looks complicated, it is derived from a simple framework (11) with intuition that can be graphically illustrated. The framework behind the Bayes factor is that the risk estimate,
2/m) under the null hypothesis, and under the alternative it is distributed as N(
,
2/m) (m is the sample size). For a fixed
, this simple Bayes factor (2, 11–14) is the ratio of the two densities evaluated at
|
The three example risk estimates
Ioannidis's Bayes factor (reference 7, equation 4) does not force specification of
but instead puts a prior distribution of N(0,
2/n0) on
. Ioannidis's normal-prior Bayes factor is shown graphically in figure 1, part B. Although the mean
of the alternative is not specified, the distribution of the mean must be specified.
Another form of the Bayes factor
Some other popular measures used for determining whether an observed association is real are the false-positive report probability (12) (which was previously used by Ioannidis (15)), LOD (logarithm of odds) scores for genome-wide linkage studies, and the FDR. These measures have the same Bayesian interpretation and thus use the same Bayes factor (I call it the FDR Bayes factor) to weigh evidence (14). However, this FDR Bayes factor does not use the observed data but only considers whether the data fall into a "rejection region" (14). The FDR Bayes factor (figure 1, part C) is simply the p value divided by the "power" under the alternative hypothesis (14). The FDR Bayes factor explicitly incorporates the p value and may have other advantages (14). It has been rightly criticized for not using only the observed data (13, 14, 16). The FDR Bayes factor and the simple Bayes factor can differ (13, 14, 16), though not always (14). While the FDR Bayes factor has theoretical deficiencies, it has a few advantages, and given the popularity of the false-positive report probability and the FDR, it merits consideration in future empirical studies of the operational characteristics of p values and Bayes factors.
The Bayes factor is sensitive to the choice of alternative
The superiority of the Bayes factor over p values for measuring evidence is a consequence of considering alternative hypotheses. The difficulty in using the Bayes factor is the difficulty of specifying this alternative. An extreme choice of alternative can lead to uncomfortable results.
Table 5 in the paper by Wakefield (13) presents results on assessing whether single nucleotide polymorphisms (SNPs) in a study are associated with lung cancer. In this table, SNP D had by far the strongest effect size (relative risk = 0.2) and by far the smallest p value (p = 0.00001), yet the Bayes factor (calculable from the author's Bayesian false discovery probability) suggests little evidence against the null. Furthermore, SNPs A, B, and C, which had p values no smaller than 0.001 and effect sizes no stronger than a relative risk of 0.7, were rated as having higher Bayes factors! This anomaly occurred because the author's chosen alternative placed 95 percent of its mass on relative risks between 1/3 and 3. A relative risk of 0.2 is rather unlikely under this alternative, so neither hypothesis is supported by observing a relative risk of 0.2. Although the truth is unknown, SNP D's having little evidence against the null cautions us that alternatives matter.
For another example, instead of a normal prior on the mean of the alternative, a flat prior could be chosen (17, 18). The flat prior is reasonable when risks cannot be extreme, allowing specification of minimum and maximum risks, and when it is undesirable to consider certain risks to be more likely than others. The Bayes factor from a flat prior can differ from the normal-prior Bayes factor in simple situations (18).
Can the minimum Bayes factor over all alternatives help?
To avoid specifying alternatives, one could compute the minimum possible Bayes factor over all alternatives in the class that each Bayes factor considers (11, 19). For example, the minimum normal-prior Bayes factor for each p value is the lower "border" that one's eye picks out in Ioannidis's figures (7). The minimum Bayes factors for certain p values are presented in table 1. Not surprisingly, the Bayes factors associated with p = 0.05 are rather modest, no smaller than 0.05, and for Ioannidis's normal-prior Bayes factor, it is 0.47. For p = 0.001, the Bayes factors of 0.02 and 0.001 reduce a skeptical 75 percent prior probability to 2 percent and 0.1 percent, respectively—much closer to a standard of definitiveness.
|
While the minimum simple Bayes factor and the FDR Bayes factor are reasonably close, they differ, typically by factors of 3–6, from the minimum normal-prior Bayes factor and the symmetric-decreasing-prior Bayes factor. These differences seem important enough that, although there may be theoretical reasons to prefer one over the other, empirical studies may shed light on the operational characteristics of the minimum Bayes factors in epidemiologic practice.
| FINAL THOUGHTS |
|---|
|
|
|---|
There are good reasons for the continued popularity of p values. The p value does not require a specified alternative. More importantly, utterly miniscule p values can represent strong evidence, except against unrealistically extreme alternatives (14). Although a p value of 0.05 provides rather weak evidence, demanding a stronger publication cutoff will exacerbate publication bias. The low p = 0.05 standard could be considered a "low-threshold" screen for publishing (ideally, research should be publishable and publicly available regardless of p value). For studies that aim to be definitive even for skeptics, such as large trials or meta-analyses, investigators should demand a higher standard than p = 0.05 to claim definitiveness.
P values that are not miniscule are difficult to interpret. When the p value is moderate, supplementing the p value with a Bayes factor is especially important for demonstrating the level of evidence against realistic alternative hypotheses. Furthermore, reporting a Bayes factor facilitates the incorporation, at least informally, of subject-matter expertise as prior beliefs to understand the overall impact of a study, or the entire body of literature, on the issue. Encouraging and facilitating this incorporation is a major advantage of using the Bayes factor.
The Bayes factor isn't perfect, but it suffers from fewer conceptual drawbacks than the p value. If scientists are a pragmatic lot who prefer to use whatever "appears to work" empirically, the Bayes factor could gain widespread acceptance when empirical studies show that it has better operational characteristics than p values. The paper by Ioannidis (7) is a step in this direction.
| ACKNOWLEDGMENTS |
|---|
This research was supported by the Intramural Research Program of the National Institutes of Health/National Cancer Institute.
The author thanks Dr. Sholom Wacholder for his comments on and careful reading of earlier versions of this manuscript.
Conflict of interest: none declared.
| References |
|---|
|
|
|---|
- Jaynes ET. Probability theory: the logic of science. (2003) Cambridge, United Kingdom: Cambridge University Press.
- Jeffreys H. Theory of probability. (1961) 3rd ed. Oxford, United Kingdom: Clarendon Press.
- Weinberg CR. It's time to rehabilitate the P value. Epidemiology (2001) 12:288–90.[CrossRef][Web of Science][Medline]
- Fisher RA. Statistical methods and scientific inference. (1956) 42. Edinburgh, United Kingdom: Oliver and Boyd.
- Royall RM. Statistical evidence: a likelihood paradigm. (1997) London, United Kingdom: Chapman and Hall Ltd.
- Goodman SN. Introduction to Bayesian methods I: measuring the strength of evidence. Clin Trials (2005) 2:282–90.
[Abstract/Free Full Text] - Ioannidis JPA. Effect of formal statistical significance on the credibility of observational associations. Am J Epidemiol (2008) 168:374–83.
[Abstract/Free Full Text] - Alderson NE, Campbell G, D'Agostino R, et al. Statistical issues: a roundtable discussion. Clin Trials (2005) 2:364–72.
[Free Full Text] - Friedman M. Do old fallacies ever die? J Econ Lit (1992) 30:2129–32.[Web of Science]
- Yu K, Chatterjee N, Wheeler W, et al. Flexible design for following up positive findings. Am J Hum Genet (2007) 81:540–51.[CrossRef][Medline]
- Edwards W, Lindman H, Savage LJ. Bayesian statistical inference for psychological research. Psychol Rev (1963) 70:193–242.[CrossRef][Web of Science]
- Wacholder S, Chanock S, Garcia-Closas M, et al. Assessing the probability that a positive report is false: an approach for molecular epidemiology studies. J Natl Cancer Inst (2004) 96:434–42.
[Abstract/Free Full Text] - Wakefield JA. Bayesian measure of the probability of false discovery in genetic epidemiology studies. Am J Hum Genet (2007) 81:208–27.[CrossRef][Web of Science][Medline]
- Katki HA. Addressing a controversy about the meaning of p values by using false discovery rates. In: 2007 Proceedings of the American Statistical Association, Section on Bayesian Statistical Science. (CD-ROM). (2007) Alexandria, VA: American Statistical Association. 1199–205.
- Ioannidis JPA. Why most published research findings are false. PLoS Med (2005) 2–e124. (Electronic article).
- Goodman S, Greenland S. Assessing the unreliability of the medical literature: a response to "Why most published research findings are false." (2007) Baltimore, MD: Department of Biostatistics, Bloomberg School of Public Health, Johns Hopkins University. (Department of Biostatistics Working Paper 135). (http://www.bepress.com/jhubiostat/paper135).
- Lindley D. A statistical paradox. Biometrika (1957) 44:187–92.
[Free Full Text] - Katki HA. Addressing a controversy about the meaning of p values by using false discovery rates. In: Presented at the Joint Statistical Meetings. Salt Lake City, Utah. July 29–August 2, 2007.
- Goodman SN. Of P values and Bayes: a modest proposal. Epidemiology (2001) 12:295–7.[CrossRef][Web of Science][Medline]
- Sellke T, Bayarri MJ, Berger JO. Calibration of p values for testing precise null hypotheses. Am Stat (2001) 55:62–71.[CrossRef]
Related articles in Am. J. Epidemiol.:
- Effect of Formal Statistical Significance on the Credibility of Observational Associations
- John P. A. Ioannidis
Am. J. Epidemiol. 2008 168: 374-383.[Abstract] [FREE Full Text]
This article has been cited by other articles:
![]() |
U. Stromberg, J. Bjork, P. Vineis, K. Broberg, and E. Zeggini Ranking of genome-wide association scan signals by different measures Int. J. Epidemiol., October 1, 2009; 38(5): 1364 - 1373. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. P. A. Ioannidis The Author Responds to "Evaluating p Values and Bayes Factors" Am. J. Epidemiol., August 15, 2008; 168(4): 389 - 390. [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||


