Author Affiliations: Harborview Injury Prevention and Research Center (Dr Cummings), and School of Public Health (Drs Cummings and Koepsell), University of Washington, Seattle.
Copyright 2010 American Medical Association. All Rights Reserved. Applicable FARS/DFARS Restrictions Apply to Government Use.
Since 1988, the International Committee of Medical Journal Editors has used this language in their guidelines for authors: “When possible, quantify findings and present them with appropriate indicators of measurement error or uncertainty (such as confidence intervals). Avoid relying solely on statistical hypothesis testing, such as the use of P values, which fails to convey important quantitative information.”1 - 2 Hundreds of biomedical journals, including the Archives,3 endorse these guidelines. What concerns do editors have about P values and hypothesis testing?
Consider the hypothetical data in Table 1. Two randomized trials were conducted among hospital patients who had a urinary catheter inserted. Each trial compared an ordinary catheter with a catheter impregnated with an antibiotic drug. The study outcome was a new urinary infection while in the hospital. Usually (though not always) authors report 2-sided P values that test the hypothesis of no association between an exposure (such as a treatment) and an outcome (such as infection). Both trials followed this convention and reported P values of about .8. A 2-sided P of .8 for the null hypothesis means that if there were no treatment-related outcome difference in the population from which the study subjects were drawn,4 the probability of drawing subjects with the observed test statistic (a χ2 statistic in this example), or a more extreme test statistic, is 8 in 10. To put this more plainly, if less precisely, a large P value tells us the observed risk ratios of 0.86 and 1.03 may easily differ from the null risk ratio of 1.0 (no treatment difference) owing to what we loosely call chance—variation in infection frequency from one random sample to the next.
Authors sometimes use wording that suggests that a large P value means that there is no exposure-related difference in the outcomes of the observed subjects and/or the unobserved people in the population from which study subjects were drawn. Such wording confuses lack of a statistically significant difference with lack of any difference.4 - 12
In Table 1, the investigators observed differences in cumulative infection incidence in both trials; the risk ratios were 0.86 and 1.03, not 1.0. This is a matter of description and has nothing to do with P values. While a large P value indicates that any observed difference may be due entirely to chance, it cannot tell us that it is due entirely or partly to chance. It is possible that both risk ratios in Table 1 are accurate estimates of the true association. We can never prove the null hypothesis; no study can exclude the possibility of some true, albeit possibly small, difference between 2 groups. A claim of no difference should not be based on a P value. Incorrect use of P values is not a fault of P values, but because P values are not needed for understanding most study results, misuse can be remedied by avoiding them entirely or not using them to interpret results.
If all else is equal, a P value will be smaller when there is (1) a larger observed difference between 2 groups, (2) a larger sample size, or (3) less variation within treatment group for a continuous exposure or outcome variable.4 ,13 P value size is also affected by the proportion of subjects who are exposed and the proportion with the outcome.
Imagine hypothetical data (Table 2) for 4 trials of a drug for high blood pressure. In trial 1, the average systolic pressure was 10 mm Hg lower among those treated compared with controls: P = .12. Compared with trial 1, the P value was smaller in trial 2 because the mean blood pressure difference was larger; the P value was smaller in trial 3 because the number of subjects was larger; and the P value was smaller in trial 4 because blood pressure varied less in that trial. Because P value size depends partly on sample size and within-group variation in exposure or outcome, it cannot be expected to measure the strength of an association, here reflected by the difference in average systolic blood pressure between treatment groups. Trials 1, 3, and 4 all found the same association between treatment and mean blood pressure, a 10 mm Hg difference, despite having different P values.
Some limitations and misuses of P values can be avoided if authors instead report and interpret data using estimates of association with intervals that reflect the precision of those estimates.4 ,14 - 19 To describe the direction and size of an exposure-outcome association, authors can use risk ratios, rate ratios, hazard ratios, risk differences, rate differences, or mean differences. Odds ratios can be added to this list when they come from a case-control study; their value in studies in which they lack an interpretation as either a risk ratio or a rate ratio is a topic of debate.20 - 21
To account for chance (outcome variation between finite subject samples), confidence intervals can be used. P values and confidence intervals are related.4 ,22 - 24 In a randomized trial of drug D, death was less common among treated persons than controls (Table 3): risk ratio, 0.5; 95% confidence interval, 0.22-1.13; P = .09. Using the data from this trial, we can compute P for the hypothesis that any particular risk ratio is true in the population from which the study subjects came. In a plot of these P values (Figure 1), P is 1.0 for the hypothesis that the true risk ratio is 0.5; these data are perfectly compatible with this risk ratio. For the hypothesis that the true risk ratio is the null value of 1.0, P is .09. Risk ratios from 0.22 to 1.13 have 2-sided P values of .05 or greater; these are the 95% confidence limits for the effect of drug D.
Plot of 2-sided P values for a set of risk ratios based on data from the trial of drug D (Table 3). Each P value is for a hypothesis test that each risk ratio from 0.125 to 2.0 is true in the population from which the study subjects came. The vertical line marks the null risk ratio of 1.0 and the dashed horizontal line marks both the 95% confidence level and the P value of .05. The 95% confidence limits are risk ratios 0.22 and 1.13.
When we see a confidence interval, what should we be confident about? If we test drug D (Table 3) an infinite number of times, drawing subjects randomly from the same population each time, the 95% confidence interval will include the true effect estimate in 95% of the trials.4 For any 1 trial, we cannot be completely confident that the true value falls within the stated bounds; in 5% of the trials, the true risk ratio will lie outside of the 95% interval.
We should not consider all risk ratios within the 95% interval to be equally compatible with the data and all estimates outside of the interval as excluded by the data.4 ,22 - 23 Confidence limits are just 2 points on a continuum and 95% limits are just a convention. A P value plot (Figure 1) peaks at the observed effect estimate. The confidence interval helps us visualize the continuous curves that fall from that peak to effect estimates with progressively less support from the data.
The studies of drugs A and B had the same P value (Table 1), but the evidence was quite different for the 2 drugs. For drug A, we might summarize by writing:
The cumulative incidence of infection was less among treated persons compared with controls, with a risk ratio 0.86, a 14% reduction. But the 95% confidence interval extended from a beneficial 0.31 to a hazardous 2.37. This trial provides little information about the utility of impregnating urinary catheters with drug A. If there is reason to think that A may be beneficial, a larger trial is needed. Our data can help to estimate the sample size for a larger study by furnishing estimates of the infection incidence to be expected among subjects with an ordinary catheter.
Note, however, that sample size or power calculations for the larger study should be based on the smallest difference in outcomes that would be of practical or theoretical importance, rather than on an imprecise preliminary estimate of association observed in a small pilot study.25 For drug B, we could write:
The risk ratio of 1.03 is consistent with a small harmful effect. The 95% confidence interval (0.79-1.34) suggests that a strong benefit is unlikely. B is probably not useful for prevention of urinary infection.
Imagine a larger, second trial of drug A (Table 4) with a statistically significant reduction in new urinary tract infections (risk ratio, 0.970; 95% confidence interval, 0.945-0.996; P = .02). The risk reductions within the 95% interval (Figure 2), 0.4% to 5.5%, are small. The 14% risk reduction in the first trial (Table 1) now seems an unlikely estimate of the true effect, because it lies far outside the 95% confidence interval observed in the second trial. In the new trial, 333 patients were treated for each prevented infection. The number needed to treat = 1/(risk among controls − risk among treated) = 1/(.1 − .097) = 333. Unless drug A is easy to administer, free of adverse effects, and nearly free of cost, it probably has little clinical utility.
Plots of 2-sided P values for risk ratios from 2 studies. The solid P value curve is based on data from a small study of catheters impregnated with drug A (Table 1) and the dashed curve is based on data from a large study of drug A catheters (Table 4). The vertical line marks the null risk ratio of 1.0 and the dashed horizontal line marks both the 95% confidence level and the P value of .05.
How would you interpret the trials for drugs C and D (Table 3) for patients in septic shock? Both had a P value of .09. Imagine these drugs are inexpensive, free of adverse effects, and available for other purposes. If a patient had septic shock today, should C be given? Are additional trials of C indicated? What about D?
Sometimes a difference is not quite statistically significant (Table 3, for example). Authors or reviewers then may wonder: What power did the study have to detect the reported result? Power calculations are appropriate in the study design stage, but they are no longer relevant once results are known.24 ,26 - 30 More can be learned from the confidence interval, which reveals the range of possible effects that are reasonably compatible with the observed data.24 ,26 - 27
Meta-analysis requires that each study reports the effect estimate and its standard error, or information that can be used to calculate estimates of these. Routinely providing estimates of association with confidence intervals makes future meta-analyses possible.
First, confidence intervals should not be used to simply judge estimates as statistically significant or not. Second, confidence intervals tell us about how large a role chance may play, but they reveal nothing about bias. Interpretations of point estimates should consider possible bias. Last, confidence intervals are not the only way to estimate precision; Bayesian and likelihood intervals are available.4 ,31 - 34
Most Archives articles present effect estimates and confidence intervals, but some still use P values for interpretation. We suggest that articles would benefit by omitting most P values,35 reserving a few only for specialized purposes such as testing for a trend in outcome across several ordered exposure levels or testing the significance of differences in associations across levels of a third factor. An alternative would be to present some P values in tables, but not use them for interpretation or present them in the text. We also suggest that authors consider framing their research aim in their article's introduction in terms of estimating the size of an association rather than in terms of testing for the presence or absence of an association.
Correspondence: Drs Cummings, Department of Epidemiology, University of Washington, 250 Grandview Dr, Bishop, CA 93514 (peterc@u.washington.edu).
Author Contributions:Study concept and design: Cummings and Koepsell. Drafting of the manuscript: Cummings. Critical revision of the manuscript for important intellectual content: Cummings and Koepsell. Statistical analysis: Cummings and Koepsell. Administrative, technical, and material support: Cummings and Koepsell.
Financial Disclosure: None reported.
Country-Specific Mortality and Growth Failure in Infancy and Yound Children and Association With Material Stature
Use interactive graphics and maps to view and sort country-specific infant and early dhildhood mortality and growth failure data and their association with maternal
Instructions
Comments are moderated and will appear on the site at the discretion of the Archives of Pediatrics and Adolescent Medicine editors. Comments should not exceed 500 words of text and 10 references.
Do not submit personal medical questions or information that could identify a specific patient, questions about a particular case, or general inquiries to an author. Only content that has not been published, posted, or submitted elsewhere should be submitted. By submitting this Comment, you and any coauthors transfer copyright to the journal if your Comment is posted.
* = Required Field
Disclosure of Any Conflicts of Interest* Indicate all relevant conflicts of interest of each author below, including all relevant financial interests, activities, and relationships within the past 3 years including, but not limited to, employment, affiliation, grants or funding, consultancies, honoraria or payment, speakers’ bureaus, stock ownership or options, expert testimony, royalties, donation of medical equipment, or patents planned, pending, or issued. If all authors have none, check "No potential conflicts or relevant financial interests" in the box below. Please also indicate any funding received in support of this work. The information will be posted with your response.
Register and get free email Table of Contents alerts, saved searches, PowerPoint downloads, CME quizzes, and more
Subscribe for full-text access to content from 1998 forward and a host of useful features
Activate your current subscription (AMA members and current subscribers)
Some tools below are only available to our subscribers or users with an online account.
Download citation file:
Customize your page view by dragging & repositioning the boxes below.
and access these and other features:
Register Now
Enter your username and email address. We'll send you a reminder to the email address on record.
Athens and Shibboleth are access management services that provide single sign-on to protected resources. They replace the multiple user names and passwords necessary to access subscription-based content with a single user name and password that can be entered once per session. It operates independently of a user's location or IP address. If your institution uses Athens or Shibboleth authentication, please contact your site administrator to receive your user name and password.