0
Commentary |

P Values vs Estimates of Association With Confidence IntervalsP Values vs Estimates of Association With CIs

Peter Cummings, MD, MPH; Thomas D. Koepsell, MD, MPH
[+] Author Affiliations

Author Affiliations: Harborview Injury Prevention and Research Center (Dr Cummings), and School of Public Health (Drs Cummings and Koepsell), University of Washington, Seattle.


Copyright 2010 American Medical Association. All Rights Reserved. Applicable FARS/DFARS Restrictions Apply to Government Use.

More Author Information
Arch Pediatr Adolesc Med. 2010;164(2):193-196. doi:10.1001/archpediatrics.2009.266
Text Size: A A A
Published online
Figures in this Article

Since 1988, the International Committee of Medical Journal Editors has used this language in their guidelines for authors: “When possible, quantify findings and present them with appropriate indicators of measurement error or uncertainty (such as confidence intervals). Avoid relying solely on statistical hypothesis testing, such as the use of P values, which fails to convey important quantitative information.”1 2 Hundreds of biomedical journals, including the Archives,3 endorse these guidelines. What concerns do editors have about P values and hypothesis testing?

Consider the hypothetical data in Table 1. Two randomized trials were conducted among hospital patients who had a urinary catheter inserted. Each trial compared an ordinary catheter with a catheter impregnated with an antibiotic drug. The study outcome was a new urinary infection while in the hospital. Usually (though not always) authors report 2-sided P values that test the hypothesis of no association between an exposure (such as a treatment) and an outcome (such as infection). Both trials followed this convention and reported P values of about .8. A 2-sided P of .8 for the null hypothesis means that if there were no treatment-related outcome difference in the population from which the study subjects were drawn,4 the probability of drawing subjects with the observed test statistic (a χ2 statistic in this example), or a more extreme test statistic, is 8 in 10. To put this more plainly, if less precisely, a large P value tells us the observed risk ratios of 0.86 and 1.03 may easily differ from the null risk ratio of 1.0 (no treatment difference) owing to what we loosely call chance—variation in infection frequency from one random sample to the next.

Table Grahic Jump LocationTable 1. Hypothetical Data for 2 Randomized Trials of Urinary Catheter Type and the Outcome of a New Urinary Tract Infection While Hospitalized for Other Illness

Authors sometimes use wording that suggests that a large P value means that there is no exposure-related difference in the outcomes of the observed subjects and/or the unobserved people in the population from which study subjects were drawn. Such wording confuses lack of a statistically significant difference with lack of any difference.4 12

In Table 1, the investigators observed differences in cumulative infection incidence in both trials; the risk ratios were 0.86 and 1.03, not 1.0. This is a matter of description and has nothing to do with P values. While a large P value indicates that any observed difference may be due entirely to chance, it cannot tell us that it is due entirely or partly to chance. It is possible that both risk ratios in Table 1 are accurate estimates of the true association. We can never prove the null hypothesis; no study can exclude the possibility of some true, albeit possibly small, difference between 2 groups. A claim of no difference should not be based on a P value. Incorrect use of P values is not a fault of P values, but because P values are not needed for understanding most study results, misuse can be remedied by avoiding them entirely or not using them to interpret results.

If all else is equal, a P value will be smaller when there is (1) a larger observed difference between 2 groups, (2) a larger sample size, or (3) less variation within treatment group for a continuous exposure or outcome variable.4 ,13 P value size is also affected by the proportion of subjects who are exposed and the proportion with the outcome.

Imagine hypothetical data (Table 2) for 4 trials of a drug for high blood pressure. In trial 1, the average systolic pressure was 10 mm Hg lower among those treated compared with controls: P = .12. Compared with trial 1, the P value was smaller in trial 2 because the mean blood pressure difference was larger; the P value was smaller in trial 3 because the number of subjects was larger; and the P value was smaller in trial 4 because blood pressure varied less in that trial. Because P value size depends partly on sample size and within-group variation in exposure or outcome, it cannot be expected to measure the strength of an association, here reflected by the difference in average systolic blood pressure between treatment groups. Trials 1, 3, and 4 all found the same association between treatment and mean blood pressure, a 10 mm Hg difference, despite having different P values.

Table Grahic Jump LocationTable 2. Hypothetical Data for 4 Clinical Trials of Drugs to Reduce Systolic Blood Pressure

Some limitations and misuses of P values can be avoided if authors instead report and interpret data using estimates of association with intervals that reflect the precision of those estimates.4 ,14 19 To describe the direction and size of an exposure-outcome association, authors can use risk ratios, rate ratios, hazard ratios, risk differences, rate differences, or mean differences. Odds ratios can be added to this list when they come from a case-control study; their value in studies in which they lack an interpretation as either a risk ratio or a rate ratio is a topic of debate.20 21

To account for chance (outcome variation between finite subject samples), confidence intervals can be used. P values and confidence intervals are related.4 ,22 24 In a randomized trial of drug D, death was less common among treated persons than controls (Table 3): risk ratio, 0.5; 95% confidence interval, 0.22-1.13; P = .09. Using the data from this trial, we can compute P for the hypothesis that any particular risk ratio is true in the population from which the study subjects came. In a plot of these P values (Figure 1), P is 1.0 for the hypothesis that the true risk ratio is 0.5; these data are perfectly compatible with this risk ratio. For the hypothesis that the true risk ratio is the null value of 1.0, P is .09. Risk ratios from 0.22 to 1.13 have 2-sided P values of .05 or greater; these are the 95% confidence limits for the effect of drug D.

Place holder to copy figure label and caption
Figure 1.

Plot of 2-sided P values for a set of risk ratios based on data from the trial of drug D (Table 3). Each P value is for a hypothesis test that each risk ratio from 0.125 to 2.0 is true in the population from which the study subjects came. The vertical line marks the null risk ratio of 1.0 and the dashed horizontal line marks both the 95% confidence level and the P value of .05. The 95% confidence limits are risk ratios 0.22 and 1.13.

Grahic Jump Location
Table Grahic Jump LocationTable 3. Hypothetical Data From the First 2 Randomized Controlled Trials of Treatment and the Outcome of Death Among Patients With Septic Shock

When we see a confidence interval, what should we be confident about? If we test drug D (Table 3) an infinite number of times, drawing subjects randomly from the same population each time, the 95% confidence interval will include the true effect estimate in 95% of the trials.4 For any 1 trial, we cannot be completely confident that the true value falls within the stated bounds; in 5% of the trials, the true risk ratio will lie outside of the 95% interval.

We should not consider all risk ratios within the 95% interval to be equally compatible with the data and all estimates outside of the interval as excluded by the data.4 ,22 23 Confidence limits are just 2 points on a continuum and 95% limits are just a convention. A P value plot (Figure 1) peaks at the observed effect estimate. The confidence interval helps us visualize the continuous curves that fall from that peak to effect estimates with progressively less support from the data.

The studies of drugs A and B had the same P value (Table 1), but the evidence was quite different for the 2 drugs. For drug A, we might summarize by writing:

The cumulative incidence of infection was less among treated persons compared with controls, with a risk ratio 0.86, a 14% reduction. But the 95% confidence interval extended from a beneficial 0.31 to a hazardous 2.37. This trial provides little information about the utility of impregnating urinary catheters with drug A. If there is reason to think that A may be beneficial, a larger trial is needed. Our data can help to estimate the sample size for a larger study by furnishing estimates of the infection incidence to be expected among subjects with an ordinary catheter.

Note, however, that sample size or power calculations for the larger study should be based on the smallest difference in outcomes that would be of practical or theoretical importance, rather than on an imprecise preliminary estimate of association observed in a small pilot study.25 For drug B, we could write:

The risk ratio of 1.03 is consistent with a small harmful effect. The 95% confidence interval (0.79-1.34) suggests that a strong benefit is unlikely. B is probably not useful for prevention of urinary infection.

Imagine a larger, second trial of drug A (Table 4) with a statistically significant reduction in new urinary tract infections (risk ratio, 0.970; 95% confidence interval, 0.945-0.996; P = .02). The risk reductions within the 95% interval (Figure 2), 0.4% to 5.5%, are small. The 14% risk reduction in the first trial (Table 1) now seems an unlikely estimate of the true effect, because it lies far outside the 95% confidence interval observed in the second trial. In the new trial, 333 patients were treated for each prevented infection. The number needed to treat = 1/(risk among controls − risk among treated) = 1/(.1 − .097) = 333. Unless drug A is easy to administer, free of adverse effects, and nearly free of cost, it probably has little clinical utility.

Place holder to copy figure label and caption
Figure 2.

Plots of 2-sided P values for risk ratios from 2 studies. The solid P value curve is based on data from a small study of catheters impregnated with drug A (Table 1) and the dashed curve is based on data from a large study of drug A catheters (Table 4). The vertical line marks the null risk ratio of 1.0 and the dashed horizontal line marks both the 95% confidence level and the P value of .05.

Grahic Jump Location
Table Grahic Jump LocationTable 4. Hypothetical Data for a Randomized Trial of Urinary Catheters Impregnated With Drug A and the Outcome of a New Urinary Tract Infection While Hospitalized for Other Illness

How would you interpret the trials for drugs C and D (Table 3) for patients in septic shock? Both had a P value of .09. Imagine these drugs are inexpensive, free of adverse effects, and available for other purposes. If a patient had septic shock today, should C be given? Are additional trials of C indicated? What about D?

Sometimes a difference is not quite statistically significant (Table 3, for example). Authors or reviewers then may wonder: What power did the study have to detect the reported result? Power calculations are appropriate in the study design stage, but they are no longer relevant once results are known.24 ,26 30 More can be learned from the confidence interval, which reveals the range of possible effects that are reasonably compatible with the observed data.24 ,26 27

Meta-analysis requires that each study reports the effect estimate and its standard error, or information that can be used to calculate estimates of these. Routinely providing estimates of association with confidence intervals makes future meta-analyses possible.

First, confidence intervals should not be used to simply judge estimates as statistically significant or not. Second, confidence intervals tell us about how large a role chance may play, but they reveal nothing about bias. Interpretations of point estimates should consider possible bias. Last, confidence intervals are not the only way to estimate precision; Bayesian and likelihood intervals are available.4 ,31 34

Most Archives articles present effect estimates and confidence intervals, but some still use P values for interpretation. We suggest that articles would benefit by omitting most P values,35 reserving a few only for specialized purposes such as testing for a trend in outcome across several ordered exposure levels or testing the significance of differences in associations across levels of a third factor. An alternative would be to present some P values in tables, but not use them for interpretation or present them in the text. We also suggest that authors consider framing their research aim in their article's introduction in terms of estimating the size of an association rather than in terms of testing for the presence or absence of an association.

Correspondence: Drs Cummings, Department of Epidemiology, University of Washington, 250 Grandview Dr, Bishop, CA 93514 (peterc@u.washington.edu).

Author Contributions:Study concept and design: Cummings and Koepsell. Drafting of the manuscript: Cummings. Critical revision of the manuscript for important intellectual content: Cummings and Koepsell. Statistical analysis: Cummings and Koepsell. Administrative, technical, and material support: Cummings and Koepsell.

Financial Disclosure: None reported.

International Committee of Medical Journal Editors,  Uniform requirements for manuscripts submitted to biomedical journals. JAMA 1997;277 (11) 927- 934
PubMed
 International Committee of Medical Journal Editors. Uniform requirements for manuscripts submitted to biomedical journals: writing and editing for biomedical publications. http://www.icmje.org/. Accessed May 14, 2009
 Archives of Pediatrics & Adolescent Medicine. Instructions for Authors. http://archpedi.ama-assn.org/misc/ifora.dtl. Accessed May 22, 2009
Rothman  KJ, Greenland  S, Lash  TL,  Precision and statistics in epidemiologic studies. Rothman  KJ, Greenland  S, Lash  TL.Modern Epidemiology. 3rd ed. Philadelphia, PA Lippincott Williams & Wilkins2008;148- 167
Freiman  JA, Chalmers  TC, Smith  H  Jr, Kuebler  RR. The importance of beta, the type II error and sample size in the design and interpretation of the randomized control trial: survey of 71 “negative” trials. N Engl J Med 1978;299 (13) 690- 694
PubMed
Edwards  AWF.Likelihood: Expanded Edition. Baltimore, MD The Johns Hopkins University Press1992;179- 180
Altman  DG, Bland  MJ. Absence of evidence is not evidence of absence. BMJ 1995;311 (7003) 485
PubMed
Sterne  JA, Davey Smith  G. Sifting the evidence-what's wrong with significance tests? BMJ 2001;322 (7280) 226- 231
PubMed
Alderson  P, Chalmers  I. Survey of claims of no effect in abstracts of Cochrane reviews. BMJ 2003;326 (7387) 475
PubMed
Alderson  P. Absence of evidence is not evidence of absence. BMJ 2004;328 (7438) 476- 477
PubMed
Hauer  E. The harm done by tests of significance. Accid Anal Prev 2004;36 (3) 495- 500
PubMed
Gigerenzer  G. Mindless statistics. J Socio- Economics 2004;33 (5) 587- 606
Lang  JM, Rothman  KJ, Cann  CI. That confounded P-value [editorial]. Epidemiology 1998;9 (1) 7- 8
PubMed
Rothman  KJ. A show of confidence. N Engl J Med 1978;299 (24) 1362- 1363
PubMed
Rothman  KJ. Significance questing. Ann Intern Med 1986;105 (3) 445- 447
PubMed
Gardner  MJ, Altman  DG. Confidence intervals rather than P values: estimation rather than hypothesis testing. Br Med J (Clin Res Ed) 1986;292 (6522) 746- 750
PubMed
Savitz  DA. Is statistical significance testing useful in interpreting data? Reprod Toxicol 1993;7 (2) 95- 100
PubMed
Altman  DG, Machin  D, Bryant  TN, Gardner  MJ.Statistics with Confidence. 2nd ed. London, England BMJ Publishing Group2000;
Altman  D, Bland  JM. Confidence intervals illuminate absence of evidence [letter]. BMJ 2004;328 (7446) 1016- 1017
PubMed
Greenland  S. Interpretation and choice of effect measures in epidemiologic analyses. Am J Epidemiol 1987;125 (5) 761- 768
PubMed
Cummings  P. The relative merits of risk ratios and odds ratios. Arch Pediatr Adolesc Med 2009;163 (5) 438- 445
PubMed
Poole  C. Confidence intervals exclude nothing. Am J Public Health 1987;77 (4) 492- 493
PubMed
Poole  C. Beyond the confidence interval. Am J Public Health 1987;77 (2) 195- 199
PubMed
Smith  AH, Bates  MN. Confidence limit analyses should replace power calculations in the interpretation of epidemiologic studies. Epidemiology 1992;3 (5) 449- 452
PubMed
Kraemer  HC, Mintz  J, Noda  A, Tinklenberg  J, Yesavage  JA. Caution regarding the use of pilot studies to guide power calculations for study proposals. Arch Gen Psychiatry 2006;63 (5) 484- 489
PubMed
Goodman  SN, Berlin  JA. The use of predicted confidence intervals when planning experiments and the misuse of power when interpreting results. Ann Intern Med 1994;121 (3) 200- 206
PubMed
Hoening  JM, Heisey  DM. The abuse of power: the pervasive fallacy of power calculations for data analysis. Am Stat 2001;55 (1) 19- 24
Bacchetti  P. Peer review of statistics in medical research: the other problem. BMJ 2002;324 (7348) 1271- 1273
PubMed
Bacchetti  P. Author's thoughts on power calculations [letter]. BMJ 2002;325 (7362) 491
Senn  SJ. Power is indeed irrelevant in interpreting completed studies [letter]. BMJ 2002;325 (7375) 1304
PubMed
Goodman  SN. P values, hypothesis tests, and likelihood: implications for epidemiology of a neglected historical debate. Am J Epidemiol 1993;137 (5) 485- 501
PubMed
Goodman  SN. Toward evidence-based medical statistics. 1: the Pvalue fallacy. Ann Intern Med 1999;130 (12) 995- 1004
PubMed
Goodman  SN. Toward evidence-based medical statistics. 2: the Bayes factor. Ann Intern Med 1999;130 (12) 1005- 1013
PubMed
Royall  R.Statistical Evidence: A Likelihood Paradigm. Boca Raton, FL Chapman & Hall/CRC1997;
 The value of P. Epidemiology 2001;12 (3) 286
PubMed

First Page Preview

First page PDF preview

Figures

Place holder to copy figure label and caption
Figure 1.

Plot of 2-sided P values for a set of risk ratios based on data from the trial of drug D (Table 3). Each P value is for a hypothesis test that each risk ratio from 0.125 to 2.0 is true in the population from which the study subjects came. The vertical line marks the null risk ratio of 1.0 and the dashed horizontal line marks both the 95% confidence level and the P value of .05. The 95% confidence limits are risk ratios 0.22 and 1.13.

Grahic Jump Location
Place holder to copy figure label and caption
Figure 2.

Plots of 2-sided P values for risk ratios from 2 studies. The solid P value curve is based on data from a small study of catheters impregnated with drug A (Table 1) and the dashed curve is based on data from a large study of drug A catheters (Table 4). The vertical line marks the null risk ratio of 1.0 and the dashed horizontal line marks both the 95% confidence level and the P value of .05.

Grahic Jump Location

Tables

Table Grahic Jump LocationTable 1. Hypothetical Data for 2 Randomized Trials of Urinary Catheter Type and the Outcome of a New Urinary Tract Infection While Hospitalized for Other Illness
Table Grahic Jump LocationTable 2. Hypothetical Data for 4 Clinical Trials of Drugs to Reduce Systolic Blood Pressure
Table Grahic Jump LocationTable 3. Hypothetical Data From the First 2 Randomized Controlled Trials of Treatment and the Outcome of Death Among Patients With Septic Shock
Table Grahic Jump LocationTable 4. Hypothetical Data for a Randomized Trial of Urinary Catheters Impregnated With Drug A and the Outcome of a New Urinary Tract Infection While Hospitalized for Other Illness

Interactive Graphics

Video

Country-Specific Mortality and Growth Failure in Infancy and Yound Children and Association With Material Stature

Use interactive graphics and maps to view and sort country-specific infant and early dhildhood mortality and growth failure data and their association with maternal

International Committee of Medical Journal Editors,  Uniform requirements for manuscripts submitted to biomedical journals. JAMA 1997;277 (11) 927- 934
PubMed
 International Committee of Medical Journal Editors. Uniform requirements for manuscripts submitted to biomedical journals: writing and editing for biomedical publications. http://www.icmje.org/. Accessed May 14, 2009
 Archives of Pediatrics & Adolescent Medicine. Instructions for Authors. http://archpedi.ama-assn.org/misc/ifora.dtl. Accessed May 22, 2009
Rothman  KJ, Greenland  S, Lash  TL,  Precision and statistics in epidemiologic studies. Rothman  KJ, Greenland  S, Lash  TL.Modern Epidemiology. 3rd ed. Philadelphia, PA Lippincott Williams & Wilkins2008;148- 167
Freiman  JA, Chalmers  TC, Smith  H  Jr, Kuebler  RR. The importance of beta, the type II error and sample size in the design and interpretation of the randomized control trial: survey of 71 “negative” trials. N Engl J Med 1978;299 (13) 690- 694
PubMed
Edwards  AWF.Likelihood: Expanded Edition. Baltimore, MD The Johns Hopkins University Press1992;179- 180
Altman  DG, Bland  MJ. Absence of evidence is not evidence of absence. BMJ 1995;311 (7003) 485
PubMed
Sterne  JA, Davey Smith  G. Sifting the evidence-what's wrong with significance tests? BMJ 2001;322 (7280) 226- 231
PubMed
Alderson  P, Chalmers  I. Survey of claims of no effect in abstracts of Cochrane reviews. BMJ 2003;326 (7387) 475
PubMed
Alderson  P. Absence of evidence is not evidence of absence. BMJ 2004;328 (7438) 476- 477
PubMed
Hauer  E. The harm done by tests of significance. Accid Anal Prev 2004;36 (3) 495- 500
PubMed
Gigerenzer  G. Mindless statistics. J Socio- Economics 2004;33 (5) 587- 606
Lang  JM, Rothman  KJ, Cann  CI. That confounded P-value [editorial]. Epidemiology 1998;9 (1) 7- 8
PubMed
Rothman  KJ. A show of confidence. N Engl J Med 1978;299 (24) 1362- 1363
PubMed
Rothman  KJ. Significance questing. Ann Intern Med 1986;105 (3) 445- 447
PubMed
Gardner  MJ, Altman  DG. Confidence intervals rather than P values: estimation rather than hypothesis testing. Br Med J (Clin Res Ed) 1986;292 (6522) 746- 750
PubMed
Savitz  DA. Is statistical significance testing useful in interpreting data? Reprod Toxicol 1993;7 (2) 95- 100
PubMed
Altman  DG, Machin  D, Bryant  TN, Gardner  MJ.Statistics with Confidence. 2nd ed. London, England BMJ Publishing Group2000;
Altman  D, Bland  JM. Confidence intervals illuminate absence of evidence [letter]. BMJ 2004;328 (7446) 1016- 1017
PubMed
Greenland  S. Interpretation and choice of effect measures in epidemiologic analyses. Am J Epidemiol 1987;125 (5) 761- 768
PubMed
Cummings  P. The relative merits of risk ratios and odds ratios. Arch Pediatr Adolesc Med 2009;163 (5) 438- 445
PubMed
Poole  C. Confidence intervals exclude nothing. Am J Public Health 1987;77 (4) 492- 493
PubMed
Poole  C. Beyond the confidence interval. Am J Public Health 1987;77 (2) 195- 199
PubMed
Smith  AH, Bates  MN. Confidence limit analyses should replace power calculations in the interpretation of epidemiologic studies. Epidemiology 1992;3 (5) 449- 452
PubMed
Kraemer  HC, Mintz  J, Noda  A, Tinklenberg  J, Yesavage  JA. Caution regarding the use of pilot studies to guide power calculations for study proposals. Arch Gen Psychiatry 2006;63 (5) 484- 489
PubMed
Goodman  SN, Berlin  JA. The use of predicted confidence intervals when planning experiments and the misuse of power when interpreting results. Ann Intern Med 1994;121 (3) 200- 206
PubMed
Hoening  JM, Heisey  DM. The abuse of power: the pervasive fallacy of power calculations for data analysis. Am Stat 2001;55 (1) 19- 24
Bacchetti  P. Peer review of statistics in medical research: the other problem. BMJ 2002;324 (7348) 1271- 1273
PubMed
Bacchetti  P. Author's thoughts on power calculations [letter]. BMJ 2002;325 (7362) 491
Senn  SJ. Power is indeed irrelevant in interpreting completed studies [letter]. BMJ 2002;325 (7375) 1304
PubMed
Goodman  SN. P values, hypothesis tests, and likelihood: implications for epidemiology of a neglected historical debate. Am J Epidemiol 1993;137 (5) 485- 501
PubMed
Goodman  SN. Toward evidence-based medical statistics. 1: the Pvalue fallacy. Ann Intern Med 1999;130 (12) 995- 1004
PubMed
Goodman  SN. Toward evidence-based medical statistics. 2: the Bayes factor. Ann Intern Med 1999;130 (12) 1005- 1013
PubMed
Royall  R.Statistical Evidence: A Likelihood Paradigm. Boca Raton, FL Chapman & Hall/CRC1997;
 The value of P. Epidemiology 2001;12 (3) 286
PubMed

Correspondence

CME Course for:


You need to register in order to view this quiz.


To understand the clinical management of acute heart failure syndromes.
Accreditation Information The American Medical Association is accredited by the Accreditation Council for Continuing Medical Education to provide continuing medical education for physicians.
The AMA designates this journal-based CME activity for a maximum of 1 AMA PRA Category 1 CreditTM per course. Physicians should claim only the credit commensurate with the extent of their participation in the activity.
Physicians who complete the CME course and score at least 80% correct on the quiz are eligible for AMA PRA Category 1 CreditTM.
Note: You must get at least of the answers correct to pass this quiz.
Note: You must get at least of the answers correct to pass this quiz.
You have not filled in all the answers to complete this quiz
The following questions were not answered:
Sorry, you have unsuccessfully completed this CME quiz with a score of
The following questions were not answered correctly:
For CME Course: A Proposed Model for Initial Assessment and Management of Acute Heart Failure Syndromes
Indicate what changes(s) you will implement in your practice, if any, based on this CME course.
To view and print your certificate and access a summary of your CME courses go to My CME.
NOTE:
Citing articles are presented as examples only. In non-demo SCM6 implementation, integration with CrossRef’s “Cited By” API will populate this tab (http://www.crossref.org/citedby.html).
Submit a Comment

Some tools below are only available to our subscribers or users with an online account.

Related Content

Customize your page view by dragging & repositioning the boxes below.

Articles Related By Topic
Related Topics
PubMed Articles