Skip Navigation


American Journal of Epidemiology Advance Access originally published online on March 28, 2007
American Journal of Epidemiology 2007 165(10):1119-1121; doi:10.1093/aje/kwm072
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow All Versions of this Article:
165/10/1119    most recent
kwm072v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Related articles in Am. J. Epidemiol.
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (2)
Right arrowRequest Permissions
Right arrow Disclaimer
Google Scholar
Right arrow Articles by Oakes, J. M.
Right arrow Articles by Church, T. R.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Oakes, J. M.
Right arrow Articles by Church, T. R.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

American Journal of Epidemiology Copyright © 2007 by the Johns Hopkins Bloomberg School of Public Health All rights reserved; printed in U.S.A.

Invited Commentary

Invited Commentary: Advancing Propensity Score Methods in Epidemiology

J. Michael Oakes and Timothy R. Church

Correspondence to Dr. J. Michael Oakes, Division of Epidemiology and Community Health, Minnesota Population Center, University of Minnesota, Minneapolis, MN 55454 (e-mail: oakes{at}epi.umn.edu).

Received for publication December 13, 2006. Accepted for publication December 15, 2006.


    ABSTRACT
 TOP
 ABSTRACT
 INTRODUCTION
 THE PROBLEM
 THE ADVANCE
 CONCLUSION
 References
 
Every epidemiologist knows that unmeasured confounding is a serious analytic problem, but practically speaking, there seems to be little one can do about it. In this issue of the Journal, Stürmer et al. (Am J Epidemiol 2007:165:1110–18) offer a novel solution that combines propensity score matching methods with measurement error regression models. They call this technique "propensity score calibration" (PSC) and assess its strengths and limitations with simulated data. Their analyses demonstrate that PSC greatly improves inference when the critical assumption of surrogacy holds, but when surrogacy does not hold, PSC estimation can exacerbate bias relative to uncorrected propensity score models. The benefits of propensity score methods (and PSC) lie not only with potentially improved effect estimation but with conceptualization and practice as well.

bias (epidemiology); cohort studies; confounding factors (epidemiology); epidemiologic methods; models, statistical; propensity score calibration; research design


Abbreviations: IV, instrumental variable; OR, odds ratio; PSC, propensity score calibration


    INTRODUCTION
 TOP
 ABSTRACT
 INTRODUCTION
 THE PROBLEM
 THE ADVANCE
 CONCLUSION
 References
 
In this issue of the Journal, Stürmer et al. (1) have contributed yet another important paper on the use of propensity score methods in epidemiology. They use simulation to reveal some of the benefits and pitfalls of adjusting a propensity score model to account for unmeasured covariates/confounders. The paper is noteworthy for its novelty, methodological insight, and clarity. Here we highlight central aspects of the paper and offer some thoughts on the use and advancement of propensity score methods in epidemiology. Our target audience is not expert methodologists but rather those who wonder whether this alternative technique may be helpful in their own research.


    THE PROBLEM
 TOP
 ABSTRACT
 INTRODUCTION
 THE PROBLEM
 THE ADVANCE
 CONCLUSION
 References
 
Adjustment by stratification, restriction, and especially regression is frequently used to control for confounding by measured variables. But what of the unmeasured confounders (and/or mismeasured confounders that yield residual confounding)? While discussing instrumental-variable (IV) methods, Hernán and Robins (2, p. 360) recently asked and answered this dreaded question with abundant clarity: "Can you guarantee that the results from your observational study are unaffected by unmeasured confounding? The only answer an epidemiologist can provide is ‘no’."

Beyond hope and disregard, improved theory, better measurement, sensitivity analysis, bounding, imputation, IV techniques, marginal structural models, and more recently Bayesian methods have been the principal tools for addressing unmeasured confounding. Stürmer et al. (1) have advanced propensity score methods to do the same, and in their paper they show when and why the new technique could be useful.


    THE ADVANCE
 TOP
 ABSTRACT
 INTRODUCTION
 THE PROBLEM
 THE ADVANCE
 CONCLUSION
 References
 
Though use is rapidly increasing, epidemiologists have been slow to adopt propensity score methods (3, 4). Technical details may be found elsewhere, but the basic idea is to 1) use standard logistic regression to estimate the probability of exposure (i.e., the propensity score) for each subject in your data, 2) construct sets of subjects who have similar propensity score estimates in hopes that the sets (often matched pairs) are exchangeable in every way but for the exposure itself, and 3) estimate the causal effects of exposure by, say, conditional logistic regression contrasting persons actually exposed and those not actually exposed. The key is getting the propensity score model right, since errors yield improper matches and thus biased effect estimates. Effect estimates will be biased if there are omitted or poorly measured predictors of exposure that undermine the exchangeability requirement; the same holds true if the predictors themselves are affected by the exposure. For this reason, economists have long considered propensity score methods useful only for estimates with "observable selection," relying instead on IV estimation for overcoming the far more common problem of "unobservable selection" (5).

Stürmer et al. have shown the extent to which this limitation of propensity score methods can be overcome by combining standard measurement error models and subsample validation techniques. The basic idea is to treat the estimated propensity scores as a mismeasured variable—mismeasured because of the omission of important predictors of exposure. Adjustment of the error-prone propensity score is done by exploiting better information in a validation subsample and, in Stürmer et al.'s paper (1), the application of regression-calibration techniques to adjust the propensity score estimates in the full set of data. The corrected, or calibrated, propensity score is then used for analyses of outcome measures as usual. Standard errors of effect estimates are typically calculated through resampling techniques (e.g., bootstrapping), but other approaches may be considered (6).

In their paper (1), Stürmer et al. assess the performance of the propensity score calibration (PSC) technique by simulation. They begin with reasonable values for some arbitrary disease-exposure-confounder relations and then alter the parameters to assess the impact. Importantly, they relax the key assumption of surrogacy (discussed below).

Their table 2 presents results obtained when surrogacy holds. For point estimate evaluation, one should appreciate that the true effect (the odds ratio (OR) for the exposure-disease association AY) is ORAY = 1, and then compare the column for median ORAY adjusted for X1, X2 (the first value is 0.65) to the PSC column "ORAY median" (the first value is 1.08). The former column presents the estimate obtained from the error-prone propensity score analysis; the latter column presents the estimate obtained from the calibrated propensity score analysis when surrogacy holds. The simulations demonstrate that, given assumptions, PSC greatly improves inference when surrogacy holds. As expected, inferential errors are more likely when disease incidence is not rare and validation samples are proportionately small. There is one anomaly: The precision of estimates is low when the main cohort size is 1,000, but this may result from the very small size of the validation sample. Overall, one must conclude that compared with uncalibrated propensity score methods that do not account for unmeasured confounding, PSC is a substantial improvement over naïve estimation.

Stürmer et al.'s table 3 presents results obtained when surrogacy may not hold. Results show that most inferential problems occur in this situation, which may be observed when the surrogacy percentage is less than 50. As Stürmer et al. noted (1), other cases exist, but all are easily explained, save for the same anomalous Nmain = 1,000 row. Most importantly, when surrogacy does not hold, the PSC estimate can exacerbate bias relative to the uncorrected propensity score model.

What is surrogacy and how might it be assessed? Stürmer et al. and Carroll et al. (7, p. 36) offer technical definitions and useful examples. Surrogacy is the state of conditional independence: f(y | X, W) = f(y | X). In what may be more familiar terms, given a traditional linear model E(Y) = Xb, a covariate W is a surrogate for covariate X when W is not correlated with Y after adjustment for X. That is, when it comes to Y, there is no information in W beyond that contained in X. Surrogacy implies nondifferential measurement error between an outcome (which in this case is the probability of exposure) and covariates. It is a vital assumption. Such a state will be found in idealized randomized trials but may be difficult to assess in observational studies (8). In Stürmer et al.'s simulations (1), surrogacy is violated when the direction of confounding in the measured covariates differs from the direction of confounding in the unmeasured covariates. This is because such directional difference tends to introduce differential measurement error, a condition where errors are correlated with the estimated propensity score. Surrogacy would seem to be routinely violated, and effect estimates exaggerated, when estimating contextual effects through aggregate proxies of microlevel data, as in the recent stream of papers on neighborhood effects (9). Indeed, surrogacy is at risk whenever survey responses are employed (10). It is noteworthy that surrogacy appears to be related to the "exclusion restrictions" found in IV estimation.


    CONCLUSION
 TOP
 ABSTRACT
 INTRODUCTION
 THE PROBLEM
 THE ADVANCE
 CONCLUSION
 References
 
PSC may be a helpful technique for addressing unmeasured confounding, especially when IV techniques are not feasible or desired. Yet we trust that Stürmer et al. agree that it is no panacea, and more work is needed before researchers should routinely adopt the technique. Setting aside the need for direct comparisons with other techniques such as IV estimation and multiple bias modeling (11, 12), at least five questions about PSC merit immediate attention.

First, why the anomalous results when the main cohort size is 1,000? What sort of "specification error" might cause this? We presume that better results would be obtained if the proportionate size of the validation study exceeded 100 subjects (a default level of 10 percent of the main study size).

Second, what is the best measure or measures with which to assess surrogacy? We agree with Stürmer et al. that the pseudo-R2 value may be best, but given the critical nature of the assumption, confirmatory work, especially with respect to meaningful cutpoints, is a must. It seems that the methods employed in assessing identification in IV techniques might also be useful here.

Third, how closely must validation subsamples represent main study subjects in order for benefits to obtain? There is a great deal of room between perfectly representative subsamples (as employed here) and disparate useless ones. Practicing epidemiologists must know the degree of exchangeability (a stronger term than the "transportability" used by some) required in subsamples before they can be confident in the PSC estimates. Some recent work in economics (13) may turn out to be a useful addition to the epidemiologic literature.

Fourth and relatedly, we have some concern that the reported coverage probability of the bootstrap confidence interval obtained using PSC may not be accurate, especially in situations where the true odds ratio is not equal to 1. This possibly stems from the fact that these intervals appear to have been constructed by bootstrapping only the full study estimation procedure instead of including the validation study estimation inside the bootstrap loop. The concern is not great, since the reported coverage probabilities are not wildly inaccurate, and they certainly exceed those of the conventional estimator using only the full study data. Nevertheless, attention seems appropriate.

Finally, how does PSC perform when the overlap between exposed and unexposed subjects is incomplete and/or imperfect? Consideration of unmatched subjects is essential if we are to minimize off-support inferences, which is akin to comparing apples to oranges (1416). Does PSC exacerbate or mitigate the problem, which seems especially common in areas such as social epidemiology?

Another issue, more logical in nature, concerns the design of the validation subsample. For PSC to work, confounders not measured in the main study need to be measured with precision in the subsample. But how are we to know what to measure? The case of variables that are merely too expensive to measure in a main study is straightforward, but better etiologic understanding is needed if we are to know what we do not know.

In summary, while certainly no panacea, the benefits of propensity score methods (and PSC) lie not only with potentially improved effect estimation but with conceptualization and practice as well. First, setting aside outcome measures and predicting exposure is vastly superior to naïve modeling of outcomes directly, since the latter invites regression screening and inflated type I errors (17, 18). Second, propensity score matching methods follow naturally from counterfactual reasoning (19). Getting from counterfactuals to regression estimates is far more complicated and thus less transparent. Propensity score methods tend to remind analysts that models, especially less transparent regression models, substitute assumptions for data. Recall that if we had excellent data (on counterfactuals), simple cross-tabulations would suffice. Third, propensity score matching methods permit the use of diagnostics for assessing inferential support. Although high-dimensional matching through propensity scores requires strong assumptions, the ability to exclude aberrant (nonmatched) cases is virtually impossible in conventional regression approaches, at least as typically practiced. Finally, the fact that the method requires investigators to consider the experimental analog to their observational study may be benefit enough. By requiring them to appreciate that exchangeability is obtained with both idealized experiments and idealized propensity scores, the method makes the link between experiments and observational designs abundantly clear—namely, ignorable treatment/exposure assignment mechanisms.


    ACKNOWLEDGMENTS
 
Conflict of interest: none declared.


    References
 TOP
 ABSTRACT
 INTRODUCTION
 THE PROBLEM
 THE ADVANCE
 CONCLUSION
 References
 

  1. Stürmer T, Schneeweiss S, Avorn J, et al. Performance of propensity score calibration—a simulation study. Am J Epidemiol (2007) 165:1110–18.[Abstract/Free Full Text]
  2. Hernan MA, Robins JM. Instruments for causal inference: an epidemiologist's dream? Epidemiology (2006) 17:360–72.[CrossRef][Web of Science][Medline]
  3. Joffe MM, Rosenbaum PR. Invited commentary: propensity scores. Am J Epidemiol (1999) 150:327–33.[Abstract/Free Full Text]
  4. Sturmer T, Schneeweiss S, Avorn J, et al. Adjusting effect estimates for unmeasured confounding with validation data using propensity score calibration. Am J Epidemiol (2005) 162:279–89.[Abstract/Free Full Text]
  5. Murray MP. Avoiding invalid instruments and coping with weak instruments. J Econ Perspect (2006) 20:111–32.[CrossRef][Web of Science]
  6. Hill J, Reiter JP. Interval estimation for treatment effects using propensity score matching. Stat Med (2006) 25:2230–56.[CrossRef][Web of Science][Medline]
  7. Carroll RJ, Ruppert D, Stefanski LA, et al. Measurement error in nonlinear models: a modern perspective. (2006) New York, NY: Chapman and Hall/CRC Press.
  8. Greenland S, Gustafson P. Accounting for independent nondifferential misclassification does not increase certainty that an observed association is in the correct direction. Am J Epidemiol (2006) 164:63–8.[Abstract/Free Full Text]
  9. Oakes JM. The (mis)estimation of neighborhood effects: causal inference for a practicable social epidemiology. Soc Sci Med (2004) 58:1929–52.[CrossRef][Web of Science][Medline]
  10. Bound J, Brown C, Mathiowetz N. Measurement error in survey data. In: Handbook of econometrics—Heckman JJ, Leamer E, eds. (2001) 5. New York, NY: Elsevier Publishing Company. 3706–843.
  11. Greenland S. Multiple-bias modelling for analysis of observational data. J R Stat Soc Ser A (2005) 168:267–306.[CrossRef]
  12. Steenland K, Greenland S. Monte Carlo sensitivity analysis and Bayesian analysis of smoking as an unmeasured confounder in a study of silica and lung cancer. Am J Epidemiol (2004) 160:384–92.[Abstract/Free Full Text]
  13. Chen X, Hong H, Tamer E. Measurement error models with auxiliary data. Rev Econ Stud (2005) 72:343–66.[CrossRef][Web of Science]
  14. King G, Zeng L. When can history be our guide? The pitfalls of counterfactual inference. Int Stud Q (2007) 51:183–210. (http://gking.harvard.edu/files/counterf.pdf).[CrossRef][Web of Science]
  15. Oakes JM. Commentary: advancing neighbourhood-effects research—selection, inferential support, and structural confounding. Int J Epidemiol (2006) 35:643–7.[Free Full Text]
  16. Vandenbroucke JP. Should we abandon statistical modeling altogether? Am J Epidemiol (1987) 126:10–13.[Free Full Text]
  17. Austin PC, Mamdani MM, Juurlink DN, et al. Testing multiple statistical hypotheses resulted in spurious associations: a study of astrological signs and health. J Clin Epidemiol (2006) 59:964–9.[CrossRef][Web of Science][Medline]
  18. Freedman DA. A note on screening regression equations. Am Stat (1983) 37:152–5.[CrossRef]
  19. Oakes JM, Johnson PJ. Propensity score matching methods for social epidemiology. In: Methods in social epidemiology—Oakes JM, Kaufman JS, eds. (2006) San Francisco, CA: Jossey-Bass Publishers. 370–92.

Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?

Related articles in Am. J. Epidemiol.:

Performance of Propensity Score Calibration—A Simulation Study
Til Stürmer, Sebastian Schneeweiss, Kenneth J. Rothman, Jerry Avorn, and Robert J. Glynn
Am. J. Epidemiol. 2007 165: 1110-1118. [Abstract] [FREE Full Text]  



This article has been cited by other articles:


Home page
BMJHome page
R. L Tannen, M. G Weiner, and D. Xie
Use of primary care electronic medical record database in drug efficacy research on cardiovascular outcomes: comparison of database and randomised controlled trial findings
BMJ, January 27, 2009; 338(jan27_1): b81 - b81.
[Abstract] [Full Text] [PDF]


Home page
Am J EpidemiolHome page
D. Acevedo-Garcia and T. L. Osypuk
Invited Commentary: Residential Segregation and Health--The Complexity of Modeling Separate Social Contexts
Am. J. Epidemiol., December 1, 2008; 168(11): 1255 - 1258.
[Abstract] [Full Text] [PDF]


Home page
Am J EpidemiolHome page
T. Sturmer, S. Schneeweiss, K. J. Rothman, J. Avorn, and R. J. Glynn
Sturmer et al. Respond to "Propensity Score Methods in Epidemiology"
Am. J. Epidemiol., May 15, 2007; 165(10): 1122 - 1123.
[Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow All Versions of this Article:
165/10/1119    most recent
kwm072v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Related articles in Am. J. Epidemiol.
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (2)
Right arrowRequest Permissions
Right arrow Disclaimer
Google Scholar
Right arrow Articles by Oakes, J. M.
Right arrow Articles by Church, T. R.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Oakes, J. M.
Right arrow Articles by Church, T. R.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?