American Journal of Epidemiology Advance Access originally published online on March 28, 2007
American Journal of Epidemiology 2007 165(10):1119-1121; doi:10.1093/aje/kwm072
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Invited Commentary: Advancing Propensity Score Methods in Epidemiology
Correspondence to Dr. J. Michael Oakes, Division of Epidemiology and Community Health, Minnesota Population Center, University of Minnesota, Minneapolis, MN 55454 (e-mail: oakes{at}epi.umn.edu).
Received for publication December 13, 2006. Accepted for publication December 15, 2006.
| ABSTRACT |
|---|
|
|
|---|
Every epidemiologist knows that unmeasured confounding is a serious analytic problem, but practically speaking, there seems to be little one can do about it. In this issue of the Journal, Stürmer et al. (Am J Epidemiol 2007:165:111018) offer a novel solution that combines propensity score matching methods with measurement error regression models. They call this technique "propensity score calibration" (PSC) and assess its strengths and limitations with simulated data. Their analyses demonstrate that PSC greatly improves inference when the critical assumption of surrogacy holds, but when surrogacy does not hold, PSC estimation can exacerbate bias relative to uncorrected propensity score models. The benefits of propensity score methods (and PSC) lie not only with potentially improved effect estimation but with conceptualization and practice as well.
bias (epidemiology); cohort studies; confounding factors (epidemiology); epidemiologic methods; models, statistical; propensity score calibration; research design
Abbreviations: IV, instrumental variable; OR, odds ratio; PSC, propensity score calibration
| INTRODUCTION |
|---|
|
|
|---|
In this issue of the Journal, Stürmer et al. (1) have contributed yet another important paper on the use of propensity score methods in epidemiology. They use simulation to reveal some of the benefits and pitfalls of adjusting a propensity score model to account for unmeasured covariates/confounders. The paper is noteworthy for its novelty, methodological insight, and clarity. Here we highlight central aspects of the paper and offer some thoughts on the use and advancement of propensity score methods in epidemiology. Our target audience is not expert methodologists but rather those who wonder whether this alternative technique may be helpful in their own research.
| THE PROBLEM |
|---|
|
|
|---|
Adjustment by stratification, restriction, and especially regression is frequently used to control for confounding by measured variables. But what of the unmeasured confounders (and/or mismeasured confounders that yield residual confounding)? While discussing instrumental-variable (IV) methods, Hernán and Robins (2, p. 360) recently asked and answered this dreaded question with abundant clarity: "Can you guarantee that the results from your observational study are unaffected by unmeasured confounding? The only answer an epidemiologist can provide is no."
Beyond hope and disregard, improved theory, better measurement, sensitivity analysis, bounding, imputation, IV techniques, marginal structural models, and more recently Bayesian methods have been the principal tools for addressing unmeasured confounding. Stürmer et al. (1) have advanced propensity score methods to do the same, and in their paper they show when and why the new technique could be useful.
| THE ADVANCE |
|---|
|
|
|---|
Though use is rapidly increasing, epidemiologists have been slow to adopt propensity score methods (3, 4). Technical details may be found elsewhere, but the basic idea is to 1) use standard logistic regression to estimate the probability of exposure (i.e., the propensity score) for each subject in your data, 2) construct sets of subjects who have similar propensity score estimates in hopes that the sets (often matched pairs) are exchangeable in every way but for the exposure itself, and 3) estimate the causal effects of exposure by, say, conditional logistic regression contrasting persons actually exposed and those not actually exposed. The key is getting the propensity score model right, since errors yield improper matches and thus biased effect estimates. Effect estimates will be biased if there are omitted or poorly measured predictors of exposure that undermine the exchangeability requirement; the same holds true if the predictors themselves are affected by the exposure. For this reason, economists have long considered propensity score methods useful only for estimates with "observable selection," relying instead on IV estimation for overcoming the far more common problem of "unobservable selection" (5).
Stürmer et al. have shown the extent to which this limitation of propensity score methods can be overcome by combining standard measurement error models and subsample validation techniques. The basic idea is to treat the estimated propensity scores as a mismeasured variablemismeasured because of the omission of important predictors of exposure. Adjustment of the error-prone propensity score is done by exploiting better information in a validation subsample and, in Stürmer et al.'s paper (1), the application of regression-calibration techniques to adjust the propensity score estimates in the full set of data. The corrected, or calibrated, propensity score is then used for analyses of outcome measures as usual. Standard errors of effect estimates are typically calculated through resampling techniques (e.g., bootstrapping), but other approaches may be considered (6).
In their paper (1), Stürmer et al. assess the performance of the propensity score calibration (PSC) technique by simulation. They begin with reasonable values for some arbitrary disease-exposure-confounder relations and then alter the parameters to assess the impact. Importantly, they relax the key assumption of surrogacy (discussed below).
Their table 2 presents results obtained when surrogacy holds. For point estimate evaluation, one should appreciate that the true effect (the odds ratio (OR) for the exposure-disease association AY) is ORAY = 1, and then compare the column for median ORAY adjusted for X1, X2 (the first value is 0.65) to the PSC column "ORAY median" (the first value is 1.08). The former column presents the estimate obtained from the error-prone propensity score analysis; the latter column presents the estimate obtained from the calibrated propensity score analysis when surrogacy holds. The simulations demonstrate that, given assumptions, PSC greatly improves inference when surrogacy holds. As expected, inferential errors are more likely when disease incidence is not rare and validation samples are proportionately small. There is one anomaly: The precision of estimates is low when the main cohort size is 1,000, but this may result from the very small size of the validation sample. Overall, one must conclude that compared with uncalibrated propensity score methods that do not account for unmeasured confounding, PSC is a substantial improvement over naïve estimation.
Stürmer et al.'s table 3 presents results obtained when surrogacy may not hold. Results show that most inferential problems occur in this situation, which may be observed when the surrogacy percentage is less than 50. As Stürmer et al. noted (1), other cases exist, but all are easily explained, save for the same anomalous Nmain = 1,000 row. Most importantly, when surrogacy does not hold, the PSC estimate can exacerbate bias relative to the uncorrected propensity score model.
What is surrogacy and how might it be assessed? Stürmer et al. and Carroll et al. (7, p. 36) offer technical definitions and useful examples. Surrogacy is the state of conditional independence: f(y | X, W) = f(y | X). In what may be more familiar terms, given a traditional linear model E(Y) = Xb, a covariate W is a surrogate for covariate X when W is not correlated with Y after adjustment for X. That is, when it comes to Y, there is no information in W beyond that contained in X. Surrogacy implies nondifferential measurement error between an outcome (which in this case is the probability of exposure) and covariates. It is a vital assumption. Such a state will be found in idealized randomized trials but may be difficult to assess in observational studies (8). In Stürmer et al.'s simulations (1), surrogacy is violated when the direction of confounding in the measured covariates differs from the direction of confounding in the unmeasured covariates. This is because such directional difference tends to introduce differential measurement error, a condition where errors are correlated with the estimated propensity score. Surrogacy would seem to be routinely violated, and effect estimates exaggerated, when estimating contextual effects through aggregate proxies of microlevel data, as in the recent stream of papers on neighborhood effects (9). Indeed, surrogacy is at risk whenever survey responses are employed (10). It is noteworthy that surrogacy appears to be related to the "exclusion restrictions" found in IV estimation.
| CONCLUSION |
|---|
|
|
|---|
PSC may be a helpful technique for addressing unmeasured confounding, especially when IV techniques are not feasible or desired. Yet we trust that Stürmer et al. agree that it is no panacea, and more work is needed before researchers should routinely adopt the technique. Setting aside the need for direct comparisons with other techniques such as IV estimation and multiple bias modeling (11, 12), at least five questions about PSC merit immediate attention.
First, why the anomalous results when the main cohort size is 1,000? What sort of "specification error" might cause this? We presume that better results would be obtained if the proportionate size of the validation study exceeded 100 subjects (a default level of 10 percent of the main study size).
Second, what is the best measure or measures with which to assess surrogacy? We agree with Stürmer et al. that the pseudo-R2 value may be best, but given the critical nature of the assumption, confirmatory work, especially with respect to meaningful cutpoints, is a must. It seems that the methods employed in assessing identification in IV techniques might also be useful here.
Third, how closely must validation subsamples represent main study subjects in order for benefits to obtain? There is a great deal of room between perfectly representative subsamples (as employed here) and disparate useless ones. Practicing epidemiologists must know the degree of exchangeability (a stronger term than the "transportability" used by some) required in subsamples before they can be confident in the PSC estimates. Some recent work in economics (13) may turn out to be a useful addition to the epidemiologic literature.
Fourth and relatedly, we have some concern that the reported coverage probability of the bootstrap confidence interval obtained using PSC may not be accurate, especially in situations where the true odds ratio is not equal to 1. This possibly stems from the fact that these intervals appear to have been constructed by bootstrapping only the full study estimation procedure instead of including the validation study estimation inside the bootstrap loop. The concern is not great, since the reported coverage probabilities are not wildly inaccurate, and they certainly exceed those of the conventional estimator using only the full study data. Nevertheless, attention seems appropriate.
Finally, how does PSC perform when the overlap between exposed and unexposed subjects is incomplete and/or imperfect? Consideration of unmatched subjects is essential if we are to minimize off-support inferences, which is akin to comparing apples to oranges (1416). Does PSC exacerbate or mitigate the problem, which seems especially common in areas such as social epidemiology?
Another issue, more logical in nature, concerns the design of the validation subsample. For PSC to work, confounders not measured in the main study need to be measured with precision in the subsample. But how are we to know what to measure? The case of variables that are merely too expensive to measure in a main study is straightforward, but better etiologic understanding is needed if we are to know what we do not know.
In summary, while certainly no panacea, the benefits of propensity score methods (and PSC) lie not only with potentially improved effect estimation but with conceptualization and practice as well. First, setting aside outcome measures and predicting exposure is vastly superior to naïve modeling of outcomes directly, since the latter invites regression screening and inflated type I errors (17, 18). Second, propensity score matching methods follow naturally from counterfactual reasoning (19). Getting from counterfactuals to regression estimates is far more complicated and thus less transparent. Propensity score methods tend to remind analysts that models, especially less transparent regression models, substitute assumptions for data. Recall that if we had excellent data (on counterfactuals), simple cross-tabulations would suffice. Third, propensity score matching methods permit the use of diagnostics for assessing inferential support. Although high-dimensional matching through propensity scores requires strong assumptions, the ability to exclude aberrant (nonmatched) cases is virtually impossible in conventional regression approaches, at least as typically practiced. Finally, the fact that the method requires investigators to consider the experimental analog to their observational study may be benefit enough. By requiring them to appreciate that exchangeability is obtained with both idealized experiments and idealized propensity scores, the method makes the link between experiments and observational designs abundantly clearnamely, ignorable treatment/exposure assignment mechanisms.
| ACKNOWLEDGMENTS |
|---|
Conflict of interest: none declared.
| References |
|---|
|
|
|---|
- Stürmer T, Schneeweiss S, Avorn J, et al. Performance of propensity score calibrationa simulation study. Am J Epidemiol (2007) 165:111018.
[Abstract/Free Full Text] - Hernan MA, Robins JM. Instruments for causal inference: an epidemiologist's dream? Epidemiology (2006) 17:36072.[CrossRef][ISI][Medline]
- Joffe MM, Rosenbaum PR. Invited commentary: propensity scores. Am J Epidemiol (1999) 150:32733.
[Abstract/Free Full Text] - Sturmer T, Schneeweiss S, Avorn J, et al. Adjusting effect estimates for unmeasured confounding with validation data using propensity score calibration. Am J Epidemiol (2005) 162:27989.
[Abstract/Free Full Text] - Murray MP. Avoiding invalid instruments and coping with weak instruments. J Econ Perspect (2006) 20:11132.[CrossRef][ISI]
- Hill J, Reiter JP. Interval estimation for treatment effects using propensity score matching. Stat Med (2006) 25:223056.[CrossRef][ISI][Medline]
- Carroll RJ, Ruppert D, Stefanski LA, et al. Measurement error in nonlinear models: a modern perspective. (2006) New York, NY: Chapman and Hall/CRC Press.
- Greenland S, Gustafson P. Accounting for independent nondifferential misclassification does not increase certainty that an observed association is in the correct direction. Am J Epidemiol (2006) 164:638.
[Abstract/Free Full Text] - Oakes JM. The (mis)estimation of neighborhood effects: causal inference for a practicable social epidemiology. Soc Sci Med (2004) 58:192952.[CrossRef][ISI][Medline]
- Bound J, Brown C, Mathiowetz N. Measurement error in survey data. In: Handbook of econometricsHeckman JJ, Leamer E, eds. (2001) 5. New York, NY: Elsevier Publishing Company. 3706843.
- Greenland S. Multiple-bias modelling for analysis of observational data. J R Stat Soc Ser A (2005) 168:267306.[CrossRef]
- Steenland K, Greenland S. Monte Carlo sensitivity analysis and Bayesian analysis of smoking as an unmeasured confounder in a study of silica and lung cancer. Am J Epidemiol (2004) 160:38492.
[Abstract/Free Full Text] - Chen X, Hong H, Tamer E. Measurement error models with auxiliary data. Rev Econ Stud (2005) 72:34366.[CrossRef][ISI]
- King G, Zeng L. When can history be our guide? The pitfalls of counterfactual inference. Int Stud Q (2007) 51:183210. (http://gking.harvard.edu/files/counterf.pdf).[CrossRef][ISI]
- Oakes JM. Commentary: advancing neighbourhood-effects researchselection, inferential support, and structural confounding. Int J Epidemiol (2006) 35:6437.
[Free Full Text] - Vandenbroucke JP. Should we abandon statistical modeling altogether? Am J Epidemiol (1987) 126:1013.
[Free Full Text] - Austin PC, Mamdani MM, Juurlink DN, et al. Testing multiple statistical hypotheses resulted in spurious associations: a study of astrological signs and health. J Clin Epidemiol (2006) 59:9649.[CrossRef][ISI][Medline]
- Freedman DA. A note on screening regression equations. Am Stat (1983) 37:1525.[CrossRef]
- Oakes JM, Johnson PJ. Propensity score matching methods for social epidemiology. In: Methods in social epidemiologyOakes JM, Kaufman JS, eds. (2006) San Francisco, CA: Jossey-Bass Publishers. 37092.
Related articles in Am. J. Epidemiol.:
- Performance of Propensity Score CalibrationA Simulation Study
- Til Stürmer, Sebastian Schneeweiss, Kenneth J. Rothman, Jerry Avorn, and Robert J. Glynn
Am. J. Epidemiol. 2007 165: 1110-1118.[Abstract] [FREE Full Text]
This article has been cited by other articles:
![]() |
T. Sturmer, S. Schneeweiss, K. J. Rothman, J. Avorn, and R. J. Glynn Sturmer et al. Respond to "Propensity Score Methods in Epidemiology" Am. J. Epidemiol., May 15, 2007; 165(10): 1122 - 1123. [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
