American Journal of Epidemiology Advance Access originally published online on June 18, 2007
American Journal of Epidemiology 2007 166(4):465-471; doi:10.1093/aje/kwm107
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
ORIGINAL CONTRIBUTIONS |
A Note on Correlated Errors in Exposure and Outcome in Logistic Regression
From the Department of Biostatistics, Institute of Basic Medical Sciences, University of Oslo, Oslo, Norway
Correspondence to Dr. Magne Thoresen, Department of Biostatistics, Institute of Basic Medical Sciences, University of Oslo, P.O. Box 1122 Blindern, N-0317 Oslo, Norway (e-mail: magne.thoresen{at}medsin.uio.no).
Received for publication October 27, 2005. Accepted for publication February 28, 2007.
| ABSTRACT |
|---|
|
|
|---|
In cross-sectional studies or studies based on questionnaires, errors in exposures and misclassification of health status may be related. The reason may be that some subjects tend to over- or underreport both exposure and disease. The author investigated the effects of such dependent misclassification from a threshold-model point of view, in that an assumption was made of an underlying linear relation between a continuous exposure and response, both measured with error, and where these errors are correlated. Allowance is also made for covariates measured without error. This approach enables the derivation of explicit expressions for bias in the estimated association between exposure and outcome in different situations. It is shown that, dependent on the true effect of the exposure, the effect of the errors can be both an over- and an underestimation of the true relation. In addition, a study design from which the true effect can be consistently estimated is also provided.
correlated errors; logistic regression; measurement error; misclassification; threshold model
Abbreviations: HSCL-10, 10-item Hopkins Symptoms Checklist
| INTRODUCTION |
|---|
|
|
|---|
Measurement errors are inevitable in epidemiologic studies and are generally recognized as a common source of bias. Measurement errors are often divided into two types: differential and nondifferential errors. Differential measurement errors occur when the degree of error in the exposure variable depends on the response variable or the other way around. Nondifferential errors, on the other hand, occur when the error in the exposure variable is independent of the response variable and when the error in the response is independent of the exposure. Traditionally, nondifferential errors have been considered less serious, as the general understanding has been that the effect is most often bias in the direction of the null, while the effects of differential errors are more difficult to foretell.
A related problem is that of dependent, or correlated, errors. To make the distinction between this problem and the more familiar scenarios of differential and nondifferential errors, we note that dependent errors occur when there are measurement errors in both the exposure and the response variable, and these errors are correlated. That is, the error in the exposure depends on the error in the response. The error could still be either differential or nondifferential. Assume that x is the true exposure, y is the true response, and X is the surrogate exposure and Y the surrogate response. Formally, when there are nondifferential dependent errors, Pr(X|x,y,Y – y) = Pr(X|x,Y – y). However, Pr(X|x,Y)
Pr(X|x). That is, the observed outcome does contain information about the observed exposure. When Pr(X|x,y,Y – y)
Pr(X|x,Y – y), we may refer to this as differential dependent errors. Both of these cases of dependent errors behave similarly to the more familiar differential error case, where Pr(X|x,y)
Pr(X|x), as bias in the estimated exposure-response association can be away from the null.
Dependent errors are typically a problem in cross-sectional studies and in studies based on questionnaires. An example may be the following. In a paper in the British Medical Journal, a study investigating the association between self-perceived psychological stress and cardiovascular disease was reported (1). The study showed an association between self-perceived stress and incident angina based on a questionnaire, through an odds ratio of 2.66 (95 percent confidence interval: 1.61, 4.41), while there was no association between self-perceived stress and incident ischemia as measured by electrocardiogram: odds ratio = 0.67 (95 percent confidence interval: 0.36, 1.26). Although there may be several reasons for this finding, one possible explanation is obviously dependent misclassification.
The problem is widely acknowledged in behavioral sciences (2), and it also comes up from time to time in epidemiologic literature (3). It has typically been a subject of concern within fields of epidemiology that rely on the use of validation studies, such as nutritional epidemiology, where a series of papers have explored the effects of different types of correlated errors and possible ways to deal with the problem (4–9). The effects have also been studied in a pure categorical setting (categorical response and exposure) (10–12). Lately, the effects on estimates of attributable risk have also been studied (13).
In epidemiology, the situation will often be that one has a response variable that is binary (diseased/not diseased) because of a categorization of a continuous or nearly continuous variable. Examples of such variables may be the categorization of patients having some sort of psychiatric disorder or not, based on some screening instrument/symptom score, or the diagnosis of diabetic patients based on their level of fasting blood sugar. If this categorization is based upon some subjective measurements and is then related to another subjectively measured exposure variable, we will often have a logistic regression situation with dependent errors. In this report, we will investigate the effects of such errors from a threshold-model point of view.
| MODEL |
|---|
|
|
|---|
Assume that we have a linear relation between an exposure x and a response y, satisfying the linear regression equation,
|
|
Further, assume that we categorize y according to some prespecified threshold value c, such that
|
| (1) |
|
|
is standard normally distributed and where F denotes the logistic distribution function in cases where
is distributed according to the logistic distribution. Defining
is normally distributed, leading to the probit regression model. That is,
denotes the standard normal distribution function. It is, however, an easy task to approximate the logistic model from this one. We should mention that, although our main focus is on the coefficients in the binary regression model, the qualitative properties are most often seen already when studying the coefficients in the linear regression model. This means that we will discuss properties of the naive estimators (the estimators that ignore measurement error) in both linear and binary regression models.
We will now look at the effect of measurement errors in two different situations.
Situation 1
We assume that we have a situation where, at least for some of the subjects in the study, we are not able to observe x and y. Instead, we have to relate to the error-prone observations X and Y, where Xi = xi + ri and Yi = yi + si. Here, ri and si denote person-specific bias components of the two variables, respectively. We model r and s as random effects with corresponding variance components and, as long as we are interested in only the estimated effect of the exposure variable (b), we can assume that they both have expected values of zero, as any deviations from this assumption will affect only the constant term a. Further, we assume that r and s have variances 
and 
, respectively, that they are independent of x,y, and that they have a nonzero correlation between them, denoted by
. This means that we have assumed nondifferential errors, as f(X|x,y) = f(X|x) and f(Y|x,y) = f(Y|y), but dependent errors, as f(X,Y|x,y)
f(X|x,y) f(Y|x,y), where f(.) denotes the relevant density functions.
If we ignore the measurement error in x and y, the linear regression equation with which we in effect work (the naive model) is
Now, with an assumption about normality of x, y, r, and s, it can be shown (14) that
|
| (2) |

denotes the variance of x, and that
*
is normally distributed with expectation zero and variance 
, where
|
| (3) |

are expressions for the expected values of the least-squares estimators based on the naive model.
So, by defining
|
|
![]() | (4) |
and equation 4.
From this expression, it is not intuitively clear what is going on. However, one striking observation is clear. For ß =0 (and thereby also b = 0), any
0 implies bias away from zero.
In order to shed some light on equation 4, we follow the analysis by Schaalje and Butts (14) and define
We can then reformulate equation 4 into
![]() | (5) |
is negative, that is, if 

/
>ß. In addition, it is easily seen that the denominator of equation 5 can become very close to
in situations with |
| close to 1 and
close to 0. This means that b* is, in fact, asymptotically unbiased in situations where |
| = 1, and ß = 

/
.
Situation 2
Further, we can obviously extend the model to also include covariates measured without error. It is well known from the situation with independent random errors that the effect of the error in x to some extent depends on the correlation between x and the covariate measured without error, z, and it is also known that the estimated relation between the response
and z will be biased because of the error in x. We will now investigate these relations under the assumption of dependent errors.
Let the true underlying linear model be
|
|
= x
+ r
and Y
= y
+ s
. Assume that zi is measured without error. If we ignore measurement error, our regression equation becomes
and ß*
equal
![]() |
![]() |

denotes the coefficient of z in the regression of x on z, 
denotes the residual variance from the regression of x on z, and
*
is normally distributed with variance 
, where
![]() |

, 
, 
denote the covariances between (x, z), (x, y), and (z,y), respectively.
Define
as in equation 1. We are again interested in the binary regression of
on (x, z). Let the coefficients in this regression be denoted (bx, bz). We define
in the same way as above. The naively estimated (and biased) coefficients in the probit regression of
on Xi and zi are again given by the ratio between (ß*
, ß*
) and the square root of the residual variance from the naive linear model, 
. These are denoted (b*
,b*
).
Unfortunately, these expressions are not easily interpreted, but some basic features are easily seen. It is again easy to see that, for ß
= 0, any
0 implies bias away from the null in the naive estimate ß*
and thereby also in the naive estimate b*
. The degree of bias depends on both 
, 
and 
. It is also well known from linear regression with random measurement error that, when there is a high correlation between x and z (i.e., 
0), the attenuation in ß*
gets extreme. In our situation, this attenuation effect is more limited, as ß*
will approach 



/
.
As in situation 1, we can define a new parameter
= ß
– 

/
. Then ß*x becomes
![]() |
Again, it can be seen that ß
is overestimated if
is negative, that is, if 

/
> ß
, and ß*x is consistent when ß
= 

/
.
The most striking observation is probably that, even in the situation where there is no effect of the exposure x (i.e., b
= 0), any
0 implies bias in b*z as long as x and z are correlated. If x and z are positively correlated, we will in this situation have an underestimation of the effect of z, given a positive correlation between the errors. This is the opposite effect of what we experience with independent random errors, where the effect of z will be overestimated. For ß
0, ß*
is also asymptotically unbiased when ß
= 

/
, that is, when ß
equals the coefficient in the regression of s on r, and ß
is underestimated if 

/
> ß
.
Further, it can be shown that b*
,b*
are both consistent when, in addition to ß
= 

/
, |
| = 1. This parallels the finding from situation 1.
| ILLUSTRATION |
|---|
|
|
|---|
Although our calculations are based on the probit regression model, we will from an applied point of view obviously be more interested in the effects in the logistic regression model. Fortunately, the logistic and the probit functions are very similar, except for a scale difference. This means that, by multiplying the regression coefficients from the probit regression model by a given factor, one will approximate the coefficients from the corresponding logistic regression model very well. The factor most often used in the measurement error literature is 1.7, which we will also use here (15, p. 64).
To illustrate the effect of correlated errors on the observed effect of x, we compute the approximated observed odds ratio between the first and the fifth quintile in a logistic regression model, for two different values of the true effect. The computation is based on equation 4 and use of the factor 1.7 to go from the probit to the logistic model. Figures 1 and 2 give these results. In both figures, we have assumed that x and y have the same variance = 1. We have also assumed 
= 
, and we have used values 0.3 and 0.5 for these parameters. This gives us observed X distributions with variances 1.09 and 1.25, respectively. In figure 1, the true effect of x is zero (odds ratio = 1), while in figure 2, the true effect corresponds to an odds ratio = 2.5. In figure 1, we see that with 
= 
= 0.3, the effect of the correlated errors is modest. For
= 0.4, the observed odds ratio is 1.15. The corresponding odds ratio for 
= 
= 0.5 is 1.4. In figure 2, we see that for low correlations, the effect of x is still underestimated. Overestimation occurs when the correlation is above 0.2. With 
= 
= 0.5, the overestimation becomes substantial for higher correlations.
|
|
| EXAMPLE |
|---|
|
|
|---|
The Oslo Health Study is a cross-sectional study conducted in the city of Oslo, Norway, from May 2000 until September 2001. The youth part of the study included all youth aged 15 and 16 years in Oslo. The response rate here was 88.3 percent (7,343 persons). Publications based on these data include those by Lien et al. (16, 17). The participants answered two four-page questionnaires, where included in these questionnaires was the 10-item Hopkins Symptoms Checklist (HSCL-10), used as a measure of psychological distress. An average score of 1.85 or above was used as an indicator of being mentally distressed. The study also asked about physical activity with the following question: "How many hours per day do you exercise?" The answers were given on a six-point scale ranging from "0 hours" to "11 hours or more." Obviously, both of these measurements may be disturbed by person-specific bias. We will look at the association between physical activity and mental distress.
The distribution of the HSCL-10 score is somewhat skewed, and so the residuals in the regression of HSCL-10 on physical activity are not normally distributed. Neither is the physical activity score truly continuous, but even so, we believe that this example may serve as a useful illustration.
Carrying out the linear regression of HSCL-10 on physical activity, we find an estimated regression coefficient of –0.043, and the residual variance is estimated as 0.24. This gives a calculated regression coefficient of –0.15 in a logistic regression model based on the cutoff value of HSCL-10, using the factor 1.7 to go from the probit to the logistic model. Carrying out the actual logistic regression analysis gives an estimated coefficient of –0.2. This gives estimated odds ratios of 0.86 versus 0.82, which are not too far apart. This shows an association between the degree of physical activity and the risk of being mentally distressed, with a higher level of physical activity corresponding to a lower risk of being mentally distressed. We are interested in to what degree this estimated association may be disturbed by correlated errors.
Assume now that there are correlated errors in the measurements of physical activity and psychological distress. On the basis of these data, there is no way that we can assess either the magnitude of any possible errors or the correlation between them, but we assume that the errors are relatively small. We assume a reliability coefficient
equal to 0.8 and 0.9 for both X and Y, that is, 
/(
+ 
) = 
/(
+ 
) = (0.8, 0.9), where 
denotes the variance of y. We can then calculate "corrected" regression coefficients and odds ratios for different values of the correlation coefficient
, by solving equations 2 and 3 for ß and
2, respectively. Figure 3 gives these corrected odds ratios for varying
, for
= 0.8 and 0.9, respectively. It is immediately seen that, for all positive correlations, the effect is attenuated. However, in this situation, there may obviously be a negative correlation between the bias components r and s, which gives an overestimation of the true association if the correlation is stronger than about –0.1.
|
| DISCUSSION |
|---|
|
|
|---|
We have looked at the effects of correlated errors on the observed odds ratios from a logistic regression model, from a threshold-model point of view. The model is of interest, as medical diagnoses are often based on a cutoff of an underlying continuous scale.
We have investigated two quite simple situations: first, the very simple model with one exposure variable measured with error, and second, a model where a covariate measured without error was added. These models could of course be extended. One interesting situation is when the errors in (x, y) depend also on the true values of (x, y). This can be modeled by letting X
= x
+ bias
and Y
= y
+ bias
, where we let bias
= 
+ 
x
+ r
, and correspondingly for yi. Again, we can calculate the naively observed coefficients and look at their properties. However, for illustrative purposes, this situation does not give very different results from what is already reported. Further, one can also, of course, extend the regression model to include more covariates, with or without error; one can let the errors depend on other covariates as well; and one can add random noise to the person-specific bias components.
We believe that our calculations can be used to gain some insight into the problem with correlated errors in epidemiologic data. Although the expressions seem rather complicated, it is possible to extract some main features that describe the effects of the correlated errors.
It is interesting to observe that the effect of the correlated errors does not depend on the prevalence of the response (disease), as neither the true regression coefficient b nor the observed coefficient b* is a function of the cutoff value c. (It should be noted that the value of c will affect the estimated intercept in the logistic model, though.) This is in contrast to what is shown in the purely categorical setting, with dependent misclassification (10–12). The reason for this is that the calculations in the categorical setting are based on the sensitivity and the specificity of the error-prone variables. In our situation, sensitivity and specificity for
for a given measurement error, are functions of the cutoff value, which determines the prevalence.
It should also be noted that this implies that we do not any longer have nondifferential errors, as the sensitivity and the specificity of
are functions of y, which is again a function of x as long as ß
0. This is a general result, first pointed out by Flegal et al. (18), although they were working with misclassified exposure variables. If a misclassified dichotomous variable is created by thresholding a continuous variable subject to nondifferential measurement error, then in general the misclassification will be differential.
In our calculations, we have assumed that we have normally distributed variables and errors. This is obviously not realistic in most applications, but we believe that the more qualitative findings will also apply under other distributions.
The reasons for the dependent errors may be several (refer to reference 2 for a discussion), but one main reason is obviously the tendency for respondents to have either a generally positive or a generally negative attitude to the world around them. This may lead to high, respectively low, thresholds for reporting problems of any kind. Another reason for the dependent errors may be that some respondents tend to agree or disagree with questionnaire items independent of their content, and yet another reason may be respondents' tendency to answer in agreement with socially accepted standards. All these reasons are based on respondents' personality and are for that reason difficult to eliminate.
Spiegelman et al. (9) investigated possible study designs for estimation in situations with correlated errors in validation studies in nutritional epidemiology. Unfortunately, none of these results can be applied in our situation, as we will not be able to single out the person-specific bias component si from the measurement Yi. However, with some additional information and with proper restrictions on the parameters, the ideas in the report by Spiegelman et al. (9) can be applied, and a proper correction for measurement error can be performed. The idea and the solution are outlined in appendix B.
| APPENDIX A |
|---|
|
|
|---|
Here, we will give the calculations leading to the expressions for ß*x and ß*z under situation 2.
Following, for example, the discussion by Carroll et al. (15, p. 26), we can write
![]() |

denotes the covariance between ri and si. This gives
![]() |
|
|

= 
/
(the coefficient of z in the regression of x on z), 
= 
– 
/
(the residual variance from the regression of x on z), and
= 
/(
+ 
), we get
![]() |
|
|
| APPENDIX B |
|---|
|
|
|---|
Assume that, in addition to X and Y, we are able to measure some other variable W that is correlated with x and uncorrelated with the error components in both X and Y. Assume also that we are able to take replicated measurements of W on some subgroup. We then have an overall model
|
|
|
|
|
|
is independent and identically distributed with expected value 0 and variance 
and uncorrelated with x
,r
,s
, and 
.
This model has 11 model parameters, and the design, with replicates on W, permits estimation of 10 first- and second-order moments. This means that we will have to put some restrictions on the parameters in order to make the model identifiable. We will assume that 
/
= k, where k is some known constant. Appendix table 1 gives the identification of the model. By equating the theoretical moments to the sample moments, we obtain the standard method-of-moments estimators. Based on these, it is now an easy task to obtain consistent estimators of ß and
. The ratio between these two estimators will then form a consistent estimator of b in the probit regression model.
|
| ACKNOWLEDGMENTS |
|---|
The data collection for the example was conducted as part of the Oslo Health Study, 2000–2001, in collaboration with the National Health Screening Service of Norway (now the Norwegian Institute of Public Health).
Conflict of interest: none declared.
| References |
|---|
|
|
|---|
- Macleod J, Smith GD, Heslop P, et al. Psychological stress and cardiovascular disease: empirical demonstration of bias in a prospective observational study of Scottish men. BMJ (2002) 324:1247–51.
[Abstract/Free Full Text] - Podsakoff PM, MacKenzie SB, Lee JY, et al. Common method bias in behavioral research: a critical review of the literature and recommended remedies. J Appl Psychol (2003) 88:879–903.[CrossRef][Web of Science][Medline]
- Kristensen P. Information bias from dependent measurement error in observational studies (2005) 125. (In Norwegian): Tidsskr Nor Laegeforen. 173–5.
- Wacholder S, Armstrong B, Hartge P. Validation studies using an alloyed gold standard. Am J Epidemiol (1993) 137:1251–8.
[Abstract/Free Full Text] - Kaaks R, Riboli E, Estève J, et al. Estimating the accuracy of dietary questionnaire assessments: validation in terms of structural equation models. Stat Med (1994) 13:127–42.[Web of Science][Medline]
- Spiegelman D, Schneeweiss S, McDermott A. Measuremet error correction for logistic regression models with an "alloyed gold standard." Am J Epidemiol (1997) 145:184–96.
[Abstract/Free Full Text] - Kipnis V, Carroll RJ, Freedman LS, et al. Implications of a new dietary measurement error model for estimation of relative risk: application to four calibration studies. Am J Epidemiol (1999) 150:642–51.
[Abstract/Free Full Text] - Kipnis V, Midthune D, Freedman LS, et al. Empirical evidence of correlated biases in dietary assessment instruments and its implications. Am J Epidemiol (2001) 153:394–403.
[Abstract/Free Full Text] - Spiegelman D, Zhao B, Kim J. Correlated errors in biased surrogates: study designs and methods for measurement error correction. Stat Med (2005) 24:1657–82.[CrossRef][Web of Science][Medline]
- Kristensen P. Bias from nondifferential but dependent misclassification of exposure and outcome. Epidemiology (1992) 3:210–15.[Web of Science][Medline]
- Chavance M, Dellatolas G, Lellouch J. Correlated nondifferential misclassifications of disease and exposure: application to a cross-sectional study of the relation between handedness and immune disorders. Int J Epidemiol (1992) 21:537–46.
[Abstract/Free Full Text] - Brenner H, Savitz DA, Gefeller O. The effects of joint misclassification of exposure and disease on epidemiologic measures of association. J Clin Epidemiol (1993) 46:1195–202.[CrossRef][Web of Science][Medline]
- Vogel C, Brenner H, Pfahlberg A. The effects of joint misclassification of exposure and disease on the attributable risk. Stat Med (2005) 24:1881–96.[CrossRef][Web of Science][Medline]
- Schaalje GB, Butts RA. Some effects of ignoring correlated measurement errors in straight line regression and prediction. Biometrics (1993) 49:1262–7.[CrossRef][Web of Science]
- Carroll RJ, Ruppert D, Stefanski LA. Measurement error in nonlinear models (1995) London, United Kingdom: Chapman & Hall.
- Lien L, Tambs K, Oppedal B, et al. Is relatively young age within a school year a risk factor for mental health problems and poor school performance? A population-based cross-sectional study of adolescents in Oslo, Norway. BMC Public Health (2005) 5:102–9.[CrossRef][Medline]
- Lien L, Claussen B, Hauff E, et al, Bodily pain and associated mental distress among immigrant adolescents. A population-based cross-sectional study. In: Eur Child Adolesc Psychiatry (2005) 14:371–5.[CrossRef][Web of Science][Medline]
- Flegal KM, Keyl PM, Nieto FJ. Differential misclassification arising from nondifferential errors in exposure measurement. Am J Epidemiol (1991) 134:1233–44.
[Abstract/Free Full Text]
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||











