American Journal of Epidemiology Advance Access originally published online on November 5, 2008
American Journal of Epidemiology 2009 169(1):113-121; doi:10.1093/aje/kwn292
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
PRACTICE OF EPIDEMIOLOGY |
Fractional Polynomials and Model Selection in Generalized Estimating Equations Analysis, With an Application to a Longitudinal Epidemiologic Study in Australia
Correspondence to Dr. Jisheng Cui, World Health Organization Collaborating Centre for Obesity Prevention, Deakin University, 221 Burwood Highway, Melbourne, VIC 3125, Australia (e-mail: jisheng.cui{at}deakin.edu.au).
Received for publication January 28, 2008. Accepted for publication August 18, 2008.
| ABSTRACT |
|---|
|
|
|---|
In epidemiologic studies, researchers often need to establish a nonlinear exposure-response relation between a continuous risk factor and a health outcome. Furthermore, periodic interviews are often conducted to take repeated measurements from an individual. The authors proposed to use fractional polynomial models to jointly analyze the effects of 2 continuous risk factors on a health outcome. This method was applied to an analysis of the effects of age and cumulative fluoride exposure on forced vital capacity in a longitudinal study of lung function carried out among aluminum workers in Australia (1995–2003). Generalized estimating equations and the quasi-likelihood under the independence model criterion were used. The authors found that the second-degree fractional polynomial models for age and fluoride fitted the data best. The best model for age was robust across different models for fluoride, and the best model for fluoride was also robust. No evidence was found to suggest that the effects of smoking and cumulative fluoride exposure on change in forced vital capacity over time were significant. The trend 1 model, which included the unexposed persons in the analysis of trend in forced vital capacity over tertiles of fluoride exposure, did not fit the data well, and caution should be exercised when this method is used.
fractional polynomial; generalized estimating equation; longitudinal studies; model selection; occupational exposure
Abbreviations: AIC, Akaike's Information Criterion; BIC, Bayesian Information Criterion; FVC, forced vital capacity; GEE, generalized estimating equations; QIC, quasi-likelihood under the independence model criterion
| INTRODUCTION |
|---|
|
|
|---|
In epidemiologic studies, especially occupational epidemiologic studies, we often need to analyze the effect of a continuous exposure risk factor on a health outcome. However, the statistical distribution of the exposure variable is often highly skewed—that is, the level of exposure in a small group of people can be much higher than that in the bulk of the people in a study (1, 2). This causes difficulties in establishing a smoothed exposure-response relation between the health outcome and the continuous variable. Conventional linear regression methods are usually inappropriate in this situation because of the nonlinear relation (3). Furthermore, periodic interviews are often conducted over a period of time, or even over several years in some studies (4, 5), in order to make repeated measurements of the outcome and relevant risk factors in each individual. Such a study design enables investigators to have sufficient statistical power to detect the long-term effect of a risk factor on an uncommon chronic disease or on a small change in a continuous health outcome over time (6, 7). The possible correlation between repeated measurements in an individual is a major issue in any analysis of longitudinal data, and it needs to be accounted for by appropriate statistical models.
Although the analysis of such longitudinal data is complicated, a number of studies, especially in occupational epidemiology, have been published. These studies have included analyses of the effect of exposure to dust and tobacco smoking on decline in forced expiratory volume in 1 second among aluminum potroom workers (8–10); the influence of dietary intake on lung function in middle-aged men (11); the impact of weight gain on lung function (12); the risk of exposure to dust on lung function decline among tunnel workers (13); lung function changes in workers exposed to cobalt compounds (14); lung function changes in coke oven workers (15); and neurotoxic effects on the central nervous system in solvent-exposed workers (16). However, in most of these studies, investigators did not establish a smoothed exposure-response relation, which is important in finding the causes of a disease (17, 18).
In the analysis of cross-sectional data, several statistical methods have been proposed for establishing the exposure-response relation. A popular method is to transform the continuous exposure data into categories with approximately equal subjects in each category (tertiles, quartiles, quintiles, etc.) and then perform a trend test on the response variable over the ordinal scores (i.e., 0, 1, 2, ...) or over the midpoint of each category (19, 20), where the number 0 represents the unexposed category and 1 represents the first category of exposure, and so on. Robertson et al. (21) proposed a method to test the trend of the response variable over scores of the exposed subjects only (i.e., over 1, 2, ...), which does not include the unexposed subjects. Greenland and Poole (22) published a critical evaluation of this method. However, it has been recognized that categorization or even dichotomization of a continuous variable is not generally a good idea (23–25). This method does not utilize all information contained in the exposure data, and it can result in reduced statistical power and less precise estimates of relevant parameters (24, 26).
Various other statistical methods have been proposed for establishing a general nonlinear exposure-response relation. Among these methods, fractional polynomials have been shown to have advantages over the above-mentioned categorization method (24, 27–29). The fractional polynomial method was first proposed by Royston and Altman (30), and further details about the computing algorithm were published in the Stata Technical Bulletin (31). Use of fractional polynomial models can increase the flexibility of conventional polynomial models. The lower-order polynomial models are often limited in the range of the shape of the curves for modeling a nonlinear exposure-response relation. The higher-order polynomial models often produce an undesirable shape of the curve at the tails of the curve. The fractional polynomial models combine the low- and high-order polynomials into 1 function and produce a family of curves with flexible shapes.
The fractional polynomial method has also been extended to multivariable analyses (32–34). However, application of the fractional polynomial method to the analysis of longitudinal data has been rare, although this method has been frequently used in the analysis of cross-sectional data. Possible reasons for this include lack of relevant computing software with which to calculate the fractional polynomial functions for the longitudinal data and difficulty of selecting the best-fitting fractional polynomial model using appropriate model selection criteria (35, 36).
Information criteria have been used for model selection in statistical analyses. The most commonly used criteria in the analysis of cross-sectional data include Akaike's Information Criterion (AIC) (37) and the Bayesian Information Criterion (BIC) (38). Other information criteria have also been developed but have been less frequently used (39, 40). However, in the analysis of longitudinal data, especially when the generalized estimating equations (GEE) method (41) is used, these information criteria are no longer valid, because the GEE method is based on the quasi-likelihood theory (42), instead of the maximum likelihood theory upon which the AIC and BIC methods were developed. There is no likelihood function available for the GEE method (43).
Although the AIC and BIC methods cannot be used directly in GEE analyses, a modified criterion has been proposed (44) and has been referred to as the "quasi-likelihood under the independence model criterion" (QIC). The QIC method can be used to compare different GEE models in the analysis of longitudinal data. The model with the smallest QIC value is usually chosen as the best model. Cui (35) and Cui and Qian (36) implemented this method and have developed a general computing program with which to calculate the QIC value for all of the commonly used statistical distributions, link functions, and correlation structures available in Stata software (45).
Our main objective in this analysis was to apply the fractional polynomial models to establish a smoothed exposure-response relation between fluoride exposure and change in lung function over time. The effect of age on change in lung function was also modeled as a fractional polynomial function. The joint fractional polynomial models for age and fluoride exposure were compared with the traditional categorical methods using the QIC method and the newly developed computing program. These methods were applied to a longitudinal study of occupational health in Australian aluminum workers.
| MATERIALS AND METHODS |
|---|
|
|
|---|
The data
The data used in this study came from an occupational longitudinal study (Healthwise) of change in lung function among new employees at an aluminum processing plant in Australia (46). A total of 446 persons were recruited into this study from 1995 to the end of 2003. The cutoff date of 2003 was used because this was the latest year for which complete exposure data were available at the time of the analysis. All participants were invited to participate in a baseline assessment of relevant risk factors and then annual follow-up assessments, using standardized questionnaires and spirometry lung function tests. A rolling seal spirometer (Graseby Andersen, Smyrna, Georgia) was used to measure the forced vital capacity (FVC) of each participant according to American Thoracic Society guidelines (47). If a participant was taking medication which could influence lung function, the measurement of FVC was postponed (48). Cumulative exposures to several contaminants, including fluoride, were calculated from the start of employment to each follow-up interview. The fluoride exposure data were estimated using routinely collected air monitoring data, based on at least 5 air samples from the most exposed full shift tasks of each person. A task exposure matrix was developed based on these routinely collected exposures and the participant's job history (49).
The main outcome variable in this analysis was the FVC. The explanatory variables included age, smoking status, educational level, marital status, workplace, and height. Except for height, which was measured at the baseline interview only, all other risk factors were time-dependent variables and were measured at each follow-up interview. The original variable for age was centered at 16 years to make the new variable, age-16, have positive values, as required by the fractional polynomial method. The original variable for height was centered at 180 cm for ease of interpretation of the intercept term in the regression model. Stratified analyses were conducted for males and females separately; only the results from the 305 male participants with complete data for the outcome and explanatory variables are reported in this article, for illustrative purposes.
Seventy male employees dropped out of the study before 2003. No evidence was found to suggest that their FVCs at baseline and the last interview or their annual decline in FVC differed significantly from those of the men who remained in the study. The median follow-up period was 1.9 years for the 47 employees who were lost to follow-up and had more than just the baseline interview. The main reason for dropping out of the study was leaving the company (63%), with only a few participants being lost to contact or deceased (4%). The remaining 33% of these 70 employees refused to continue participation in the study, usually because of lack of interest in lung function tests.
Fractional polynomial
The general m-degree fractional polynomial function of a continuous variable x is given by (30)
|
| (1) |
are fractional powers and the round bracket notation denotes the Box-Tidwell transformation—that is,
if
and
if
. It differs from the conventional polynomial in that the power p can be a noninteger number. Equation 1 also includes the situation of repeated powers, where
for at least 1 pair of indices
,
. In this situation, a term of
is used in the fractional polynomial function, instead of the term
itself.
Although many different combinations of powers can be made in fractional polynomial models, it has been suggested that it is often adequate to consider only a subset of the power, S = {–2, –1, –0.5, 0, 0.5, 1, 2, 3}, in practice (30), where
denotes ln(x). This subset of fractional polynomials provides rich and flexible shapes of curves for most practical model-fitting purposes. Curves in this subset include linear, reciprocal, logarithmic, square root, and square transformation of x. If the values of the powers
are known, fitting a fractional polynomial model is similar to conducting a conventional linear regression analysis. However, the powers of a fractional polynomial model are usually unknown and need to be estimated from the data. In this article, we use the notation FP1(power) and FP2(power1, power2) to represent the first- and second-degree fractional polynomial models, respectively (50). We compared our results with those from the multivariable fractional polynomial models (51, 52) derived using the Stata program "mfp."
The change in FVC over time among persons who were exposed to fluoride was estimated. However, persons who were unexposed to fluoride were also included in the analysis for estimation of the effect of fluoride exposure, which was treated as a dichotomous nuisance parameter. For example, let E represent the dichotomous variable for fluoride exposure status (E = 1 for fluoride exposure and E = 0 otherwise); then the effect of fluoride exposure can be estimated by means of the β coefficient associated with the variable E. The change in FVC among exposed persons can be estimated by means of the β coefficient associated with the interaction term FP1 x E or FP2 x E, depending on which fractional polynomial model was used (21).
Categorical methods
The fractional polynomial models were also compared with 4 categorical methods. In the first method (denoted the "binary model"), the fluoride exposure data were dichotomized into 2 categories (1 for exposed and 0 for unexposed). In the second method (denoted the "tertile model"), the exposure data were categorized into tertiles with approximately equal numbers of exposed persons in all tertiles, while retaining unexposed persons as a separate category. In the third method (denoted the "trend 1 model"), the tertile scores (0, 1, 2, and 3) were treated as a continuous variable and the change in FVC was analyzed over these 4 values. In the fourth method (denoted the "trend 2 model"), the change in FVC was analyzed over the 3 scores 1, 2, and 3. The major difference between the trend 1 and trend 2 models was the inclusion of unexposed persons in estimation of the change in FVC. The trend 2 model has been used in the analysis of dietary factors in pancreatic cancer (21) and has also been described by Clayton and Hills (53).
Splines
We also fitted a stepped spline model for fluoride when the best-fitting fractional polynomial model for age was fixed. Deciles of the cumulative fluoride exposure were used to define 9 knots for the spline function, which was constant between 2 consecutive knots. Indicator variables were also generated for the intervals between 2 consecutive knots, and they were included in a GEE model to obtain the estimated values of the splines from the associated β coefficients.
GEE and random-effects models
Parameters in the categorical and fractional polynomial models were estimated using the GEE method. The GEE method is an extension of the generalized linear model (54), and it is used to account for possible correlation among the repeated measurements taken in an individual. Because the outcome in this study was a continuous variable, we used the normal distribution, the identity link function, and the exchangeable working correlation structure in the analysis (6, 7, 43). As Liang and Zeger (41) pointed out, when the mean response function is correctly specified, the estimates of the parameters from the GEE model are consistent even if the correlation structure is misspecified (i.e., converges eventually in probability to the true parameter being estimated). Therefore, a robust variance estimator was also used in the analysis to account for the uncertainty about the correlation structure. We also fitted random-effects models (55) for comparison with the GEE models used in this study.
Model selection
The best-fitting fractional polynomial model needs to be chosen using appropriate model selection criteria. In this analysis, we used an iterative approach to select the final model. First, a cubic function, which is a special case of the fractional polynomial function, was assumed for the age variable. Then each of the 8 first-order and 36 second-order fractional polynomial models and 4 categorical models was fitted to the fluoride exposure data. Their corresponding QIC values were calculated, and the model with the smallest QIC value was selected as the best model for fluoride. The second step was to fix the best fractional polynomial model for fluoride, and we then fitted each of the first- and second-order fractional polynomial models for the age variable. Similar to the procedure used in the first step, the model with the smallest QIC value was chosen as the best model for age. Then we fixed the fractional polynomial for age and repeated this iterative procedure until both fractional polynomial models were the same as in the previous step. For ease of comparison, we calculated the relative QIC value, which was the difference between the QIC values for the binary model and a specific model. The statistical analyses in this study were conducted using Stata software (45).
| RESULTS |
|---|
|
|
|---|
Descriptive statistics
The demographic and other characteristics of the 305 male participants at the baseline interview are summarized in Table 1. FVC ranged from 3,200 mL to 7,740 mL, with a mean of 5,534 mL (standard deviation, 803). Because this was a study of new employees at an aluminum processing plant, most participants were young or middle-aged, and their mean age was 30.5 years (standard deviation, 8.6). The mean height was 178.5 cm (standard deviation, 7.1). Slightly more than half of the participants were nonsmokers, and the rest were smokers. The number of interviews conducted each year among these 305 persons is also shown in Table 1. A total of 1,125 interviews were conducted from 1995 to the end of 2003; more than half were conducted in the last 3 years. Excluding the 40 participants who attended only the baseline interview, the total follow-up period for the rest of the 265 participants was 848.7 person-years; the median duration of follow-up was 2.7 years.
|
Figure 1 shows a histogram of cumulative fluoride exposure among the 305 male participants over the duration of the study. Approximately one-fourth of the participants were unexposed. Among those who were exposed, the median exposure level was approximately 1.0 mg/m3-year, the interquartile range was 0.06–2.6 mg/m3-year, and the 95th percentile was 6.4 mg/m3-year. Only a small group of people (<3.8%) had fluoride exposures above 6.4 mg/m3-year; the majority (81%) were either unexposed or had an exposure level less than 2.6 mg/m3-year. These highly skewed exposure data were analyzed by means of the different methods outlined below.
|
Categorical and fractional polynomial models
Table 2 shows the comparison between the best-fitting fractional polynomial models and the categorical models based on the relative QIC values. First, the best-fitting fractional polynomial model for age was robust across different models for fluoride. No matter which model was assumed for fluoride (binary, tertile, trends, or fractional polynomial), the best-fitting first-order fractional polynomial model for age had a power of 2.0, and the best-fitting second-order fractional polynomial model had powers of 0.5 and 1.0. Secondly, the best-fitting model for fluoride was also consistent for different models for age. The best FP1 model for fluoride had a power of –0.5, and the best FP2 model had powers of –0.5 and –1.0. Finally, among all models, the model with FP2(–0.5, –1.0) for fluoride and FP2(0.5, 1.0) for age had the smallest relative QIC value and thus was chosen as the final model. The difference in the relative QIC values between the final model and the binary model suggested that the final model fitted the data substantially better than the binary model.
|
There was substantial improvement when using the second-degree fractional polynomials for age as compared with the first-degree fractional polynomials. The magnitude of the improvement depended on which model was used for fluoride. For example, when fractional polynomial models were used for fluoride (no matter which degree), the improvement in the QIC value was approximately 65. However, when the trend 2 model was used for fluoride, the improvement was approximately 101. Among the categorical models for fluoride, the trend 1 model had the largest QIC value, suggesting that this model did not fit the data very well. The tertile model and the trend 2 model fitted the data reasonably well.
Final model
Regression coefficients and 95% confidence intervals for the final best-fitting fractional polynomial model are shown in Table 3. The effect of height on FVC was significant, and for every 1-cm increase in height at the baseline interview, the value of FVC increased by nearly 70 mL, on average. The effect of age was also significant, although it had a nonlinear relation with FVC. For example, when values of other covariates were fixed, for a 1-year increase in age the value of FVC increased by 22.2 mL for a man aged 20 years and decreased by 29.8 mL for a man aged 40 years.
|
No significant effect was found for smoking (P = 0.81), and the difference in FVC between exposed and unexposed persons was also not significant (P = 0.15). Because these 2 variables were of major interest in this study, they were included in the final model. No significant effect of cumulative fluoride exposure on change in FVC was found. For a moderate increase in fluoride exposure, such as 1.0 mg/m3-year, the annual decline in FVC was approximately 29.0 mL for a man aged 40 years and 35.0 mL for a man aged 50 years. The Wald test showed that the 2 β coefficients associated with the fractional polynomials in fluoride were both not significantly different from 0 (
2 = 0.65, P = 0.72). No significant interaction effects were found between age and height, age and smoking, height and smoking status, and age and fluoride exposure, respectively. However, smoking status modified the effect of fluoride; the 2 β coefficients of the fractional polynomial function were –7.4 (95% confidence interval: –12.5, –2.3) and 0.051 (95% confidence interval: 0.008, 0.093), respectively. This single significant interaction effect was also included in the final model in Table 3. Other nonsignificant risk factors, such as educational level, marital status, and workplace, were not included in the final model.
Other models
We also investigated the polynomial effect of height at the baseline interview. The effect of height2 was not significant (β = 0.108; standard error, 0.439), but the effect of height3 was significant (β = –0.088; standard error, 0.040). However, the trivial difference obtained by adding the cubic term to the linear model for height was reflected only in the right tail of the FVC curve.
We also fitted the multivariable fractional polynomial model using the "mfp" command in Stata software. The best-fitting first- and second-order fractional polynomial models for age were FP1(2.0) and FP2(0.5, 1.0), respectively, which were consistent with the results in Table 2, and they were also robust across different models for fluoride. However, the best-fitting model for fluoride obtained using the mfp command was different from those in Table 2. Nevertheless, the effect of fluoride was also not significant using the mfp command, which was consistent with the results derived by means of the GEE model.
Similar results were derived using the random-effects models as compared with the GEE models. We found that the fractional polynomial function for age was robust across these 2 methods. The best-fitting random-effects FP1 model had a power of 2.0, and the random-effects FP2 model had powers of 0.5 and 1.0. The effects of fluoride exposure under the random-effects model were also not significant, which was consistent with the GEE model, although their fractional powers differed.
The best-fitting fractional polynomial models for men who were nonsmokers and unexposed to fluoride are shown in Figure 2, where the connected points are longitudinal observations of given individuals. The 3 solid lines represent the change in FVC among people of different heights (170 cm, 180 cm, and 190 cm, respectively) at the baseline interview. The observed trajectories of each individual's FVC are also shown, where they are linked with dotted lines. The observed data were also categorized by individual height at the baseline interview, and different symbols are used to represent the data in different categories. The fitted curves show the trend of increase and then decline in FVC at different ages, which is consistent with the trajectories of the observed longitudinal data.
|
Figure 2 shows that, on average, an individual's FVC increased up to the age of 23 years and then gradually declined after this age. The maximum FVC value that an individual could reach depended on the person's height at the baseline interview. Similarly, how far a person's FVC could decline at a particular age also depended on his height at baseline. Although the fitted curves for smokers are not shown in this figure, their corresponding curves can be derived by moving down the fitted curves by 16 mL for a person with the same height.
The fitted spline model for men aged 30 years with a height of 180 cm at baseline is shown in Figure 3, where the connected points are longitudinal observations of given individuals. The corresponding best-fitting fractional polynomial model and the observed data from a small group of people aged 25–35 years with heights of 175–185 cm are also presented. The flatness of the splines and the fractional polynomial curve suggested that the cumulative exposure to fluoride had little effect on the change in FVC over time, which again confirmed the nonsignificance of the effect of fluoride, as shown in Table 3. The fractional polynomial model was very close to the stepped splines, except that the fractional polynomial model indicated a small decline in FVC immediately after the initial exposure assessment, while the spline model did not.
|
| DISCUSSION |
|---|
|
|
|---|
In this paper, we have proposed a new method of using fractional polynomial models to establish the nonlinear exposure-response relations between age, fluoride exposure, and FVC in an occupational health longitudinal study in Australia. We compared the performance of these fractional polynomial models with that of the categorical methods. We also applied the newly developed QIC method and computing program to select the best-fitting GEE model among a large number of candidate models.
We found that the best-fitting model for age was a second-order fractional polynomial model with powers of 0.5 and 1.0 and the best-fitting model for cumulative fluoride exposure was also a second-order fractional polynomial model with powers of –0.5 and –1.0. The tertile method and the trend 2 method also fitted the data reasonably well. However, the trend 1 method did not fit the data very well. The best model for age was robust to models for fluoride and random-effects or multivariable models for age. The effect of age and height on FVC was significant. No evidence was found to suggest that the effects of smoking and cumulative fluoride exposure on FVC were significant, although their interaction was significant. The best-fitting fractional polynomial model for fluoride was close to the spline model, and the flatness of these 2 curves reassured us of the nonsignificance of the fluoride effect.
In this study, we found that the fractional polynomial models fitted the data better than the categorical methods. The fractional polynomial models have a range of flexible shapes of curves to fit a nonlinear exposure-response relation. It was not surprising that the second-order fractional polynomial model fitted the data best, because it uses more information in the data than other methods. It has been shown that dichotomization of a continuous variable at the median value reduces the power to a similar extent as discarding one-third of the data (23, 25, 56, 57). Because the fractional polynomial method requires that the exposure variable be positive or be made positive by shifting the measurement origin, we used the method proposed by Robertson et al. (21) to avoid the 0 exposure values in unexposed persons. This method is essentially similar to the "trend 2" method in terms of separating the effect of the dichotomous fluoride exposure as a nuisance parameter. The analysis results showed that the increase in FVC reached the peak when a man was approximately 23 years of age and then gradually declined at a variable rate, depending on his age. For example, the annual rate of decline was 35 mL for a man aged 50 years but 17 mL for a man aged 30 years. In general, the older a person the more rapid is the decline. Furthermore, smoking can increase the rate of decline in FVC.
It was not surprising to find that the "trend 1" model did not fit the data very well. Actually, the estimated β coefficients were virtually of the same magnitude, approximately 85–90 mL per mg/m3-year, under the "tertile" model (i.e., 83.8, 90.0, and 86.2). If we fit a "trend 2" model through these tertiles, we obtain a nonsignificant slope of 0.8 (95% confidence interval: –22.5, 24.0). However, if we fit a straight line through the 4 values 0, 83.8, 90.0, and 86.2, which is the same as fitting a "trend 1" model, the goodness of fit of this model will be poor, because of their nonlinear relation. In fact, they belong to the "J-shaped" exposure-response relation (58). Therefore, it is not sensible to include the unexposed persons in a trend analysis of the effect of a continuous exposure variable. Instead, the method proposed by Robertson et al. (21) can be considered in such situations.
Interpretation of regression coefficients from the random-effects model is different from that of coefficients from the GEE model (6). The goodness of fit of a random-effects model is usually assessed using the likelihood ratio test or the AIC, while the goodness of fit of a GEE model may be assessed by means of the QIC method. However, there is no formal statistical test available with which to evaluate whether a new GEE model represents a significant improvement over an old GEE model based on the QIC method. As a rule of thumb related to the well-known AIC method, 2 models are said to be significantly different if the difference between their AIC values is larger than 2 (39, 40). However, the QIC method is based on quasi-likelihood theory, while the AIC method is based on likelihood theory.
The best-fitting fractional polynomial model for age was also confirmed by using the independent Stata program "mfp." A significance level of 0.157 can be used with this method to select an additional variable, which is approximately equivalent to minimizing the AIC value (52). The GEE model specifies the working correction structure directly. However, the mfp command in Stata essentially assumes an independent correlation structure. On the other hand, our new computing program has the advantage of using the programming language facility in Stata to estimate the magnitude of the change in FVC only among those persons who were exposed to fluoride.
The fact that the fractional polynomial function for age was robust suggests that it may not be necessary to use the iterative approach to find the best-fitting fractional polynomial model for a significant risk factor, such as age, in this study. A simple solution is to fit a fractional polynomial model for the risk factor while adjusting for other explanatory variables. If the risk factor is significant, this fractional polynomial model is probably the final best-fitting model. However, our experience in this study suggests that this practice does not apply to a nonsignificant risk factor, such as cumulative exposure to fluoride. For a nonsignificant risk factor, the final form of the fractional polynomial function depends on the underlying model of this variable and the function of other risk factors in the model.
There were some limitations in this study. We did not conduct a simulation study to formally investigate the performance of the fractional polynomial method as compared with the categorical methods. The fractional polynomial models were also limited within the first 2 degrees, without further investigation of higher-degree fractional polynomial models. The treatment of smoking status in 2 simple categories (smoker vs. nonsmoker) was also crude. Further investigation using more precise information about smoking amount, such as pack-years, is warranted. Although we conducted an investigation of the polynomial effect of height, we did not include the cubic term for height in our final model, for ease of interpretation of the final model and demonstration of the new method. Further investigation of the best-fitting fractional polynomial model for height, when fractional polynomial functions for age and fluoride are incorporated, is also needed.
In addition, we did not take account of the uncertainty in estimation of the powers in fractional polynomial models as separate parameters. This implies that the derived QIC values in Table 2 may be slightly less than what they should have been after the adjustment. We are not aware of any formal statistical test with which to adjust for the extra parameters in QIC. However, a simple method of adjustment is to increase the value of QIC by 2 for an FP1 model and increase the value by 4 for an FP2 model, according to the rule of thumb for the AIC method. The QIC method is an extension of the AIC method to the longitudinal data, and the penalty terms in their formulae are similar. Because the differences in QIC values between different models were greater than 10 in this study, such an adjustment would not change the main conclusion drawn in this article.
In conclusion, we proposed to use fractional polynomial models to jointly analyze the effects of 2 continuous risk factors on a health outcome, and we successfully applied this method to an analysis of the effects of age and cumulative fluoride exposure on FVC in a longitudinal study of lung function in Australia. We found that the second-degree fractional polynomial models for age and fluoride fitted the data best. The best model for age was robust to different specifications of models for fluoride and random-effects or multivariable models for age. We found no evidence to suggest that the effects of smoking and cumulative fluoride exposure on change in FVC over time were significant. The fractional polynomial model utilizes more information in the data and thus increases statistical power in comparison with the categorical methods. The trend 1 model, which included the unexposed persons in the analysis, did not fit the data well, and caution should be exercised when this method is used.
| ACKNOWLEDGMENTS |
|---|
Author affiliations: Department of Epidemiology and Preventive Medicine, Faculty of Medicine, Nursing and Health Sciences, Monash University, Melbourne, Victoria, Australia (Jisheng Cui, Michael Abramson, Anthony Del Monaco, Geza Benke, Martine Dennekamp, Malcolm Sim); World Health Organization Collaborating Centre for Obesity Prevention, Deakin University, Melbourne, Victoria, Australia (Jisheng Cui); and School of Population Health, University of Western Australia, Crawley, Western Australia, Australia (Nick de Klerk, Arthur W. Musk).
The authors thank Alcoa World Alumina Australia (Booragoon, Western Australia, Australia) for funding the Healthwise study, from which the longitudinal exposure and lung function data were obtained for use as an illustrative example in this article. The authors are grateful to members of the Healthwise scientific advisory board for helpful comments and suggestions about the analysis of the Healthwise study data and to Judy Hankin and Alison Hayes for collecting the Healthwise data.
Conflict of interest: none declared.
| References |
|---|
|
|
|---|
- Checkoway H, Pearce N, Kriebel D. Research Methods in Occupational Epidemiology (2004) New York, NY: Oxford University Press.
- Steenland K. Case Studies in Occupational Epidemiology (1993) New York, NY: Oxford University Press.
- Rosner B. Fundamentals of Biostatistics (1995) 4th ed. Belmont, CA: Duxbury Press.
- D'Agostino RB Sr, Vasan RS, Pencina MJ, et al. General cardiovascular risk profile for use in primary care: The Framingham Heart Study. Circulation (2008) 117(6):743–753.
[Abstract/Free Full Text] - Kannel WB. The Framingham study [letter]. Br Med J (1976) 2(6046):1255.
[Free Full Text] - Fitzmaurice G, Laird N, Ware J. Applied Longitudinal Analysis (2004) Hoboken, NJ: Wiley-Interscience.
- Diggle P, Heagerty P, Liang K-Y, et al. Analysis of Longitudinal Data (2002) 2nd ed. New York, NY: Oxford University Press.
- Søyseth V, Kongerud J, Kjuus H, et al. Bronchial responsiveness and decline in FEV1 in aluminium potroom workers. Eur Respir J (1994) 7(5):888–894.[Abstract]
- Søyseth V, Boe J, Kongerud J. Relation between decline in FEV1 and exposure to dust and tobacco smoke in aluminium potroom workers. Occup Environ Med (1997) 54(1):27–31.
[Abstract/Free Full Text] - Kongerud J, Boe J, Søyseth V, et al. Aluminium potroom asthma: the Norwegian experience. Eur Respir J (1994) 7(1):165–172.[Abstract]
- Butland BK, Fehily AM, Elwood PC. Diet, lung function, and lung function decline in a cohort of 2512 middle aged men. Thorax (2000) 55(2):102–108.
[Abstract/Free Full Text] - Chinn S, Jarvis D, Melotti R, et al. Smoking cessation, lung function, and weight gain: a follow-up study. Lancet (2005) 365(9471):1629–1635.[CrossRef][Web of Science][Medline]
- Ulvestad B, Bakke B, Eduard W, et al. Cumulative exposure to dust causes accelerated decline in lung function in tunnel workers. Occup Environ Med (2001) 58(10):663–669.
[Abstract/Free Full Text] - Verougstraete V, Mallants A, Buchet JP, et al. Lung function changes in workers exposed to cobalt compounds: a 13-year follow-up. Am J Respir Crit Care Med (2004) 170(2):162–166.
[Abstract/Free Full Text] - Wu J, Griffiths D, Kreis IA, et al. Lung function changes in coke oven workers during 12 years of follow up. Occup Environ Med (2004) 61(8):686–691.
[Abstract/Free Full Text] - Ihrig A, Dietz MC, Bader M, et al. Longitudinal study to explore chronic neuropsychologic effects on solvent exposed workers. Ind Health (2005) 43(3):588–596.[CrossRef][Web of Science][Medline]
- Phillips CV, Goodman KJ. The missed lessons of Sir Austin Bradford Hill [electronic article]. Epidemiol Perspect Innov (2004) 1(1):3.[CrossRef][Medline]
- Hill AB. The environment and disease: association or causation? Proc R Soc Med (1965) 58(5):295–300.[Web of Science][Medline]
- Maclure M, Greenland S. Tests for trend and dose response: misinterpretations and alternatives. Am J Epidemiol (1992) 135(1):96–104.
[Abstract/Free Full Text] - Zhao LP, Kolonel LN. Efficiency loss from categorizing quantitative exposures into qualitative exposures in case-control studies. Am J Epidemiol (1992) 136(4):464–474.
[Abstract/Free Full Text] - Robertson C, Boyle P, Hsieh CC, et al. Some statistical considerations in the analysis of case-control studies when the exposure variables are continuous measurements. Epidemiology (1994) 5(2):164–170.[Web of Science][Medline]
- Greenland S, Poole C. Interpretation and analysis of differential exposure variability and zero-exposure categories for continuous exposures. Epidemiology (1995) 6(3):326–328.[Web of Science][Medline]
- Royston P, Altman DG, Sauerbrei W. Dichotomizing continuous predictors in multiple regression: a bad idea. Stat Med (2006) 25(1):127–141.[CrossRef][Web of Science][Medline]
- Weinberg CR. How bad is categorization? Epidemiology (1995) 6(4):345–347.[Web of Science][Medline]
- Altman DG, Royston P. The cost of dichotomising continuous variables. BMJ (2006) 332(7549):1080.
[Free Full Text] - Brown CC, Kipnis V, Freedman LS, et al. Energy adjustment methods for nutritional epidemiology: the effect of categorization. Am J Epidemiol (1994) 139(3):323–338.
[Abstract/Free Full Text] - Greenland S. Dose-response and trend analysis in epidemiology: alternatives to categorical analysis. Epidemiology (1995) 6(4):356–365.[Web of Science][Medline]
- Greenland S. Avoiding power loss associated with categorization and ordinal scores in dose-response and trend analysis. Epidemiology (1995) 6(4):450–454.[Web of Science][Medline]
- Govindarajulu US, Spiegelman D, Thurston SW, et al. Comparing smoothing techniques in Cox models for exposure-response relationships. Stat Med (2007) 26(20):3735–3752.[CrossRef][Web of Science][Medline]
- Royston P, Altman DG. Regression using fractional polynomials of continuous covariates: parsimonious parametric modelling (with discussion). Appl Stat (1994) 43(3):429–467.[CrossRef]
- Royston P, Altman DG. Using fractional polynomials to model curved regression relationships. Stata Tech Bull. (1994) 21(5):11–23.
- Sauerbrei W, Royston P. Building multivariable prognostic and diagnostic models: transformation of the predictors by using fractional polynomials. J R Stat Soc Ser A (1999) 162(1):71–94.[CrossRef]
- Royston P, Sauerbrei W, Altman DG. Modeling the effects of continuous risk factors. J Clin Epidemiol (2000) 53(2):219–221.[CrossRef][Web of Science][Medline]
- Sauerbrei W, Meier-Hirmer C, Benner A, et al. Multivariable regression model building by using fractional polynomials: description of SAS, STATA and R programs. Comput Stat Data Anal (2006) 50(12):3464–3485.[CrossRef]
- Cui J. QIC program and model selection in GEE analyses. Stata J (2007) 7(2):209–220.
- Cui J, Qian G. Selection of working correlation structure and best model in GEE analyses of longitudinal data. Commun Stat Simul Comput. (2007) 36(5):987–996.[CrossRef]
- Akaike H. A new look at the statistical model identification. IEEE Trans Automat Contr (1974) 19(6):716–723.[CrossRef]
- Schwarz G. Estimating the dimension of a model. Ann Stat (1978) 6(2):461–464.[CrossRef][Web of Science]
- Burnham K, Anderson D. Multimodal inference: understanding AIC and BIC in model selection. Sociol Methods Res. (2004) 33(2):261–304.[Abstract]
- Burnham K, Anderson D. Model Selection and Multimodel Inference: A Practical Information-Theoretic Approach (2002) 2nd ed. New York, NY: Springer-Verlag New York, Inc.
- Liang K-Y, Zeger S. Longitudinal data analysis using generalized linear models. Biometrika (1986) 73(1):13–22.
[Abstract/Free Full Text] - Wedderburn R. Quasi-likelihood functions, generalized linear models, and the Gauss-Newton method. Biometrika (1974) 61(3):439–447.
[Abstract/Free Full Text] - Hardin J, Hilbe J. Generalized Estimating Equations (2003) Boca Raton FL: Chapman & Hall, Inc.
- Pan W. Akaike's information criterion in generalized estimating equations. Biometrics (2001) 57(1):120–125.[CrossRef][Web of Science][Medline]
- Stata Corporation. Stata Statistical Software: Release 9 (2005) College Station, TX: Stata Corporation.
- Abramson M, Benke G, Cui J, et al. Potroom asthma may be due more to sulphur dioxide than fluoride [abstract]. Respirology (2007) 12(s4):A135.
- American Thoracic Society. Standardization of spirometry, 1994 update. Am J Respir Crit Care Med (1995) 152(3):1107–1136.[Web of Science][Medline]
- Fritschi L, Sim MR, Forbes A, et al. Respiratory symptoms and lung-function changes with exposure to five substances in aluminium smelters. Int Arch Occup Environ Health (2003) 76(2):103–110.[Web of Science][Medline]
- Benke G, Sim MR, McKenzie DP, et al. Comparison of first, last, and longest-held jobs as surrogates for all jobs in estimating cumulative exposure in cross-sectional studies of work-related asthma. Ann Epidemiol (2008) 18(1):23–27.[CrossRef][Web of Science][Medline]
- Royston P, Ambler G, Sauerbrei W. The use of fractional polynomials to model continuous risk variables in epidemiology. Int J Epidemiol (1999) 28(5):964–974.
[Abstract/Free Full Text] - Royston P, Ambler G. Multivariable fractional polynomials. Stata Tech Bull. (1998) 43(3):24–32.
- Sauerbrei W, Royston P, Binder H. Selection of important variables and determination of functional form for continuous predictors in multivariable model building. Stat Med (2007) 26(30):5512–5528.[CrossRef][Web of Science][Medline]
- Clayton D, Hills M. Statistical Models in Epidemiology (1996) Oxford, United Kingdom: Oxford Science Publications.
- Nelder J, Wedderburn R. Generalized linear models. J R Stat Soc Ser A (1972) 135(3):370–384.[CrossRef]
- Laird NM, Ware JH. Random-effects models for longitudinal data. Biometrics (1982) 38(4):963–974.[CrossRef][Web of Science][Medline]
- MacCallum RC, Zhang S, Preacher KJ, et al. On the practice of dichotomization of quantitative variables. Psychol Methods (2002) 7(1):19–40.[CrossRef][Web of Science][Medline]
- Cohen J. The cost of dichotomization. Appl Psychol Meas (1983) 7(3):249–253.[CrossRef]
- Marschner IC, Simes RJ, Keech A. Biases in the identification of risk factor thresholds and J-curves. Am J Epidemiol (2007) 166(7):824–831.
[Abstract/Free Full Text]
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||


