Skip Navigation


American Journal of Epidemiology Advance Access originally published online on November 16, 2006
American Journal of Epidemiology 2007 165(3):325-333; doi:10.1093/aje/kwk011
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow All Versions of this Article:
165/3/325    most recent
kwk011v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Right arrow Disclaimer
Google Scholar
Right arrow Articles by Perkins, N. J.
Right arrow Articles by Vexler, A.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Perkins, N. J.
Right arrow Articles by Vexler, A.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

American Journal of Epidemiology Copyright © 2006 by the Johns Hopkins Bloomberg School of Public Health All rights reserved; printed in U.S.A.

ORIGINAL CONTRIBUTIONS

Receiver Operating Characteristic Curve Inference from a Sample with a Limit of Detection

Neil J. Perkins1,2, Enrique F. Schisterman1 and Albert Vexler1

1 Division of Epidemiology, Statistics, and Prevention Research, National Institute of Child Health and Human Development, National Institutes of Health, Department of Health and Human Services, Bethesda, MD
2 Department of Mathematics and Statistics, American University, Washington, DC

Correspondence to Dr. Enrique F. Schisterman, Division of Epidemiology, Statistics, and Prevention Research, National Institute of Child Health and Human Development, 6100 Executive Boulevard, Bethesda, MD 20852 (e-mail: schistee{at}mail.nih.gov).

Received for publication February 2, 2006. Accepted for publication June 20, 2006.


    ABSTRACT
 TOP
 ABSTRACT
 INTRODUCTION
 MATERIALS AND METHODS
 VARIANCE AND CONFIDENCE...
 SIMULATION STUDY
 EXAMPLE
 DISCUSSION
 APPENDIX
 References
 
The receiver operating characteristic curve is a commonly used tool for evaluating biomarker usefulness in clinical diagnosis of disease. Frequently, biomarkers being assessed have immeasurable or unreportable samples below some limit of detection. Ignoring observations below the limit of detection leads to negatively biased estimates of the area under the curve. Several correction methods are suggested in the areas of mean estimation and testing but nothing regarding the receiver operating characteristic curve or its summary measures. In this paper, the authors show that replacement values below the limit of detection, including those suggested, result in the same biased area under the curve when properly accounted for, but they also provide guidance on the usefulness of these values in limited situations. The authors demonstrate maximum likelihood techniques leading to asymptotically unbiased estimators of the area under the curve for both normally and gamma distributed biomarker levels. Confidence intervals are proposed, the coverage probability of which is scrutinized by simulation study. An example using polychlorinated biphenyl levels to classify women with and without endometriosis illustrates the potential benefits of these methods.

censoring; curve estimation; detection; maximum likelihood; receiver operating characteristic


Abbreviations: A, area under the curve; LOD, limit of detection; MLE, maximum likelihood estimate; p(c), specificity; q(c), sensitivity; RMSE, root mean square error; ROC, receiver operating characteristic


    INTRODUCTION
 TOP
 ABSTRACT
 INTRODUCTION
 MATERIALS AND METHODS
 VARIANCE AND CONFIDENCE...
 SIMULATION STUDY
 EXAMPLE
 DISCUSSION
 APPENDIX
 References
 
Biomarkers are of ever increasing importance to the diagnosis and treatment of existing and emerging diseases. The magnitude of this importance is directly linked to the ability of a researcher to identify and demonstrate an association between biomarker levels and disease status. For example, diseased individuals may have higher levels of a biomarker than do nondiseased individuals. Therefore, the biomarker could be used as an indicator for disease. Levels above some cutpoint result in individuals being labeled as diseased and individuals with levels below the cutpoint being labeled as nondiseased.

For a cutpoint c, sensitivity (q(c)) is the probability of individuals having a positive test result given that he or she is truly diseased, and specificity (p(c)) is the probability of a negative test result given that he or she is truly healthy/nondiseased. Unless the biomarker perfectly identifies disease status, moving c up or down results in a tradeoff of increased sensitivity for decreased specificity or visa versa.

The receiver operating characteristic (ROC) curve, a mapping of sensitivity as a function of 1 – specificity, is a frequently used device that facilitates easy comparison of the overall effectiveness, effectiveness at specified levels of false positives rates, and optimal diagnostic ability of different biomarkers (13). These comparisons take place through the summary measures, the most common of which is the area under the curve (A) (16). The curve itself originates at the point (0, 0), terminates at (1, 1), and is formed by plotting pairs of 1 – specificity and sensitivity across all possible cutpoints. A curve that hugs the diagonal is indicative of a marker with little ability to classify disease status, as the diagonal indicates pure chance at any cutpoint.

The summary measure A reflects how well the biomarker performs across all cutpoints. A ranges from 0.5 (no better than chance) to 1 (perfectly determining disease status). An extensive body of literature exists on ROC curve methods, allowing researchers to better evaluate all types of biomarkers.

Estimates of a ROC curve can be either parametric or nonparametric (figure 1) (1, 3). Nonparametric curves are calculated by the ordering of observed diseased and nondiseased biomarker levels. Parametric curves are estimated on the basis of parameter estimates that govern assumed distributions of diseased and nondiseased biomarker levels. Estimates of summary measures are then constructed from these curves, with those based on a parametric curve also being functions of parameter estimates.


Figure 1
View larger version (8K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
FIGURE 1. Empirical (solid) and parametric (dashed) receiver operating characteristic curves based on levels of polychlorinated biphenyl 114, classifying women with and without endometriosis. The data are affected by a limit of detection where measurements are unquantifiable below some level. The parametric curve is based on gamma distributions, with parameters estimated using maximum likelihood.

 
One obstacle to biomarker evaluation that researchers often face is where "nondetects" or missing data exist below some limit of detection (LOD), quantified as d. These could be due to a non-"gold standard," equipment limitations, or measurement error among other reasons. Regardless of the origin, they must be addressed when estimating a ROC curve and/or any subsequent summary measures for biomarker assessment.

The simplest way to address them is to omit the observations and assess the biomarker naively by use of only the truncated data set of quantified observations. Surely, this will yield a bias assessment of biomarker effectiveness, as we would be throwing away information.

Substituting a replacement value for unobservable data is a proposed alternative to provide a less biased assessment. This would create an artificial mass, and the question, "What value?", still exists. We could use the lower or upper bound of the interval of missing values, usually 0 and d, respectively. Approximations to the expected value of the interval, d/2 and Formula have also been proposed for distributions that are highly skewed and symmetric or slightly skewed, respectively (e.g., lognormal and normal) (7). It has been shown that all of these methods lead to biased estimation of the mean and standard deviation parameters, regression coefficients, and odds ratio, but their usage has yet to be explored in the context of the ROC curve (810).

It is clear that conventional parametric ROC methods (i.e., binormal curve) do not account for the mass and cannot be used here. Schisterman et al. (11) proposed a method for estimating the ROC curve and A for a biomarker with a natural mass or repeated observation at the same value. A LOD creates a mass that is not natural to the biomarker. We will demonstrate that any replacement value below the LOD is arbitrary in the sense that the estimated ROC curve and A are unaffected when properly accounting for the mass. It is important to note that, although the current discussion focuses on A, the same concerns apply to other ROC curve summary statistics.

Instead, in Materials and Methods we propose using maximum likelihood to estimate A. We demonstrate this technique for the binormal curve and also for gamma assumptions. In the section Variance and Confidence Intervals, confidence intervals are developed for each distributional assumption by applying the delta method, the simulated coverage probabilities of which are displayed in the section Simulation Study. An example is given in the section entitled Example using polychlorinated biphenyl levels to classify women with endometriosis to illustrate replacement and maximum likelihood techniques. The Discussion section provides a brief discussion of issues surrounding estimation based on data affected by a LOD.


    MATERIALS AND METHODS
 TOP
 ABSTRACT
 INTRODUCTION
 MATERIALS AND METHODS
 VARIANCE AND CONFIDENCE...
 SIMULATION STUDY
 EXAMPLE
 DISCUSSION
 APPENDIX
 References
 
Let Xi and Yj denote the biomarker levels of the ith diseased and jth nondiseased individuals for i = 1,...,nX and j = 1,...,nY. Assume that these levels are independent observations from continuous distributions

Formula
where f(x) and g(y) are the probability density functions for diseased and nondiseased, respectively. On the basis of these two distributions, A can be written as the probability that a nondiseased individual has a lower biomarker value than a diseased individual or

Formula (1)

Now suppose that, for all i and j, biomarker values are subject to a LOD and are only quantifiable above d such that

Formula (2)
where not applicable (N/A) is observed when a biomarker level is unquantifiable. Hence, biomarker values below d are lumped together creating an artificial mass of values that is noninformative except for being less than d. In order to assess this biomarker's effectiveness, generating a ROC curve and A, we are forced to address these observations with nonquantified values.

Replacement values
One remedy for analysis based on data with a LOD is to replace the missing values with a number a such that

Formula
for all i and j. This creates a mass of observations at a specific number a that are not intrinsic to the continuous nature of the biomarker but allow missing observations to be included in estimation procedures.

The A based on data with such a mass, denoted by AZ, can be written (11)

Formula
If a < d, as common replacement values d/2, Formula, and the expected value below d are, then

Formula
where

Formula
Simplifying yields

Formula
which is independent of the choice of a, below d, and thus a does not affect P(ZX > ZY) or whether AZ = P(ZX > ZY) = P(X > Y) = A even when parameters governing the probability are known. The ROC curve-associated AZ would be consistent from the points (0, 0) to (1 – p(d), q(d)) and then change to a straight line to (1, 1). So, using a replacement value a < d should have less bias than ignoring unquantified observations, but it is still bias.

Maximum likelihood
Another approach for handling LOD data is to utilize the maximum likelihood method. Consider a random sample {yj; j = 1, ..., n} from a random variable (Y) with distribution G(y; {theta}Y) and unknown parameter {theta}Y. Let {zj; j = 1, ..., n} denote observed values subject to the transformation in equation 2 for a fixed d, where k values are quantified and nk observations are censored below d. The likelihood function for parameter {theta}Y given the sample of all zj can be viewed as a Bernoulli, where an observation's value is missing or not. If missing, we can obtain no additional information beyond its value being below d. If present, we can assess the likelihood of {theta}Y given the value zj above d. Putting the zj in ascending order with N/A values leading, we can write the likelihood function as

Formula
where g(y; {theta}Y) is the probability density function of Y, and I is an indicator function taking 1 if true, 0 otherwise. To calculate the maximum likelihood estimate (MLE) Formula of {theta}Y, simply maximize L with respect to the parameter. In the case where {theta}Y is a vector parameter, maximize L with respect to each parameter while fixing all others.

By use of this approach, MLEs Formula and Formula are obtained for the unknown parameters {theta}X and {theta}Y that govern distributions F(x; {theta}X) and G(y; {theta}Y) for diseased and nondiseased biomarker levels, respectively. Substituting Formula and Formula for their respective parameters yields the estimate Formula = A(Formula,Formula), which is the MLE of A through the invariance property of MLEs.

Normal case.
Cohen (12) in 1950 and Gupta (13) in 1952 independently considered the case where a population was normally distributed and values were censored above and below, respectively, some value. Both developed MLEs for µ and {sigma}, and Harter and Moore (14) provided similar developments for samples censored above and below simultaneously.

With adaptation of these techniques to a biomarker that is censored solely below by a fixed d, the log-likelihood function for the controls, zY, using Gupta's notation, is

Formula
where {eta}Y =(d – µY)/{sigma}Y, and {Phi} is the standard normal distribution function. Differentiating the log-likelihood function with respect to µY and {sigma}Y yields equations that can be combined such that

Formula
where

Formula
(10, 13). Rewriting in the form

Formula
allows us to solve for Formula numerically and, subsequently, Formula by substitution.

A, the ROC curve formed by diseased and nondiseased biomarker levels that follow normal distributions, is commonly called a binormal curve with

Formula
So, performing these steps similarly for cases and controls yields MLEs for all four parameters necessary to calculate

Formula

Gamma case.
Similarly, consider the scenario where biomarker levels follow gamma distributions. The log-likelihood equation for these controls, zY, is

Formula
where C is a constant,

Formula
(15). Unlike the normal, the two equations formed by differentiating with respect to alpha and beta cannot be combined in such a way that would allow us to solve for a single parameter. So, we must maximize the log likelihood with respect to both. Software is readily available that can perform this maximization yielding Formula and Formula MLEs for {alpha} and ß.

Now consider the area under a ROC curve formed from the gamma distributions of diseased and nondiseased biomarker levels with shape and scale parameters {alpha}X, ßX and {alpha}Y, ßY, respectively. As in the "normal case" section, here we can simplify the area such that

Formula
where

Formula
(16). Thus, Formula can be calculated by substituting the MLEs Formula, Formula, Formula, and Formula for their respective parameters.


    VARIANCE AND CONFIDENCE INTERVALS
 TOP
 ABSTRACT
 INTRODUCTION
 MATERIALS AND METHODS
 VARIANCE AND CONFIDENCE...
 SIMULATION STUDY
 EXAMPLE
 DISCUSSION
 APPENDIX
 References
 
The estimator Formula is asymptotically normally distributed such that Formula ~ Normal(0, {sigma}Formula), where N = nX + nY, and {sigma}Formula is the variance parameter. This result is left to the Appendix as well as the development of {sigma}Formula. This relation holds for not only biomarker values following normal or gamma distributions but also any Formula found via maximum likelihood estimation.

By use of this distributional development, {alpha}-level two-tailed confidence intervals for A are constructed from

Formula
When {sigma}Formula is unknown, using the estimator Formula in lieu of {sigma}Formula leads to approximate {alpha}-level confidence intervals for A.


    SIMULATION STUDY
 TOP
 ABSTRACT
 INTRODUCTION
 MATERIALS AND METHODS
 VARIANCE AND CONFIDENCE...
 SIMULATION STUDY
 EXAMPLE
 DISCUSSION
 APPENDIX
 References
 
Our results rely on asymptotics, especially the confidence intervals. For this reason, we performed a simulation study consisting of B = 2,000 independent samples of nx diseased and ny nondiseased (nx = ny = 50, 100, 200, 1,000) individuals to assess the maximum likelihood techniques over various sample sizes and shape scenarios.

The normal case was evaluated first with nondiseased values generated from a normal distribution with µY = 0 and {sigma}Formula = 1. Diseased values were obtained from a normal distribution with variance {sigma}Formula = 1,3 and mean µX corresponding to A = 0.6, 0.7, 0.8, 0.9.

Next, nondiseased biomarker levels were generated from a gamma distribution with {alpha}Y = 1.5 and ßY = 1. Samples of diseased levels were from a gamma with {alpha}X = 1.5,2,3 and ßX set to achieve A = 0.6, 0.7, 0.8, 0.9.

For each simulated group of samples, maximum likelihood techniques were used to find parameter MLEs, which were subsequently used to calculate Formula and 95 percent confidence intervals for A. These estimators were found before a LOD effect was introduced, no d, and then again for various d, resulting in increasing levels of unquantifiable data. The d selected corresponded to the 20, 40, 60, 80th quantiles of the nondiseased distribution. For clarity, the same d value determined the unquantifiable portions of both diseased and nondiseased values.

Summary measures bias and root mean square error (RMSE) were calculated over the 2,000 groups. The coverage probability of the confidence interval was also found by assessing the proportion of confidence intervals that contained the true A. Tables 1 and 2 display excerpts of these simulation results for the normal and gamma cases, respectively. The summaries of the other simulated scenarios are consistent with tables 1 and 2 but have been omitted for spatial considerations. Generally, as the sample sizes increase, the bias and RMSE decrease for normal and gamma distributions. The highest percentages of bias over both table 1 and table 2 are 1.3, 0.6, 0.4, and 0.1 percent, respectively, for nx = ny = 50, 100, 200, 1,000. The coverage probabilities of the 95 percent confidence intervals are nominal or within random departure from nominal with samples of 50 observations and improve as the number of observations increases. The only exceptions are slightly less than nominal coverage for A = 0.9 for the gamma, nx = ny = 50, 100, and possibly for the normal with nx = ny = 50.


View this table:
[in this window]
[in a new window]

 
TABLE 1. Simulation summary of maximum likelihood estimate Formula based on normal distributions affected by limit of detection

 

View this table:
[in this window]
[in a new window]

 
TABLE 2. Simulation summary of maximum likelihood estimate Formula based on gamma distributions affected by limit of detection

 
As an aside to our simulation study of MLEs that account for a LOD, we considered using replacement values and conventional area estimation techniques that ignore a LOD for the same 2,000 iterations of every case of both the normal and gamma. We replaced with the values d/2 and Formula and recorded bias and RMSE. As expected, Formula performed slightly better than d/2 for normal, but d/2 was better for the gamma case because of the skewed shape of the gamma. Naïve replacement yields the percentages of bias that are roughly 0.3, 0.7, 1.4, and 4.3 percent for 20, 40, 60, and 80 percent missing for both the normal and gamma cases used in tables 1 and 2 and across sample sizes. For all cases, this bias was substantially greater, often from five to 10 times as large, than that yielded by using MLEs. The RMSE was generally comparable to that from MLEs with the exception that, for nx = ny = 1,000 when 60 and 80 percent are missing, the RMSE from replacement can be from 30 percent to three times larger than that of the MLEs. Also, for the gamma case, bias and RMSE increased as {alpha}X increased, whereas they were relatively constant for MLEs.


    EXAMPLE
 TOP
 ABSTRACT
 INTRODUCTION
 MATERIALS AND METHODS
 VARIANCE AND CONFIDENCE...
 SIMULATION STUDY
 EXAMPLE
 DISCUSSION
 APPENDIX
 References
 
Endometriosis is a gynecologic disorder that occurs primarily in women of reproductive age. It is a disease exclusive to species that menstruate, such as humans and primates (17). The weight of evidence underlying an association between dioxin and polychlorinated biphenyls and endometriosis includes data from experimental animal, primate, and human studies.

Using an incident case-control study of 28 cases and 51 controls, we evaluated how various polychlorinated biphenyls classified cases and controls with endometriosis. Investigators were interested specifically in polychlorinated biphenyl 114. The LOD experimentally set at d = 0.005 resulted in a censoring of 76 percent of the controls and 36 percent of the cases. Empirical analysis leads to the nonsmooth ROC curve in figure 1. The straight line in the upper portion of the empirical curve corresponds to the mass in the data created by censoring. This ROC curve led to an area point estimate of 0.701 (95 percent confidence interval: 0.576, 0.827). We also applied the maximum likelihood techniques developed here. Quantile plots showed that normal distributions would be very poor choices and that gamma distributions with parameters equal to MLEs fit well. Moreover, polychlorinated biphenyl levels are naturally restricted above zero, which is intrinsic to the gamma and not to the normal distribution. The smooth ROC curve in figure 1 is based on cases and controls following gamma distributions with MLEs substituted for parameters. The area under this curve is Formula = 0.777 (95 percent confidence interval: 0.655, 0.899).

The fact that Formula is greater than the empirical estimate reflects a greater perceived discriminating capability of polychlorinated biphenyl 114. The difference is a result of the empirical estimate disregarding possible differentiation below d, while Formula includes the possibility. Including the possibility that levels below d behave in a manner consistent with the levels above d yields a substantial increase in the overall ability of polychlorinated biphenyl 114 to classify women with and without endometriosis. This increase may warrant additional resources for a larger sample size for more precise estimation or improved laboratory methods for better measurement of samples.


    DISCUSSION
 TOP
 ABSTRACT
 INTRODUCTION
 MATERIALS AND METHODS
 VARIANCE AND CONFIDENCE...
 SIMULATION STUDY
 EXAMPLE
 DISCUSSION
 APPENDIX
 References
 
In this paper, we considered the task of estimating the area under the ROC curve of a biomarker whose measurement is unattainable below a limit of detection for some reason. A review of the literature shows that estimation (distributional parameters, odds ratio, regression) based on such data focuses primarily on replacement values substituted for unquantified observations and maximum likelihood techniques. Replacement values across topics were shown to yield biased estimation. Maximum likelihood techniques, though more computationally intensive, result in asymptotically unbiased estimation. We evaluated both techniques in the context of biomarker evaluation via the area under the ROC curve, specifically A, but other summary measures could be handled similarly. We proved analytically that any replacement value chosen causes bias in a proper estimate of A and that Formula is asymptotically unbiased and normally distributed. Our assessment of the asymptotics through simulation study of varying sample sizes confirmed the unbiasedness of Formula and nominal coverage probability of 95 percent confidence intervals for A.

To use this method and calculate Formula, one can write the likelihood functions described in the "normal case" and "gamma case" sections herein and maximize (or minimize – log likelihood) fairly easily in SAS (SAS Institute, Inc., Cary, North Carolina), S-PLUS (e.g., nlminb) (Insightful Corporation, Seattle, Washington), R (e.g., optim) (Comprehensive R Archive Network, Department of Statistics and Mathematics, Vienna University of Economics and Business Administration, Vienna, Austria), or other software. The calculation of the asymptotic variance and a confidence interval will require additional programming; this also is straightforward using this same software and the information in our Appendix.

We also assessed the use of replacement values naively in standard methods. This type of estimation is theoretically incorrect, as distributional assumptions do not account for the mass. Simply, the area calculated is under the wrong curve. However, replacement values used in this manner yielded estimates of A comparable to Formula in both the normal and gamma cases that we evaluated here for small degrees of censoring, 20 and 40 percent missing. In these cases, this technique could provide a practical solution to computational limitations of equipment or available software.

The calculation of MLEs and, subsequently, Formula requires distributional assumptions. Considering both normal and gamma distributions gives flexibility in this assumption. In addition, while not constructed here, other continuous distributions (i.e., lognormal, Weibull) could be handled similarly if necessary.

We assessed the robustness of our methods to departure from normality and gamma by misspecifying Student's t distributed data (5, 10, and 25 df) as if they were normally distributed and lognormally distributed data as if they were gamma. The means and variances of the alternative distributions were matched to those of the normals and gammas in our original simulations. As expected, misspecifying the true distribution of the data does introduce bias; however, the percentage of bias remained below 3 percent. The bias of Formula ranged from about –0.02 to 0.02, with the vast majority in both normal and gamma cases having bias from –0.01 to 0.01. The RMSE was comparable, larger by approximately 5 percent, to that based on data from actual normal and gamma distributions. Coverage probability ranged from nominal to nearly nominal when nx = ny = 50 but decreased as nx and ny increased. Coverage probability is less sensitive to misspecification of distributional assumptions when for data subject to LOD than for data unaffected by a LOD when MLEs are used. The gamma case was less affected than the normal case in this manner. Overall, Formula proved not to be overly sensitive to departures from reasonable distributional assumptions.

It is important to note that, when measurements are subject to a LOD due to instrument sensitivity, observations above the LOD might also be subject to the measurement error. Although we do not address it here, in practice this measurement error should also be considered and, if present, corrected for.

Accounting for a biomarker's potential discriminating ability is important when comparing biomarkers based on overall discriminating ability. Consider biomarkers affected and unaffected by a LOD, where underestimation of the affected marker may lead to choosing the less discriminatory biomarker with no LOD. Improved measurement techniques may have been warranted if the biomarker's potential had been considered through MLEs.

With biomarkers becoming more and more integral to disease diagnosis and treatment, their assessment is linked to effective, accurate statistical techniques. Limits of detection arise for a variety of reasons, resulting in unquantifiable observations. The maximum likelihood estimators developed here properly account for their "missingness" and provide investigators with consistent estimates of biomarkers' true discriminating capabilities.


    APPENDIX
 TOP
 ABSTRACT
 INTRODUCTION
 MATERIALS AND METHODS
 VARIANCE AND CONFIDENCE...
 SIMULATION STUDY
 EXAMPLE
 DISCUSSION
 APPENDIX
 References
 
Let us consider biomarker levels of diseased, X, and nondiseased, Y, individuals as independent random variables that have continuous distributions F(x; {theta}X) and G(y; {theta}Y), respectively, where {theta}X and {theta}Y are vector parameters. Hence, the ROC curve formed has an area under the curve A({theta}X, {theta}Y) = P(X > Y; {theta}X, {theta}Y).

Suppose random samples {Xi, i = 1, ..., nX} and {Yj, j = 1, ..., nY} yield kX and kY measured, quantified values and nX kX and nYkY biomarker levels that are unquantifiable (N/A) below some fixed limit of detection, d. Thus, the observations are transformed such that

Formula
for all i and j.

Let Formula and Formula be MLEs based on these censored data, {ZX1,...,ZXnX} and {ZY1,...,ZYnY}, respectively. In order to investigate the asymptotic distribution of Formula , we apply the usual Taylor quadratic expansion of A around the parameters

Formula (A1)
where o(T) is a function such that o(T)/T->0 as T -> {infty}. Therefore, since Formula and Formula are MLEs, one can show Formula where Formula denotes asymptotically distributed, N = nX + nY, and the variance {sigma}Formula is obtained by

Formula (A2)
with

Formula
as nX -> {infty} and nY -> {infty}. The covariance matrices of MLEs of unknown vectors of parameters are evaluated by the inverse of the Fisher information matrices, for example, IX, IY, i.e., Formula and Formula

Consider the case where F and G are normal distributions with vector parameters of mean and standard deviation {theta}X = {µX, {sigma}X} and {theta}Y = {µY{sigma}Y}, respectively. Let {Phi} and {phi} be the standard normal distribution and density functions, respectively. The MLEs Formula and Formula are calculated from a sample with censoring below a fixed d and lead to the estimator of the area under a binormal curve

Formula (A3)
Following the previous asymptotic developments, Formula is consistent and normally distributed. The variance of Formula is found by substituting the partial derivatives

Formula (A4)
and covariance matrices of Formula and Formula into A2. Here, the covariance matrices of diseased and nondiseased MLEs follow the general form

Formula (A5)
where Formula , and p = 1 – {Phi}({eta}) (13). By applying equations A4 and A5 with Formula and Formula to equation A2, we obtain the estimator Formula

Now, consider the case where F and G are gamma distributions with vector parameters {theta}X = {{alpha}X, ßX} and {theta}Y = {{alpha}Y, ßY}, respectively, where {alpha} denotes the shape parameter and ß denotes scale. MLEs Formula and Formula based on a sample with left censoring are substituted into the area under the bigamma curve

Formula (A6)
where Formula and {Gamma}(n) = (n – 1)! to form the estimator Formula(16). This Formula is consistent and asymptotically normally distributed. The variance of Formula in equation A6 is formed by substituting the partial derivatives

Formula (A7)
where Formula and covariance matrices of Formula and Formula in equation A2. Again, both covariances are found by the inverse of the Fisher information matrices

Formula (A8)
Based on a censored gamma distribution,

Formula (A9)
where Formula , and Formula are the appropriate entries in equation A8 (15). By applying equations A7A9 with Formula and Formula to equation A2, we obtain the estimator Formula


    ACKNOWLEDGMENTS
 
This research was supported by the Intramural Research Program of the Epidemiology Branch, Division of Epidemiology, Statistics, and Prevention Research, National Institute of Child Health and Human Development, National Institutes of Health.

Conflict of interest: none declared.


    References
 TOP
 ABSTRACT
 INTRODUCTION
 MATERIALS AND METHODS
 VARIANCE AND CONFIDENCE...
 SIMULATION STUDY
 EXAMPLE
 DISCUSSION
 APPENDIX
 References
 

  1. Kramer CK. (1992) Evaluating medical tests. (Sage Publications, Inc, Newbury Park, CA).
  2. Zweig MH and Campbell G. (1993) Reciever operating characteristic (ROC) plots: a fundamental evaluation tool in clinical medicine. Clin Chem 39:561–77.[Abstract/Free Full Text]
  3. Zhou XH, Obuchowski NA, McClish DK. (2002) Statistical methods in diagnostic medicine. (Wiley & Sons Interscience, New York, NY).
  4. Faraggi D. (2003) Adjusting ROC curves and realted indices for covariates. J R Stat Soc (D) 52:179–92.
  5. Schisterman EF, Faraggi D, Reiser B. (2004) Adjusting the generalized ROC curve for covariates. Stat Med 23:3319–31.[CrossRef][ISI][Medline]
  6. Youden WJ. (1950) Index for rating diagnostic tests. Cancer 3:32–5.[CrossRef][ISI][Medline]
  7. Hornung RW and Reed LD. (1990) Estimation of average concentration in the presence of nondetectable values. Appl Occup Environ Hyg 5:46–51.
  8. Haas CH and Scheff PA. (1990) Estimation of averages in truncated samples. Environ Sci Technol 24:912–19.[CrossRef]
  9. Singh A and Nocerino J. (2002) Robust estimation of mean and variance using environmental data sets with below detection limit observations. Chemometrics Intell Lab Syst 60:69–86.[CrossRef]
  10. Lynn HS. (2001) Maximum likelihood inference for left-censored HIV RNA data. Stat Med 20:33–45.[CrossRef][ISI][Medline]
  11. Schisterman EF, Reiser B, Faraggi D. (2006) ROC analysis for markers with mass at zero. Stat Med 25:623–38.[CrossRef][ISI][Medline]
  12. Cohen AC. (1950) Estimating the mean and variance of normal populations from singly truncated and doubly truncated samples. Ann Math Stat 21:557–69.
  13. Gupta AK. (1952) Estimation of the mean and standard deviation of a normal population from a censored sample. Biometrika 39:260–73.[Free Full Text]
  14. Harter HL and Moore AH. (1966) Iterative maximum-likelihood estimation of the parameters of normal populations from singly and doubly censored samples. Biometrika 53:205–13.[Abstract/Free Full Text]
  15. Harter HL and Moore AH. (1967) Asymptotic variances and covariances of maximum-likelihood estimators, from censored samples, of the parameters of Weibull and gamma populations. Ann Math Stat 38:557–70.
  16. Faraggi D, Reiser B, Schisterman E. (2003) ROC curve analysis for biomarkers based on pooled assessments. Stat Med 22:2515–27.[CrossRef][ISI][Medline]
  17. Louis GM, Weiner JM, Whitcomb BW, et al. (2005) Environmental PCB exposure and risk of endometriosis. Hum Reprod 20:279–85.[Abstract/Free Full Text]

Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow All Versions of this Article:
165/3/325    most recent
kwk011v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Right arrow Disclaimer
Google Scholar
Right arrow Articles by Perkins, N. J.
Right arrow Articles by Vexler, A.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Perkins, N. J.
Right arrow Articles by Vexler, A.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?