American Journal of Epidemiology Advance Access originally published online on January 12, 2006
American Journal of Epidemiology 2006 163(7):670-675; doi:10.1093/aje/kwj063
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Original Contribution |
The Inconsistency of "Optimal" Cutpoints Obtained using Two Criteria based on the Receiver Operating Characteristic Curve
1 Division of Epidemiology, Statistics and Prevention Research, National Institute of Child Health and Human Development, Bethesda, MD
2 Department of Mathematics and Statistics, American University, Washington, DC
Correspondence to Dr. Enrique F. Schisterman, Division of Epidemiology, Statistics and Prevention Research, National Institute of Child Health and Human Development, 6100 Executive Blvd., Bethesda, MD 20852 (e-mail: schistee{at}mail.nih.gov).
Received for publication June 17, 2005. Accepted for publication October 13, 2005.
| ABSTRACT |
|---|
|
|
|---|
The use of biomarkers is of ever-increasing importance in clinical diagnosis of disease. In practice, a cutpoint is required for dichotomizing naturally continuous biomarker levels to distinguish persons at risk of disease from those who are not. Two methods commonly used for establishing the "optimal" cutpoint are the point on the receiver operating characteristic curve closest to (0,1) and the Youden index, J. Both have sound intuitive interpretationsthe point closest to perfect differentiation and the point farthest from none, respectivelyand are generalizable to weighted sensitivity and specificity. Under the same weighting of sensitivity and specificity, these two methods identify the same cutpoint as "optimal" in certain situations but different cutpoints in others. In this paper, the authors examine situations in which the two criteria agree or disagree and show that J is the only "optimal" cutpoint for given weighting with respect to overall misclassification rates. A data-driven example is used to clarify and demonstrate the magnitude of the differences. The authors also demonstrate a slight alteration in the (0,1) criterion that retains its intuitive meaning while resulting in consistent agreement with J. In conclusion, the authors urge that great care be taken when establishing a biomarker cutpoint for clinical use.
area under curve; biological markers; cutpoints; data interpretation, statistical; epidemiologic methods; ROC curve; statistics; Youden index
Abbreviations: AUC, area under the curve; ROC, receiver operating characteristic
| INTRODUCTION |
|---|
|
|
|---|
The proper diagnosis of disease and treatment administration is a task that requires a variety of tools. Through advancements in biology and laboratory methods, a multitude of biomarkers are available as clinical tools for such diagnosis. These biomarkers are usually measured on a continuous scale with overlapping levels for diseased and nondiseased persons. Cutpoints dichotomize biomarker levels, providing benchmarks that label people as diseased or not diseased on the basis of "positive" or "negative" test results. Biomarker levels of persons with known disease status are used to evaluate potential cutpoint choices and, hopefully, identify a cutpoint that is "optimal" under some criterion.
Such a data set would comprise biomarker levels for persons classified as coming from the diseased (D) or nondiseased (
) population. These levels could then be classified in terms of positive (+) or negative () test results on the basis of whether the biomarker levels were above or below a given cutpoint. In most instances, some persons will be misclassified, truly belonging to a population other than the one indicated by their test results. The sensitivity (q(c)) and specificity (p(c)) of that biomarker for a given cutpoint, c, are the probabilities of correctly identifying a person's disease status (i.e., identifying true positives and true negatives):
![]() |
![]() |
A receiver operating characteristic (ROC) curve is a mapping of this sensitivity by 1 minus specificity. The ROC curve has become a useful tool in comparing the effectiveness of different biomarkers (1
3
). This comparison takes place through summary measures such as the area under the curve (AUC) and the partial AUC, with higher area values indicating higher levels of diagnostic ability (1
, 2
, 4
). A biomarker with an AUC of 1 differentiates perfectly between diseased persons (sensitivity = 1) and healthy persons (specificity = 1). An AUC of 0.5 means that, overall, there is a 50-50 chance that the biomarker will correctly identify diseased or healthy persons as such.
Though useful for biomarker evaluation, these measures do not inherently lead to benchmark "optimal" cutpoints with which clinicians and other health-care professionals can differentiate between diseased and nondiseased persons. Several methods for identifying "optimal" cutpoints using sensitivity, specificity, and the ROC curve have been proposed and applied (4
8
). Confidence intervals and corrections for measurement error are some of the supporting statistical developments accompanying cutpoint estimation (9
). Applications of these techniques have been demonstrated in several fields, including nuclear cardiology, epidemiology, and genetics (7
, 10
, 11
).
In the "Criteria" section of this article, we describe two criteria for locating this cutpoint that have similar intuitive justifications. In describing the mathematical mechanisms behind these criteria, we demonstrate that one of the criteria retains the intended meaning, while the other inherently depends on quantities that may differ from an investigator's intentions. In the "Example" section, we use data from a nested case-control study carried out in the Calcium for Pre-Eclampsia Prevention cohort (12
) to demonstrate how these two criteria identify different cutpoints for the classification of 120 preeclampsia cases and 120 controls based on levels of placenta growth factor, a biomarker of angiogenesis. Next, we discuss the appropriateness of the term "optimal" as it applies to each criterion. This is handled first with equally weighted sensitivity and specificity. Consideration of differing disease prevalences and costs due to misclassification is also presented as a practical generalization (5
, 13
). We end with a brief discussion.
| CRITERIA |
|---|
|
|
|---|
The closest-to-(0,1) criterion
If a biomarker perfectly differentiates persons with disease from those without disease on the basis of a single cutpoint, where q(c) = 1 and p(c) = 1, the ROC curve is a vertical line from (0,0) to (0,1) joined with a line from (0,1) to (1,1) with an AUC of 1. However, for a less-than-perfect biomarker, where q(c) < 1 and/or p(c) < 1, the ROC curve does not touch the (0,1) point. Here the choice of an "optimal" cutpoint is less straightforward. A criterion by which the point on the curve closest to (0,1) is identified and the corresponding cutpoint is labeled "optimal" has been suggested and utilized (6
![]() |
![]() | (1) |
This criterion can be viewed as searching for the shortest radius originating at the (0,1) point and terminating on the ROC curve. Reference arcs can be used to visually compare radial distances, with the arc corresponding to c* being tangent to the ROC curve and thus the minimum and interior of any of the concentric arcs possible. Figure 1 demonstrates this point at which the dotted arc is completely interior to, and thus closer to (0,1) than, the arc formed by the distance to an alternate point on the curve.
|
The Youden index
Another measure for evaluating biomarker effectiveness is the Youden index (J), first introduced in the medical literature by Youden (14
![]() | (2) |
Agreement/disagreement
The above criteria agree with respect to intuition; they maximize and minimize the rates of people's being classified correctly and incorrectly, respectively. The question "Do they agree on the same optimal cutpoint?" now begs to be answered.
Suppose the biomarker of interest follows continuous distributions for both diseased and nondiseased populations that are known completely, leading to a true ROC curve. Our only distributional restriction is that a ROC curve is generated that is differentiable everywhere. This is intrinsic to the case where diseased and nondiseased persons are assumed to follow any number of common continuous densities (i.e., normal, lognormal, gamma, etc.). Through differentiation, Appendix 1 shows that the two criteria only agree, c* = cJ = c, when q(c*) = p(c*) and q(cJ) = p(cJ). When either criterion identifies a point on the curve such that q(c*)
p(c*) or q(cJ)
p(cJ), the criteria disagree on what cutpoint is "optimal," that is, c*
cJ.
An investigator with complete knowledge of a biomarker's data distribution could be faced with two different cutpoints labeled "optimal" under two criteria that are intuitively the same. Our motivation here is simply to show that they are different and address the appropriateness of the label "optimal."
| EXAMPLE |
|---|
|
|
|---|
Preeclampsia affects approximately 5 percent of pregnancies, resulting in substantial maternal and neonatal morbidity and mortality (16
|
"Optimality"
When attempting to classify people on the basis of biomarker levels, it is always one's intent to do so "optimally." However, the event of interest may intrinsically involve constraints which must, for ethical or fiscal reasons, be considered. These constraints commonly account for the prevalence of the event in both populations and the costs of misclassification, both monetary and physiologic. Thus, mathematical techniques of optimality must now operate within these constraints, but the idea of an "optimal" cutpoint should remain; one still wishes to choose a point that classifies the most people correctly and the fewest incorrectly.
First let us assume the simplest scenario, absent of constraints or weighting. By definition, the cJ found by equation 2 succeeds ideologically by maximizing the overall rate of correct classification, q(cJ) + p(cJ). As a result, the overall rate of misclassification, (1 q(cJ)) + (1 p(cJ)), is minimized. Thus, we can say that cJ is "optimal" with respect to the total correct and incorrect classification rates and any cutpoint that deviates from it is not.
Under the same scenario, the closest-to-(0,1) criterion in equation 1 minimizes the total squared misclassification rates, quadratic terms for which an ideology does not seem to exist, other than being geometrically intuitive. Equation 1 can be expanded and rewritten as
![]() | (3) |
cJ.
Now, let us consider the circumstance in which cost and prevalence are thought to be factors, as they usually are in practice. Using decision theory, a generalized J can be formed where these factors are represented as a weighting of sensitivity and specificity. The function that minimizes expected loss in classifying a subject can be written as
![]() | (4) |
is the proportion of diseased persons in the population of interest (prevalence) (17
![]() | (5) |
)/a
. For r = 1, this is equivalent to J.
Weighting of the (0,1) criterion occurs similarly,
![]() | (6) |
![]() | (7) |
Example revisited
To demonstrate this unnecessary misclassification and its possible magnitude, we revisit the example of placenta growth factor levels' being used to differentiate preeclamptic women from those without the disease. Sensitivity and specificity at the cutpoints previously identified are q(c*) = 0.592, p(c*) = 0.558 and q(cJ) = 0.817, p(cJ) = 0.358, respectively. The overall correct classification rate (q + p) is 1.150 for c* and 1.175 for cJ out of a possible 2, with a difference of 0.025. Without the justification for the third term in equation 3 and without weighting, this difference can be thought of as one person out of 100 being unnecessarily misclassified. Relative cost and disease prevalence are often difficult to assess, as discussed by Greiner et al. (18
) and the references cited therein. Thus, we will not attempt adjustment in this example.
| DISCUSSION |
|---|
|
|
|---|
In this paper, we demonstrated the intuitive similarity of two criteria used to choose an "optimal" cutpoint. We then showed that the criteria agree in some instances and disagree in others. Placenta growth factor levels used to classify women as preeclamptic or not preeclamptic were used to demonstrate this point and quantify the extent of disagreement.
We addressed both criteria in the context of what an investigator might view as "optimal," with and without attention to misclassification cost and prevalence. Mathematically, J reflects the intention of maximizing overall correct classification rates and thus minimizing misclassification rates, while choosing the point closest to (0,1) involves a quadratic term for which the clinical meaning is unknown. It is for this reason that we advocate for the use of J to find the "optimal" cutpoint.
Since the (0,1) criterion is visually intuitive, we have provided an amended (0,1) criterion in Appendix 2 that is likewise geometrically satisfying while consistently identifying the same "optimal" cutpoint as J. This criterion relies on a ratio of radii originating at (0,1).
Additional motivation for using J is an ever-increasing body of supporting literature (9
, 15
, 19
). Topics such as confidence intervals and correcting the estimate for measurement error have been considered, whereas the (0,1) criterion lacks such support.
Most importantly, cutpoints chosen through less than "optimal" criteria or criteria that are "optimal" in some arbitrary sense can lead to unnecessary misclassifications, resulting in needlessly missed opportunities for disease diagnosis and intervention. We showed above that J is "optimal" when equal weight is given to sensitivity and specificity, r = 1, and a generalized J is "optimal" when cost and prevalence lead to weighted sensitivity and specificity, r
1. Thus, when the point closest to (0,1) differs from the point resulting in J, using this criterion to establish an "optimal" cutpoint unnecessarily introduces an increased rate of misclassification.
| APPENDIX 1 |
|---|
|
|
|---|
For continuous receiver operating characteristic (ROC) curves, we make no distributional assumptions beyond the assumption that the probability density functions fD and f
for biomarker levels of diseased and nondiseased persons, respectively, form a ROC curve that is differentiable everywhere. This is the case when fD and f
are assumed to be any common continuous parametric distributions (i.e., normal, gamma, lognormal).
In order to locate the cutpoints that minimize and maximize equations 1 and 2, respectively, it is first necessary to locate critical values. Thus, differentiating equation 1,
![]() | (A1.1) |
![]() | (A1.2) |
![]() | (A1.3) |
![]() | (A1.4) |
Equations A1.2 and A1.4 show us that the (0,1) and J methods agree, c* = cJ = c, only when q(c*) = p(c*) and thus (1 p(c*))/(1 q(c*)) = 1. When q(c*)
p(c*), the criteria disagree on what point is optimal (c*
cJ).
| APPENDIX 2 |
|---|
|
|
|---|
Equation 1 identifies the point closest to perfection irrespective of the possibilities of imperfection. In other words, this criterion minimizes the distance from (0,1) to the curve but fails to take into account the possible distance to the chance line. To obtain a weighted criterion that accounts for this deficiency, minimize the proportion of the smaller radius (r2) to the larger radius (r1), as displayed in appendix figure 1, such that
![]() | (A2.1) |
|
The relation in equation A2.1 can be derived algebraically or by using the proportionality of the triangles in appendix figure 1, such that
![]() |
It is now easily seen that the differentiation
![]() |
| ACKNOWLEDGMENTS |
|---|
This research was supported by the National Institutes of Health Intramural Research Program, National Institute of Child Health and Human Development.
The authors thank Dr. Richard Levine for allowing them to use the data from the Calcium for Pre-Eclampsia Prevention Study.
Conflict of interest: none declared.
| References |
|---|
|
|
|---|
- Zhou XH, Obuchowski NA, McClish DK. Statistical methods in diagnostic medicine. New York, NY: John Wiley and Sons, Inc, 2002.
- Faraggi D. Adjusting ROC curves and related indices for covariates. J R Stat Soc Ser D Statistician 2003;52:17992.
- Schisterman EF, Faraggi D, Reiser B. Adjusting the generalized ROC curve for covariates. Stat Med 2004;23:331931.[CrossRef][Web of Science][Medline]
- Pepe M. The statistical evaluation of medical tests for classification and prediction. New York, NY: Oxford University Press, 2003.
- Zwieg MH, Campbell G. Receiver-operating characteristic (ROC) plots: a fundamental evaluation tool in clinical medicine. Clin Chem 1993;39:56177.
[Abstract/Free Full Text] - Coffin M, Sukhatme S. Receiver operating characteristic studies and measurement errors. Biometrics 1997;53:82337.[CrossRef][Web of Science][Medline]
- Sharir T, Berman DS, Waechter PB, et al. Quantitative analysis of regional motion and thickening by gated myocardial perfusion SPECT: normal heterogeneity and criteria for abnormality. J Nucl Med 2001;42:16308.
[Abstract/Free Full Text] - van Belle G. Statistical rules of thumb. New York, NY: John Wiley and Sons, Inc, 2002:98.
- Perkins NJ, Schisterman EF. The Youden index and the optimal cut-point corrected for measurement error. Biom J 2005;47:42841.[Medline]
- Schisterman EF, Faraggi D, Brown R, et al. TBARS and cardiovascular disease in a population-based sample. J Cardiovasc Risk 2001;8:21925.[CrossRef][Web of Science][Medline]
- Chen R, Rabinovitch PS, Crispin DA, et al. DNA fingerprinting abnormalities can distinguish ulcerative colitis patients with dysplasia and cancer from those who are dysplasia/cancer-free. Am J Pathol 2003;16:66572.
- Levine RJ, Maynard SE, Qian C, et al. Circulating angiogenic factors and the risk of preeclampsia. N Engl J Med 2004;350:67283.
[Abstract/Free Full Text] - Barkan N. Statistical inference on r * specificity + sensitivity. (Doctoral dissertation). Haifa, Israel: University of Haifa, 2001:6974.
- Youden WJ. An index for rating diagnostic tests. Cancer 1950;3:325.[CrossRef][Web of Science][Medline]
- Schisterman EF, Perkins NJ, Aiyi L, et al. Optimal cutpoint and its corresponding Youden index to discriminate individuals using pooled blood samples. Epidemiology 2005;16:7381.[CrossRef][Web of Science][Medline]
- Chmura Kraemer H. Evaluating medical tests: objective and quantitative guidelines. Newbury Park, CA: Sage Publications, 1992.
- Geisser S. Comparing two tests used for diagnostic or screening processes. Stat Prob Lett 1998;40:11319.
- Greiner M, Pfeiffer D, Smith RM. Principles and practical application of the receiver-operating characteristic analysis for diagnostic tests. Prev Vet Med 2000;45:2341.[CrossRef][Web of Science][Medline]
- Hilden J, Glasziou P. Regret graphs, diagnostic uncertainty and Youden's index. Stat Med 1996;15:96986.[CrossRef][Web of Science][Medline]
This article has been cited by other articles:
![]() |
J. D. Ford, R. L. Trestman, V. H. Wiesbrock, and W. Zhang Validation of a Brief Screening Instrument for Identifying Psychiatric Disorders Among Newly Incarcerated Adults Psychiatr Serv, June 1, 2009; 60(6): 842 - 846. [Abstract] [Full Text] [PDF] |
||||
![]() |
X. Zhou, F.-m. Ding, J.-t. Lin, and K.-s. Yin Validity of Asthma Control Test for Asthma Control Assessment in Chinese Primary Care Settings Chest, April 1, 2009; 135(4): 904 - 910. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. Gebker, C. Jahnke, R. Manka, A. Hamdan, B. Schnackenburg, E. Fleck, and I. Paetsch Additional Value of Myocardial Perfusion Imaging During Dobutamine Stress Magnetic Resonance for the Assessment of Coronary Artery Disease Circ Cardiovasc Imaging, September 1, 2008; 1(2): 122 - 130. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Weber, O. Bazzino, J. L. Navarro Estrada, J. J. Fuselli, F. Botto, D. Perez de Arenaza, H. Mollmann, H. N. Nef, A. Elsasser, and C. W. Hamm N-Terminal B-Type Natriuretic Peptide Assessment Provides Incremental Prognostic Information in Patients With Acute Coronary Syndromes and Normal Troponin T Values Upon Admission J. Am. Coll. Cardiol., March 25, 2008; 51(12): 1188 - 1195. [Abstract] [Full Text] [PDF] |
||||
![]() |
Y. Matoba, T. Inoguchi, S. Nasu, S. Suzuki, T. Yanase, H. Nawata, and R. Takayanagi Optimal Cut Points of Waist Circumference for the Clinical Diagnosis of Metabolic Syndrome in the Japanese Population Diabetes Care, March 1, 2008; 31(3): 590 - 592. [Full Text] [PDF] |
||||
![]() |
N. Vermeulen, I. Arijs, S. Joossens, S. Vermeire, S. Clerens, K. Van den Bergh, G. Michiels, L. Arckens, F. Schuit, L. Van Lommel, et al. Anti-{alpha}-enolase Antibodies in Patients with Inflammatory Bowel Disease Clin. Chem., March 1, 2008; 54(3): 534 - 541. [Abstract] [Full Text] [PDF] |
||||
![]() |
A Brunner, M Prelog, I Verdorfer, A Tzankov, G Mikuz, and C Ensinger EpCAM is predominantly expressed in high grade and advanced stage urothelial carcinoma of the bladder J. Clin. Pathol., March 1, 2008; 61(3): 307 - 310. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Tzankov, C. Meier, P. Hirschmann, P. Went, S. A. Pileri, and S. Dirnhofer Correlation of high numbers of intratumoral FOXP3+ regulatory T cells with improved survival in germinal center-like diffuse large B-cell lymphoma, follicular lymphoma and classical Hodgkin's lymphoma Haematologica, February 1, 2008; 93(2): 193 - 200. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. D. Ford, R. L. Trestman, V. Wiesbrock, and Wanli Zhang Development and Validation of a Brief Mental Health Screening Instrument for Newly Incarcerated Adults Assessment, September 1, 2007; 14(3): 279 - 299. [Abstract] [PDF] |
||||
![]() |
L. D. Wendland, L. A. Zacher, P. A. Klein, D. R. Brown, D. Demcovitz, R. Littell, and M. B. Brown Improved Enzyme-Linked Immunosorbent Assay To Reveal Mycoplasma agassizii Exposure: a Valuable Tool in the Management of Environmentally Sensitive Tortoise Populations Clin. Vaccine Immunol., September 1, 2007; 14(9): 1190 - 1195. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Hachenberg, C. Weinkauf, S. Hiss, and H. Sauerwein Evaluation of classification modes potentially suitable to identify metabolic stress in healthy dairy cows during the peripartal period J Anim Sci, August 1, 2007; 85(8): 1923 - 1932. [Abstract] [Full Text] [PDF] |
||||
![]() |
T. Hayashi, E. J. Boyko, M. J. McNeely, D. L. Leonetti, S. E. Kahn, and W. Y. Fujimoto Minimum Waist and Visceral Fat Values for Identifying Japanese Americans at Risk for the Metabolic Syndrome Diabetes Care, January 1, 2007; 30(1): 120 - 127. [Abstract] [Full Text] [PDF] |
||||
![]() |
H. Zetterberg RE: "THE INCONSISTENCY OF 'OPTIMAL' CUTPOINTS OBTAINED USING TWO CRITERIA BASED ON THE RECEIVER OPERATING CHARACTERISTIC CURVE" Am. J. Epidemiol., October 1, 2006; 164(7): 707 - 708. [Full Text] [PDF] |
||||
![]() |
N. J. Perkins and E. F. Schisterman THE AUTHORS REPLY Am. J. Epidemiol., October 1, 2006; 164(7): 708 - 708. [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||































