Skip Navigation


American Journal of Epidemiology Advance Access originally published online on September 19, 2007
American Journal of Epidemiology 2008 167(1):86-89; doi:10.1093/aje/kwm257
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow All Versions of this Article:
167/1/86    most recent
kwm257v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Right arrow Disclaimer
Google Scholar
Right arrow Articles by Lee, W.-C.
Right arrow Articles by Wang, L.-Y.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Lee, W.-C.
Right arrow Articles by Wang, L.-Y.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

American Journal of Epidemiology © The Author 2007. Published by the Johns Hopkins Bloomberg School of Public Health. All rights reserved. For permissions, please e-mail: journals.permissions@oxfordjournals.org.

ORIGINAL CONTRIBUTIONS

Simple Formulas for Gauging the Potential Impacts of Population Stratification Bias

Wen-Chung Lee and Liang-Yi Wang

From the Research Center for Genes, Environment and Human Health and the Graduate Institute of Epidemiology, College of Public Health, National Taiwan University, Taipei, Taiwan

Correspondence to Dr. Wen-Chung Lee, Room 536, College of Public Health, National Taiwan University, No. 17 Xuzhou Road, Taipei 100, Taiwan (e-mail: wenchung{at}ntu.edu.tw).

Received for publication May 8, 2007. Accepted for publication August 13, 2007.


    ABSTRACT
 TOP
 ABSTRACT
 INTRODUCTION
 POTENTIAL IMPACTS OF POPULATION...
 EXAMPLES
 DISCUSSION
 APPENDIX
 References
 
The case-control study design is popular for genetic association studies of complex human diseases. However, case-control studies may suffer from bias due to population stratification. In this paper, the authors present simple formulas that can set a limit to the havoc population stratification bias can wreak (the lower and upper bounds of the confounding rate ratio and the upper bound of the type I error rate). The authors demonstrate applications of these formulas using two examples. The formulas can help researchers make more prudent interpretations of their (potentially biased) results.

bias (epidemiology); case-control studies; data interpretation, statistical; epidemiologic methods; genetics


Abbreviations: CRR, confounding rate ratio; ER-{alpha}, estrogen receptor alpha; IL, interleukin


    INTRODUCTION
 TOP
 ABSTRACT
 INTRODUCTION
 POTENTIAL IMPACTS OF POPULATION...
 EXAMPLES
 DISCUSSION
 APPENDIX
 References
 
The case-control study design is popular for genetic association studies of complex human diseases (1, 2). However, case-control studies may suffer from bias due to population stratification, if the study is conducted in a population comprising two or more strata in which allele frequencies and disease rates differ across the strata (3, 4). Population stratification bias may manifest itself in a gene-characterization study, causing over- or underestimation of a genotype relative risk; it may also appear in a gene-mapping study as inflation of the type I error rate, causing an excess number of false-positive findings (5).

Unfortunately, it is not always possible to adjust for or remove population stratification bias using conventional methods of matching or stratified analysis, because the possible strata in a population are not readily identifiable. (Race and ethnicity are difficult to characterize using questionnaires, and there may be subethnicities within ethnic groups that make correct identification even more difficult. Worse, population stratification is not limited to known racial/ethnic groups; it may simply result from incomplete admixing of populations due to subtle geographic, social, or political barriers.) Recent methods of "genomic control" (69) and "structured association" (1014) may offer hope for correcting the bias. However, these methods require typing of a panel of markers across the genome, which may prove too costly for a candidate gene study.

In this paper, we present simple formulas for gauging the potential impacts of population stratification bias. We demonstrate the applications of these formulas using two examples.


    POTENTIAL IMPACTS OF POPULATION STRATIFICATION BIAS
 TOP
 ABSTRACT
 INTRODUCTION
 POTENTIAL IMPACTS OF POPULATION...
 EXAMPLES
 DISCUSSION
 APPENDIX
 References
 
In a population composed of several strata (indexed by j), let pj denote the frequency (gj = pj/(1 – pj), the frequency odds) of the susceptibility genotype, bj the background disease rate (the disease rate for persons who do not carry the susceptibility genotype), and mj the total number of subjects (or more precisely, the total person-time) in the jth stratum. Furthermore, let RRG denote the relative rate of disease for persons who carry the susceptibility genotype as compared with those who do not (we assume RRG to be constant across the strata). In this stratified population as a whole, the disease rate for persons who carry the susceptibility genotype is

Formula
and for those who do not, it is

Formula
Therefore, population stratification bias as quantified by the "confounding rate ratio" (CRR) is

Formula
where

Formula
Among all of the strata in the population, let G (G ≥ 1) denote the ratio of the largest frequency odds and the smallest frequency odds of the susceptibility genotype, and let B (B ≥ 1) denote the ratio of the largest background disease rate and the smallest background disease rate. It is easy to show mathematically that the CRR is always bounded below by L and above by U, where L = (1/U) ≤ 1 and

Formula
(see Appendix). In a case-control study with n1 cases and n0 controls for a susceptibility genotype having frequency p, the type I error rate with a significance level of {alpha} is bounded above by A = Pr[{chi}df = 12({delta}2)>{chi}df = 1;1 – {alpha}2], where {chi}df = 12({delta}2) is a noncentral chi-squared distribution with 1 degree of freedom and a noncentrality parameter

Formula
and {chi}df = 1;1 – {alpha}2 is the upper {alpha} quantile of a (central) chi-squared distribution with 1 degree of freedom (see Appendix). Note that U = L = 1 (no bias) and A = {alpha} (no inflation of the type I error rate) when either G = 1 (no variation in the frequency odds of the susceptibility genotype) or B = 1 (no variation in the background disease rates).

The above bounds are not intended to be close to the true magnitude of population stratification bias. Rather, they serve to accommodate the magnitude of bias for every possible population stratification scenario conceivable, under the G and B constraints. G and B themselves are to be determined in an ad hoc manner. To err on the safe side, one can overexaggerate the values of G and B (to obtain more conservative bounds for population stratification bias) on the basis of one's best (but perhaps scant) knowledge of the stratified population under study.

Table 1 presents the lower and upper bounds of the CRR and the upper bounds of the type I error rate under various conditions when p = 0.3 and {alpha} = 0.05. When the variation in the frequency odds of the susceptibility genotype and the background disease rates is not large (G = B = 1.5), the bias is approximately 5 percent at most, and the inflation of the type I error is negligible even at a large sample size of n1 = n0 = 1,000. As the variation gets larger (e.g., G = B = 5), the bias (up to approximately 80 percent) and the inflation of the type I error rate (up to approximately 0.5, even at a small sample size of n1 = n0 = 100) becomes intolerable.


View this table:
[in this window]
[in a new window]

 
TABLE 1. Lower and upper bounds of the confounding rate ratio and upper bounds of the type I error rate under various conditions when p = 0.3 and {alpha} = 0.05

 

    EXAMPLES
 TOP
 ABSTRACT
 INTRODUCTION
 POTENTIAL IMPACTS OF POPULATION...
 EXAMPLES
 DISCUSSION
 APPENDIX
 References
 
Hefler et al. (15) conducted a case-control study of interleukin (IL) gene polymorphisms and breast cancer risk. The cases (n1 = 269) and controls (n0 = 227) were Caucasian women in Germany and Austria. Hefler et al. found that the odds ratio for the interleukin-6 (IL-6) gene (–174C/C and –174G/C vs. –174G/G) was 1.64 and was statistically significant (15). However, could this be nothing but population stratification bias in disguise? The breast cancer rates for various countries in Europe range from approximately 33 per 100,000 to 97 per 100,000 (16), and IL-6 genotype frequencies range from approximately 0.30 to 0.45 (17). Thus, it is reasonable to expect that, among the potential hidden strata in Hefler et al.'s study population, B will be no more than 2.94 (i.e., 97.01/32.98) and G no more than 1.91 Formula . Using the formula in the preceding section, we find that the upper bound for the bias is

Formula
which is less than 1.64, the estimated odds ratio for IL-6. This suggests that Hefler et al.'s finding (15) cannot be explained away by population stratification bias alone (and perhaps should be taken more seriously).

As another example, Hsiao et al. (18) conducted a case-control study in southern Taiwan to examine whether estrogen receptor alpha (ER-{alpha}) polymorphisms are related to breast cancer risk. Hsiao et al. found that the frequency of a silent single nucleotide polymorphism in the ER-{alpha} gene (allele 1 of codon 10) was significantly lower among breast cancer patients (32.0 percent; n1 = 189) than among controls (40.4 percent; n0 = 177) at {alpha} = 0.05 (18). To determine whether population stratification bias could have seriously damaged this study, we first obtain the following data. The range of the allele frequencies of codon 10 is approximately 0.33–0.38 among Asian populations (19), and the range of the age-standardized breast cancer incidence rates is approximately 34–47 per 100,000 in the northern, central, southern, and eastern parts of Taiwan and the offshore islands (20). From these figures, we expect that B will be no more than 1.38 (i.e., 47.30/34.18) and G no more than 1.24 Formula . Thus,

Formula

Formula
(note that 2n is used here because of an allele-based approach), and A = Pr[{chi}df = 12(0.0138)>{chi}df = 1;1 – 0.052] = 0.0516. Because this level of inflation of the type I error rate is inconsequential, the above positive finding in Hsiao et al.'s study (18) is unlikely to be a false alarm due to population stratification bias.


    DISCUSSION
 TOP
 ABSTRACT
 INTRODUCTION
 POTENTIAL IMPACTS OF POPULATION...
 EXAMPLES
 DISCUSSION
 APPENDIX
 References
 
Half a century ago, Cornfield et al. (21) also studied the potential impact of an omitted variable on a positive finding. They demonstrated that the accumulated epidemiologic findings to date on the relation of tobacco smoking to lung cancer could not be explained away by the "constitutional hypothesis." At that time, the omitted variable was the yet-unknown and unmeasurable (but thought to exist) "gene." It is interesting to learn that in this postgenomic era, 50 years after Cornfield et al., the gene itself becomes the easy part. It is now the nonconstitutional factor (i.e., population stratification) that proves elusive.

Population stratification bias was previously studied by Wacholder et al. (22) and Wang et al. (23) (the magnitude of bias) and by Heiman et al. (24) and Gorroochurn et al. (25) (the magnitude of the false-positive rate). They used computer simulation to demonstrate the impacts of bias in many different situations, assuming that the total number of strata, as well as the population number, the frequency of the susceptibility genotype, and the background disease rate in any given stratum, were known in advance. Researchers may have difficulty applying their results, because it is often not possible to obtain such detailed knowledge of the anatomy of a stratified population in real practice. At best, investigators may make an educated guess as to what the variation of the frequency odds of the susceptibility genotype (G) and the variation of the background disease rate (B) might be among the hidden strata. By overexaggerating those two values, conservative bounds for population stratification bias can be obtained. As is demonstrated in the two examples presented above, this would help researchers make more prudent interpretations of their (potentially biased) results.


    APPENDIX
 TOP
 ABSTRACT
 INTRODUCTION
 POTENTIAL IMPACTS OF POPULATION...
 EXAMPLES
 DISCUSSION
 APPENDIX
 References
 
Subject to the G and B constraints, population stratification bias is largest when there are two strata in the study population, the first stratum having frequency odds of susceptibility genotype g and background disease rate b and the second stratum having frequency odds of susceptibility genotype gG and background disease rate bB. For this two-stratum population, the confounding rate ratio (CRR) can be calculated as follows:

Formula
It is straightforward to see that the CRR achieves its maximum of

Formula
at Formula Similarly, by setting the first stratum as having frequency odds of susceptibility genotype g and background disease rate bB and the second stratum as having gG and b, at the outset, it is easy to show that L = (1/U).

Under the null hypothesis that the gene under study is not associated with the disease, its true value of log relative rate is 0 and its maximally biased log relative rate is ±log U. In a 2 x 2 table testing a null gene under the worst-case (maximally biased) scenario, the sample log odds ratio has an expected value of ±log U and a variance of

Formula
approximately. Therefore, the test follows a noncentral chi-squared distribution with 1 degree of freedom and a noncentrality parameter

Formula


    ACKNOWLEDGMENTS
 
This study was supported by grants from the National Science Council of Taiwan, Republic of China (NSC 95-2314-B-002-242, NSC 95-3114-P-002-005-Y, and NSC 96-2314-B-002-143).

Conflict of interest: none declared.


    References
 TOP
 ABSTRACT
 INTRODUCTION
 POTENTIAL IMPACTS OF POPULATION...
 EXAMPLES
 DISCUSSION
 APPENDIX
 References
 

  1. Khoury MJ, Yang Q. The future of genetic studies of complex human diseases: an epidemiologic perspective. Epidemiology (1998) 9:350–4.[Web of Science][Medline]
  2. Risch N, Merikangas K. The future of genetic studies of complex human diseases. Science (1996) 273:1516–17.[Abstract/Free Full Text]
  3. Ewens WJ, Spielman RS. The transmission/disequilibrium test: history, subdivision, and admixture. Am J Hum Genet (1995) 57:455–64.[Web of Science][Medline]
  4. Witte JS, Gauderman WJ, Thomas DC. Asymptotic bias and efficiency in case-control studies of candidate genes and gene-environment interactions: basic family designs. Am J Epidemiol (1999) 149:693–705.[Abstract/Free Full Text]
  5. Thomas DC, Witte JS. Point: population stratification: a problem for case-control studies of candidate-gene associations? Cancer Epidemiol Biomarkers Prev (2002) 11:505–12.[Free Full Text]
  6. Devlin B, Roeder K. Genomic control for association studies. Biometrics (1999) 55:997–1004.[CrossRef][Web of Science][Medline]
  7. Bacanu SA, Devlin B, Roeder K. The power of genomic control. Am J Hum Genet (2000) 66:1933–44.[CrossRef][Web of Science][Medline]
  8. Reich DE, Goldstein DB. Detecting association in a case-control study while correcting for population stratification. Genet Epidemiol (2001) 20:4–16.[CrossRef][Web of Science][Medline]
  9. Lee WC. Case-control association studies with matching and genomic controlling. Genet Epidemiol (2004) 27:1–13.[CrossRef][Web of Science][Medline]
  10. Pritchard JK, Stephens M, Rosenberg NA, et al. Association mapping in structured populations. Am J Hum Genet (2000) 67:170–81.[CrossRef][Web of Science][Medline]
  11. Pritchard JK, Stephens M, Donnelly P. Inference of population structure using multilocus genotype data. Genetics (2000) 155:945–59.[Abstract/Free Full Text]
  12. Satten GA, Flanders WD, Yang Q. Account for unmeasured population substructure in case-control studies of genetic association using a novel latent-class model. Am J Hum Genet (2001) 68:466–77.[CrossRef][Web of Science][Medline]
  13. Zhu X, Zhang SL, Zhao H, et al. Association mapping, using a mixture model for complex traits. Genet Epidemiol (2002) 23:181–96.[CrossRef][Web of Science][Medline]
  14. Hoggart CJ, Parra EJ, Shriver MD, et al. Control of confounding of genetic associations in stratified populations. Am J Hum Genet (2003) 72:1492–504.[CrossRef][Web of Science][Medline]
  15. Hefler LA, Grimm C, Lantzsch T, et al. Interleukin-1 and interleukin-6 gene polymorphisms and the risk of breast cancer in Caucasian women. Clin Cancer Res (2005) 11:5718–21.[Abstract/Free Full Text]
  16. Parkin DM, Whelan SL, Ferlay J, et al. Cancer incidence in five continents. (2002) Lyon, France: International Agency for Research on Cancer.
  17. Berger FG. The interleukin-6 gene: a susceptibility factor that may contribute to racial and ethnic disparities in breast cancer mortality. Breast Cancer Res Treat (2004) 88:281–5.[CrossRef][Web of Science][Medline]
  18. Hsiao WC, Young KC, Lin SL, et al. Estrogen receptor-alpha polymorphism in a Taiwanese clinical breast cancer population: a case-control study. Breast Cancer Res (2004) 6:R180–6.[CrossRef][Web of Science][Medline]
  19. National Center for Biotechnology Information, US National Library of Medicine. Reference SNP cluster report: rs 2077 647. (NCBI Single Nucleotide Polymorphism database) (2007) Washington, DC: National Library of Medicine. (http://www.ncbi.nlm.nih.gov/SNP/snp_ref.cgi?rs=2077647).
  20. Bureau of Health Promotion, Department of Health, Republic of Taiwan. Cancer registry annual report, 1995 –2002. (In Chinese) (2007) Taipei, Taiwan: Taiwan Department of Health. (http://crs.cph.ntu.edu.tw/crs_c/annual.html).
  21. Cornfield J, Haenszel W, Hammond EC, et al. Smoking and lung cancer: recent evidence and a discussion of some questions. J Natl Cancer Inst (1959) 22:173–203.[Web of Science][Medline]
  22. Wacholder S, Rothman N, Caporaso N. Population stratification in epidemiologic studies of common genetic variants and cancer: quantification of bias. J Natl Cancer Inst (2000) 92:1151–8.[Abstract/Free Full Text]
  23. Wang Y, Localio R, Rebbeck TR. Evaluating bias due to population stratification in case-control association studies of admixed populations. Genet Epidemiol (2004) 27:14–20.[CrossRef][Web of Science][Medline]
  24. Heiman GA, Hodge SE, Gorroochurn P, et al. Effect of population stratification on case-control association studies. I. Elevation in false positive rates and comparison to confounding risk ratios (a simulation study). Hum Hered (2004) 58:30–9.[CrossRef][Web of Science][Medline]
  25. Gorroochurn P, Hodge SE, Heiman G, et al. Effect of population stratification on case-control association studies. II. False-positive rates and their limiting behavior as number of subpopulations increases. Hum Hered (2004) 58:40–8.[CrossRef][Web of Science][Medline]

Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Am J EpidemiolHome page
L.-Y. Wang and W.-C. Lee
Population Stratification Bias in the Case-Only Study for Gene-Environment Interactions
Am. J. Epidemiol., July 15, 2008; 168(2): 197 - 201.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow All Versions of this Article:
167/1/86    most recent
kwm257v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Right arrow Disclaimer
Google Scholar
Right arrow Articles by Lee, W.-C.
Right arrow Articles by Wang, L.-Y.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Lee, W.-C.
Right arrow Articles by Wang, L.-Y.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?