American Journal of Epidemiology Advance Access originally published online on September 19, 2007
American Journal of Epidemiology 2008 167(1):86-89; doi:10.1093/aje/kwm257
American Journal of Epidemiology © The Author 2007. Published by the Johns Hopkins Bloomberg School of Public Health. All rights reserved. For permissions, please e-mail: journals.permissions@oxfordjournals.org.
Simple Formulas for Gauging the Potential Impacts of Population Stratification Bias
Wen-Chung Lee and
Liang-Yi Wang
From the Research Center for Genes, Environment and Human Health and the Graduate Institute of Epidemiology, College of Public Health, National Taiwan University, Taipei, Taiwan
Correspondence to Dr. Wen-Chung Lee, Room 536, College of Public Health, National Taiwan University, No. 17 Xuzhou Road, Taipei 100, Taiwan (e-mail: wenchung{at}ntu.edu.tw).
Received for publication May 8, 2007.
Accepted for publication August 13, 2007.
 |
ABSTRACT
|
|---|
The case-control study design is popular for genetic association
studies of complex human diseases. However, case-control studies
may suffer from bias due to population stratification. In this
paper, the authors present simple formulas that can set a limit
to the havoc population stratification bias can wreak (the lower
and upper bounds of the confounding rate ratio and the upper
bound of the type I error rate). The authors demonstrate applications
of these formulas using two examples. The formulas can help
researchers make more prudent interpretations of their (potentially
biased) results.
bias (epidemiology); case-control studies; data interpretation, statistical; epidemiologic methods; genetics
Abbreviations:
CRR, confounding rate ratio; ER-

, estrogen receptor alpha; IL, interleukin
 |
INTRODUCTION
|
|---|
The case-control study design is popular for genetic association
studies of complex human diseases (
1,
2). However, case-control
studies may suffer from bias due to population stratification,
if the study is conducted in a population comprising two or
more strata in which allele frequencies and disease rates differ
across the strata (
3,
4). Population stratification bias may
manifest itself in a gene-characterization study, causing over-
or underestimation of a genotype relative risk; it may also
appear in a gene-mapping study as inflation of the type I error
rate, causing an excess number of false-positive findings (
5).
Unfortunately, it is not always possible to adjust for or remove population stratification bias using conventional methods of matching or stratified analysis, because the possible strata in a population are not readily identifiable. (Race and ethnicity are difficult to characterize using questionnaires, and there may be subethnicities within ethnic groups that make correct identification even more difficult. Worse, population stratification is not limited to known racial/ethnic groups; it may simply result from incomplete admixing of populations due to subtle geographic, social, or political barriers.) Recent methods of "genomic control" (6–9) and "structured association" (10–14) may offer hope for correcting the bias. However, these methods require typing of a panel of markers across the genome, which may prove too costly for a candidate gene study.
In this paper, we present simple formulas for gauging the potential impacts of population stratification bias. We demonstrate the applications of these formulas using two examples.
 |
POTENTIAL IMPACTS OF POPULATION STRATIFICATION BIAS
|
|---|
In a population composed of several strata (indexed by
j), let
pj denote the frequency (
gj =
pj/(1 –
pj), the frequency
odds) of the susceptibility genotype,
bj the background disease
rate (the disease rate for persons who do not carry the susceptibility
genotype), and
mj the total number of subjects (or more precisely,
the total person-time) in the
jth stratum. Furthermore, let
RR
G denote the relative rate of disease for persons who carry
the susceptibility genotype as compared with those who do not
(we assume RR
G to be constant across the strata). In this stratified
population as a whole, the disease rate for persons who carry
the susceptibility genotype is
and for those who do not, it is
Therefore, population stratification bias as
quantified by the "confounding rate ratio" (CRR) is
where
Among all of the strata in the population, let
G (
G 
1) denote
the ratio of the largest frequency odds and the smallest frequency
odds of the susceptibility genotype, and let
B (
B 
1) denote
the ratio of the largest background disease rate and the smallest
background disease rate. It is easy to show mathematically that
the CRR is always bounded below by
L and above by
U, where
L = (1/
U)

1 and
(see
Appendix). In a case-control study with
n1 cases and
n0 controls for a susceptibility genotype having frequency
p,
the type I error rate with a significance level of

is bounded
above by
A = Pr[
df = 12(
2)>
df = 1;1 –
2], where

df
= 1
2(
2) is a noncentral chi-squared distribution with 1 degree
of freedom and a noncentrality parameter
and
df = 1;1 –
2 is the upper

quantile
of a (central) chi-squared distribution with 1 degree of freedom
(see
Appendix). Note that
U =
L = 1 (no bias) and
A =

(no inflation
of the type I error rate) when either
G = 1 (no variation in
the frequency odds of the susceptibility genotype) or
B = 1
(no variation in the background disease rates).
The above bounds are not intended to be close to the true magnitude of population stratification bias. Rather, they serve to accommodate the magnitude of bias for every possible population stratification scenario conceivable, under the G and B constraints. G and B themselves are to be determined in an ad hoc manner. To err on the safe side, one can overexaggerate the values of G and B (to obtain more conservative bounds for population stratification bias) on the basis of one's best (but perhaps scant) knowledge of the stratified population under study.
Table 1 presents the lower and upper bounds of the CRR and the upper bounds of the type I error rate under various conditions when p = 0.3 and
= 0.05. When the variation in the frequency odds of the susceptibility genotype and the background disease rates is not large (G = B = 1.5), the bias is approximately 5 percent at most, and the inflation of the type I error is negligible even at a large sample size of n1 = n0 = 1,000. As the variation gets larger (e.g., G = B = 5), the bias (up to approximately 80 percent) and the inflation of the type I error rate (up to approximately 0.5, even at a small sample size of n1 = n0 = 100) becomes intolerable.
View this table:
[in this window]
[in a new window]
|
TABLE 1. Lower and upper bounds of the confounding rate ratio and upper bounds of the type I error rate under various conditions when p = 0.3 and = 0.05
|
|
 |
EXAMPLES
|
|---|
Hefler et al. (
15) conducted a case-control study of interleukin
(
IL) gene polymorphisms and breast cancer risk. The cases (
n1 = 269) and controls (
n0 = 227) were Caucasian women in Germany
and Austria. Hefler et al. found that the odds ratio for the
interleukin-6 (
IL-6) gene (–174C/C and –174G/C vs.
–174G/G) was 1.64 and was statistically significant (
15).
However, could this be nothing but population stratification
bias in disguise? The breast cancer rates for various countries
in Europe range from approximately 33 per 100,000 to 97 per
100,000 (
16), and
IL-6 genotype frequencies range from approximately
0.30 to 0.45 (
17). Thus, it is reasonable to expect that, among
the potential hidden strata in Hefler et al.'s study population,
B will be no more than 2.94 (i.e., 97.01/32.98) and
G no more
than 1.91

.
Using the formula in the preceding section, we find that the
upper bound for the bias is
which is less than 1.64, the estimated odds ratio
for
IL-6. This suggests that Hefler et al.'s finding (
15) cannot
be explained away by population stratification bias alone (and
perhaps should be taken more seriously).
As another example, Hsiao et al. (18) conducted a case-control study in southern Taiwan to examine whether estrogen receptor alpha (ER-
) polymorphisms are related to breast cancer risk. Hsiao et al. found that the frequency of a silent single nucleotide polymorphism in the ER-
gene (allele 1 of codon 10) was significantly lower among breast cancer patients (32.0 percent; n1 = 189) than among controls (40.4 percent; n0 = 177) at
= 0.05 (18). To determine whether population stratification bias could have seriously damaged this study, we first obtain the following data. The range of the allele frequencies of codon 10 is approximately 0.33–0.38 among Asian populations (19), and the range of the age-standardized breast cancer incidence rates is approximately 34–47 per 100,000 in the northern, central, southern, and eastern parts of Taiwan and the offshore islands (20). From these figures, we expect that B will be no more than 1.38 (i.e., 47.30/34.18) and G no more than 1.24
. Thus,
 |
 |
(note that
2
n is used here because of an allele-based approach), and
A = Pr[
df = 12(0.0138)>
df = 1;1 – 0.052] = 0.0516. Because
this level of inflation of the type I error rate is inconsequential,
the above positive finding in Hsiao et al.'s study (
18) is unlikely
to be a false alarm due to population stratification bias.
 |
DISCUSSION
|
|---|
Half a century ago, Cornfield et al. (
21) also studied the potential
impact of an omitted variable on a positive finding. They demonstrated
that the accumulated epidemiologic findings to date on the relation
of tobacco smoking to lung cancer could not be explained away
by the "constitutional hypothesis." At that time, the omitted
variable was the yet-unknown and unmeasurable (but thought to
exist) "gene." It is interesting to learn that in this postgenomic
era, 50 years after Cornfield et al., the gene itself becomes
the easy part. It is now the nonconstitutional factor (i.e.,
population stratification) that proves elusive.
Population stratification bias was previously studied by Wacholder et al. (22) and Wang et al. (23) (the magnitude of bias) and by Heiman et al. (24) and Gorroochurn et al. (25) (the magnitude of the false-positive rate). They used computer simulation to demonstrate the impacts of bias in many different situations, assuming that the total number of strata, as well as the population number, the frequency of the susceptibility genotype, and the background disease rate in any given stratum, were known in advance. Researchers may have difficulty applying their results, because it is often not possible to obtain such detailed knowledge of the anatomy of a stratified population in real practice. At best, investigators may make an educated guess as to what the variation of the frequency odds of the susceptibility genotype (G) and the variation of the background disease rate (B) might be among the hidden strata. By overexaggerating those two values, conservative bounds for population stratification bias can be obtained. As is demonstrated in the two examples presented above, this would help researchers make more prudent interpretations of their (potentially biased) results.
 |
APPENDIX
|
|---|
Subject to the
G and
B constraints, population stratification
bias is largest when there are two strata in the study population,
the first stratum having frequency odds of susceptibility genotype
g and background disease rate
b and the second stratum having
frequency odds of susceptibility genotype
gG and background
disease rate
bB. For this two-stratum population, the confounding
rate ratio (CRR) can be calculated as follows:
It is straightforward to see that the CRR achieves
its maximum of
at

Similarly,
by setting the first stratum as having frequency odds of susceptibility
genotype
g and background disease rate
bB and the second stratum
as having
gG and
b, at the outset, it is easy to show that
L = (1/
U).
Under the null hypothesis that the gene under study is not associated with the disease, its true value of log relative rate is 0 and its maximally biased log relative rate is ±log U. In a 2 x 2 table testing a null gene under the worst-case (maximally biased) scenario, the sample log odds ratio has an expected value of ±log U and a variance of
approximately. Therefore, the test follows a
noncentral chi-squared distribution with 1 degree of freedom
and a noncentrality parameter
 |
ACKNOWLEDGMENTS
|
|---|
This study was supported by grants from the National Science
Council of Taiwan, Republic of China (NSC 95-2314-B-002-242,
NSC 95-3114-P-002-005-Y, and NSC 96-2314-B-002-143).
Conflict of interest: none declared.
 |
References
|
|---|
- Khoury MJ, Yang Q. The future of genetic studies of complex human diseases: an epidemiologic perspective. Epidemiology (1998) 9:350–4.[Web of Science][Medline]
- Risch N, Merikangas K. The future of genetic studies of complex human diseases. Science (1996) 273:1516–17.[Abstract/Free Full Text]
- Ewens WJ, Spielman RS. The transmission/disequilibrium test: history, subdivision, and admixture. Am J Hum Genet (1995) 57:455–64.[Web of Science][Medline]
- Witte JS, Gauderman WJ, Thomas DC. Asymptotic bias and efficiency in case-control studies of candidate genes and gene-environment interactions: basic family designs. Am J Epidemiol (1999) 149:693–705.[Abstract/Free Full Text]
- Thomas DC, Witte JS. Point: population stratification: a problem for case-control studies of candidate-gene associations? Cancer Epidemiol Biomarkers Prev (2002) 11:505–12.[Free Full Text]
- Devlin B, Roeder K. Genomic control for association studies. Biometrics (1999) 55:997–1004.[CrossRef][Web of Science][Medline]
- Bacanu SA, Devlin B, Roeder K. The power of genomic control. Am J Hum Genet (2000) 66:1933–44.[CrossRef][Web of Science][Medline]
- Reich DE, Goldstein DB. Detecting association in a case-control study while correcting for population stratification. Genet Epidemiol (2001) 20:4–16.[CrossRef][Web of Science][Medline]
- Lee WC. Case-control association studies with matching and genomic controlling. Genet Epidemiol (2004) 27:1–13.[CrossRef][Web of Science][Medline]
- Pritchard JK, Stephens M, Rosenberg NA, et al. Association mapping in structured populations. Am J Hum Genet (2000) 67:170–81.[CrossRef][Web of Science][Medline]
- Pritchard JK, Stephens M, Donnelly P. Inference of population structure using multilocus genotype data. Genetics (2000) 155:945–59.[Abstract/Free Full Text]
- Satten GA, Flanders WD, Yang Q. Account for unmeasured population substructure in case-control studies of genetic association using a novel latent-class model. Am J Hum Genet (2001) 68:466–77.[CrossRef][Web of Science][Medline]
- Zhu X, Zhang SL, Zhao H, et al. Association mapping, using a mixture model for complex traits. Genet Epidemiol (2002) 23:181–96.[CrossRef][Web of Science][Medline]
- Hoggart CJ, Parra EJ, Shriver MD, et al. Control of confounding of genetic associations in stratified populations. Am J Hum Genet (2003) 72:1492–504.[CrossRef][Web of Science][Medline]
- Hefler LA, Grimm C, Lantzsch T, et al. Interleukin-1 and interleukin-6 gene polymorphisms and the risk of breast cancer in Caucasian women. Clin Cancer Res (2005) 11:5718–21.[Abstract/Free Full Text]
- Parkin DM, Whelan SL, Ferlay J, et al. Cancer incidence in five continents. (2002) Lyon, France: International Agency for Research on Cancer.
- Berger FG. The interleukin-6 gene: a susceptibility factor that may contribute to racial and ethnic disparities in breast cancer mortality. Breast Cancer Res Treat (2004) 88:281–5.[CrossRef][Web of Science][Medline]
- Hsiao WC, Young KC, Lin SL, et al. Estrogen receptor-alpha polymorphism in a Taiwanese clinical breast cancer population: a case-control study. Breast Cancer Res (2004) 6:R180–6.[CrossRef][Web of Science][Medline]
- National Center for Biotechnology Information, US National Library of Medicine. Reference SNP cluster report: rs 2077 647. (NCBI Single Nucleotide Polymorphism database) (2007) Washington, DC: National Library of Medicine. (http://www.ncbi.nlm.nih.gov/SNP/snp_ref.cgi?rs=2077647).
- Bureau of Health Promotion, Department of Health, Republic of Taiwan. Cancer registry annual report, 1995 –2002. (In Chinese) (2007) Taipei, Taiwan: Taiwan Department of Health. (http://crs.cph.ntu.edu.tw/crs_c/annual.html).
- Cornfield J, Haenszel W, Hammond EC, et al. Smoking and lung cancer: recent evidence and a discussion of some questions. J Natl Cancer Inst (1959) 22:173–203.[Web of Science][Medline]
- Wacholder S, Rothman N, Caporaso N. Population stratification in epidemiologic studies of common genetic variants and cancer: quantification of bias. J Natl Cancer Inst (2000) 92:1151–8.[Abstract/Free Full Text]
- Wang Y, Localio R, Rebbeck TR. Evaluating bias due to population stratification in case-control association studies of admixed populations. Genet Epidemiol (2004) 27:14–20.[CrossRef][Web of Science][Medline]
- Heiman GA, Hodge SE, Gorroochurn P, et al. Effect of population stratification on case-control association studies. I. Elevation in false positive rates and comparison to confounding risk ratios (a simulation study). Hum Hered (2004) 58:30–9.[CrossRef][Web of Science][Medline]
- Gorroochurn P, Hodge SE, Heiman G, et al. Effect of population stratification on case-control association studies. II. False-positive rates and their limiting behavior as number of subpopulations increases. Hum Hered (2004) 58:40–8.[CrossRef][Web of Science][Medline]

CiteULike
Connotea
Del.icio.us What's this?
This article has been cited by other articles:

|
 |

|
 |
 
L.-Y. Wang and W.-C. Lee
Population Stratification Bias in the Case-Only Study for Gene-Environment Interactions
Am. J. Epidemiol.,
July 15, 2008;
168(2):
197 - 201.
[Abstract]
[Full Text]
[PDF]
|
 |
|