American Journal of Epidemiology Advance Access originally published online on September 12, 2008
American Journal of Epidemiology 2008 168(8):878-889; doi:10.1093/aje/kwn208
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
PRACTICE OF EPIDEMIOLOGY |
Estimating the Single Nucleotide Polymorphism Genotype Misclassification From Routine Double Measurements in a Large Epidemiologic Sample
Correspondence to Dr. Iris M. Heid, Helmholtz Zentrum München–German Research Center for Environmental Health, Institute of Epidemiology, Ingolstädter Landstraße 1, 85764 Neuherberg, Germany (e-mail: heid{at}helmholtz-muenchen.de).
Received for publication January 14, 2008. Accepted for publication June 13, 2008.
| ABSTRACT |
|---|
|
|
|---|
Previously, estimation of genotype misclassification of single nucleotide polymorphisms (SNPs) as encountered in epidemiologic practice and involving thousands of subjects was lacking. The authors collected representative data on approximately 14,000 subjects from 8 studies and 646,558 genotypes assessed in 2005 by means of matrix-assisted laser desorption ionization time-of-flight mass spectrometry. Overall discordance among 57,805 double genotypes from routine quality control was 0.36%. Fitting different misclassification models by maximum likelihood assuming identical misclassification for all SNPs, the estimated misclassification probabilities ranged from 0.0000 to 0.0035. When applying the misclassification simulation and extrapolation (MC-SIMEX) method for the first time to genetic data to account for the misclassification in a reanalysis of adiponectin-encoding (APM1) gene SNP associations with plasma adiponectin in 1,770 subjects, the authors found no impact of this small error on association estimates but increased estimates for a more substantial error. This study is the first to provide large-scale epidemiologic data on SNP genotype misclassification. The estimated misclassification in this example was small and negligible for association estimates, which is reassuring and essential for detecting SNP associations. In situations with more substantial error, the presented approach using duplicate genotyping and the MC-SIMEX method is practical and helpful for quantifying the genotyping error and its impact.
bias (epidemiology); genetics; genotype; likelihood functions; polymorphism, single nucleotide
Abbreviations: HWE, Hardy-Weinberg equilibrium; MALDI-TOF MS, matrix-assisted laser desorption ionization time-of-flight mass spectrometry; MC-SIMEX, misclassification simulation and extrapolation; SNP, single nucleotide polymorphism
| INTRODUCTION |
|---|
|
|
|---|
Because high-throughput single nucleotide polymorphism (SNP) genotyping is technically feasible today and is readily applied in large epidemiologic studies with thousands of subjects, there is currently a focus in genetic epidemiology on the analysis of SNPs and their associations with diseases or disease markers. However, consistent replication of SNP association signals is a concern. One possible source of bias is error in the genotype (see the Appendix for a short introduction to genetic terminology).
The general effect of errors in predictor variables of regression models is to bias estimates and decrease power (1–4). While nondifferential misclassification in a dichotomous covariate usually induces a bias towards the null (5), the trichotomous covariate case is usually not as predictable (6). There have been numerous studies of the effect of genotyping error on linkage (7, 8), linkage disequilibrium (9), tagging SNPs (10), multiple dimension reduction methods (11), genotype and haplotype distribution (12–15), haplotype assignment (16), and family-based association (17–20). The investigations on how genotyping error affects population-based association have pertained mostly to case-control studies and have applied restricted association models like the chi-squared test or the Armitage trend test that do not allow for covariate adjustment (21–27). There have been few studies for logistic (28) or linear (29) regression models, which apply restricted error models.
The sources of genotyping error are manifold and have already received a great deal of attention (30). When the error cannot be estimated from pedigrees but needs to be derived for unrelated subjects, assumptions or validation/replication data are needed. The use of Hardy-Weinberg equilibrium (HWE) is much debated (31, 32). Validation data implying the availability of a gold standard would be ideal (33), and this approach has already been proposed (25), but it may not be advisable because of the lack of a perfect gold standard genotype and the potential for overcorrecting when a nonperfect standard is used (34). Use of replication from multiple genotype assessments (>2) has been illustrated with small-scale experimental (30) or simulation (28) data, but in practice duplicate genotyping is usually only available for 5–10% of the subjects from routine quality control. Previous attempts to estimate the error from duplicate genotypes involved a limited number of discordant genotype pairs such as 2 (27) or 30 (29) and a restricted error model. To our knowledge, estimation of genotyping error has never been based on routine data from a set of representative epidemiologic studies.
One reason for the lack of previous studies might be that the genotyping error was expected to be small. A situation that is the pride of the laboratory and the joy of the epidemiologist is a problem for the statistician, for a number of reasons: 1) a likelihood with the maximum close to 0 and steep in the vicinity of the maximum is a challenge for robust estimation; 2) huge genotype data sets are required in order to obtain sufficient numbers of discordant repetitions; and 3) the impact of such a small error on association estimates is expected to be negligible. So why bother? Well, the error cannot be deemed to be small in routine genetic epidemiologic association studies before the error has been estimated in such studies. It could well be that rather small experiments in which investigators know the purpose of error assessment contain completely different errors than large studies in which thousands of subjects are routinely genotyped. Methodological investigations of the impact of genotyping error have often assumed large error sizes of 1%–10%—an error size possibly stemming from former times, when sophisticated standard operating procedures or robotics were not available.
We aimed to gather a representative set of large epidemiologic studies with double SNP genotypes to provide an approach to estimation of genotype misclassification in these routine data, and to characterize the model and the size of the error as it can be expected in practice. It was a further objective to elucidate the impact of such misclassification on genetic association estimates in a real example by applying a practical method: the recently developed misclassification simulation and extrapolation (MC-SIMEX) approach (35), which has not yet been used for genetic data.
| MATERIALS AND METHODS |
|---|
|
|
|---|
Collecting double genotypes
We collected genotype information on all studies with at least 1,000 subjects and 5% double genotypes that had been assessed by laboratory personnel of the Genome Analysis Center of the Helmholtz Zentrum München by matrix-assisted laser desorption ionization time-of-flight mass spectrometry (MALDI-TOF MS) during 2004–2005. There were 2 possible sources of double genotypes: either the DNA of 1 subject was put on 2 positions of the same microtiter plate for routine quality control (routine doubles) or a microtiter plate was processed a second time because of an insufficient call rate in the first run (trouble-shooting doubles). The SNPs in our analysis had met laboratory quality-control requirements (sufficient call rate, polymorphic, and clear spectrometer signals), as they do to be cleared for association analysis. The final data set comprised 8 studies, including 5 distinct samples from the KORA (Kooperative Gesundheitsforschung in der Region Augsburg) studies (36), a Utah study (37), the SAPHIR Study (Salzburg Atherosclerosis Study to Identify Persons with High Individual Risk) (38), and the German part of the AIRGENE (Air Pollution and Inflammatory Response in Myocardial Infarction Survivors) Study (39). All of the studies included had been conducted according to the principles expressed in the Declaration of Helsinki. The investigators in these individual studies had either the written informed consent of all participants for genetic analyses or approval from their institutional review boards for genetic analyses.
Genotypes
For the kth SNP, k = 1, ..., K,
is a subject's true genotype, and
is the firstly and
the secondly (if available) observed genotype (omitting indices for the subjects). We denote true and error-prone genotype probabilities by
with
and
with
, respectively. For the latter, the observed genotype frequencies,
, are a consistent maximum likelihood estimate based on the likelihood
|
|
Discordance matrix
For each SNP k, k = 1, ..., K, we derived the number of concordant or discordant observed genotype pairs
(discordance matrix) with
being the number of subjects with
and
. Summing the
over i and j yields the total number of observed genotype pairs, for each k = 1, ..., K, giving rise to the restriction
. Without ordering of measurements, the matrix is triangular (Table 1). The overall discordance was computed as the number of discordant pairs across all SNPs relative to the total number of genotype pairs. The SNP-wise discordance was computed accordingly per SNP.
|
Misclassification matrix and the problem of identifiability
The misclassification problem can be represented by a 3 x 3 matrix for each SNP containing the misclassification probabilities,
, which are the probabilities of misclassifying a true genotype
as
,
, for SNP k, k = 1, ..., K. Solving this problem in general requires more than 2 measurements, but repeated genotyping for routine quality control is not usually performed more than twice. Thus, statistical procedures requiring more than 2 measurements cannot be applied (28, 30). This leaves us with the problem of making this 3 x 3 misclassification problem identifiable with double measurements. "Not identifiable" means there are more parameters to estimate than information available: In the case of K SNPs, the presence of 9 parameters per SNP in the matrix minus 3 due to each column summing up to unity leaves 6K parameters to estimate. The observed number of subjects with genotype pairs i and j (i, j = 0, 1, 2, i < j),
, and the restriction
for each SNP k leave 5K independent observations.
We achieved identifiability by assuming the misclassification probabilities to be the same for all SNPs and thus to be independent of k. The general misclassification matrix
|
|
,
, thus involved 6 unknown parameters
(Table 2).
|
Error models
The automated high-throughput MALDI-TOF MS platform (Sequenom, San Diego, California) was used with an example genotyping signal (shown in Figure 1). One or 2 signals are detected when the amplitude exceeds a prespecified detection level for 2 equal (homozygous) alleles or 2 different (heterozygous) alleles. The unavoidable white noise gives rise to specific genotyping error models:
Model A: A true signal falls short of the detection level resulting in allelic dropout, which implies that 1) a heterozygous subject is more likely to be misclassified as homozygous (1 of the 2 signals vanished) than the other way around, 2) a subject homozygous for 1 allele is unlikely to be misclassified as homozygous for the other, and 3) a homozygous subject is more likely to be coded as missing than a heterozygous subject (18).
View larger version (10K):
[in this window]
[in a new window]
[Download PowerPoint slide]
Figure 1. Genotyping signal from the matrix-assisted laser desorption ionization time-of-flight mass spectrometry (MALDI-TOF MS) genotyping platform for 1 single nucleotide polymorphism (SNP) of 1 person. The x-axis displays the mass of the extension primer product and the y-axis the relative intensity of the product. Each of the 2 alleles refers to a product with a specific mass. Therefore, a signal is detected at either of these 2 positions on the x-axis (homozygous genotype of either of the alleles) or at both (heterozygous as shown in the figure). The other signals are white noise.
Model B: The "zero-corner model" (19) assumes 0 probability for a homozygous genotype's being misclassified as the other homozygous genotype (an extreme case of model A2 above).Model C: The "symmetrical model" assumes no systematic ordering of the major and minor alleles in the assay. It implies that the probability of misclassifying a true homozygous genotype as heterozygous or of falsely classifying a heterozygous genotype as homozygous does not depend upon whether an allele is the minor or major allele of the homozygous genotype.
Model D: The "allele-independent model" assumes that the probability of misclassifying 1 allele for the other is the same as the other way around (9).
The other genotyping error models described in the literature are closely related to these 4, except for the "uniform error model" (8), which assumes equal misclassification probabilities and is mathematically appealing but rather unrealistic for this genotyping setting.
The 4 error models correspond to restrictions in the misclassification matrix:
- The allelic dropout model:
and
, still involving 6 parameters.
- The zero-corner model:
and
, reducing to 4 parameters.
- The symmetrical model:
,
, and
, reducing to 3 parameters.
- The allele-independent model: Prob(allele A is misclassified into allele a) = Prob(allele a is misclassified into allele A) = :
, reducing to 1 parameter (Table 3).
|
Estimating the misclassification matrix via maximum likelihood
The discordance probabilities,
,
,
—that is, the probabilities of observing genotypes i and j for the 2 measurements for SNP k—relate to the misclassification probabilities
and the true genotype probabilities
via
![]() | (1) |
are known, the likelihood for
given
,
, is
![]() | (2) |
When the true genotype probabilities
are unknown, either they can be estimated together with the misclassification probabilities ("extended likelihood") or assumptions need to be made. Applying the latter approach, we assumed that 1) the observed genotype probabilities
reasonably approximated the truth (
) and 2)
was estimated by
with negligible sampling error. Therefore, we estimated the misclassification probabilities by maximizing
with
("small misclassification assumption"). Again based on this assumption, exact P values for HWE (see Appendix) were computed using
.
Maximum likelihood estimates were computed by applying the Nelder-Mead simplex algorithm (Mathematica, version 5.0; Wolfram Research, Champaign, Illinois), and their variances were derived by means of the Fisher matrix. Genotype misclassification was estimated on the basis of error models A–D. Likelihood ratio tests were conducted to compare model fits.
In sensitivity analyses, we evaluated the robustness of error estimation upon violation of the assumption
, upon exclusion of SNPs with HWE violation, or upon exclusion of SNPs with sparse data in 1 genotype category. Furthermore, we explored what our results would have looked like had we not opted for the small misclassification assumption but rather estimated the true genotype probabilities simultaneously with the misclassification probabilities (2K + 6 parameters to estimate) via the "extended likelihood" given by the product of equation 2 and
|
|
,
, and
.
Method of correcting association analysis for genotype misclassification and real data example
To elucidate the impact of the SNP genotype misclassification on association analysis, we applied the MC-SIMEX method (see brief description in Appendix) in a real data example: We reanalyzed the association of 13 adiponectin-encoding (APM1) gene SNPs with adiponectin plasma levels in the SAPHIR Study (38), including 1,770 unrelated healthy subjects. For each SNP, we computed linear regression association estimates based on log(adiponectin + 1), adjusted for body mass index (weight (kg)/height (m)2), sex, and age, without and with application of the MC-SIMEX method (using the log-linear extrapolation function). We assumed a realistic scenario based on the general misclassification matrix as estimated and an extreme scenario created by multiplying the nondiagonal elements of this matrix by 10.
| RESULTS |
|---|
|
|
|---|
The analyzed sample
The data set contained 646,558 genotypes with 160,454 doubles involving 283 SNPs from over 10,000 subjects in 8 projects. Among these were 70,539 routine doubles. For 62,318 routine doubles, both genotype measurements were nonmissing; 57,805 of these corresponded to 225 "3-level" SNPs. Table 4 summarizes data on these samples.
|
Discordance
Table 5 shows the discordance matrix, including routine as well as trouble-shooting doubles, summarizing over all 283 SNPs. This matrix also highlights that the proportion of missing genotypes among the first measurements is 15.65% as opposed to 8.15% among the second, indicating an undesirable informative ordering of the measurements. Restricting the data set to the routine doubles, now including 262 SNPs, yielded symmetry (7.56% vs. 7.60%, respectively; Table 5), indicating that missingness was now independent of the measurement order.
|
Our main analysis was based on the 225 3-level SNPs with 57,805 routine double genotypes, both nonmissing, which yielded 210 discordant pairs and thus an overall discordance of 0.36%. Table 6 depicts the discordance across all SNPs. The scatterplots of the SNP-wise discordance versus P values from testing for HWE violation (Figure 2, part A) or versus the minor allele frequency (Figure 2, part B) show that some of the larger discordances occurred together with smaller HWE P values, but not all HWE violations implicated large discordance (Spearman correlation coefficient (r) = –0.1362, P = 0.0313). There was no dependency of the discordance on the minor allele frequency (r = 0.0826, P = 0.1927).
|
|
Estimation of misclassification probabilities
Table 7 summarizes the misclassification matrices from maximizing the likelihood
(equation 2) for the various misclassification models. For the general model, the estimated misclassification probabilities ranged between 0.0001 and 0.0024, and for the allele-independent model, they ranged between 0.0000 and 0.0020; the other models yielded a similarly small dimension of the error.
|
The estimated parameters and 95% confidence intervals indicated that the allelic dropout characteristics held (
and
); the symmetric model deviated the least from the general model, as the 95% confidence interval from only 1 misclassification probability was disjoint with the corresponding confidence intervals of the general model. This was supported not only by a comparison of the number of discordant genotype pairs observed with the number expected under the various models but also by the likelihood ratio test of model fit (Table 8), which yielded no formal rejection of the symmetrical model (though a "borderline" P value of 0.07), but for the "zero-corner" and "allele-independent" models (P < 10–3).
|
Robustness of estimation
Firstly, we explored the impact of violation of the small misclassification assumption (i.e., deviation of
from
) on the misclassification probability estimates. We chose the deviation such that
would have been observed, given an allele-independent error with
, to be in the ballpark of a realistic error. The new
were derived from
,
, and
. The misclassification probabilities were again estimated, maximizing
given the new
. For the general model, the estimated parameters
were similar ((0.0024, 0.0016, 0.0004, 0.0002, 0.0000, 0.0016) instead of (0.0024, 0.0014, 0.0004, 0.0002, 0.0001, 0.0015)); the
parameter for the allele-independent model remained basically unchanged at 0.0010. Secondly, when we excluded the 29 SNPs with HWE violation (Ps < 0.05), the results did not change markedly. Neither did they change when we excluded SNPs for which fewer than 30 subjects were minor-allele homozygous (leaving 152 SNPs), with the general model parameters being estimated as (0.0018, 0.0013, 0.0009, 0.0002, 0.0002, 0.0018) and the
estimate being 0.0012. Finally, when estimating true genotype probabilities together with the misclassification probabilities, we had to restrict the data to the 152 SNPs with enough observations in the third genotype category; this yielded (0.0020, 0.0011, 0.0008, 0.0061, 0.0002, 0.0000) and an
estimate of 0.0012.
Impact of genotyping error in a real data example using the MC-SIMEX method
Figure 3 summarizes the uncorrected and MC-SIMEX-corrected β estimates in the example of the 13 3-level APM1 SNPs and their association with plasma adiponectin concentrations. A clear change of the β estimates of up to 15% when correcting for the misclassification was seen only under the extreme scenario for the 2 SNPs with already-high uncorrected estimates (SNP4 and SNP13).
|
| DISCUSSION |
|---|
|
|
|---|
In this study, we collected 646,558 SNP genotypes derived by MALDI-TOF MS from large epidemiologic studies including approximately 14,000 subjects altogether. On the basis of 57,805 double genotypes from routine quality control, estimated genotype misclassification probabilities were 0.001 and below. These data thus underscore the validity of SNP genotypes in situations comparable to that of our study. Note, however, that such a small genotyping error cannot be expected when quality control is relaxed, when the DNA quality is inferior, or when more error-prone genotyping technologies are applied. Furthermore, double genotyping by the same genotyping platform, using the same primer, DNA, and aliquot, does not enable one to grasp all possible sources of genotyping error or the potential mismatch of subject identifiers. Additionally, the error here does not reflect all of the genotypes produced in the laboratory, but rather reflects only the quality-controlled SNPs presented to the data analyst in routine practice. Thus, we can only make conclusions about some aspects of the genotyping error.
While our example of 13 APM1 gene SNPs (38), applying the MC-SIMEX correction method (35), pinpointed only marginal bias in association estimates induced by an error as estimated by our repeated genotype data, it was also illustrated that increased genotyping error would decrease association estimates and that the MC-SIMEX approach can effectively remove this bias. Because the MC-SIMEX method can handle a wide range of error and association models for this trichotomous variable situation also and because it allows adjustment for other covariates, it can be recommended for utilization in future genetic association studies, particularly for conditions that are problematic and when the error is nonnegligible. If it is not possible to conduct repeated genotyping which provides independent replications, re-genotyping by employing a "gold standard" method or formulating a mechanistically motivated error model are further options for obtaining error estimates. If none of these 3 options are possible, the sensitivity of association results to different error sizes can be explored (e.g., using the MC-SIMEX method). Note, however, that when replicate genotypes indicate an extremely low error, as observed in our data, methods such as MC-SIMEX are probably not needed.
It was a great challenge to us to collect a sufficiently large data set with routinely performed repeated genotyping to estimate the genotyping error in epidemiologic practice. The second challenge was to achieve identifiability of the 3 x 3 genotype misclassification problem with double observed genotypes. We did not want a method requiring more than 2 genotype repetitions, nor did we want to restrict the genotyping error model. The first type of method would have prevented our approach from being applicable to routine double genotype data; the second would have omitted the possibility of exploring the misclassification model fit. We thus assumed the same misclassification for all SNPs. Our data supported this assumption, since the discordance did not depend upon the minor allele frequency (see Figure 2, part B). This assumption is also practical when one is interested in the overall error across a full set of SNPs rather than the error of a specific SNP.
We were able to estimate the genotyping error under the most general error model, while the literature covers rather restricted models (Table 9). Note that estimation of all 6 parameters was not as robust as desirable, most likely because of the small error giving rise to only 210 discordant genotype pairs despite the large sample size. Nevertheless, the misclassification probability estimates remained very small throughout, and the fact that we observed discordant genotype pairs with both opposite homozygous genotypes argues against the "zero-corner model." Our data suggested the allelic dropout model to be appropriate and the symmetrical model to fit reasonably well while being at the same time more parsimonious (involving 3 instead of 6 parameters), and provided evidence against further restrictions.
|
It was a strength of our work that we were able to collect a large representative set of epidemiologic studies with double genotypes as would be encountered in practice. Our sample was not an experimental data set, and laboratory personnel were unaware of this project at the time of genotyping. We present an approach as it could be applied in practice: estimating genotyping error from routine double genotypes and a correction method applicable for linear and logistic regression, allowing for covariate adjustment, all genetic effects, and most misclassification models.
It must be considered a limitation that we used categorized genotypes instead of genotype probability scores, which are more sophisticated from a methodological perspective; however, routine association analyses currently use categories, and the epidemiologic practice was our focus here. Our conclusions cannot be transferred to differential error in case-control studies (40) or to more complex genetic variants such as microsatellite markers implying a higher-dimensional misclassification problem.
To our knowledge, this study is the first to provide epidemiologic data with which to estimate and characterize SNP genotype misclassification as it can be expected in practice. For the first time in the genetic context, we have applied the MC-SIMEX method and elucidated it as a method well-suited to account for misclassification in genetic association analysis.
We conclude that SNP genotyping error as presented in our example data—derived from a high-quality laboratory, with experienced personnel, using an established genotyping method, and with quality control before association analysis—is small and possibly negligible for many association studies. This is reassuring and is essential for detecting SNP associations in genetic epidemiology. In cases of very small genotyping error, as in our data, the particular choice of error model is not a concern, and a correction of association estimates applying methods like MC-SIMEX would not be needed. Situations may arise in which more substantial error is encountered. Then the implementation of an allele-independent error model might be appropriate for a first simplified approach, but extension to more complex models might be desirable. In addition, our approach to estimating genotype misclassification from double genotyping and accounting for this misclassification in the association analysis is useful and practical for quantifying the genotyping error and its impact.
| APPENDIX |
|---|
|
|
|---|
A. Genetic terminology
A (human) single nucleotide polymorphism (SNP) is a position in the DNA where human beings exhibit variation in just 1 nucleotide as opposed to other polymorphisms involving a series of nucleotides. Usually, this SNP variation across subjects involves 2 different nucleotides out of the 4 possible (cytosine, thymine, adenine, guanine) for the 2 alleles that each subject possesses (1 inherited from the mother and 1 from the father). For example, 1 subject could have a cytosine (C) and a thymine (T) at 1 SNP position (subject with a heterozygous genotype); another subject could exhibit 2 C's and again another 2 T's (homozygous genotypes). If T is the nucleotide that is less often found in a population sample, T is considered the minor allele. Thus, a subject's SNP genotype can be coded as 0 (major-allele homozygous), 1 (heterozygous), or 2 (minor-allele homozygous).
Genotype frequencies are typically assumed to be in Hardy-Weinberg equilibrium based on the hypothesis that humans mate randomly and the genotype of the father does not depend upon the genotype of the mother. Hardy-Weinberg equilibrium is thus given when the probability of the heterozygous genotype equals the product of the probability of the 2 involved nucleotides. Violation of Hardy-Weinberg equilibrium is therefore considered a possible indication of genotyping error with allelic dropout.
B. Misclassification simulation and extrapolation approach
Use of the misclassification simulation and extrapolation (MC-SIMEX) approach to account for misclassification in association analysis involves a simulation and an extrapolation step: Starting from the naïve estimate
naive (i.e., the estimate that does not account for the misclassification) and assuming that this was derived with underlying misclassification
, data with higher levels of misclassification are simulated. From the estimates resulting from these simulated data, a function (linear, quadratic, or log-linear) is extrapolated back to the case of no misclassification, yielding the corrected estimator
SIMEX(
). This estimate has already been shown to be consistent, and variance estimates have been developed (35).
| ACKNOWLEDGMENTS |
|---|
Author affiliations: Institute of Epidemiology, Helmholtz Zentrum München, Neuherberg, Germany (Iris M. Heid, Claudia Lamina, Guido Fischer, Norman Klopp, Melanie Kolz, Harald Grallert, Caren Vollmert, Stefanie Wagner, Cornelia Huth, Martina Müller, Annette Peters, H.-Erich Wichmann, Thomas Illig); Ludwig-Maximilians-Universität München, Munich, Germany (Iris M. Heid, Melanie Kolz, Harald Grallert, Cornelia Huth, Julia Müller, Martina Müller, H.-Erich Wichmann, Thomas Illig); Statistical Consulting Unit, Ludwig-Maximilians-Universität München, Munich, Germany (Helmut Küchenhoff); Cardiovascular Genetics Division, University of Utah School of Medicine, Salt Lake City, Utah (Steven Hunt); First Department of Internal Medicine, St. Johann Spital, Paracelsus Private Medical University Salzburg, Salzburg, Austria (Bernhard Paulweber); Division of Genetic Epidemiology, Department of Medical Genetics, Molecular and Clinical Pharmacology, Innsbruck Medical University, Innsbruck, Austria (Florian Kronenberg).
This study was supported by a grant from the Deutsche Forschungsgemeinschaft (SFB 386) to Dr. Iris Heid, by a grant within the framework of the German National Genome Research Net to Dr. H.-Erich Wichmann, by the Munich Center of Health Sciences of Ludwig-Maximilians University, by US National Institutes of Health grants DK55006 and HL21088 and National Center for Research Resources grant M01-RR00064 to Dr. Steven Hunt, and by a Genomics of Lipid-associated Disorders (GOLD) grant from the Austrian Genome Research Program (GEN-AU) to Dr. Florian Kronenberg.
The last 3 authors contributed equally to the study.
Conflict of interest: none declared.
| References |
|---|
|
|
|---|
- Fuller WA. Measurement Error Models (1987) New York, NY: John Wiley and Sons, Inc.
- Willett W. An overview of issues related to the correction of non-differential exposure measurement error in epidemiologic studies. Stat Med (1989) 8(9):1031–1040.[Web of Science][Medline]
- Thomas D, Stram D, Dwyer J. Exposure measurement error: influence on exposure-disease relationships and methods of correction. Annu Rev Public Health (1993) 14:69–93.[CrossRef][Web of Science][Medline]
- Carroll RJ, Ruppert D, Stefanski LA. Measurement Error in Nonlinear Models (1995) 1st ed. Boca Raton, FL: Chapman & Hall/CRC.
- Bross I. Misclassification in 2 x 2 tables. Biometrics (1954) 10:478–486.[CrossRef][Web of Science]
- Dosemeci M, Wacholder S, Lubin JH. Does nondifferential misclassification of exposure always bias a true effect toward the null value? Am J Epidemiol (1990) 132(4):746–748.
[Abstract/Free Full Text] - Sobel E, Papp JC, Lange K. Detection and integration of genotyping errors in statistical genetics. Am J Hum Genet. (2002) 70(2):496–508.[CrossRef][Web of Science][Medline]
- Lincoln SE, Lander ES. Systematic detection of errors in genetic linkage data. Genomics (1992) 14(3):604–610.[CrossRef][Web of Science][Medline]
- Akey JM, Zhang K, Xiong M, et al. The effect that genotyping errors have on the robustness of common linkage-disequilibrium measures. Am J Hum Genet. (2001) 68(6):1447–1456.[CrossRef][Web of Science][Medline]
- Liu W, Zhao W, Chase GA. The impact of missing and erroneous genotypes on tagging SNP selection and power of subsequent association tests. Hum Hered (2006) 61(1):31–44.[CrossRef][Web of Science][Medline]
- Ritchie MD, Hahn LW, Moore JH. Power of multifactor dimensionality reduction for detecting gene-gene interactions in the presence of genotyping error, missing data, phenocopy, and genetic heterogeneity. Genet Epidemiol (2003) 24(2):150–157.[CrossRef][Web of Science][Medline]
- Moskvina V, Schmidt KM. Susceptibility of biallelic haplotype and genotype frequencies to genotyping error. Biometrics (2006) 62(4):1116–1123.[Medline]
- Govindarajulu US, Spiegelman D, Miller KL, et al. Quantifying bias due to allele misclassification in case-control studies of haplotypes. Genet Epidemiol (2006) 30(7):590–601.[CrossRef][Web of Science][Medline]
- Zhu WS, Fung WK, Guo J. Incorporating genotyping uncertainty in haplotype frequency estimation in pedigree studies. Hum Hered (2007) 64(3):172–181.[CrossRef][Web of Science][Medline]
- Quade SR, Elston RC, Goddard KA. Estimating haplotype frequencies in pooled DNA samples when there is genotyping error [electronic article]. BMC Genet. (2005) 6(1):25.[CrossRef][Medline]
- Lamina C, Bongardt F, Kuchenhoff H, et al. Haplotype reconstruction error as a classical misclassification problem: introducing sensitivity and specificity as error measures [electronic article]. PLoS ONE (2008) 3(3):e1853.[CrossRef]
- Gordon D, Heath SC, Liu X, et al. A transmission/disequilibrium test that allows for genotyping errors in the analysis of single-nucleotide polymorphism data. Am J Hum Genet. (2001) 69(2):371–380.[CrossRef][Web of Science][Medline]
- Mitchell AA, Cutler DJ, Chakravarti A. Undetected genotyping errors cause apparent overtransmission of common alleles in the transmission/disequilibrium test. Am J Hum Genet. (2003) 72(3):598–610.[CrossRef][Web of Science][Medline]
- Morris RW, Kaplan NL. Testing for association with a case-parents design in the presence of genotyping errors. Genet Epidemiol (2004) 26(2):142–154.[CrossRef][Web of Science][Medline]
- Seaman SR, Holmans P. Effect of genotyping error on type-I error rate of affected sib pair studies with genotyped parents. Hum Hered (2005) 59(3):157–164.[CrossRef][Web of Science][Medline]
- Rice KM, Holmans P. Allowing for genotyping error in analysis of unmatched case-control studies. Ann Hum Genet. (2003) 67(pt 2):165–174.[CrossRef][Web of Science][Medline]
- Kang SJ, Gordon D, Finch SJ. What SNP genotyping errors are most costly for genetic association studies? Genet Epidemiol (2004) 26(2):132–141.[CrossRef][Web of Science][Medline]
- Gordon D, Ott J. Assessment and management of single nucleotide polymorphism genotype errors in genetic association analysis. Pac Symp Biocomput (2001) 18–29.
- Gordon D, Finch SJ, Nothnagel M, et al. Power and sample size calculations for case-control genetic association tests when errors are present: application to single nucleotide polymorphisms. Hum Hered (2002) 54(1):22–33.[CrossRef][Web of Science][Medline]
- Gordon D, Yang Y, Haynes C, et al. Increasing power for tests of genetic association in the presence of phenotype and/or genotype error by use of double-sampling [electronic article]. Stat Appl Genet Mol Biol. (2004) 3. article 26.
- Gordon D, Haynes C, Yang Y, et al. Linear trend tests for case-control genetic association that incorporate random phenotype and genotype misclassification error. Genet Epidemiol (2007) 31(8):853–870.[CrossRef][Web of Science][Medline]
- Tintle NL, Gordon D, McMahon FJ, et al. Using duplicate genotyped data in genetic analyses: testing association and estimating error rates [electronic article]. Stat Appl Genet Mol Biol. (2007) 6. article 4.
- Lai R, Zhang H, Yang Y. Repeated measurement sampling in genetic association analysis with genotyping errors. Genet Epidemiol (2007) 31(2):143–153.[CrossRef][Web of Science][Medline]
- Wong MY, Day NE, Luan JA, et al. Estimation of magnitude in gene-environment interactions in the presence of measurement error. Stat Med (2004) 23(6):987–998.[CrossRef][Web of Science][Medline]
- Pompanon F, Bonin A, Bellemain E, et al. Genotyping errors: causes, consequences and solutions. Nat Rev Genet. (2005) 6(11):847–859.[CrossRef][Web of Science][Medline]
- Leal SM. Detection of genotyping errors and pseudo-SNPs via deviations from Hardy-Weinberg equilibrium. Genet Epidemiol (2005) 29(3):204–214.[CrossRef][Web of Science][Medline]
- Cox DG, Kraft P. Quantification of the power of Hardy-Weinberg equilibrium testing to detect genotyping error. Hum Hered (2006) 61(1):10–14.[CrossRef][Web of Science][Medline]
- Carroll RJ, Ruppert D, Stefanski LA, et al. Measurement Error in Nonlinear Models. 2nd ed. Boca Raton, FL: Chapman & Hall/CRC; 2006.
- Wacholder S, Armstrong B, Hartge P. Validation studies using an alloyed gold standard. Am J Epidemiol (1993) 137(11):1251–1258.
[Abstract/Free Full Text] - Kuchenhoff H, Mwalili SM, Lesaffre E. A general method for dealing with misclassification in regression: the misclassification SIMEX. Biometrics (2006) 62(1):85–96.[Medline]
- Wichmann HE, Gieger C, Illig T. KORA-gen—resource for population genetics, controls and a broad spectrum of disease phenotypes. Gesundheitswesen (2005) 67(suppl 1):S26–S30.[Web of Science][Medline]
- Schoenborn V, Heid IM, Vollmert C, et al. The ATGL gene is associated with free fatty acids, triglycerides, and type 2 diabetes. Diabetes (2006) 55(5):1270–1275.
[Abstract/Free Full Text] - Heid IM, Wagner SA, Gohlke H, et al. Genetic architecture of the APM1 gene and its influence on adiponectin plasma levels and parameters of the metabolic syndrome in 1,727 healthy Caucasians. Diabetes (2006) 55(2):375–384.
[Abstract/Free Full Text] - Ruckerl R, Greven S, Ljungman P, et al. Air pollution and inflammation (interleukin-6, C-reactive protein, fibrinogen) in myocardial infarction survivors. Environ Health Perspect (2007) 115(7):1072–1080.[Web of Science][Medline]
- Moskvina V, Craddock N, Holmans P, et al. Effects of differential genotyping error rate on the type I error probability of case-control studies. Hum Hered (2006) 61(1):55–64.[CrossRef][Web of Science][Medline]
- Hao K, Wang X. Incorporating individual error rate into association test of unmatched case-control design. Hum Hered (2004) 58(3-4):154–163.[CrossRef][Web of Science][Medline]
- Zou G, Pan D, Zhao H. Genotyping error detection through tightly linked markers. Genetics (2003) 164(3):1161–1173.
[Abstract/Free Full Text] - Kirk KM, Cardon LR. The impact of genotyping error on haplotype reconstruction and frequency estimation. Eur J Hum Genet. (2002) 10(10):616–622.[CrossRef][Web of Science][Medline]
- Becker T, Valentonyte R, Croucher PJ, et al. Identification of probable genotyping errors by consideration of haplotypes. Eur J Hum Genet. (2006) 14(4):450–458.[CrossRef][Web of Science][Medline]
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||




