American Journal of Epidemiology Advance Access originally published online on August 21, 2006
American Journal of Epidemiology 2006 164(8):794-804; doi:10.1093/aje/kwj269
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Practice of Epidemiology |
Candidate Single Nucleotide Polymorphism Selection using Publicly Available Tools: A Guide for Epidemiologists
1 Radiation Epidemiology Branch, Division of Cancer Epidemiology and Genetics, National Cancer Institute, National Institutes of Health, Department of Health and Human Services, Bethesda, MD
2 National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Department of Health and Human Services, Bethesda, MD
3 National Institute on Drug Abuse, National Institutes of Health, Department of Health and Human Services, Bethesda, MD
4 Laboratory of Population Genetics, Center for Cancer Research, National Cancer Institute, National Institutes of Health, Department of Health and Human Services, Bethesda, MD
Correspondence to Dr. Alice Sigurdson, Radiation Epidemiology Branch, Division of Cancer Epidemiology and Genetics, National Cancer Institute, National Institutes of Health, Department of Health and Human Services, 6120 Executive Boulevard, EPS 7060, MSC 7238, Bethesda, MD 20892-7238 (e-mail: sigurdsa{at}mail.nih.gov).
Received for publication October 13, 2005. Accepted for publication March 31, 2006.
| ABSTRACT |
|---|
|
|
|---|
Single nucleotide polymorphisms (SNPs) are the most common form of human genetic variation, with millions present in the human genome. Because only 1% might be expected to confer more than modest individual effects in association studies, the selection of predictive candidate variants for complex disease analyses is formidable. Technologic advances in SNP discovery and the ever-changing annotation of the genome have led to massive informational resources that can be difficult to master across disciplines. A simplified guide is needed. Although methods for evaluating nonsynonymous coding SNPs are known, several other publicly available computational tools can be utilized to assess polymorphic variants in noncoding regions. As an example, the authors applied multiple methods to select SNPs in DNA double-strand break repair genes. They chose to evaluate SNPs that occurred among a preexisting set of 57 validated assays and to justify new assay development for 83 potential SNPs in the DNA-dependent protein kinase catalytic subunit. Of the 140 SNPs, the authors eliminated 119 variants with low or neutral predictions. The existing computational methods they used and the semiquantitative relative ranking strategy they developed can be adapted to a priori SNP selection or post hoc evaluation of variants identified in whole genome scans or within haplotype blocks associated with disease. The authors show a "real world" application of some existing bioinformatics tools for use in large epidemiologic studies and genetic analyses. They also reviewed alternative approaches that provide related information.
amino acid sequence; base sequence; epidemiologic methods; genetic predisposition to disease; polymorphism, single nucleotide
Abbreviations: DNAPKcs, catalytic subunit of the DNA protein kinase; ESE, exonic splicing enhancer; mRNA, messenger RNA; NCBI, National Center for Biotechnology Information; SIFT, sorting intolerant from tolerant; SNP, single nucleotide polymorphism; TFBS, transcription factor binding site; UCSC, University of California, Santa Cruz; UTR, untranslated region
| INTRODUCTION |
|---|
|
|
|---|
A major goal in molecular epidemiology and other disciplines is to identify elements of genetic variation that associate with a disease outcome of interest. Some human diseases are caused by rare mutations exhibiting Mendelian inheritance, but complex diseases such as cancer are thought to be polygenic disorders in which several variants with modest effects together produce significant disease risks for many individuals (1
T (11
Earlier reports have substantiated the efficacy of in silico tools for SNP selection or function prediction in coding regions (15
18
) based on evolutionarily conserved sequence homology (19
) or consideration of protein domains and structural features of the amino acid substitutions (20
). These methods provide a means by which to select SNPs for testing in epidemiologic studies, particularly when only certain assays might be available from a core facility or vendor, or when only a limited number of SNPs can be assayed. In addition, if investigators wish to calculate the false-positive report probability as suggested by Wacholder et al. (21
), this approach could be used for an objective assignment of the prior probability. These methods can also be adapted for use in hierarchical models, which have been suggested as a solution to problems of multiple inference and model uncertainty that occur in SNP association studies (22
, 23
). We acknowledge that other selection approaches could be used, such as linkage disequilibrium methods (haplotype tagging) (24
), but here we focus on uncovering evidence for possible function among many potential candidate SNPs however they might have been identified. Selection of SNPs to assay based on relatively agnostic selection schemes such as linkage disequilbirum is inexact, and selection is limited to common variants. Although haplotypes may interrogate some level of "unobserved" variation, the methods (SNP/haplotype-tagging approaches and candidate selection) can be complementary. For example, instead of arbitrarily selecting among the set of variants that "tag" a group of correlated SNPs, one could select those most likely to be biologically relevant. Eventually, one needs to use both in silico and biochemical characterizations to confirm the biologically relevant variation when a signal is detected in a clinical-epidemiologic study.
Different in silico procedures are appropriate for coding SNPs (exonic nonsynonymous amino acid SNPs, exonic synonymous amino acid SNPs) and noncoding SNPs (intronic SNPs and exonic 5'- and 3'-untranslated region (UTR) SNPs). We describe our approach to SNP selection for coding and noncoding SNPs, with an emphasis on methods to evaluate noncoding SNPs, particularly since a "major challenge of future studies will be to identify and characterize regulatory variants" (6
, p. 682). In addition, we review several alternative and sometimes equivalent approaches that provide similar or related information. Table 1 provides a summary of the resources that we reviewed. There may be other ways in which the bioinformatics tools described herein can be implemented for SNP selection; however, our purpose was to raise awareness of these tools and to show a practical example of their use.
|
| MATERIALS AND METHODS |
|---|
|
|
|---|
Identification of in silico tools
MEDLINE (US National Library of Medicine) search terms "SNP algorithm" and "SNP resource" were used to search for relevant publications. References listed in reviewed publications were also retrieved, and PubMed (US National Library of Medicine) searches were augmented by evaluating publications featured from "related articles." The search was last updated in February 2006. We restricted publications to be English language only.
SNPs considered for evaluation
We were interested in genes involved with homologous recombination or nonhomologous end joining and identified them by use of recent literature reviews of DNA double-strand break repair (25
27
). For illustration purposes, we limited our DNA repair SNP selection to those available in the US National Cancer Institute's SNP500Cancer database (http://snp500cancer.nci.nih.gov). Because the population to be genotyped was predominantly Caucasian, minor allele frequencies for Caucasians from the SNP500Cancer database were used, and we required their frequency to be greater than 3 percent for inclusion. Fifty-seven SNPs qualified for evaluation (Web table 1). (This information is described in the first of two supplementary tables; each is referred to as "Web table" in the text and is posted on the Journal's website (http://aje.oxfordjournals.org/).) Because an important double-strand break repair gene in the nonhomologous end-joining pathway, the catalytic subunit of the DNA protein kinase (DNAPKcs), was not contained in the SNP500Cancer list, we included 83 SNPs in DNAPKcs that exceeded a minor allele frequency of 3 percent (based on information from the US National Institute of Environmental Health Sciences' website, http://egp.gs.washington.edu/) so that assay development requests could be made for high ranking gene variants. Ranking was based on in silico database information that was available as of August 28, 2005, and using human genome build 35 (http://www.ncbi.nlm.nih.gov/genome/guide/human/release_notes.html). Accessing and navigating the databases are described in the following sections.
Location and downloading of the sequence containing the SNP of interest
The annotated sequence is required to determine the functional class (e.g., intron, exon, exonic UTR) of the SNP so that appropriate in silico tools can be selected for analyses. Nucleotide coordinates (numerical position of a nucleotide in a gene sequence) are also obtained from this sequence so that SNPs can be located in the output generated by in silico tools.
There are three major sources of annotated gene sequences: National Center for Biotechnology Information (NCBI) Entrez Gene (http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene), University of California, Santa Cruz (UCSC), Genome Bioinformatics (http://genome.ucsc.edu/), and Ensembl (http://www.ensembl.org/). Gene annotations from these sources may differ. Thus, in silico analyses may produce varied results depending on the source of annotated gene sequences that is utilized. For our analyses, we used NCBI annotations. Refer to a supplementary document available on the Journal's website (http://aje.oxfordjournals.org/) for instructions to obtain NCBI annotated sequences and SNP information.
For those preferring a more visual or graphical approach, resources such as Genewindow (http://genewindow.nci.nih.gov/Welcome) and GeneSNPs (http://www.genome.utah.edu/genesnps/) are available. By use of NCBI and UCSC gene annotations, respectively, Genewindow and GeneSNPs provide simplified representations of gene sequences with clear delineations of SNP, intron, exon, and exonic UTR locations (28
).
Certain tools examining nonsynonymous amino acid changes require FASTA-formatted amino acid or nucleotide sequences (FASTA is a standard formatting of sequences with a single line of description followed by lines of sequence data). A supplementary document providing further details on methods to download NCBI annotated sequences in FASTA format is available on the Journal's website (http://aje.oxfordjournals.org/).
Evaluating coding SNPs
Coding SNPs are located in the exons of genes. Sets of three nucleotides in these exons (codons) "code" for the amino acids that are used to build proteins. A SNP may change (nonsynonymous) or not change (synonymous) the amino acid, with functional consequences more and less likely, respectively. If the SNP is in an exonic portion of a splice junction, then a noncoding intron may be retained (where an intron is not spliced out of the transcribed sequence), or the exon may be skipped (where the exon is removed from a transcribed sequence) (29
). This may result in the loss or gain of amino acids in the protein or result in an unstable messenger RNA (mRNA) transcript. Nonsynonymous and synonymous coding SNPs may also occur within an exonic splicing enhancer (ESE). ESEs are nucleotide regions where various components of the splicing machinery localize to splice the pre-mRNA (30
). As with splice site SNPs, those within ESEs may result in intron retention or exon skipping.
We used Genewindow's visual delineation of intron-exon boundaries to determine whether a SNP occurred within the exonic portion of splice junction sequences. We considered the occurrence of a SNP within an identified splice junction sequence as strong evidence for potential functional impact.
Web-based resources of known or predicted ESEs may be used to evaluate whether a SNP is located in an ESE. We used two such resources, ESEfinder (http://rulai.cshl.edu/tools/ESE/) (31
) and RESCUE-ESE (http://genes.mit.edu/burgelab/rescue-ese/) (32
). Refer to the online supplementary material for a description of ESEfinder and RESCUE-ESE. For us to consider the SNP potentially functional, both ESEfinder and RESCUE-ESE had to identify the SNP of interest as lying within a possible ESE sequence. We were more stringent for ESEs because these sequences are relatively short (67 nucleotides) and thus are more likely to occur at random throughout the genome.
There are many Web-based resources available that allow one to predict whether nonsynonymous coding SNPs may have functional effects on proteins. We chose two complementary algorithms for functional impact prediction of nonsynonymous coding SNPs: sorting intolerant from tolerant (SIFT) (http://blocks.fhcrc.org/sift/SIFT.html) and polymorphism phenotyping (PolyPhen) (http://www.bork.embl-heidelberg.de/PolyPhen/) (19
, 20
). Refer to the online supplementary materials for descriptions of SIFT and PolyPhen and for four other programs (PARSESNP (http://www.proweb.org/parsesnp), SNPeffect (http://snpeffect.vib.be/), LS-SNP (http://alto.compbio.ucsf.edu/LS-SNP/), and PMUT (http://mmb2.pcb.ub.es:8080/PMut/)) that also evaluate nonsynonymous coding SNPs.
We based our interpretation of SIFT and PolyPhen scores on the previously published criteria of Xi et al. (17
) that allow for a functional range more relevant to disease susceptibility (table 2). Moreover, as recommended by Xi et al., we altered the SIFT default sequence identity cutoff of 90 percent (meaning that homologous sequences more than 90 percent identical were excluded from analysis) to 95 percent, because DNA repair genes have a high level of evolutionary conservation. Because SIFT and PolyPhen evaluate slightly different predictive in silico aspects of protein function, we considered the strongest evidence for protein function to have occurred when SIFT and PolyPhen interpretations "agreed" when ranking nonsynonymous SNPs.
|
Evaluating noncoding SNPs
SNPs affecting transcription processing (e.g., splice site recognition) may also occur within the intronic portion of a splice site or within the 5'- or 3'-UTR of an exon. Introns and the regions around the exons that do not code for proteins are also important, because they contain sequences that dictate other attributes of how the protein is made. Introns, exonic UTRs, and noncoding regions upstream and downstream of genes are known to contain various regulatory elements important for transcription and translation (33
Continued advancements in the annotation of the genome have made it clear that noncoding regions are far from unimportant. A powerful tool that compares similar genetic regions in multiple animal species may uncover nearly identical sequences (homology) that could represent transcriptional elements or previously uncharacterized coding regions. For example, a region considered noncoding could harbor a "new" exon and, thus, a noncoding SNP could be "converted" to a coding SNP.
As with exonic SNPs, we used Genewindow's visual delineation of intron-exon boundaries to determine whether a SNP occurred within the intronic portion of splice junction sequences. We considered the occurrence of a SNP within an identified splice junction sequence as strong evidence for likely functional impact.
By comparing gene sequences between species that diverged approximately 4080 million years ago, noncoding regions with potential regulatory function (or undiscovered exons) can be identified, because conserved elements are most likely to have been actively retained due to functional constraints rather than a lack of divergence time (35
). PipMaker (http://pipmaker.bx.psu.edu/cgi-bin/pipmaker?advanced), Vista (http://genome.lbl.gov/vista/index.shtml), and the evolutionary conserved regions (ECR) browser (http://ecrbrowser.dcode.org/) are Web-based programs that can be used to align genomic sequences from different species and to determine conservation between gap-free segments of those sequences (36
38
).
We used PipMaker to discover regions of conservation. Because false-positive findings are reduced as more species are examined, we conducted alignments using both mouse and rat. A detailed description of PipMaker is in the online supplementary materials. We used the criteria of Frazer et al. (35
) for "strong-hits" (regions
100 base pairs in length with
70 percent sequence similarity) to evaluate the PipMaker output and to determine the likely functional impact of the SNP. We ranked the SNP as high impact if it was located in one of these "strong-hits" regions in both mouse and rat alignments. We assigned a lower ranking for functional impact if a SNP was found in any region with greater than 50 percent identity in both mouse and rat. If a SNP was present in a region with less than 50 percent identity in either mouse or rat, then we classified the SNP as without likely functional impact.
Databases of known or predicted transcription regulatory regions can be used to complement evolutionary conservation analyses in that a SNP within a conserved region corresponding to a known transcription factor binding site (TFBS) constitutes additional evidence for functional impact. Although there are several tools to assess a TFBS (refer to the supplementary online section for rVISTA (http://rvista.dcode.org), multiTF (http://multitf.dcode.org/), Promolign (http://polly.wustl.edu/promolign/main.html), HGVbase (http://hgvbase.cgb.ki.se), and MAPPER (http://mapper.chip.org/)), we used the public version of the Web-based program Match to search introns and upstream and downstream noncoding regions to locate transcription regulatory regions (http://www.gene-regulation.com/pub/programs.html#match) (39
). Refer to the online supplementary material for a description of Match. Because transcription regulatory regions are usually short nucleotide sequences (48 base pairs) that may occur at random, Match findings alone were not considered as evidence for SNP functional impact.
As with ESEs, Web-based resources of known or predicted UTRs may be used to identify potential UTRs in which a SNP may be located. We used UTRscan (http://www.ba.itb.cnr.it/BIG/UTRScan/) (40
) and the Istituto Technologie Biomediche BLAST (http://www.ba.itb.cnr.it/BIG/Blast/BlastUTR.html) to locate UTRs. UTRscan uses the UTRsite collection that contains regulatory elements in UTRs whose function and structure have been experimentally determined and published, while BLAST utilizes the UTRdb collection that contains UTR eukaryotic mRNA sequences derived from several sources of data (41
). Similar to our ESE analysis, both UTRscan and BLAST had to identify SNPs as residing in regulatory sequences to be considered potentially functional.
Semiquantitative relative ranking procedures
We considered multiple ways a SNP could impact function, such that multiple rankings could be generated for a single SNP. For the present study, we categorized each SNP with the highest assigned ranking from among all the applicable in silico assessments by which that individual SNP could affect function. Figure 1 shows the four category-ranking processes we used: category "1" SNPs have the most potential for functional significance, and category "4" SNPs have the least potential for functional significance.
|
As shown in figure 1, if a synonymous coding SNP occurred within a splice site, it was ranked "1" or high likelihood of functional impact. If both ESEfinder and RESCU-ESE indicated that the SNP occurred within a potential ESE, the SNP was ranked "2" (even with agreement between ESEfinder and RESCUE-ESE, the potential for false positives remains high because of the short length of ESEs). Splice site and ESE functional impact rankings for nonsynonymous coding SNPs were identical to the synonymous coding SNPs. For amino acid substitutions, figure 1 shows how we ranked the various combinations of SIFT and PolyPhen results. Also displayed in figure 1, UTRscan and BLAST results were treated analogously to results from ESE analyses. Intronic SNPs occurring within a splice site were placed in category "1" regardless of the PipMaker score within the region (figure 1). For the nonsplicing, noncoding SNPs, the ranking depended mainly on the PipMaker score and could have been modified by Match analysis results. However, as mentioned previously, Match results on their own were considered as weak evidence for SNP functional impact.
Comprehensive in silico tools
Several of the in silico tools we used were not originally designed for SNP evaluation. Thus, applying these methods was time consuming. An easily accessible, automated, and intuitive system incorporating the various tools would be invaluable. PolyMAPr (available upon request from the authors) and PupasView (http://pupasview.bioinfo.ochoa.fib.es/) integrated many of the in silico methods we described and can assist in prioritizing SNPs for disease association analyses (42
, 43
). PolyMAPr mines multiple public databases to obtain current data. The SNPs identified during this "mining procedure" are then evaluated for their functional effects by use of various third-party programs such as PolyPhen and ESEfinder. PolyMAPr localizes SNPs in the potential TFBS or splice sites by utilizing the JASPAR and the ASD (Alternative Splicing Database) databases, respectively (44
, 45
). However, the use of ESEfinder, PolyPhen, JASPAR, and ASD on their own may result in false-positive findings, as we noted previously. The addition of programs for evolutionary conservation analyses for noncoding and coding regions (e.g., PipMaker and SIFT), as well as programs to examine exonic UTRs (e.g., UTRscan), would increase the versatility of PolyMAPr. The program also involves several complexities, such as the manual construction and submission of gene annotation files. Neverthless, PolyMAPr combines several features of our own searches within one package and can be customized by adding additional program modules (42
).
PupasView, an Ensembl-based tool, provides precompiled analyses that examined if SNPs were located in potential splice sites, ESE sequences (using weight matrices similar to ESEfinder), TFBSs (using Match), human/rat conserved regions (using BLASTZ), and triplex-forming oligonucleotide sequences. PupasView also provides precompiled analysis of the potential pathological effects of nonsynonymous amino acid changes. These various analyses can be combined with each other and population frequency and linkage disequilibrium data such that users can select "optimal" SNPs to assay. For instance, one can screen for ESE and TFBS SNPs that occur in "highly conserved" human/rat sequences. Although PupasView is very easy to use, it does not provide raw output data for interpretation; the user must rely on interpretations based on the program's predefined criteria. Furthermore, when analyzing splice site SNPs, PupasView considers only the single intronic and exonic nucleotides forming the splice junction; it does not consider the surrounding consensus splice site nucleotides. PupasView also does not use Match for downstream regulatory regions nor does it use UTR resources to analyze UTR regulatory regions. Because PupasView was so easy to use, we compared its results with those of our more "manual" method.
There are two additional comprehensive SNP selection tools: SNPselector (http://primer.duhs.duke.edu/ (currently unavailable)) (uses gene sequences from Ensembl) and TAMAL (http://neoref.ils.unc.edu/tamal/) (uses gene sequences from UCSC Bioinformatics) (46
, 47
). The primary focus of these tools is in the selection of SNPs based on linkage disequilibrium (haplotype-tagging) methods. Both resources, however, also identify SNPs that alter splice sites and that occur in potential regulatory regions using a variety of methods including multiple species alignments (refer to online supplementary materials for further details). As with PupasView, raw outputs of these analyses are not provided for individualized interpretation. Furthermore, neither resource considers SNPs in potential ESEs. SNPselector does not include algorithms to predict functionality of nonsynonymous coding SNPs; however, TAMAL uses the limited LS-SNP-based predictions for nonsynonymous coding SNPs.
| RESULTS |
|---|
|
|
|---|
For SNPs that occurred in pathways of interest among a preexisting set of 57 validated SNP assays (Web table 1), we found 14 SNPs that were highly ranked (either a "1" or "2"; Web table 2). We identified seven with possible functional consequence for new assay development out of 83 potential SNPs in DNAPKcs (Web tables 1 and 2). We eliminated 119 of the 140 variants with low or neutral predictions based on these in silico methods. UTRscan and BLAST did not identify any UTR regulatory regions; thus, UTRscan and BLAST results are not provided in Web table 2. Table 3 is an excerpt from Web table 2, providing results for the breast cancer 1, early onset, gene BRCA1 and the breast cancer 2, early onset, gene BRCA2 as examples.
|
One SNP, the x-ray repair cross-complementing gene XRCC3 T241M (rs861539), was placed in category "3" based on in silico assessment, but this is one of the few SNPs with assays in previous studies that have demonstrated a functional effect (48
Web table 2 (refer to table 3 for examples with BRCA1 and BRCA2) also displays results from PupasView analyses of the SNPs. Of the four SNPs that we found to potentially affect splice sites, PupasView identified two. PupasView identified 13 SNPs that potentially alter ESEs. When restricting the query to those ESE SNPs that lie within highly conserved regions, we found that all 13 SNPs met the criteria. Of these 13 SNPs, our methodology identified three (ranked "2"). In all 13 instances, however, either ESEfinder or RESCUE-ESE identified the SNPs occurring within ESEs. PupasView identified three nonsynonymous amino acid SNPs as "pathological"; our methodology identified 10 SNPs in addition to these three (ranked "1" or "2"). Of the seven SNPs that PupasView identified as occurring within highly conserved regions, we found two in highly conserved regions (
70 percent similarity and
100 nucleotides in mouse and rat), three in medium conserved regions (<70 percent and/or <100 nucleotides and
50 percent similarity in mouse or rat), and two in low conserved regions (<50 percent similarity in mouse and rat). There was one SNP that we found to be in a highly conserved region that PupasView considered to be in a conserved region.
| DISCUSSION |
|---|
|
|
|---|
We have described an approach to address a vexing issue in candidate SNP studies, namely, prioritizing SNPs with the highest likelihood of being functionally relevant and therefore most important to interrogate. None of the methods described herein are new, but their utility is somewhat underappreciated outside the realm of bioinformatics and genetics. Although we do not explicitly incorporate haplotype-tagging SNPs or selection of SNPs based on linkage disequilibrium methods (24
The ranking scheme that we developed to evaluate the in silico information is semiquantitative and primarily intended for use in the absence of biochemical characterization. We found functional data for the XRCC3 T241M (rs861539) SNP, but our ranking cutoff did not include this SNP among the group to be evaluated in our epidemiologic study. It is possible that this SNP, while itself having a low ranking, is near a more functional but undiscovered SNP. Future experimentation may show our "high ranking" SNPs to have functional effects.
PupasView is an easier to use alternative to the laborious methodology that we have outlined. Discrepancies between our results and the results from PupasView were, in part, due to differences in the annotated sequences used (NBCI vs. Ensembl) and inherent differences in the resources used (e.g., SIFT and PolyPhen vs. PMUT). For the most part, however, discrepancies were due to differences in the stringency of the analyses. For instance, PupasView utilized a single resource for ESE SNP identification; we utilized two resources. PupasView utilized PMUT's default cutoffs for the identification of pathological nonsynonymous amino acid SNPs; we utilized cutoffs for SIFT and PolyPhen deemed appropriate for epidemiologic studies (17
). Although it is difficult to determine what criteria or cutoffs are optimal for choosing SNPs, a strength of our methodology is that individual users can define their own. Our methodology can be used to prioritize SNPs that have not been validated or are new. In Web table 2, there are two SNPs that could not be examined by PupasView because they have not yet been assigned Entrez SNP identification numbers ("rs numbers"). This is a major limitation with precompiled resources; the user must wait until the resource has been updated to analyze recently discovered SNPs.
We sought to simplify and practically apply in silico queries of the existing bioinformatic databases to rank SNPs in importance for large epidemiologic studies and genetic analyses. We presented a methodology born from the necessity to prioritize SNP selection in our own studies; individual investigators may choose to utilize alternative tools or to interpret the output from these tools in an alternative manner. The existing in silico methods that we used and the semiquantitative relative ranking strategy that we developed can also be adapted by any investigator to a priori SNP selection or post hoc evaluation of variants identified in whole-genome scans or within haplotype blocks associated with disease. Continued improvement of computer programs that eliminate the tedious nature of the in silico searches would vastly improve the accessibility and use of these genetics databases for epidemiologists and other researchers.
| ACKNOWLEDGMENTS |
|---|
This research was supported in part by the Intramural Research Program of the Division of Cancer Epidemiology and Genetics and the Center for Cancer Research, National Cancer Institute, and by the National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Department of Health and Human Services.
The authors would like to thank Dr. Andrew W. Bergen for his thoughtful comments on the manuscript.
Conflict of interest: none declared.
| References |
|---|
|
|
|---|
- Collins FS, Guyer MS, Charkravarti A. Variations on a theme: cataloging human DNA sequence variation. Science 1997;278:15801.
[Free Full Text] - Lander ES. The new genomics: global views of biology. Science 1996;274:5369.
[Free Full Text] - Mohrenweiser HW. Genetic variation and exposure related risk estimation: will toxicology enter a new era? DNA repair and cancer as a paradigm. Toxicol Pathol 2004;32(suppl 1):13645.
- Cargill M, Altshuler D, Ireland J, et al. Characterization of single-nucleotide polymorphisms in coding regions of human genes. Nat Genet 1999;22:2318.[CrossRef][Web of Science][Medline]
- Pharoah PD, Dunning AM, Ponder BA, et al. Association studies for finding cancer-susceptibility genetic variants. Nat Rev Cancer 2004;4:85060.[CrossRef][Web of Science][Medline]
- Rebbeck TR, Ambrosone CB, Bell DA, et al. SNPs, haplotypes, and cancer: applications in molecular epidemiology. Cancer Epidemiol Biomarkers Prev 2004;13:6817.
[Free Full Text] - Johnson GC, Esposito L, Barratt BJ, et al. Haplotype tagging for the identification of common disease genes. Nat Genet 2001;29:2337.[CrossRef][Web of Science][Medline]
- Ferrer-Costa C, Orozco M, de la Cruz X. Use of bioinformatics tools for the annotation of disease-associated mutations in animal models. Proteins 2005;61:87887.[CrossRef][Web of Science][Medline]
- Siepel A, Bejerano G, Pedersen JS, et al. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res 2005;15:103450.
[Abstract/Free Full Text] - Chamary JV, Hurst LD. Biased codon usage near intron-exon junctions: selection on splicing enhancers, splice-site recognition or something else? Trends Genet 2005;21:2569.[CrossRef][Web of Science][Medline]
- Duan J, Wainwright MS, Comeron JM, et al. Synonymous mutations in the human dopamine receptor D2 (DRD2) affect mRNA stability and synthesis of the receptor. Hum Mol Genet 2003;12:20516.
[Abstract/Free Full Text] - Costa LG, Vitalone A, Cole TB, et al. Modulation of paraoxonase (PON1) activity. Biochem Pharmacol 2005;69:54150.[CrossRef][Web of Science][Medline]
- Rantala M, Silaste ML, Tuominen A, et al. Dietary modifications and gene polymorphisms alter serum paraoxonase activity in healthy women. J Nutr 2002;132:301217.
[Abstract/Free Full Text] - Kim VN, Nam JW. Genomics of microRNA. Trends Genet 2006;22:16573.[CrossRef][Web of Science][Medline]
- Rebbeck TR, Martinez ME, Sellers TA, et al. Genetic variation and cancer: improving the environment for publication of association studies. Cancer Epidemiol Biomarkers Prev 2004;13:19856.
[Free Full Text] - Savas S, Kim D, Ahmad M, et al. Identifying functional genetic variants in DNA repair pathway using protein conservation analysis. Cancer Epidemiol Biomarkers Prev 2004;13:8017.
[Abstract/Free Full Text] - Xi T, Jones IM, Mohrenweiser HW. Many amino acid substitution variants identified in DNA repair genes during human population screenings are predicted to impact protein function. Genomics 2004;83:9709.[CrossRef][Web of Science][Medline]
- Zhu Y, Spitz MR, Amos CI, et al. An evolutionary perspective on single-nucleotide polymorphism screening in molecular cancer epidemiology. Cancer Res 2004;64:22517.
[Abstract/Free Full Text] - Ng PC, Henikoff S. Predicting deleterious amino acid substitutions. Genome Res 2001;11:86374.
[Abstract/Free Full Text] - Ramensky VBP, Sunyaev S. Human non-synonymous SNPs: server and survey. Nucleic Acids Res 2002;30:3894900.
[Abstract/Free Full Text] - Wacholder S, Chanock S, Garcia-Closas M, et al. Assessing the probability that a positive report is false: an approach for molecular epidemiology studies. J Natl Cancer Inst 2004;96:43442.
[Abstract/Free Full Text] - Thomas DC. The need for a systematic approach to complex pathways in molecular epidemiology. Cancer Epidemiol Biomarkers Prev 2005;14:5579.
[Free Full Text] - Witte JS. Genetic analysis with hierarchical models. Genet Epidemiol 1997;14:113742.[CrossRef][Web of Science][Medline]
- Stram DO. Tag SNP selection for association studies. Genet Epidemiol 2004;27:36574.[CrossRef][Web of Science][Medline]
- Dudas A, Chovanec M. DNA double-strand break repair by homologous recombination. Mutat Res 2004;566:13167.[CrossRef][Web of Science][Medline]
- Jackson SP. Sensing and repairing DNA double-strand breaks. Carcinogenesis 2002;23:68796.
[Abstract/Free Full Text] - Lees-Miller SP, Meek K. Repair of DNA double strand breaks by non-homologous end joining. Biochimie 2003;85:116173.[Medline]
- Staats B, Qi L, Beerman M, et al. Genewindow: an interactive tool for visualization of genomic variation. Nat Genet 2005;37:10910.[CrossRef][Web of Science][Medline]
- Strachan T, Read AP. Human molecular genetics 2. 2nd ed. Oxford, United Kingdom: BIOS Scientific Publishers, 1999.
- Blencowe BJ. Exonic splicing enhancers: mechanism of action, diversity and role in human genetic diseases. Trends Biochem Sci 2000;25:10610.[CrossRef][Web of Science][Medline]
- Cartegni L, Wang J, Zhu Z, et al. ESEfinder: a Web resource to identify exonic splicing enhancers. Nucleic Acids Res 2003;31:356871.
[Abstract/Free Full Text] - Fairbrother WG, Yeo GW, Yeh R, et al. RESCUE-ESE identifies candidate exonic splicing enhancers in vertebrate exons. Nucleic Acids Res 2004;32(Web server issue):W18790.
[Abstract/Free Full Text] - Mattick JS. Introns: evolution and function. Curr Opin Genet Dev 1994;4:82331.[CrossRef][Medline]
- Sonenberg N. mRNA translation: influence of the 5' and 3' untranslated regions. Curr Opin Genet Dev 1994;4:31015.[CrossRef][Medline]
- Frazer KA, Elnitski L, Church DM, et al. Cross-species sequence comparisons: a review of methods and available resources. Genome Res 2003;13:112.
[Abstract/Free Full Text] - Mayor C, Brudno M, Schwartz JR, et al. VISTA: visualizing global DNA sequence alignments of arbitrary length. Bioinformatics 2000;16:10467.
[Abstract/Free Full Text] - Ovcharenko I, Nobrega MA, Loots GG, et al. ECR browser: a tool for visualizing and accessing data from comparisons of multiple vertebrate genomes. Nucleic Acids Res 2004;32(Web server issue):W2806.
[Abstract/Free Full Text] - Schwartz S, Zhang Z, Frazer KA, et al. PipMakera Web server for aligning two genomic DNA sequences. Genome Res 2000;10:57786.
[Abstract/Free Full Text] - Kel AE, Gossling E, Reuter I, et al. MATCH: a tool for searching transcription factor binding sites in DNA sequences. Nucleic Acids Res 2003;31:35769.
[Abstract/Free Full Text] - Pesole G, Liuni S. Internet resources for the functional analysis of 5' and 3' untranslated regions of eukaryotic mRNAs. Trends Genet 1999;15:378.[CrossRef][Web of Science][Medline]
- Mignone F, Grillo G, Licciulli F, et al. UTRdb and UTRsite: a collection of sequences and regulatory motifs of the untranslated regions of eukaryotic mRNAs. Nucleic Acids Res 2005;33(database issue):D1416.
[Abstract/Free Full Text] - Freimuth RR, Stormo GD, McLeod HL. PolyMAPr: programs for polymorphism database mining, annotation, and functional analysis. Hum Mutat 2005;25:11017.[CrossRef][Web of Science][Medline]
- Conde L, Vaquerizas JM, Ferrer-Costa C, et al. PupasView: a visual tool for selecting suitable SNPs, with putative pathological effect in genes, for genotyping purposes. Nucleic Acids Res 2005;33(Web server issue):W5015.
[Abstract/Free Full Text] - Sandelin A, Alkema W, Engstrom P, et al. JASPAR: an open-access database for eukaryotic transcription factor binding profiles. Nucleic Acids Res 2004;32(database issue):D914.
[Abstract/Free Full Text] - Thanaraj TA, Stamm S, Clark F, et al. ASD: the alternative splicing database. Nucleic Acids Res 2004;32(database issue):D649.
[Abstract/Free Full Text] - Hemminger BM, Saelim B, Sullivan PF. TAMAL: an integrated approach to choosing SNPs for genetic studies of human complex traits. Bioinformatics 2006;22:6267.
[Abstract/Free Full Text] - Xu H, Gregory SG, Hauser ER, et al. SNPselector: a Web tool for selecting SNPs for genetic association studies. Bioinformatics 2005;21:41816.
[Abstract/Free Full Text] - Aka P, Mateuca R, Buchet JP, et al. Are genetic polymorphisms in OGG1, XRCC1 and XRCC3 genes predictive for the DNA strand break repair phenotype and genotoxicity in workers exposed to low dose ionising radiations? Mutat Res 2004;556:16981.[Web of Science][Medline]
- Au WW, Salama SA, Sierra-Torres CH. Functional characterization of polymorphisms in DNA repair genes using cytogenetic challenge assays. Environ Health Perspect 2003;111:184350.[Web of Science][Medline]
- Figueiredo JC, Knight JA, Briollais L, et al. Polymorphisms XRCC1-R399Q and XRCC3-T241M and the risk of breast cancer at the Ontario site of the Breast Cancer Family Registry. Cancer Epidemiol Biomarkers Prev 2004;13:58391.
[Abstract/Free Full Text] - Han J, Colditz GA, Samson LD, et al. Polymorphisms in DNA double-strand break repair genes and skin cancer risk. Cancer Res 2004;64:300913.
[Abstract/Free Full Text] - Jacobsen NR, Nexo BA, Olsen A, et al. No association between the DNA repair gene XRCC3 T241M polymorphism and risk of skin cancer and breast cancer. Cancer Epidemiol Biomarkers Prev 2003;12:5845.
[Free Full Text] - Kuschel B, Auranen A, McBride S, et al. Variants in DNA double-strand break repair genes and breast cancer susceptibility. Hum Mol Genet 2002;11:1399407.
[Abstract/Free Full Text] - Smith TR, Levine EA, Perrier ND, et al. DNA-repair genetic polymorphisms and breast cancer risk. Cancer Epidemiol Biomarkers Prev 2003;12:12004.
[Abstract/Free Full Text] - Smith TR, Miller MS, Lohman K, et al. Polymorphisms of XRCC1 and XRCC3 genes and susceptibility to breast cancer. Cancer Lett 2003;190:18390.[CrossRef][Web of Science][Medline]
- Webb PM, Hopper JL, Newman B, et al. Double-strand break repair gene polymorphisms and risk of breast or ovarian cancer. Cancer Epidemiol Biomarkers Prev 2005;14:31923.
[Abstract/Free Full Text] - Carlson CS, Eberle MA, Rieder MJ, et al. Selecting a maximally informative set of single-nucleotide polymorphisms for association analyses using linkage disequilibrium. Am J Hum Genet 2004;74:10620.[CrossRef][Web of Science][Medline]
- Hinds DA, Stuve LL, Nilsen GB, et al. Whole-genome patterns of common DNA variation in three human populations. Science 2005;307:10729.
[Abstract/Free Full Text]
This article has been cited by other articles:
![]() |
S. Kumar, M. P. Suleski, G. J. Markov, S. Lawrence, A. Marco, and A. J. Filipski Positional conservation and amino acids shape the correct diagnosis and population frequencies of benign and damaging personal amino acid mutations Genome Res., September 1, 2009; 19(9): 1562 - 1569. [Abstract] [Full Text] [PDF] |
||||
![]() |
P. H. Lee and H. Shatkay An integrative scoring system for ranking SNPs by their potential deleterious effects Bioinformatics, April 15, 2009; 25(8): 1048 - 1055. [Abstract] [Full Text] [PDF] |
||||
![]() |
E. T. Chang, B. M. Birmann, J. L. Kasperzyk, D. V. Conti, P. Kraft, R. F. Ambinder, T. Zheng, and N. E. Mueller Polymorphic Variation in NFKB1 and Other Aspirin-Related Genes and Risk of Hodgkin Lymphoma Cancer Epidemiol. Biomarkers Prev., March 1, 2009; 18(3): 976 - 986. [Abstract] [Full Text] [PDF] |
||||
![]() |
P. H. Lee and H. Shatkay F-SNP: computationally predicted functional SNPs for disease association studies Nucleic Acids Res., January 11, 2008; 36(suppl_1): D820 - D824. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||





