ASSESSING THE FUNCTION OFGENETIC VARIANTS IN CANDIDATEGENE ASSOCIATION STUDIES Timothy R. Rebbeck*, Margaret Spitz ‡ and Xifeng Wu‡ Knowledge of inherited genetic variation has a fundamental impact on understanding human disease. Unfortunately, our understanding of the functional significance of many inherited genetic variants is limited. New approaches to assessing functional significance of inherited genetic variation, which combine molecular genetics, epidemiology and bioinformatics, promise to enhance reproducibility and plausibility of associations between genotypes and disease.
The human genome contains about ten million SNPs, Genomic variation and molecular epidemiology
with an estimated two common missense variants per The protein coding sequences of the human genome gene1. At least five million SNPs have already been contain approximately 100,000–300,000 common reported in public databases2,3. Many of these variants SNPs, and additional SNPs lie within putative regula- might be involved in human disease aetiology, but it is tory regions of genes that might be relevant for studies often difficult to assess their function on the basis of of human health and disease1. Regulatory and coding nucleotide sequence alone. This is particularly true SNPs are of particular interest to molecular epidemio- when variants do not alter an amino acid or do not dis- logical association studies. Non-synonymous SNPs rupt a well-characterized motif that affects protein (nsSNPs), or missense variants, translate into amino- function or structure. In addition, only a small subset acid polymorphisms in the proteins they encode.
of variants that affect the phenotype will confer small Regulatory SNPs (rSNPs) can affect the expression, to moderate effects on phenotypes that are causally tissue-specificity or function of relevant proteins. It related to disease risk. So, an important challenge that seems that both nsSNPs and rSNPs are relatively rare faces molecular epidemiological association studies of compared with the total number of SNPs in the human *Department of Biostatistics
candidate disease-susceptibility genes is to define the genome5. The rarity of nsSNPs might be a consequence and Epidemiology, and
variants that are functionally implicated in disease.
of selection against the functional disruptions of Abramson Cancer Center,
This is particularly urgent because the amount of amino-acid variation. Some molecular functional diver- University of Pennsylvania
genomic information that is available greatly exceeds sity is attributable to the effects on protein function School of Medicine, 904
Blockley Hall, 423 Guardian

the information about the function of variants that are caused by nsSNPs. For example, the kinetic parameters drive, Philadelphia,
used in human disease studies. There is insufficient of enzymes, the DNA-binding properties of proteins Pennsylvania 19104, USA.
guidance for molecular epidemiologists to optimally that regulate transcription, the signal transduction ‡Department of
select variants for an epidemiology study, highlighting activities of transmembrane receptors, and the architec- Epidemiology,
the need for methods that prioritize the choice of tural roles of structural proteins are all susceptible to M. D. Anderson Cancer
Center, 1515 Halcombe

genetic variants to be genotyped in molecular epidemi- perturbation by nsSNPs and their associated amino-acid Boulevard, Houston,
ological studies4. Here, we bring together examples of Texas 77030, USA.
experimental population genetics and evolutionary In addition to ongoing efforts to identify and charac- Correspondence to T.R.R.
approaches to assessing variant function, and evaluate terize these genetic variants, molecular epidemiological e-mail: trebbeck@

the potential for using this information to improve disease association studies are under way to better under- inferences from disease association studies.
stand the role of inherited genetic variation in disease.
VOLUME 5 | AUGUST 2004 | 5 8 9
A major challenge for epidemiologists undertaking tissue-specific effects or experimental conditions. For candidate gene–disease association studies is to choose example, the functional effect of a variant might be small target SNPs that are most likely to affect the phenotype in magnitude in an experimental system, but might and that ultimately contribute to disease development.
become more important in a specific human tissue or on Variants in biologically plausible candidate genes are exposure to relevant environmental agents. Similarly, usually selected for study on the basis of both variant large experimental effects might be observed that reflect allele frequency and the functional effect of the variant on negligible in vivo effects in humans. Similarly, effects that relevant traits. Although there is often sufficient informa- are small in magnitude in a single time point-experiment tion to assess the allele frequency of a candidate variant, might be amplified or only have a phenotypically rele- understanding the functional significance of genetic vant effect over long time periods. Indeed, most genetic variants that are studied in complex traits and disease Knowledge of gene and SNP function is crucial to would be expected to have small effects on function. So, direct the appropriate design and interpretation of can- the use of model systems that are appropriate to detect didate gene association studies. This is in contrast to large effects (for example, as might be observed in high genome-wide scans and other approaches that rely on penetrance ‘inborn errors of metabolism’ settings) might LINKAGE DISEQUILIBRIUM between loci to identify genotype- disease associations, for which knowledge of function is As outlined above, experimental evidence of genetic not required. Most epidemiologists are not trained in variant function might not be consistent with results of experimental laboratory methods and do not maintain epidemiological studies. For example, CYP3A4 metabo- laboratories in which detailed laboratory assessment of lizes drugs and other compounds, including steroid variant function in target tissues or individuals of interest hormones, that are important in the aetiology of many can be made. They must therefore often rely on making common diseases11. A variant has been identified in the an assessment of the function of a candidate gene or CYP3A4 promoter (denoted CYP3A4*1B) that consists SNP from the published literature. However, published of an A→G nucleotide substitution at position –290 experimental evidence might not be adequate to guide (denoted A–290G) in the NIFEDIPINE-specific element the design or interpretation of a molecular epidemio- (NFSE)12. Subsequently, the CYP3A4*1B variant was logical study. For example, inadequate information epidemiologically associated with various characteris- about the function of a genetic variant impedes the abil- tics, including a more advanced stage of prostate ity to evaluate whether the association is consistent with tumours13,14, decreased risk for treatment-related a causative event that is consistent with a biological mechanism, whether the association is a reflection of plasma levels of insulin-like growth factor-I among linkage disequilibrium with a truly causative variant, or users of oral contraception18. Although these data pro- whether the association represents a false positive result.
vide evidence for functionally significant effects of The lack of reproducibility of many association studies CYP3A4*1B, the basic science literature has not consis- might reflect the number of studies that involve genetic tently supported the hypothesis that CYP3A4*1B has a variants with no functional significance6,7.
functionally significant effect. Hashimoto et al.12 identi-fied several regulatory elements, including a putative Experimental approaches
repressor fragment and a NFSE element in the CYP3A4 Inferences about function of a genetic variant can be promoter, which indicates that variants in this promoter made using experimental systems, including in vitro sys- region might affect CYP3A4 transcription (FIG. 1).
tems and in vivo animal models. These include the effect Lamba et al.19 reported that CYP3A4*1B alleles were of variants on regulatory-region control of DNA found significantly more frequently in Caucasians with expression, RNA stability or degradation, protein struc- low CYP3A4 protein levels than in those with higher ture, protein denaturing, protein expression, other mea- sures of in vivo and in vitro control of protein levels, and Many authors20–26 have studied the relationship of tissue specificity8–10. Because the scope of experimental CYP3A4 expression or function to the CYP3A4*1B pro- chromosome, are not inheritedindependently but are observed approaches for assessing SNP function is large, the pur- moter (FIG. 1). Most of them concluded that there were pose of this section is not to provide a comprehensive no biologically meaningful effects, given the small mag- review of experimental approaches for evaluating SNP nitude of effects that were observed. However, most function, but instead to provide a context in which epi- studies reported consistently elevated expression associ- demiological studies can use experimental data in the ated with CYP3A4*1B (a 20%–200% increase over the design and interpretation association studies that On the basis of these data, it is possible that These approaches can provide the strongest evidence CYP3A4*1B has only a small to moderate phenotypic for the functional role of a genetic variant, but can also be effect. These small phenotypic effects will probably not metabolized by CYP3A4, and forwhich a regulatory element difficult to interpret in the context of studies of complex have a clinically meaningful impact on drug disposi- human traits. The extent of the effects of genetic variants tion. It is not clear, however, whether this magnitude of on relevant phenotypes involved in the disease process is phenotypic perturbation is sufficient to alter metabo- probably small. These effects may be highly dependent lism on environmental exposure to steroid hormones on the context in which the genetic variant is acting, or other agents that might confer disease risk over the including influence from environmental exposure, lifetime of an individual. For example, would a 20% 5 9 0 | AUGUST 2004 | VOLUME 5
Cell system
Effect of CYP3A4*1B versus CYP3A4*1A
190% Increase in testosterone 6 β-oxidation 40% Increase in nifedipine oxidase activity 110% Increase in CYP3A4 protein expression 20–90% Increase in luciferase expression In response to xenobiotics:
20–117% Increase in transcriptional activation 75–147% Increase in transcriptional activation 4–19% Increase in transcriptional activation Figure 1 | In vitro studies of the effect of CYP3A4*1B compared with CYP3A4*1A. At the top of the figure on the left is a
schematic representation of the structure of the CYP3A4 region. Blue bars represent the genomic regions that are used in the
CYP3A4-containing constructs for each study. The postive and negative numbers on each of the bars represent the postion of the
terminal basepairs of these regions relative to the postion of the SNP. The studies summarized here encompass substantial
variability in experimental conditions. Nonetheless, there is a small but consistent effect of increased CYP3A4 expression in the
presence of CYP3A4*1B. NFSE, nifedipine-specific element.
greater metabolism of testosterone by CYP3A4*1B over further compounded if the magnitude of functional the course of a man’s lifetime (as indicated by the data effects of individual variants is small, leading to a greater in FIG. 1) sufficiently increase the risk of prostate cancer potential for artificial masking or enhancement of func- to the extent that it could be detected in an epidemio- tional effects in experimental systems. Therefore, the logical context? If so, the apparent discrepancy between ability to detect functionally relevant effects of genetic epidemiological associations of this genetic variant and variants might be highly dependent on the context of the functional effects of this variant in experimental the experimental system used, and experimental sys- tems might not reflect relevant in vivo effects in Variation in experimental approaches used to assess humans. Given the differences between the relevant the function of genetic variants in complex metabolic phenotypes in humans and experimental systems, systems could affect inferences about function (see FIG. 1 information obtained from the latter might not be con- and BOX 1). Results might vary depending on the spe- sistent with results of epidemiological association stud- cific experimental conditions, unrecognized effects of ies. Therefore, it might not be possible to make clear regulatory elements, and post-transcriptional or post- inferences about in vivo genotype function in humans translational processing. Hashimoto et al.12 suggest that removing the repressor region upstream of the Inconsistencies between experimental data and CYP3A4*1B variant sequence might unmask a promoter epidemiological studies can also reflect a potential effect. Studies that evaluate regulatory region genetic study bias that is inherent to some epidemiological variants in CYP3A4 might differ substantially depending investigations. Poor study design, including insuffi- on whether this region is present or absent in the experi- cient statistical power, could influence the results of an mental systems. In fact, various constructs were used to epidemiological study to produce false positive or evaluate in vitro functional effects of CYP3A4*1B (FIG. 1), and this experimental variability could have influenced In addition to more traditional experimental the inferences made in these studies. The inclusion of approaches that are used to assess SNP function, a wide enhancer elements that might be required in expression variety of novel approaches for assessing gene function assays can similarly affect expression. Regulation of have been proposed, including gene tagging27, gene trap- expression might only become apparent under exposure ping28, N-ethyl-N-nitrosourea (ENU) mutagenesis29, to the relevant compounds in the proper cellular context.
proteomics methods30 and evaluation of epigenetic The use of primary cells versus cell culture systems can mechanisms31. The HaploCHIP method32 is an illustra- also affect inferences about function. The use of different tive example. HaploCHIP analyzes SNPs that affect cell or tissue types (for example see FIG. 1) might reflect gene regulation in vivo using chromatin immunopre- different regulatory influences that affect the functional cipitation and mass spectrometry to identify differential assessment of a genetic variant. These effects might be protein–DNA binding in vivo that is associated with VOLUME 5 | AUGUST 2004 | 5 9 1
pleiotropic effects on disease risk. This should be a main Box 1 | Factors influencing consistency of gene–disease associations
focus of future studies attempting to assess the func-tional significance of genetic variants.
Variables affecting inferences from experimental studies:
• In vitro
or in vivo system studied
Population genetics approaches
Cell type studied
Knowledge of population genetic structure might Cultured versus fresh cells studied
provide insight into the functional relevance of a Genetic background of the system
genetic variant on a disease trait. For example, genetic DNA constructs
variants with a functionally significant impact on rele- DNA segments that are included in functional (for example, expression) constructs
vant phenotypes, including disease endpoints, might bemore likely to deviate from expected allele and genotype Use of additional promoter or enhancer elements
frequencies compared with alleles that are functionally Exposures
neutral. It remains unclear whether there is evidence of Use of compounds that induce or repress expression
selection for or against low penetrance alleles over evo- Influence of diet or other exposures on animal studies
lutionary time, although it is clear that selection hasinfluenced the pattern of genetic variability in the Variables affecting epidemiological inferences:
human genome33,34. However, a number of diseases that Inclusion/exclusion criteria for study subject selection
could confer selective pressures on genes of interest are Sample size and statistical power
relatively new with respect to evolutionary genetics his- Candidate gene choice
tory. For example, diabetes and obesity might confer A biologically plausible candidate gene
disease risks that could lead to selective pressure onallele or genotype frequencies, but these effects still Functional relevance of the candidate genetic variant
occur relatively late in life and have been a problem to Frequency of allelic variant
human health on a population level for only a few gen- Statistical analysis
erations. So, whilst substantial new information about Consideration of confounding variables, including ethincity, gender or age.
evolutionary genetics history and low penetrance SNPs Whether an appropriate statistical model was applied (for example, were interactions
is becoming available, the link between this knowledge considered in addition to main effects of genes?)
and the functional importance of specific SNPs has yet Violation of model assumptions
Deviations from expected allele or genotype frequen- cies across relevant phenotypic groups could be used to allelic variants of a gene as a surrogate measure of tran- identify alleles that are more likely to be functionally asso- scriptional activity. This allele-specific quantification ciated with disease aetiology. For example, alleles that are method uses haplotype-specific chromatin immunopre- truly causative of a disease state would be expected to be cipitation (CHIP) to measure the amount of phosphory- disproportionately over-represented among cases with lated RNA polymerase II that is bound to different alleles, the disease but under-represented among the disease- thereby estimating the differences in protein binding free controls35,36. As a result, deviations from HARDY- between the alleles. This assay can be adapted for high- WEINBERG PROPORTIONS or other measures of population throughput analysis because of its sensitivity and ability frequency might provide clues to the role of genes in the to quantify the relative abundance of two different alleles aetiology of this disease. For example, Hoh et al.37 devel- in a sample of immunoprecipitated chromatin.
oped a so- called ‘set association’ approach that evalu- Using this approach, Knight et al.32 showed that ates sets of SNPs at various positions in the genome.
allele A, and q is the frequency of allele a) that result in a there was a close correlation between the level of bound Information about allelic association and Hardy- phosphorylated (active) RNA polymerase II at the Weinberg equilibrium are combined over multiple imprinted small nuclear ribonucleoprotein polypeptide markers in the genome. A genome-wide TEST STATISTIC is N locus and allele-specific expression. They also used generated by summing-up contributions from many this method to identify an SNP of the cytokine tumour SNPs located in different genomic regions. The method necrosis factor (TNF), which plays a pivotal role in performs a significance test on several sets of loci inflammation, immunity and apoptosis, although the simultaneously, while using conservative measures of involvement of this SNP in modulating TNF transcrip- inference to control for the potential of false positive tion is yet to be shown. Application of the HaploCHIP associations. Hoh et al.37 applied their approach to an approach to the TNF/lymphotoxin-α (LTA) locus iden- SNP association study of RESTENOSIS. Among 779 patients tified functionally important haplotypes that correlate with heart disease, 342 showed restenosis (cases) 6 with allele-specific transcription of LTA32.
months after ANGIOPLASTY. The remaining patients, on Despite reports of both traditional and new ways of whom angioplasty was not performed, were the con- after an initial treatment such asangioplasty aimed at removing assessing the function of a genetic variant, little has been trols. Eighty nine SNP markers were genotyped in 62 done to determine whether genotypic differences that candidate genes for each individual. The algorithm iden- lead to small phenotypic perturbations are functionally tified nine genes that conferred susceptibility to the dis- significant in the context of complex human disease. No ease. Unfortunately, despite promising results, the criteria have been established to evaluate the impor- method is sensitive to genotyping errors. Population repair a damaged blood vessel orunblock a coronary artery.
tance of genetic variants with small but potentially ADMIXTURE, in which cases and controls belong to different 5 9 2 | AUGUST 2004 | VOLUME 5
ethnic groups with different SNP allele frequencies, can multiple alignment information of these sequences to also adversely distort the test results. Therefore, estimate ‘tolerance indices’ that predict tolerated and although incorporating population genetics informa- deleterious (that is, intolerant) substitutions for every tion can provide further information to a genotype–dis- position of the query sequence. Substitutions at each ease association study, additional research is required to position with normalized tolerance indices that are determine the conditions under which these approaches below a chosen cut-off point are predicted to be delete- rious. Substitutions that are greater than or equal to thecut-off point are predicted as being tolerated (that is, Evolutionary and structural approaches
putatively non-functional). Using three examples Natural selection shapes patterns of genetic variation in (LacI, HIV-I protease and bacteriophage T4 lysozyme), populations such that favourable variants increase in Ng et al.48 showed that a high proportion of substitu- frequency relative to less favourable variants over time.
tions that are predicted to be deleterious by SIFT did Mutations in functionally important sites can be elimi- affect phenotypes in experimental assays. However, nated or kept at low frequencies. So, nucleotide or some positions that were predicted to be intolerant by amino-acid residues that have been conserved across SIFT were tolerated in experimental assays. In these species or within a gene family are more likely to be cases, the positions are usually involved in an unknown involved in regulation of vital functions, expression or function that the assay does not detect.
tissue specificity than residues that are not conserved.
There is also evidence that SIFT fails to identify Knowledge of the functional domains can therefore be residues that are vital for protein function but that have useful when assessing the functional impact of an not been conserved throughout the family48. SIFT predic- amino-acid change. The value of this knowledge was tions are based on sequence data alone and do not require demonstrated early on for the effects of mutations in the knowledge of protein structure or function. So, substitu- primary sequence of haemoglobin38. In this case, the mol- tions in uncharacterized proteins can be evaluated by ecular basis of the clinical effects caused by mutations SIFT only when homologous sequences are provided.
could be inferred as soon as the structural information Recently, Zhu et al.6 compared the relationship of became available. These pioneering studies recognized tolerance to amino-acid change, as predicted by SIFT, crucial links between the putative functional motifs and with the reported association with SNPs in cancer- potential effects of mutations on function.
related genes (FIG. 2). For the study, 166 published Recently, a number of computational algorithms case-control studies that reported associations in 46 have been developed to predict the impact of nucleotide SNPs, in 39 different genes, in 16 different cancer or amino-acid substitutions on protein structure, sites, were chosen. All of these SNPs are located in expression and function. Evolutionary conservation of biologically plausible cancer-related genes, such as the amino-acid sequence can be determined by align- those involved in DNA repair, carcinogen metabolism ing amino-acid sequences of related proteins from or CELL CYCLE CHECKPOINTS. The putative functional signif- unrelated organisms or across gene families. A number icance of these affected SNPs was calculated using tol- of algorithms have been proposed that use DNA or erance indices from SIFT by comparing sequences amino-acid sequence data to identify potentially func- from different species. The analyses showed a signifi- tional residues or domains in a comparison with cant inverse correlation between estimates of cancer sequences in the public databases. These include meth- risk as assessed by the ODDS RATIO (OR) and the toler- ods that are based on protein three-dimensional (3D) ance indices of the amino-acid variants. These findings structure39–44, methods that are based on evolutionary indicate that alterations in conserved amino-acids in considerations45,46, and machine-learning approaches47.
SNPs are more likely to be associated with cancer sus- As examples of these approaches, we present two algo- ceptibility. So, using a molecular evolutionary approach rithms, SIFT (‘sorting intolerant from tolerant’; REFS 48–50; might help identify SNPs to be genotyped in future see Further information for website) and PolyPhen (‘polymorphism phenotyping’; REF. 51;see Further infor- Bayesian phylogenetic analysis is another compara- populations into a single group.
Combining two populations has mation for website). Both are based entirely around tive evolutionary method that identifies potential func- amino-acid sequence, and are useful for predicting the tionally important amino-acid sites that disrupt gene impact of exonic amino-acid substitutions on disease function52. This approach aligns amino-acid sequences risk. A third algorithm, known as CODDLE (‘choosing for different species and constructs a phylogenetic tree.
codons to optimize discover of deleterious lesions’; see It then obtains maximum likelihood estimates of Further information for website) uses nucleotide nucleotide substitution rates, identifies conserved sequence (for example, a PCR amplicon) to identify all regions and calculates ancestral sequences. Finally, potentially deleterious nucleotide substitutions, including regions that evolve under positive selection are iden- nonsense, frameshift, missense and splicing mutations.
tified and compared with the distribution of con- cell. Disruption can lead touncontrolled cell growth, and served sites with missense changes that have been The SIFT algorithm. This algorithm is based on the
reported in public database. Fleming et al.52 used this assumption that amino-acid positions that are impor- approach, as well as the SIFT program, to identify mis- tant for the correct biological function of the protein sense changes in exon 11 of the BRCA1 gene. The phy- are conserved across the protein family and/or across logenetic approach inferred that 38 of 139 missense usually estimated from casecontrol studies.
evolutionary history. SIFT uses protein sequence and changes affected function in exon 11. By contrast, SIFT VOLUME 5 | AUGUST 2004 | 5 9 3
known detrimental missense changes in severaldomains of BRCA1, showing the utility of this approachin identifying genetic variants that could be prioritizedin further association studies.
Concerted efforts to further understand the func- tional role of genomic elements, including SNPs, arealso underway. For example, the National Institutes ofHealth has set up a public research consortium known as ‘Encode’ (‘encyclopaedia of DNA elements’; REF. 53), with the goal of identifying and categorizing DNA func-tional elements including transcriptional regulatorysequences and determinants of chromosome structureand function. This initiative involves comparing existing computational and experimental approaches, anddeveloping new methods for the identification and Although these and other approaches can be used to assist molecular epidemiologists in identification ofgenetic variants in specific nucleotides or residues that are most likely to be functionally significant, they willprobably provide only limited information about the true function of a genetic variant. The methodsdescribed above generally cannot identify functional effects in non-coding regions, and cannot evaluate theeffect of other factors on function, such as exposures, that might be involved in the aetiology of the disease.
Recently, Rogan et al.54 proposed a computational approach to evaluate the putative effect of splicing mutations. They used information theory-based modelsto evaluate the relationship between variants and pre- dicted splice sites and relevant phenotypes. The resultspresented by Rogan et al.54 corresponded to known functions of these genes and serve as a model for addi-tional studies that evaluate the effect of both coding and non-coding variation on functionality (see also REF. 55).
Having interpreted the results from SIFT and related approaches, it might not be possible to identify Figure 2 | Relationship of in silico indices of SNP function
whether a single variant within a putative functional and associations (measured by log -transformed odds
domain is sufficient to disrupt function. Such app- ratios) taken from the literature. a | The relationship
between the position-specific independent count (PSIC) score
roaches require sequence data and might therefore be difference for studied SNPs from PolyPhen (‘polymorphism limited by the sequence information that is available in phenotyping’) analyses and log -transformed odds ratios public databases. Therefore, inferences of conservation (Log ORs) from published associations. b | The relationship
(or the lack thereof) might simply reflect data limita- between the tolerance index for studied SNPs from SIFT tions. Inconsistencies of inference using different classes (‘sorting intolerant from tolerant’) analysis and log ORs from of sequence data might also arise. For example, an associations. The data indicate that SNPs that are inferred asfunctional are more likely to be associated with higher odds- analysis of sequences among human gene families ratio effects. By inference, they are more likely to be causative might provide inferences that are different from across- of disease. Modified, with permission, from REF. 6 (2004) The species sequence analysis. Although potentially helpful American Association for Cancer Research.
in uncovering the role of a particular genetic region orvariant, such differences might also obscure inferencesabout putative functional significance.
identified 36 of these 38 changes, and an additional 34.
Fleming et al.52 hypothesized that SIFT predicted more The PolyPhen algorithm. Missense variants might affect
changes because substitutions in all taxa are given the protein folding, binding or interaction sites, as well as identical weight in SIFT. By contrast, under the assump- the solubility or stability of the protein. These effects can tion of the phylogenetic method, if substitutions are not be estimated from physical considerations and from the shared by sister taxa, they are more likely to be sequenc- context of an amino-acid replacement within the family ing errors. In addition, non-conservative substitutions of homologous proteins. Sunyaev et al.56 demonstrated maintained at sites in one pair of sister taxa are unlikely that a significant fraction of missense variants (nsSNPs) to be functionally significant. Applying this approach, are likely to affect protein structure or function. Their Fleming et al.52 successfully predicted >85% of the approach, implemented in PolyPhen algorithm uses 5 9 4 | AUGUST 2004 | VOLUME 5
amino-acid sequence, phylogenetic and structural difference, and between tolerance index and PSIC information to characterize the potential functional score difference. So, the more probable it is that an SNP is functionally relevant, the higher the OR effect and the PolyPhen was developed to identify functionally more likely it is that causative associations will be identi- important SNPs by predicting whether an amino-acid fied. These results imply that using a method to assess change is likely to be deleterious for the protein on the functional significance, such as PolyPhen, can optimize basis of 3D structure and multiple alignment of homol- the ability to identify meaningful and reproducible ogous sequences57. The possibly damaging effect of vari- molecular epidemiological associations.
ants in an amino-acid sequence can bedetermined if the In addition to identifying important genetic variants substitution is in an annotated active or binding site, for research prioritization, genotyping efforts could be affects interaction with ligands present in the crystallo- reduced by eliminating amino-acid substitutions that graphic structure, leadsto hydrophobicity or electrostatic have been deemed ‘neutral’ by algorithms such as charge change in a buriedsite, destroys a disulfide bond, PolyPhen. In some cases, some genes could be removed affects the protein’s solubility, inserts proline in an from further consideration if all of the variant alleles α-helix,or is incompatiblewith the profile of amino-acid were deemed to be non-functional. Because prediction substitutions observed at this site in the set of homolo- algorithms provide numerical data, it is feasible to fur- gous proteins. Mapping the amino-acid substitution tothe ther subdivide the variant alleles of a gene into those known 3D structure reveals if the change is likely to that might have a moderate, modest or no impact. In destroy the hydrophobic core of a protein, electrostatic particular, when a gene is known to have many genetic interactions, interactions with ligands, or other features variants, prediction algorithms can also reduce the effective number of alleles that need to be considered in Briefly, PolyPhen uses protein sequence and variant association studies. Therefore, the use of these app- data to search homologous sequence data from publicly roaches should limit the number of genotypes that available protein databases. Only sequences with 30% need to be considered, thereby limiting the potential for or more identity to the protein of interest are consid- false positive associations that might result from per- ered. On the basis of the alignment of these homolo- forming unnecessary hypothesis tests. The drawback of gous sequences, profile scores can be computed for these approaches is that potential interactions among allelic variants. Profile scores, known as position-specific combinations of substitutions at a gene might be independent counts (PSICs), are logarithmic ratios of ignored, and the dependence of functional impact on the likelihood that a given amino acid occurs at a par- genotype at other genes or on exposure might not be ticular site to the background probability of the amino addressed. Nonetheless, the algorithms for predicting acid occurring at random at a given position. Large dif- genotype impact on allele function provide an initial ferences in PSIC values for specific genetic variants and important step towards simplifying the seemingly might indicate that the substitution of interest is rarely overwhelming complexity of the genotype data. The or never observed in the protein family. Finally, concept of equivalent alleles provides a basis for addi- PolyPhen maps the amino-acid substitution to the tional steps towards reducing the complexity of the known 3D structure of the protein to examine whether data for molecular epidemiological studies.
the substitution might destroy the protein’s hydropho-bic core, electrostatic interactions or interactions with Optimizing molecular epidemiological studies
ligands, or other important features of a protein based To improve the probability of obtaining biologically on the analysis of several structural parameters, and and clinically meaningful associations between geno- also on the analysis of several contact parameters.
types and disease outcomes, the design and interpreta- Therefore, Polyphen can provide information about tion of molecular epidemiological studies should whether an nsSNP is probably damaging, possibly include an assessment of functional significance. As damaging, benign or whether its function is unknown.
outlined above and in TABLE 1, a number of criteria Using this approach, Sunyaev et al.57 estimated that and/or approaches can be used to assess functional sig- 20% of human nsSNPs might affect protein function, nificance of a genetic variant. These include genetic although the proportion of SNPs that are truly delete- characteristics such as the type of genetic change (mis- rious is probably substantially lower. Sunayev et al.57 sense, frameshift, nonsense, regulatory, splicing, disrup- further estimated that there are 20,000–60,000 func- tion of a known functional motif, etc.), population tionally relevant nsSNPs in the human genome. By genetics characteristics and evolutionary conservation comparison, Fay et al.33 used a different model-based of nucleotide sequences, experimental evidence from in approach to estimate that 20–45% of nsSNPs are vitro or animal model studies that consider repro- slightly deleterious and reach population frequencies of ducibility of functional findings, experimental condi- 1–10%, although as many as 80% of nsSNPs might tions (such as cell/animal system, inducers/repressors) that are used to evaluate a functional effect, the magni- Zhu et al.6applied PolyPhen to the 166 molecular epi- tude of the inferred biological effects, and the relevance demiological studies mentioned above to examine the of the experimental model to human systems, includ- correlation between PSIC score and the OR associated ing knowledge of the effect of the variant in the target with a particular nsSNP (FIG. 2). They found a significant tissue (that is, tissue specificity) and consideration of inverse correlation between the ORs and PSIC score timing, dose or duration of relevant exposures.
VOLUME 5 | AUGUST 2004 | 5 9 5
Table 1 | Criteria for assessing the functional significance of a genetic variant in candidate gene association studies
Strong support for
Moderate support for
Evidence against
functional significance
functional significance
functional significance
Variant is a missense change or disrupts a putative functional motif; changes to protein Evidence for conservation across species In the absence of laboratory error, strong In the absence of laboratory error, moderate Population genetics data indicates to small deviations from expected population no deviations from expected frequencies in cases and/or controls in a frequencies in cases and/or controls; effects proportions Consistent effects from multiple lines of Some (possibly inconsistent) evidence for function from experimental data; effect in context is established; effect in target human context or target tissue is unclear Exposures (for example, Variant is known to affect the genotype–environment metabolism of the exposure in effect in target tissue might not be known moderate-to-large magnitude associations replication studies are not available Information from previously published epidemio- The population genetics of CYP3A4*1B has been logical investigations indicating the effect of an SNP can well described59, but no data have been reported about also be considered for studies that attempt to replicate deviation of frequencies from Hardy-Weinberg propor- association findings. For example, an ad hoc measure of tions in cases versus controls, so this criterion provides support for the functional significance of a particular little support for or against function.
variant could be created by weighting in favour of a The ‘experimental evidence’ about CYP3A4*1B functionally significant effect if the criteria for ‘strong function (FIG. 1) has been controversial, but there seems support’ or ‘moderate support’ for functional signifi- to be a small but consistent increase in expression asso- cance are available, and therefore might be prioritized ciated with CYP3A4*1B. This relatively small effect for association studies. Similarly, variants for which might be insufficient to confer major effects on drug there is no or neutral functional information might be metabolism, but indicates possible associations with ranked lower in priority for association studies. Variants altered disease risk. We could therefore conclude that for which there is evidence against functional signifi- the ‘experimental evidence’ criterion suggests ‘moderate cance should not be considered in association studies support’ for functional significance.
unless the hypothesis and association study methods It is well known that CYP3A4 metabolizes com- consider linkage disequilibrium approaches to identify pounds that are strongly associated with disease risk, such candidate regions of interest. In this case, the study as testosterone metabolism in prostate cancer aetiology11.
would assume that the variants under investigation are Therefore, there is strong support that the product of in linkage disequilibrium with the causative allele(s).
this gene metabolizes relevant aetiological exposures, Variants would be chosen for analysis on the basis of including steroid hormones and carcinogens11.
polymorphic content and haplotype or population Finally, there are numerous epidemiological studies that report an association of CYP3A4*1B with various How would this approach be applied for a specific phenotypes, thereby providing strong support for func- candidate SNP? For example, applying the criteria tion on the basis of the ‘epidemiological evidence’ crite- outlined in TABLE 1 for CYP3A4*1B, we learn first that rion. So, these criteria suggest moderate to strong support CYP3A4*1B disrupts a regulatory motif (NFSE)12.
for the hypothesis that CYP3A4*1B is a functionally Because the function of NFSE is not clear, we might therefore conclude that there is ‘moderate support’ By applying these or similar criteria, molecular for function on the basis of the ‘nucleotide sequence’ epidemiologists might be able to maximize the chance that association studies involve functionally The regulatory region of CYP3A4 is similar, at the significant genetic variants, and therefore could sequence level, to many members of the CYP3A multi- reduce the likelihood of TYPE I ERRORS, increase repro- gene family, so we might conclude that there is strong ducibility of candidate gene association studies, and support for function on the basis of the ‘evolutionary facilitate interpretation of positive associations. Even conservation’ criterion. However, because the role of in the absence of a definitive function of a genetic putative regulatory domains in CYP3A genes58 is still variant, candidate gene association studies should debatable, the assessment of ‘evolutionary conservation’ not be undertaken without some consideration of hypothesis is correct. Similarly,the false positive rate.
5 9 6 | AUGUST 2004 | VOLUME 5
et al. Characterization of single-nucleotide et al. Re: modification of clinical presentation of 43. Saunders, C. T. & Baker, D. Evaluation of structural and polymorphisms in coding regions of human genes. Nature prostate tumors by a novel genetic variant in CYP3A4. evolutionary contributions to deleterious mutation prediction.
Genet. 22, 231–238 (1999).
J. Natl Cancer Inst. 91, 1587–1590 (1999).
J. Mol. Biol. 322, 891–901 (2002).
Salisbury, B. A. et al. SNP and haplotype variation in the 23. Spurdle, A. B. et al. The CYP3A4*1B polymorphism has no et al. Prediction of deleterious functional effects human genome. Mutat. Res. 526, 53–61 (2003).
functional significance and is not associated with risk of of amino acid mutations using a library of structure-based Schneider, J. A. et al. DNA variability of human genes.
breast or ovarian cancer. Pharmacogenetics 12, 355–366
function descriptors. Proteins 53, 806–816 (2003).
Mech. Ageing Dev. 124, 17–25 (2003).
45. Miller, M. P. & Kumar, S. Understanding human disease Schork, N. J., Fallin, D. & Lanchbury, J. S. Single nucleotide D. et al. Genotype-phenotype associations for mutations through the use of interspecific genetic variation.
polymorphisms and the future of genetic epidemiology. common CYP3A4 and CYP3A5 variants in the basal and Hum. Mol. Genet. 10, 2319–2328 (2001).
Clin. Genet. 58, 250–264 (2000).
induced metabolism of midazolam in European- and 46. Koref, M. E. S., Gangeswaran, R., Koref, I. P. S., Shanahan, N.
African-American men and women. Pharmacogenetics 13,
& Hancock, J. M. A phylogenetic approach to assessing the sequence variation containing 1.42 million single nucleotide significance of missense mutations in disease genes. Hum. polymorphisms. Nature 409, 928–933 (2001).
et al. Transcriptional activity effects of a Mutat. 22, 51–58 (2003).
et al. An evolutionary perspective on SNP screening CYP3A4 promoter variant. Environ. Mol. Mutagen. 42,
47. Krishnan, V. G. & Westhead, D. R. A comparative study of in molecular cancer epidemiology. Cancer Res. 64,
machine-learning methods to predict the effects of single 26. Hamzeiy, H., Bombail, V., Plant, N., Gibson, G. & Goldfarb, P.
nucleotide polymorphisms on protein function.
The first comprehensive evaluation and comparison
Transcriptional regulation of cytochrome P4503A4 gene Bioinformatics 19, 2199–2209 (2003).
of SIFT and PolyPhen algorithms in molecular
expression: effects of inherited mutations in the 5′-flanking 48. Ng, P. C. & Henikoff, S. Predicting deleterious amino acid epidemiological association studies.
region. Xenobiotica 33, 1085–1095 (2003).
substitutions. Genome Res. 11, 863–874 (2001).
Lohmueller, K. E., Pearce, C. L., Pike, M., Lander, E. S. & 27. Jeon, J. & An, G. Gene tagging in rice: a high throughput An outline of the SIFT approach to assessing missense
variant function using evolutionary similarity.

Hirschhorn, J. N. Meta-analysis of genetic association system for functional genomics. Plant Sci. 161, 211–219
49. Ng, P. C. & Henikoff, S. Accounting for human studies supports a contribution of common variants to polymorphisms predicted to affect protein function. Genome susceptibility to common disease. Nature Genet. 33,
28. Cecconi, F. & Meyer, B. I. Gene trap: a way to identify novel Res. 12, 436–446 (2002).
genes and unravel their biological function. FEBS Lett. 480,
50. Ng, P. C. & Henikoff, S. SIFT: predicting amino acid changes A comprehensive evaluation of the consistency of
that affect protein function. Nucleic Acids Res. 31,
association studies that demonstrates the need for
29. Adams, M. D. ENU mutagenesis for pharma. Drug Discov. functional correlates in achieving consistency in
Today 8, 199–200 (2003)
51. Ramensky, V., Bork, P. & Sunyaev, S. Human non- association study results.
30. Lee, Y. S. & Mrksich, M. Protein chips: from concept to synonymous SNPs: server and survey. Nucleic Acids Res. Botstein, D. & Risch, N. Discovering genotypes underlying practice. Trends Biotechnol. 20 (Suppl.), S14–18 (2002).
30, 3894–3900 (2002).
human phenotypes: past successes for mendelian disease, et al. EICO (expression-based imprint candidate An outline of the PolyPhen methodology for using
future approaches for complex disease. Nature Genet. 33
organizer): finding disease-related imprinted genes. Nucleic evolutionary and structure data to assess SNP function.
Acids Res. 32 (database issue), D548–551 (2004).
52. Fleming, M. A., Potter, J. D., Ramirez, C. J., Ostrander, G. K.
et al. Defects in pre-mRNA processing as causes 32. Knight, J. C., Keating, B. J., Rockett, K. A. & Kwiatkowski, & Ostrander, E. A. Understanding missense mutations in the of and predisposition to diseases. DNA Cell Biol. 21,
D. P. In vivo characterization of regulatory polymorphisms by BRCA1 gene: an evolutionary approach. Proc. Natl Acad. allele-specific quantification of RNA polymerase loading.
Sci. USA 100, 1151–1156 (2003).
10. Knight, J. C. Functional implications of genetic variation in Nature Genet. 33, 469–475 (2003).
53. National Institutes of Health. The ENCODE Project: non-coding DNA for disease susceptibility and gene The authors report a new method and application of
ENCyclopedia Of DNA Elements [online], regulation. Clin. Sci. (Lond.) 104, 493–501 (2003).
experimental approaches to assessing genotype
<> (2003).
11. Li, A. P., Kaminski, D. L. & Rasmussen, A. R. Substrates of function.
54. Rogan, P. K., Svojanovsky, S. & Leeder, J. S. Information human hepatic cytochrome P450 3A4. Toxicology 104, 1–8
33. Fay, J. C., Wyckoff, G. J. & Wu, C. I. Positive and negative theory-based analysis of CYP2C19, CYP2D6 and CYP3A5 selection on the human genome. Genetics 158, 1227–1234
splicing mutations. Pharmacogenetics 13, 207–218 (2003).
et al. Gene structure of CYP3A4, an adult- 55. Pagani, F. & Baralle, F. E. Genomic variants in exons and specific form of cytochrome P450 in human livers and its 34. Akey, J. M., Zhang, G., Zhang, K., Jin, L. & Shriver, M. D.
introns: identifying the splicing spoilers. Nature Rev. Genet.
transcriptional control. Eur. J. Biochem. 218, 585–595 (1993).
Interrogating a high-density SNP map for signatures of 5, 389–396 (2004).
13. Rebbeck, T. R., Jaffe, J. M., Walker, A. H., Wein, A. J. & natural selection. Genome Res. 12, 1805–1814 (2002).
56. Sunyaev, S., Ramensky, V. & Bork, P. Towards a structural Malkowicz, S. B. Modification of clinical presentation of 35. Feder, J. N. et al. A novel MHC class I-like gene is mutated basis of human non-synonymous single nucleotide prostate tumors by a novel genetic variant in CYP3A4. in patients with hereditary haemochromatosis. Nature polymorphisms. Trends Genet. 16, 198–200 (2000).
J. Natl Cancer Inst. 90, 1225–1229 (1998).
Genet. 13, 399–408 (1996).
et al. Prediction of deleterious human alleles.
et al. Association between a CYP3A4 genetic 36. Nielsen, D. M., Ehm, M. G. & Weir, B. S. Detecting marker- Hum. Mol. Genet. 10, 591–597 (2001).
variant and clinical presentation in African-American prostate disease association by testing for Hardy-Weinberg 58. Schuetz, E. G. Lessons from the CYP3A4 promoter. Mol. cancer patients. Cancer Epidemiol. Biomarkers Prev. 8,
disequilibrium at a marker locus. Am. J. Hum. Genet. 63,
Pharmacol. 65, 279–281 (2004).
Zeigler-Johnson, C. M. et al. Ethnic differences in the 15. Felix, C. A., et al. Association of CYP3A4 genotype with 37. Hoh, J., Wille, A. & Ott, J. Trimming, weighting, and frequency of prostate cancer susceptibilty alleles at SRD5A2 treatment-related leukemia. Proc. Natl Acad. Sci. 95,
grouping SNPs in human case-control association studies.
and CYP3A4. Hum. Hered. 54, 13–21 (2002).
Genome Res. 11, 2115–2119 (2001).
16. Kadlubar, F. F. et al. The putative high activity variant, The authors propose a novel approach to association
CYP3A4*1B, predicts the onset of puberty in young girls.
studies that incorporates both association and
Some of the work discussed in this review was supported by Cancer Epidemiol. Biomarkers Prev. 12, 327–331 (2003).
population genetics information in identifying disease
grants from the Public Health Service and the University of 17. Lai, J., Vesprini, D., Chu, W., Jernstrom, H. & Narod, S. A.
genes, including the possibility of genome-wide
CYP gene polymorphisms and early menarche. Mol. Genet. associations.
Metab. 74, 449–457 (2001).
38. Perutz, M. F. Structure and function of haemoglobin. et al. Genetic factors related to racial variation I. A tentative atomic model of horse oxyhaemoglobin. The authors declare that they have no competing financial interests.
in plasma levels of insulin-like growth factor-1: implications J. Mol. Biol. 13, 646–668 (1965).
for premenopausal breast cancer risk. Mol. Genet. Metab. 39. Wang, Z. & Moult, J. Three-dimensional structural 72, 144–154 (2001).
location and molecular functional effects of missense SNPs Online links
19. Lamba, J. K. et al. Common allelic variants in cytochrome in the T cell receptor Vβ domain. Proteins 53, 748–757
P4503A4 and their prevalence in different populations.
Pharmacogenetics 12, 121–132 (2002).
40. Wang, Z. & Moult, J. SNPs, protein structure, and disease.
The following terms in this article are linked online to:
20. Westlind, A., Lofberg, L., Tindberg, N., Andersson, T. B. & Hum. Mutat. 17, 263–270 (2001).
Entrez Gene:
Ingelman-Sundberg, M. Interindividual differences in hepatic 41. Chasman, D. & Adams, R. M. Predicting the functional expression of CYP3A4: relationship to genetic consequences of non-synonymous single nucleotide polymorphism in the 5′-upstream regulatory region.
polymorphisms: structure-based assessment of amino acid Biochem. Biophys. Res. Commun. 259, 201–205 (1999).
variation. J. Mol. Biol. 307, 683–706 (2001).
21. Amirimani, B., Walker, A. H., Weber, B. L. & Rebbeck, T. R.
42. Ferrer-Costa, C., Orozco, M. & de la Cruz, X.
Response: re: modification of clinical presentation of Characterization of disease-associated single amino acid PolyPhen:
prostate tumors by a novel genetic variant in CYP3A4. polymorphisms in terms of sequence and structure SIFT:
J. Natl Cancer Inst. 91, 1588–1590 (1999).
properties. J. Mol. Biol. 315, 771–786 (2002).
Access to this interactive links box is free online.
VOLUME 5 | AUGUST 2004 | 5 9 7


Solution sketches for Day 2 A feasible brute force solution for 30 points: Use backtracking, build the tower from the bottom to the top. Pruning: There is a simple O(N) greedy check whether there is at least one possible way to build the rest of the tower using the remaining cubes. If we use this check whenever we make a recursive call, the runtime is guaranteed to be proportional to the actual

Microsoft word - programme for nz track ii delegation for inaugural india-nz track ii talks_v1.doc

Visit by New Zealand Track II Delegation for Inaugural India-New Zealand New Zealand Delegation Dr Richard Grant, Executive Director, Asia New Zealand Foundation (Leader of Mr Brian Lynch, Director, New Zealand Institute of International Affairs (mob. +64 27 445 2958) (arrive on 5 Dec in Delhi, overnight at Hotel Pooja Palace,15A/11 WEA Puja Park,Karol Bagh. Tel 2574 5275(76) Professor Xiaomi

Copyright © 2010 Medicament Inoculation Pdf