ASSESSING THE FUNCTION OFGENETIC VARIANTS IN CANDIDATEGENE ASSOCIATION STUDIES
Timothy R. Rebbeck*, Margaret Spitz ‡ and Xifeng Wu‡
Knowledge of inherited genetic variation has a fundamental impact on understanding human
disease. Unfortunately, our understanding of the functional significance of many inherited genetic
variants is limited. New approaches to assessing functional significance of inherited genetic
variation, which combine molecular genetics, epidemiology and bioinformatics, promise to enhance
reproducibility and plausibility of associations between genotypes and disease.
The human genome contains about ten million SNPs,
Genomic variation and molecular epidemiology
with an estimated two common missense variants per
The protein coding sequences of the human genome
gene1. At least five million SNPs have already been
contain approximately 100,000–300,000 common
reported in public databases2,3. Many of these variants
SNPs, and additional SNPs lie within putative regula-
might be involved in human disease aetiology, but it is
tory regions of genes that might be relevant for studies
often difficult to assess their function on the basis of
of human health and disease1. Regulatory and coding
nucleotide sequence alone. This is particularly true
SNPs are of particular interest to molecular epidemio-
when variants do not alter an amino acid or do not dis-
logical association studies. Non-synonymous SNPs
rupt a well-characterized motif that affects protein
(nsSNPs), or missense variants, translate into amino-
function or structure. In addition, only a small subset
acid polymorphisms in the proteins they encode.
of variants that affect the phenotype will confer small
Regulatory SNPs (rSNPs) can affect the expression,
to moderate effects on phenotypes that are causally
tissue-specificity or function of relevant proteins. It
related to disease risk. So, an important challenge that
seems that both nsSNPs and rSNPs are relatively rare
faces molecular epidemiological association studies of
compared with the total number of SNPs in the human
*Department of Biostatistics
candidate disease-susceptibility genes is to define the
genome5. The rarity of nsSNPs might be a consequence
and Epidemiology, and
variants that are functionally implicated in disease.
of selection against the functional disruptions of
Abramson Cancer Center,
This is particularly urgent because the amount of
amino-acid variation. Some molecular functional diver-
University of Pennsylvania
genomic information that is available greatly exceeds
sity is attributable to the effects on protein function
School of Medicine, 904 Blockley Hall, 423 Guardian
the information about the function of variants that are
caused by nsSNPs. For example, the kinetic parameters
drive, Philadelphia,
used in human disease studies. There is insufficient
of enzymes, the DNA-binding properties of proteins
Pennsylvania 19104, USA.
guidance for molecular epidemiologists to optimally
that regulate transcription, the signal transduction
‡Department of
select variants for an epidemiology study, highlighting
activities of transmembrane receptors, and the architec-
Epidemiology,
the need for methods that prioritize the choice of
tural roles of structural proteins are all susceptible to
M. D. Anderson Cancer Center, 1515 Halcombe
genetic variants to be genotyped in molecular epidemi-
perturbation by nsSNPs and their associated amino-acid
Boulevard, Houston,
ological studies4. Here, we bring together examples of
Texas 77030, USA.
experimental population genetics and evolutionary
In addition to ongoing efforts to identify and charac-
Correspondence to T.R.R.
approaches to assessing variant function, and evaluate
terize these genetic variants, molecular epidemiological
e-mail: trebbeck@ cceb.med.upenn.edu
the potential for using this information to improve
disease association studies are under way to better under-
inferences from disease association studies.
stand the role of inherited genetic variation in disease.
VOLUME 5 | AUGUST 2004 | 5 8 9
A major challenge for epidemiologists undertaking
tissue-specific effects or experimental conditions. For
candidate gene–disease association studies is to choose
example, the functional effect of a variant might be small
target SNPs that are most likely to affect the phenotype
in magnitude in an experimental system, but might
and that ultimately contribute to disease development.
become more important in a specific human tissue or on
Variants in biologically plausible candidate genes are
exposure to relevant environmental agents. Similarly,
usually selected for study on the basis of both variant
large experimental effects might be observed that reflect
allele frequency and the functional effect of the variant on
negligible in vivo effects in humans. Similarly, effects that
relevant traits. Although there is often sufficient informa-
are small in magnitude in a single time point-experiment
tion to assess the allele frequency of a candidate variant,
might be amplified or only have a phenotypically rele-
understanding the functional significance of genetic
vant effect over long time periods. Indeed, most genetic
variants that are studied in complex traits and disease
Knowledge of gene and SNP function is crucial to
would be expected to have small effects on function. So,
direct the appropriate design and interpretation of can-
the use of model systems that are appropriate to detect
didate gene association studies. This is in contrast to
large effects (for example, as might be observed in high
genome-wide scans and other approaches that rely on
penetrance ‘inborn errors of metabolism’ settings) might
LINKAGE DISEQUILIBRIUM between loci to identify genotype-
disease associations, for which knowledge of function is
As outlined above, experimental evidence of genetic
not required. Most epidemiologists are not trained in
variant function might not be consistent with results of
experimental laboratory methods and do not maintain
epidemiological studies. For example, CYP3A4 metabo-
laboratories in which detailed laboratory assessment of
lizes drugs and other compounds, including steroid
variant function in target tissues or individuals of interest
hormones, that are important in the aetiology of many
can be made. They must therefore often rely on making
common diseases11. A variant has been identified in the
an assessment of the function of a candidate gene or
CYP3A4 promoter (denoted CYP3A4*1B) that consists
SNP from the published literature. However, published
of an A→G nucleotide substitution at position –290
experimental evidence might not be adequate to guide
(denoted A–290G) in the NIFEDIPINE-specific element
the design or interpretation of a molecular epidemio-
(NFSE)12. Subsequently, the CYP3A4*1B variant was
logical study. For example, inadequate information
epidemiologically associated with various characteris-
about the function of a genetic variant impedes the abil-
tics, including a more advanced stage of prostate
ity to evaluate whether the association is consistent with
tumours13,14, decreased risk for treatment-related
a causative event that is consistent with a biological
mechanism, whether the association is a reflection of
plasma levels of insulin-like growth factor-I among
linkage disequilibrium with a truly causative variant, or
users of oral contraception18. Although these data pro-
whether the association represents a false positive result.
vide evidence for functionally significant effects of
The lack of reproducibility of many association studies
CYP3A4*1B, the basic science literature has not consis-
might reflect the number of studies that involve genetic
tently supported the hypothesis that CYP3A4*1B has a
variants with no functional significance6,7.
functionally significant effect. Hashimoto et al.12 identi-fied several regulatory elements, including a putative
Experimental approaches
repressor fragment and a NFSE element in the CYP3A4
Inferences about function of a genetic variant can be
promoter, which indicates that variants in this promoter
made using experimental systems, including in vitro sys-
region might affect CYP3A4 transcription (FIG. 1).
tems and in vivo animal models. These include the effect
Lamba et al.19 reported that CYP3A4*1B alleles were
of variants on regulatory-region control of DNA
found significantly more frequently in Caucasians with
expression, RNA stability or degradation, protein struc-
low CYP3A4 protein levels than in those with higher
ture, protein denaturing, protein expression, other mea-
sures of in vivo and in vitro control of protein levels, and
Many authors20–26 have studied the relationship of
tissue specificity8–10. Because the scope of experimental
CYP3A4 expression or function to the CYP3A4*1B pro-
chromosome, are not inheritedindependently but are observed
approaches for assessing SNP function is large, the pur-
moter (FIG. 1). Most of them concluded that there were
pose of this section is not to provide a comprehensive
no biologically meaningful effects, given the small mag-
review of experimental approaches for evaluating SNP
nitude of effects that were observed. However, most
function, but instead to provide a context in which epi-
studies reported consistently elevated expression associ-
demiological studies can use experimental data in the
ated with CYP3A4*1B (a 20%–200% increase over the
design and interpretation association studies that
On the basis of these data, it is possible that
These approaches can provide the strongest evidence
CYP3A4*1B has only a small to moderate phenotypic
for the functional role of a genetic variant, but can also be
effect. These small phenotypic effects will probably not
metabolized by CYP3A4, and forwhich a regulatory element
difficult to interpret in the context of studies of complex
have a clinically meaningful impact on drug disposi-
human traits. The extent of the effects of genetic variants
tion. It is not clear, however, whether this magnitude of
on relevant phenotypes involved in the disease process is
phenotypic perturbation is sufficient to alter metabo-
probably small. These effects may be highly dependent
lism on environmental exposure to steroid hormones
on the context in which the genetic variant is acting,
or other agents that might confer disease risk over the
including influence from environmental exposure,
lifetime of an individual. For example, would a 20%
5 9 0 | AUGUST 2004 | VOLUME 5 Cell system Effect of CYP3A4*1B versus CYP3A4*1A
190% Increase in testosterone 6 β-oxidation
40% Increase in nifedipine oxidase activity
110% Increase in CYP3A4 protein expression
20–90% Increase in luciferase expression
In response to xenobiotics:
20–117% Increase in transcriptional activation
75–147% Increase in transcriptional activation
4–19% Increase in transcriptional activation
Figure 1 | In vitro studies of the effect of CYP3A4*1B compared with CYP3A4*1A. At the top of the figure on the left is a schematic representation of the structure of the CYP3A4 region. Blue bars represent the genomic regions that are used in the CYP3A4-containing constructs for each study. The postive and negative numbers on each of the bars represent the postion of the terminal basepairs of these regions relative to the postion of the SNP. The studies summarized here encompass substantial variability in experimental conditions. Nonetheless, there is a small but consistent effect of increased CYP3A4 expression in the presence of CYP3A4*1B. NFSE, nifedipine-specific element.
greater metabolism of testosterone by CYP3A4*1B over
further compounded if the magnitude of functional
the course of a man’s lifetime (as indicated by the data
effects of individual variants is small, leading to a greater
in FIG. 1) sufficiently increase the risk of prostate cancer
potential for artificial masking or enhancement of func-
to the extent that it could be detected in an epidemio-
tional effects in experimental systems. Therefore, the
logical context? If so, the apparent discrepancy between
ability to detect functionally relevant effects of genetic
epidemiological associations of this genetic variant and
variants might be highly dependent on the context of
the functional effects of this variant in experimental
the experimental system used, and experimental sys-
tems might not reflect relevant in vivo effects in
Variation in experimental approaches used to assess
humans. Given the differences between the relevant
the function of genetic variants in complex metabolic
phenotypes in humans and experimental systems,
systems could affect inferences about function (see FIG. 1
information obtained from the latter might not be con-
and BOX 1). Results might vary depending on the spe-
sistent with results of epidemiological association stud-
cific experimental conditions, unrecognized effects of
ies. Therefore, it might not be possible to make clear
regulatory elements, and post-transcriptional or post-
inferences about in vivo genotype function in humans
translational processing. Hashimoto et al.12 suggest
that removing the repressor region upstream of the
Inconsistencies between experimental data and
CYP3A4*1B variant sequence might unmask a promoter
epidemiological studies can also reflect a potential
effect. Studies that evaluate regulatory region genetic
study bias that is inherent to some epidemiological
variants in CYP3A4 might differ substantially depending
investigations. Poor study design, including insuffi-
on whether this region is present or absent in the experi-
cient statistical power, could influence the results of an
mental systems. In fact, various constructs were used to
epidemiological study to produce false positive or
evaluate in vitro functional effects of CYP3A4*1B (FIG. 1),
and this experimental variability could have influenced
In addition to more traditional experimental
the inferences made in these studies. The inclusion of
approaches that are used to assess SNP function, a wide
enhancer elements that might be required in expression
variety of novel approaches for assessing gene function
assays can similarly affect expression. Regulation of
have been proposed, including gene tagging27, gene trap-
expression might only become apparent under exposure
ping28, N-ethyl-N-nitrosourea (ENU) mutagenesis29,
to the relevant compounds in the proper cellular context.
proteomics methods30 and evaluation of epigenetic
The use of primary cells versus cell culture systems can
mechanisms31. The HaploCHIP method32 is an illustra-
also affect inferences about function. The use of different
tive example. HaploCHIP analyzes SNPs that affect
cell or tissue types (for example see FIG. 1) might reflect
gene regulation in vivo using chromatin immunopre-
different regulatory influences that affect the functional
cipitation and mass spectrometry to identify differential
assessment of a genetic variant. These effects might be
protein–DNA binding in vivo that is associated with
VOLUME 5 | AUGUST 2004 | 5 9 1
pleiotropic effects on disease risk. This should be a main
Box 1 | Factors influencing consistency of gene–disease associations
focus of future studies attempting to assess the func-tional significance of genetic variants. Variables affecting inferences from experimental studies: • In vitro or in vivo system studied Population genetics approaches • Cell type studied
Knowledge of population genetic structure might
• Cultured versus fresh cells studied
provide insight into the functional relevance of a
• Genetic background of the system
genetic variant on a disease trait. For example, genetic
• DNA constructs
variants with a functionally significant impact on rele-
• DNA segments that are included in functional (for example, expression) constructs
vant phenotypes, including disease endpoints, might bemore likely to deviate from expected allele and genotype
• Use of additional promoter or enhancer elements
frequencies compared with alleles that are functionally
• Exposures
neutral. It remains unclear whether there is evidence of
• Use of compounds that induce or repress expression
selection for or against low penetrance alleles over evo-
• Influence of diet or other exposures on animal studies
lutionary time, although it is clear that selection hasinfluenced the pattern of genetic variability in the
Variables affecting epidemiological inferences:
human genome33,34. However, a number of diseases that
• Inclusion/exclusion criteria for study subject selection
could confer selective pressures on genes of interest are
• Sample size and statistical power
relatively new with respect to evolutionary genetics his-
• Candidate gene choice
tory. For example, diabetes and obesity might confer
• A biologically plausible candidate gene
disease risks that could lead to selective pressure onallele or genotype frequencies, but these effects still
• Functional relevance of the candidate genetic variant
occur relatively late in life and have been a problem to
• Frequency of allelic variant
human health on a population level for only a few gen-
• Statistical analysis
erations. So, whilst substantial new information about
• Consideration of confounding variables, including ethincity, gender or age.
evolutionary genetics history and low penetrance SNPs
• Whether an appropriate statistical model was applied (for example, were interactions
is becoming available, the link between this knowledge
considered in addition to main effects of genes?)
and the functional importance of specific SNPs has yet
• Violation of model assumptions
Deviations from expected allele or genotype frequen-
cies across relevant phenotypic groups could be used to
allelic variants of a gene as a surrogate measure of tran-
identify alleles that are more likely to be functionally asso-
scriptional activity. This allele-specific quantification
ciated with disease aetiology. For example, alleles that are
method uses haplotype-specific chromatin immunopre-
truly causative of a disease state would be expected to be
cipitation (CHIP) to measure the amount of phosphory-
disproportionately over-represented among cases with
lated RNA polymerase II that is bound to different alleles,
the disease but under-represented among the disease-
thereby estimating the differences in protein binding
free controls35,36. As a result, deviations from HARDY-
between the alleles. This assay can be adapted for high-
WEINBERG PROPORTIONS or other measures of population
throughput analysis because of its sensitivity and ability
frequency might provide clues to the role of genes in the
to quantify the relative abundance of two different alleles
aetiology of this disease. For example, Hoh et al.37 devel-
in a sample of immunoprecipitated chromatin.
oped a so- called ‘set association’ approach that evalu-
Using this approach, Knight et al.32 showed that
ates sets of SNPs at various positions in the genome.
allele A, and q is the frequency of allele a) that result in a
there was a close correlation between the level of bound
Information about allelic association and Hardy-
phosphorylated (active) RNA polymerase II at the
Weinberg equilibrium are combined over multiple
imprinted small nuclear ribonucleoprotein polypeptide
markers in the genome. A genome-wide TEST STATISTIC is
N locus and allele-specific expression. They also used
generated by summing-up contributions from many
this method to identify an SNP of the cytokine tumour
SNPs located in different genomic regions. The method
necrosis factor (TNF), which plays a pivotal role in
performs a significance test on several sets of loci
inflammation, immunity and apoptosis, although the
simultaneously, while using conservative measures of
involvement of this SNP in modulating TNF transcrip-
inference to control for the potential of false positive
tion is yet to be shown. Application of the HaploCHIP
associations. Hoh et al.37 applied their approach to an
approach to the TNF/lymphotoxin-α (LTA) locus iden-
SNP association study of RESTENOSIS. Among 779 patients
tified functionally important haplotypes that correlate
with heart disease, 342 showed restenosis (cases) 6
with allele-specific transcription of LTA32.
months after ANGIOPLASTY. The remaining patients, on
Despite reports of both traditional and new ways of
whom angioplasty was not performed, were the con-
after an initial treatment such asangioplasty aimed at removing
assessing the function of a genetic variant, little has been
trols. Eighty nine SNP markers were genotyped in 62
done to determine whether genotypic differences that
candidate genes for each individual. The algorithm iden-
lead to small phenotypic perturbations are functionally
tified nine genes that conferred susceptibility to the dis-
significant in the context of complex human disease. No
ease. Unfortunately, despite promising results, the
criteria have been established to evaluate the impor-
method is sensitive to genotyping errors. Population
repair a damaged blood vessel orunblock a coronary artery.
tance of genetic variants with small but potentially
ADMIXTURE, in which cases and controls belong to different
5 9 2 | AUGUST 2004 | VOLUME 5
ethnic groups with different SNP allele frequencies, can
multiple alignment information of these sequences to
also adversely distort the test results. Therefore,
estimate ‘tolerance indices’ that predict tolerated and
although incorporating population genetics informa-
deleterious (that is, intolerant) substitutions for every
tion can provide further information to a genotype–dis-
position of the query sequence. Substitutions at each
ease association study, additional research is required to
position with normalized tolerance indices that are
determine the conditions under which these approaches
below a chosen cut-off point are predicted to be delete-
rious. Substitutions that are greater than or equal to thecut-off point are predicted as being tolerated (that is,
Evolutionary and structural approaches
putatively non-functional). Using three examples
Natural selection shapes patterns of genetic variation in
(LacI, HIV-I protease and bacteriophage T4 lysozyme),
populations such that favourable variants increase in
Ng et al.48 showed that a high proportion of substitu-
frequency relative to less favourable variants over time.
tions that are predicted to be deleterious by SIFT did
Mutations in functionally important sites can be elimi-
affect phenotypes in experimental assays. However,
nated or kept at low frequencies. So, nucleotide or
some positions that were predicted to be intolerant by
amino-acid residues that have been conserved across
SIFT were tolerated in experimental assays. In these
species or within a gene family are more likely to be
cases, the positions are usually involved in an unknown
involved in regulation of vital functions, expression or
function that the assay does not detect.
tissue specificity than residues that are not conserved.
There is also evidence that SIFT fails to identify
Knowledge of the functional domains can therefore be
residues that are vital for protein function but that have
useful when assessing the functional impact of an
not been conserved throughout the family48. SIFT predic-
amino-acid change. The value of this knowledge was
tions are based on sequence data alone and do not require
demonstrated early on for the effects of mutations in the
knowledge of protein structure or function. So, substitu-
primary sequence of haemoglobin38. In this case, the mol-
tions in uncharacterized proteins can be evaluated by
ecular basis of the clinical effects caused by mutations
SIFT only when homologous sequences are provided.
could be inferred as soon as the structural information
Recently, Zhu et al.6 compared the relationship of
became available. These pioneering studies recognized
tolerance to amino-acid change, as predicted by SIFT,
crucial links between the putative functional motifs and
with the reported association with SNPs in cancer-
potential effects of mutations on function.
related genes (FIG. 2). For the study, 166 published
Recently, a number of computational algorithms
case-control studies that reported associations in 46
have been developed to predict the impact of nucleotide
SNPs, in 39 different genes, in 16 different cancer
or amino-acid substitutions on protein structure,
sites, were chosen. All of these SNPs are located in
expression and function. Evolutionary conservation of
biologically plausible cancer-related genes, such as
the amino-acid sequence can be determined by align-
those involved in DNA repair, carcinogen metabolism
ing amino-acid sequences of related proteins from
or CELL CYCLE CHECKPOINTS. The putative functional signif-
unrelated organisms or across gene families. A number
icance of these affected SNPs was calculated using tol-
of algorithms have been proposed that use DNA or
erance indices from SIFT by comparing sequences
amino-acid sequence data to identify potentially func-
from different species. The analyses showed a signifi-
tional residues or domains in a comparison with
cant inverse correlation between estimates of cancer
sequences in the public databases. These include meth-
risk as assessed by the ODDS RATIO (OR) and the toler-
ods that are based on protein three-dimensional (3D)
ance indices of the amino-acid variants. These findings
structure39–44, methods that are based on evolutionary
indicate that alterations in conserved amino-acids in
considerations45,46, and machine-learning approaches47.
SNPs are more likely to be associated with cancer sus-
As examples of these approaches, we present two algo-
ceptibility. So, using a molecular evolutionary approach
rithms, SIFT (‘sorting intolerant from tolerant’; REFS 48–50;
might help identify SNPs to be genotyped in future
see Further information for website) and PolyPhen
(‘polymorphism phenotyping’; REF. 51;see Further infor-
Bayesian phylogenetic analysis is another compara-
populations into a single group. Combining two populations has
mation for website). Both are based entirely around
tive evolutionary method that identifies potential func-
amino-acid sequence, and are useful for predicting the
tionally important amino-acid sites that disrupt gene
impact of exonic amino-acid substitutions on disease
function52. This approach aligns amino-acid sequences
risk. A third algorithm, known as CODDLE (‘choosing
for different species and constructs a phylogenetic tree.
codons to optimize discover of deleterious lesions’; see
It then obtains maximum likelihood estimates of
Further information for website) uses nucleotide
nucleotide substitution rates, identifies conserved
sequence (for example, a PCR amplicon) to identify all
regions and calculates ancestral sequences. Finally,
potentially deleterious nucleotide substitutions, including
regions that evolve under positive selection are iden-
nonsense, frameshift, missense and splicing mutations.
tified and compared with the distribution of con-
cell. Disruption can lead touncontrolled cell growth, and
served sites with missense changes that have been
The SIFT algorithm. This algorithm is based on the
reported in public database. Fleming et al.52 used this
assumption that amino-acid positions that are impor-
approach, as well as the SIFT program, to identify mis-
tant for the correct biological function of the protein
sense changes in exon 11 of the BRCA1 gene. The phy-
are conserved across the protein family and/or across
logenetic approach inferred that 38 of 139 missense
usually estimated from casecontrol studies.
evolutionary history. SIFT uses protein sequence and
changes affected function in exon 11. By contrast, SIFT
VOLUME 5 | AUGUST 2004 | 5 9 3
known detrimental missense changes in severaldomains of BRCA1, showing the utility of this approachin identifying genetic variants that could be prioritizedin further association studies.
Concerted efforts to further understand the func-
tional role of genomic elements, including SNPs, arealso underway. For example, the National Institutes ofHealth has set up a public research consortium known
as ‘Encode’ (‘encyclopaedia of DNA elements’; REF. 53),
with the goal of identifying and categorizing DNA func-tional elements including transcriptional regulatorysequences and determinants of chromosome structureand function. This initiative involves comparing existing
computational and experimental approaches, anddeveloping new methods for the identification and
Although these and other approaches can be used to
assist molecular epidemiologists in identification ofgenetic variants in specific nucleotides or residues that
are most likely to be functionally significant, they willprobably provide only limited information about the
true function of a genetic variant. The methodsdescribed above generally cannot identify functional
effects in non-coding regions, and cannot evaluate theeffect of other factors on function, such as exposures,
that might be involved in the aetiology of the disease. Recently, Rogan et al.54 proposed a computational
approach to evaluate the putative effect of splicing
mutations. They used information theory-based modelsto evaluate the relationship between variants and pre-
dicted splice sites and relevant phenotypes. The resultspresented by Rogan et al.54 corresponded to known
functions of these genes and serve as a model for addi-tional studies that evaluate the effect of both coding and
non-coding variation on functionality (see also REF. 55).
Having interpreted the results from SIFT and
related approaches, it might not be possible to identify
Figure 2 | Relationship of in silico indices of SNP function
whether a single variant within a putative functional
and associations (measured by log -transformed odds
domain is sufficient to disrupt function. Such app-
ratios) taken from the literature. a | The relationship between the position-specific independent count (PSIC) score
roaches require sequence data and might therefore be
difference for studied SNPs from PolyPhen (‘polymorphism
limited by the sequence information that is available in
phenotyping’) analyses and log -transformed odds ratios
public databases. Therefore, inferences of conservation
(Log ORs) from published associations. b | The relationship
(or the lack thereof) might simply reflect data limita-
between the tolerance index for studied SNPs from SIFT
tions. Inconsistencies of inference using different classes
(‘sorting intolerant from tolerant’) analysis and log ORs from
of sequence data might also arise. For example, an
associations. The data indicate that SNPs that are inferred asfunctional are more likely to be associated with higher odds-
analysis of sequences among human gene families
ratio effects. By inference, they are more likely to be causative
might provide inferences that are different from across-
of disease. Modified, with permission, from REF. 6 (2004) The
species sequence analysis. Although potentially helpful
American Association for Cancer Research.
in uncovering the role of a particular genetic region orvariant, such differences might also obscure inferencesabout putative functional significance.
identified 36 of these 38 changes, and an additional 34. Fleming et al.52 hypothesized that SIFT predicted more
The PolyPhen algorithm. Missense variants might affect
changes because substitutions in all taxa are given the
protein folding, binding or interaction sites, as well as
identical weight in SIFT. By contrast, under the assump-
the solubility or stability of the protein. These effects can
tion of the phylogenetic method, if substitutions are not
be estimated from physical considerations and from the
shared by sister taxa, they are more likely to be sequenc-
context of an amino-acid replacement within the family
ing errors. In addition, non-conservative substitutions
of homologous proteins. Sunyaev et al.56 demonstrated
maintained at sites in one pair of sister taxa are unlikely
that a significant fraction of missense variants (nsSNPs)
to be functionally significant. Applying this approach,
are likely to affect protein structure or function. Their
Fleming et al.52 successfully predicted >85% of the
approach, implemented in PolyPhen algorithm uses
5 9 4 | AUGUST 2004 | VOLUME 5
amino-acid sequence, phylogenetic and structural
difference, and between tolerance index and PSIC
information to characterize the potential functional
score difference. So, the more probable it is that an SNP
is functionally relevant, the higher the OR effect and the
PolyPhen was developed to identify functionally
more likely it is that causative associations will be identi-
important SNPs by predicting whether an amino-acid
fied. These results imply that using a method to assess
change is likely to be deleterious for the protein on the
functional significance, such as PolyPhen, can optimize
basis of 3D structure and multiple alignment of homol-
the ability to identify meaningful and reproducible
ogous sequences57. The possibly damaging effect of vari-
molecular epidemiological associations.
ants in an amino-acid sequence can bedetermined if the
In addition to identifying important genetic variants
substitution is in an annotated active or binding site,
for research prioritization, genotyping efforts could be
affects interaction with ligands present in the crystallo-
reduced by eliminating amino-acid substitutions that
graphic structure, leadsto hydrophobicity or electrostatic
have been deemed ‘neutral’ by algorithms such as
charge change in a buriedsite, destroys a disulfide bond,
PolyPhen. In some cases, some genes could be removed
affects the protein’s solubility, inserts proline in an
from further consideration if all of the variant alleles
α-helix,or is incompatiblewith the profile of amino-acid
were deemed to be non-functional. Because prediction
substitutions observed at this site in the set of homolo-
algorithms provide numerical data, it is feasible to fur-
gous proteins. Mapping the amino-acid substitution tothe
ther subdivide the variant alleles of a gene into those
known 3D structure reveals if the change is likely to
that might have a moderate, modest or no impact. In
destroy the hydrophobic core of a protein, electrostatic
particular, when a gene is known to have many genetic
interactions, interactions with ligands, or other features
variants, prediction algorithms can also reduce the
effective number of alleles that need to be considered in
Briefly, PolyPhen uses protein sequence and variant
association studies. Therefore, the use of these app-
data to search homologous sequence data from publicly
roaches should limit the number of genotypes that
available protein databases. Only sequences with 30%
need to be considered, thereby limiting the potential for
or more identity to the protein of interest are consid-
false positive associations that might result from per-
ered. On the basis of the alignment of these homolo-
forming unnecessary hypothesis tests. The drawback of
gous sequences, profile scores can be computed for
these approaches is that potential interactions among
allelic variants. Profile scores, known as position-specific
combinations of substitutions at a gene might be
independent counts (PSICs), are logarithmic ratios of
ignored, and the dependence of functional impact on
the likelihood that a given amino acid occurs at a par-
genotype at other genes or on exposure might not be
ticular site to the background probability of the amino
addressed. Nonetheless, the algorithms for predicting
acid occurring at random at a given position. Large dif-
genotype impact on allele function provide an initial
ferences in PSIC values for specific genetic variants
and important step towards simplifying the seemingly
might indicate that the substitution of interest is rarely
overwhelming complexity of the genotype data. The
or never observed in the protein family. Finally,
concept of equivalent alleles provides a basis for addi-
PolyPhen maps the amino-acid substitution to the
tional steps towards reducing the complexity of the
known 3D structure of the protein to examine whether
data for molecular epidemiological studies.
the substitution might destroy the protein’s hydropho-bic core, electrostatic interactions or interactions with
Optimizing molecular epidemiological studies
ligands, or other important features of a protein based
To improve the probability of obtaining biologically
on the analysis of several structural parameters, and
and clinically meaningful associations between geno-
also on the analysis of several contact parameters.
types and disease outcomes, the design and interpreta-
Therefore, Polyphen can provide information about
tion of molecular epidemiological studies should
whether an nsSNP is probably damaging, possibly
include an assessment of functional significance. As
damaging, benign or whether its function is unknown.
outlined above and in TABLE 1, a number of criteria
Using this approach, Sunyaev et al.57 estimated that
and/or approaches can be used to assess functional sig-
20% of human nsSNPs might affect protein function,
nificance of a genetic variant. These include genetic
although the proportion of SNPs that are truly delete-
characteristics such as the type of genetic change (mis-
rious is probably substantially lower. Sunayev et al.57
sense, frameshift, nonsense, regulatory, splicing, disrup-
further estimated that there are 20,000–60,000 func-
tion of a known functional motif, etc.), population
tionally relevant nsSNPs in the human genome. By
genetics characteristics and evolutionary conservation
comparison, Fay et al.33 used a different model-based
of nucleotide sequences, experimental evidence from in
approach to estimate that 20–45% of nsSNPs are
vitro or animal model studies that consider repro-
slightly deleterious and reach population frequencies of
ducibility of functional findings, experimental condi-
1–10%, although as many as 80% of nsSNPs might
tions (such as cell/animal system, inducers/repressors)
that are used to evaluate a functional effect, the magni-
Zhu et al.6applied PolyPhen to the 166 molecular epi-
tude of the inferred biological effects, and the relevance
demiological studies mentioned above to examine the
of the experimental model to human systems, includ-
correlation between PSIC score and the OR associated
ing knowledge of the effect of the variant in the target
with a particular nsSNP (FIG. 2). They found a significant
tissue (that is, tissue specificity) and consideration of
inverse correlation between the ORs and PSIC score
timing, dose or duration of relevant exposures.
VOLUME 5 | AUGUST 2004 | 5 9 5
Table 1 | Criteria for assessing the functional significance of a genetic variant in candidate gene association studies Criteria Strong support for Moderate support for Evidence against functional significance functional significance functional significance
Variant is a missense change or disrupts a
putative functional motif; changes to protein
Evidence for conservation across species
In the absence of laboratory error, strong
In the absence of laboratory error, moderate Population genetics data indicates
to small deviations from expected population no deviations from expected
frequencies in cases and/or controls in a
frequencies in cases and/or controls; effects proportions
Consistent effects from multiple lines of
Some (possibly inconsistent) evidence for
function from experimental data; effect in
context is established; effect in target
human context or target tissue is unclear
Exposures (for example, Variant is known to affect the
genotype–environment metabolism of the exposure in
effect in target tissue might not be known
moderate-to-large magnitude associations replication studies are not available
Information from previously published epidemio-
The population genetics of CYP3A4*1B has been
logical investigations indicating the effect of an SNP can
well described59, but no data have been reported about
also be considered for studies that attempt to replicate
deviation of frequencies from Hardy-Weinberg propor-
association findings. For example, an ad hoc measure of
tions in cases versus controls, so this criterion provides
support for the functional significance of a particular
little support for or against function.
variant could be created by weighting in favour of a
The ‘experimental evidence’ about CYP3A4*1B
functionally significant effect if the criteria for ‘strong
function (FIG. 1) has been controversial, but there seems
support’ or ‘moderate support’ for functional signifi-
to be a small but consistent increase in expression asso-
cance are available, and therefore might be prioritized
ciated with CYP3A4*1B. This relatively small effect
for association studies. Similarly, variants for which
might be insufficient to confer major effects on drug
there is no or neutral functional information might be
metabolism, but indicates possible associations with
ranked lower in priority for association studies. Variants
altered disease risk. We could therefore conclude that
for which there is evidence against functional signifi-
the ‘experimental evidence’ criterion suggests ‘moderate
cance should not be considered in association studies
support’ for functional significance.
unless the hypothesis and association study methods
It is well known that CYP3A4 metabolizes com-
consider linkage disequilibrium approaches to identify
pounds that are strongly associated with disease risk, such
candidate regions of interest. In this case, the study
as testosterone metabolism in prostate cancer aetiology11.
would assume that the variants under investigation are
Therefore, there is strong support that the product of
in linkage disequilibrium with the causative allele(s).
this gene metabolizes relevant aetiological exposures,
Variants would be chosen for analysis on the basis of
including steroid hormones and carcinogens11.
polymorphic content and haplotype or population
Finally, there are numerous epidemiological studies
that report an association of CYP3A4*1B with various
How would this approach be applied for a specific
phenotypes, thereby providing strong support for func-
candidate SNP? For example, applying the criteria
tion on the basis of the ‘epidemiological evidence’ crite-
outlined in TABLE 1 for CYP3A4*1B, we learn first that
rion. So, these criteria suggest moderate to strong support
CYP3A4*1B disrupts a regulatory motif (NFSE)12.
for the hypothesis that CYP3A4*1B is a functionally
Because the function of NFSE is not clear, we might
therefore conclude that there is ‘moderate support’
By applying these or similar criteria, molecular
for function on the basis of the ‘nucleotide sequence’
epidemiologists might be able to maximize the
chance that association studies involve functionally
The regulatory region of CYP3A4 is similar, at the
significant genetic variants, and therefore could
sequence level, to many members of the CYP3A multi-
reduce the likelihood of TYPE I ERRORS, increase repro-
gene family, so we might conclude that there is strong
ducibility of candidate gene association studies, and
support for function on the basis of the ‘evolutionary
facilitate interpretation of positive associations. Even
conservation’ criterion. However, because the role of
in the absence of a definitive function of a genetic
putative regulatory domains in CYP3A genes58 is still
variant, candidate gene association studies should
debatable, the assessment of ‘evolutionary conservation’
not be undertaken without some consideration of
hypothesis is correct. Similarly,the false positive rate. 5 9 6 | AUGUST 2004 | VOLUME 5 et al. Characterization of single-nucleotide
et al. Re: modification of clinical presentation of
43. Saunders, C. T. & Baker, D. Evaluation of structural and
polymorphisms in coding regions of human genes. Nature
prostate tumors by a novel genetic variant in CYP3A4.
evolutionary contributions to deleterious mutation prediction. Genet. 22, 231–238 (1999). J. Natl Cancer Inst. 91, 1587–1590 (1999). J. Mol. Biol. 322, 891–901 (2002).
Salisbury, B. A. et al. SNP and haplotype variation in the
23. Spurdle, A. B. et al. The CYP3A4*1B polymorphism has no
et al. Prediction of deleterious functional effects
human genome. Mutat. Res. 526, 53–61 (2003).
functional significance and is not associated with risk of
of amino acid mutations using a library of structure-based
Schneider, J. A. et al. DNA variability of human genes.
breast or ovarian cancer. Pharmacogenetics 12, 355–366
function descriptors. Proteins 53, 806–816 (2003). Mech. Ageing Dev. 124, 17–25 (2003).
45. Miller, M. P. & Kumar, S. Understanding human disease
Schork, N. J., Fallin, D. & Lanchbury, J. S. Single nucleotide
D. et al. Genotype-phenotype associations for
mutations through the use of interspecific genetic variation.
polymorphisms and the future of genetic epidemiology.
common CYP3A4 and CYP3A5 variants in the basal and
Hum. Mol. Genet. 10, 2319–2328 (2001). Clin. Genet. 58, 250–264 (2000).
induced metabolism of midazolam in European- and
46. Koref, M. E. S., Gangeswaran, R., Koref, I. P. S., Shanahan, N.
African-American men and women. Pharmacogenetics 13,
& Hancock, J. M. A phylogenetic approach to assessing the
sequence variation containing 1.42 million single nucleotide
significance of missense mutations in disease genes. Hum.
polymorphisms. Nature 409, 928–933 (2001). et al. Transcriptional activity effects of a
Mutat. 22, 51–58 (2003). et al. An evolutionary perspective on SNP screening
CYP3A4 promoter variant. Environ. Mol. Mutagen. 42,
47. Krishnan, V. G. & Westhead, D. R. A comparative study of
in molecular cancer epidemiology. Cancer Res. 64,
machine-learning methods to predict the effects of single
26. Hamzeiy, H., Bombail, V., Plant, N., Gibson, G. & Goldfarb, P.
nucleotide polymorphisms on protein function. The first comprehensive evaluation and comparison
Transcriptional regulation of cytochrome P4503A4 gene
Bioinformatics 19, 2199–2209 (2003). of SIFT and PolyPhen algorithms in molecular
expression: effects of inherited mutations in the 5′-flanking
48. Ng, P. C. & Henikoff, S. Predicting deleterious amino acid
epidemiological association studies.
region. Xenobiotica 33, 1085–1095 (2003).
substitutions. Genome Res. 11, 863–874 (2001).
Lohmueller, K. E., Pearce, C. L., Pike, M., Lander, E. S. &
27. Jeon, J. & An, G. Gene tagging in rice: a high throughput
An outline of the SIFT approach to assessing missense variant function using evolutionary similarity.
Hirschhorn, J. N. Meta-analysis of genetic association
system for functional genomics. Plant Sci. 161, 211–219
49. Ng, P. C. & Henikoff, S. Accounting for human
studies supports a contribution of common variants to
polymorphisms predicted to affect protein function. Genome
susceptibility to common disease. Nature Genet. 33,
28. Cecconi, F. & Meyer, B. I. Gene trap: a way to identify novel
Res. 12, 436–446 (2002).
genes and unravel their biological function. FEBS Lett. 480,
50. Ng, P. C. & Henikoff, S. SIFT: predicting amino acid changes
A comprehensive evaluation of the consistency of
that affect protein function. Nucleic Acids Res. 31, association studies that demonstrates the need for
29. Adams, M. D. ENU mutagenesis for pharma. Drug Discov.functional correlates in achieving consistency in Today 8, 199–200 (2003)
51. Ramensky, V., Bork, P. & Sunyaev, S. Human non-
association study results.
30. Lee, Y. S. & Mrksich, M. Protein chips: from concept to
synonymous SNPs: server and survey. Nucleic Acids Res.
Botstein, D. & Risch, N. Discovering genotypes underlying
practice. Trends Biotechnol. 20 (Suppl.), S14–18 (2002). 30, 3894–3900 (2002).
human phenotypes: past successes for mendelian disease,
et al. EICO (expression-based imprint candidate
An outline of the PolyPhen methodology for using
future approaches for complex disease. Nature Genet. 33
organizer): finding disease-related imprinted genes. Nucleicevolutionary and structure data to assess SNP function. Acids Res. 32 (database issue), D548–551 (2004).
52. Fleming, M. A., Potter, J. D., Ramirez, C. J., Ostrander, G. K. et al. Defects in pre-mRNA processing as causes
32. Knight, J. C., Keating, B. J., Rockett, K. A. & Kwiatkowski,
& Ostrander, E. A. Understanding missense mutations in the
of and predisposition to diseases. DNA Cell Biol. 21,
D. P. In vivo characterization of regulatory polymorphisms by
BRCA1 gene: an evolutionary approach. Proc. Natl Acad.
allele-specific quantification of RNA polymerase loading. Sci. USA 100, 1151–1156 (2003).
10. Knight, J. C. Functional implications of genetic variation in
Nature Genet. 33, 469–475 (2003).
53. National Institutes of Health. The ENCODE Project:
non-coding DNA for disease susceptibility and gene
The authors report a new method and application of ENCyclopedia Of DNA Elements [online],
regulation. Clin. Sci. (Lond.) 104, 493–501 (2003). experimental approaches to assessing genotype
<http://www.genome.gov/10005107> (2003).
11. Li, A. P., Kaminski, D. L. & Rasmussen, A. R. Substrates of
function.
54. Rogan, P. K., Svojanovsky, S. & Leeder, J. S. Information
human hepatic cytochrome P450 3A4. Toxicology 104, 1–8
33. Fay, J. C., Wyckoff, G. J. & Wu, C. I. Positive and negative
theory-based analysis of CYP2C19, CYP2D6 and CYP3A5
selection on the human genome. Genetics 158, 1227–1234
splicing mutations. Pharmacogenetics 13, 207–218 (2003). et al. Gene structure of CYP3A4, an adult-
55. Pagani, F. & Baralle, F. E. Genomic variants in exons and
specific form of cytochrome P450 in human livers and its
34. Akey, J. M., Zhang, G., Zhang, K., Jin, L. & Shriver, M. D.
introns: identifying the splicing spoilers. Nature Rev. Genet.
transcriptional control. Eur. J. Biochem. 218, 585–595 (1993).
Interrogating a high-density SNP map for signatures of
5, 389–396 (2004).
13. Rebbeck, T. R., Jaffe, J. M., Walker, A. H., Wein, A. J. &
natural selection. Genome Res. 12, 1805–1814 (2002).
56. Sunyaev, S., Ramensky, V. & Bork, P. Towards a structural
Malkowicz, S. B. Modification of clinical presentation of
35. Feder, J. N. et al. A novel MHC class I-like gene is mutated
basis of human non-synonymous single nucleotide
prostate tumors by a novel genetic variant in CYP3A4.
in patients with hereditary haemochromatosis. Nature
polymorphisms. Trends Genet. 16, 198–200 (2000). J. Natl Cancer Inst. 90, 1225–1229 (1998). Genet. 13, 399–408 (1996). et al. Prediction of deleterious human alleles. et al. Association between a CYP3A4 genetic
36. Nielsen, D. M., Ehm, M. G. & Weir, B. S. Detecting marker-
Hum. Mol. Genet. 10, 591–597 (2001).
variant and clinical presentation in African-American prostate
disease association by testing for Hardy-Weinberg
58. Schuetz, E. G. Lessons from the CYP3A4 promoter. Mol.
cancer patients. Cancer Epidemiol. Biomarkers Prev. 8,
disequilibrium at a marker locus. Am. J. Hum. Genet. 63, Pharmacol. 65, 279–281 (2004).
Zeigler-Johnson, C. M. et al. Ethnic differences in the
15. Felix, C. A., et al. Association of CYP3A4 genotype with
37. Hoh, J., Wille, A. & Ott, J. Trimming, weighting, and
frequency of prostate cancer susceptibilty alleles at SRD5A2
treatment-related leukemia. Proc. Natl Acad. Sci. 95,
grouping SNPs in human case-control association studies.
and CYP3A4. Hum. Hered. 54, 13–21 (2002). Genome Res. 11, 2115–2119 (2001).
16. Kadlubar, F. F. et al. The putative high activity variant,
The authors propose a novel approach to association
CYP3A4*1B, predicts the onset of puberty in young girls. studies that incorporates both association and
Some of the work discussed in this review was supported by
Cancer Epidemiol. Biomarkers Prev. 12, 327–331 (2003). population genetics information in identifying disease
grants from the Public Health Service and the University of
17. Lai, J., Vesprini, D., Chu, W., Jernstrom, H. & Narod, S. A. genes, including the possibility of genome-wide CYP gene polymorphisms and early menarche. Mol. Genet.associations. Metab. 74, 449–457 (2001).
38. Perutz, M. F. Structure and function of haemoglobin.
et al. Genetic factors related to racial variation
I. A tentative atomic model of horse oxyhaemoglobin.
The authors declare that they have no competing financial interests.
in plasma levels of insulin-like growth factor-1: implications
J. Mol. Biol. 13, 646–668 (1965).
for premenopausal breast cancer risk. Mol. Genet. Metab.
39. Wang, Z. & Moult, J. Three-dimensional structural
72, 144–154 (2001).
location and molecular functional effects of missense SNPs
Online links
19. Lamba, J. K. et al. Common allelic variants in cytochrome
in the T cell receptor Vβ domain. Proteins 53, 748–757
P4503A4 and their prevalence in different populations. DATABASES Pharmacogenetics 12, 121–132 (2002).
40. Wang, Z. & Moult, J. SNPs, protein structure, and disease. The following terms in this article are linked online to:
20. Westlind, A., Lofberg, L., Tindberg, N., Andersson, T. B. &
Hum. Mutat. 17, 263–270 (2001). Entrez Gene:
Ingelman-Sundberg, M. Interindividual differences in hepatic
41. Chasman, D. & Adams, R. M. Predicting the functional
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene
expression of CYP3A4: relationship to genetic
consequences of non-synonymous single nucleotide
polymorphism in the 5′-upstream regulatory region.
polymorphisms: structure-based assessment of amino acid
Biochem. Biophys. Res. Commun. 259, 201–205 (1999).
variation. J. Mol. Biol. 307, 683–706 (2001). FURTHER INFORMATION
21. Amirimani, B., Walker, A. H., Weber, B. L. & Rebbeck, T. R.
42. Ferrer-Costa, C., Orozco, M. & de la Cruz, X. CODDLE: http://www.proweb.org/coddle
Response: re: modification of clinical presentation of
Characterization of disease-associated single amino acid
PolyPhen: http://www.bork.embl-heidelberg.de/PolyPhen
prostate tumors by a novel genetic variant in CYP3A4.
polymorphisms in terms of sequence and structure
SIFT: http://blocks.fhcrc.org/%7Epauline/SIFT J. Natl Cancer Inst. 91, 1588–1590 (1999).
properties. J. Mol. Biol. 315, 771–786 (2002). Access to this interactive links box is free online.
VOLUME 5 | AUGUST 2004 | 5 9 7
Solution sketches for Day 2 A feasible brute force solution for 30 points: Use backtracking, build the tower from the bottom to the top. Pruning: There is a simple O(N) greedy check whether there is at least one possible way to build the rest of the tower using the remaining cubes. If we use this check whenever we make a recursive call, the runtime is guaranteed to be proportional to the actual
Visit by New Zealand Track II Delegation for Inaugural India-New Zealand New Zealand Delegation Dr Richard Grant, Executive Director, Asia New Zealand Foundation (Leader of Mr Brian Lynch, Director, New Zealand Institute of International Affairs (mob. +64 27 445 2958) (arrive on 5 Dec in Delhi, overnight at Hotel Pooja Palace,15A/11 WEA Puja Park,Karol Bagh. Tel 2574 5275(76) Professor Xiaomi