Table Of ContentWangetal.BMCGenomics2013,14(Suppl1):S11
http://www.biomedcentral.com/1471-2164/14/S1/S11
PROCEEDINGS Open Access
A probabilistic method for identifying rare
variants underlying complex traits
Jiayin Wang2*, Zhongmeng Zhao1*, Zhi Cao1, Aiyuan Yang1, Jin Zhang2
From The Eleventh Asia Pacific Bioinformatics Conference (APBC 2013)
Vancouver, Canada. 21-24 January 2013
Abstract
Background: Identifying the genetic variants that contribute to disease susceptibilities is important both for
developing methodologies and for studying complex diseases in molecular biology. It has been demonstrated that
the spectrum of minor allelic frequencies (MAFs) of risk genetic variants ranges from common to rare. Although
association studies are shifting to incorporate rare variants (RVs) affecting complex traits, existing approaches do
not show a high degree of success, and more efforts should be considered.
Results: In this article, we focus on detecting associations between multiple rare variants and traits. Similar to
RareCover, a widely used approach, we assume that variants located close to each other tend to have similar
impacts on traits. Therefore, we introduce elevated regions and background regions, where the elevated regions
are considered to have a higher chance of harboring causal variants. We propose a hidden Markov random field
(HMRF) model to select a set of rare variants that potentially underlie the phenotype, and then, a statistical test is
applied. Thus, the association analysis can be achieved without pre-selection by experts. In our model, each variant
has two hidden states that represent the causal/non-causal status and the region status. In addition, two Bayesian
processes are used to compare and estimate the genotype, phenotype and model parameters. We compare our
approach to the three current methods using different types of datasets, and though these are simulation
experiments, our approach has higher statistical power than the other methods. The software package, RareProb
and the simulation datasets are available at: http://www.engr.uconn.edu/~jiw09003.
Introduction have limited diagnostic value [2,3]. While the identifica-
In most existing genetic variant association studies, tion of common variants creates a dilemma, known as
“common trait, common variants”, which asserts that “common trait, rare variants”, an alternative hypothesis,
common genetic variants contribute to most of traits which asserts that multiple rare variants with moderate
(disease susceptibilities), serves as the central assump- to high penetrances may collectively influence disease
tion. Researchers have successfully identified some sig- susceptibilities, has been suggested in some literatures
nificant associations between common single nucleotide [3-5]. Rare variants are defined as those whose minor
polymorphisms (SNPs) and disease traits [1]. However, allele frequencies (MAF) are less than or equal to 0.01
despite the enormous efforts expended on association (≤ 10-2). Although some rare variants associated with
studies of complex traits, common genetic variants only Mendeliandiseaseshavebeenidentified,moreoften,the
show a moderate influence on different phenotypes in allelicpopulationattributablerisk(PAR),whichdescribes
many reported disease associations and consequently a small reduction in the incidence that would be
observed in unexposed samples compared to the actual
exposurepattern,islow.Theoddsratio(OR),ameasure
*Correspondence:[email protected];[email protected]
1DepartmentofComputerScienceandTechnology,Xi’anJiaotong of the strength of association or non-independence
University,Xi’an,P.R.China between two binary data values, is also low. Moreover,
2ComputerScienceandEngineeringDepartment,UniversityofConnecticut, basedonthe“commontrait,rarevariants”hypothesis,in
Storrs,Connecticut06269-2155,USA
Fulllistofauthorinformationisavailableattheendofthearticle
©2013Wangetal.;licenseeBioMedCentralLtd.ThisisanopenaccessarticledistributedunderthetermsoftheCreativeCommons
AttributionLicense(http://creativecommons.org/licenses/by/2.0),whichpermitsunrestricteduse,distribution,andreproductionin
anymedium,providedtheoriginalworkisproperlycited.
Wangetal.BMCGenomics2013,14(Suppl1):S11 Page2of9
http://www.biomedcentral.com/1471-2164/14/S1/S11
manycases,asetofrarevariants,insteadofjustonevar- alsocomputethepair-wiselikelihoodofcandidatevariants
iant, should be identified to fully explain the genetic beingcollapsedtogether.Notethatalthoughitisdifficult
influence. Boththe single-variant test[6] and the multi- toobserve,relativelylowinteractions(e.g.,linkagedisequi-
ple-varianttest[7]havebeenappliedtorarevariantasso- librium) are expected between rare variants [4,11,13,20].
ciation studies. However, due to the reasons outlined Furthermore, in regression-based association methods,
above, neither of them shows satisfactory power in genetic similarities are often used to reduce the dimen-
obtaining associations. Although more and more atten- sionsoftheregressionmodels.Therefore,we introduced
tion is being focused upon rare variants, there has only two kindsofgenetic regions, the elevated regionandthe
beenlimitedsuccessthusfar[8-10]. background region, in our model analysis; the elevated
Alternatively, the collapsing strategy, also called the regionhasahigherprobabilityofharboringacausalvar-
“burden-basedtest”,isanotherapproachforrarevariants iant. This assumption that the causal variants are often
associationstudies.Mostofthecollapse-basedapproaches located close to each other is often used, e.g. slide win-
build on the “recessive-set” genetic model, in which the dows in RareCover [19]. However, the regions are more
predisposing haplotype contains mutation(s) in at least flexiblethanapresetslidewindow,asinRareCover.
onevariant[11]. Multiplerarevariantsinthesamelocus We adopt the “dominant” and “recessive set” genetic
arecollapsed,basedondifferentstandards,thenstatistical model, which are also used in [9-11,14,15,19]. In the
tests are applied. The locus here is defined as a selected dominantandrecessive-setmodel,thepredisposinggen-
region thatconsistsofa groupofcandidate rare variants otype harbors the mutation(s) in at least one variant on
[9,12-14]. However, it is argued that existing collapse- any of the two haplotypes. Therefore, for one genotype,
basedapproachesassumeallrarevariantsimplicitlyinflu- there are two possible allelic values at each variant: one
encing thephenotypeinthesame directionandwith the denotes that both haplotypes carry a wide-type allele,
same magnitude [10,15]. Researchershave observed that whiletheotherdenotesthatatleastonehaplotypecarries
anygivenrarevariantcouldhavenoeffect,couldbecau- a mutant. In our method, each variant has two hidden
sal, or couldbe protective for the endpoints (traits) [15]. states,causal/non-causalstatusandelevated/background
For example, some low-frequency variants in African regionstatus.TheMRFincludesthehiddenstates,emis-
AmericansPCSK9 canhaveasubstantialeffectonserum sionprobabilitiesandtransitionprobabilities.The emis-
Low-Density Lipoprotein Cholesterol (LDL-C)by increas- sion probabilities bridge the hidden states and the
ingtheriskoforprotectingagainstmyocardialinfarction genotypes,whilethe transitionprobabilitieslink thetwo
[16-18]. hiddenstates.Followingthepseudo-likelihoodestimation
Collapse-based approacheshave low statistical powers method [21], we infer the model parameters and all of
when “causal”, “neutral” and “protective” variants are thehiddenstates.Thesimulationexperimentsshowthat
combined [13,15,19]. To overcome this weakness, some our approach outperforms RareCover, RWAS [14] and
approaches [9,14] assume that the rare variants are well LRT [9] on different parametric settings. In particular,
selected by experts, while weighting of each variant is RareProbobtainsbetterresultsonlarge-scaledata.
another widelyusedstrategy[9,11,14].Inarecentstudy,
Bhatia and others [19] suggest the development of a Methods
“model-free” approach, RareCover, that only collapses a Notions and model overview
subsetofpotentiallycausalvariantsfromallofthegiven Suppose we are given M rare variants (allelic sites) on a
variants.Here,the “model” refersto the geneticassocia- set of N genotypes. Let s denote the allelic value of the
i
tion model that consists of the pre-selection candidate site s on the genotype i (1 ≤ i ≤ N, 1 ≤ s ≤ M), where s
i
variants. = 0 means both haplotypes of i have the wild type allele,
MotivatedbyRareCover,inthisarticle,wefocusonrare while s = 1 means at least one haplotype has a mutant
i
variant association analysis without any pre-selection of allele. Each genotype carries a dichotomous phenotype.
candidate variants. We propose a probabilistic approach, Let vector P denotes the phenotypes, where P = 1
i
RareProb,tomakeselectionsusingaMarkovrandomfield represents that i is affected by the phenotype trait
(MRF) model and identify multiple causal rare variants (being a case), while P = 0 represents that i is a control.
i
that influence a dichotomousphenotype using statistical The core of our approach is a Markov random field
tests.Ourapproachconsidersboththecausalandthepro- (MRF) model. We first introduce four key components
tective variants, whichdistinguishesit fromthe previous of modeling this MRF:
study RareCover,anditisthereforea robust predictorof
the direction and the magnitude of the genetic effects. (cid:129) The observed data of this MRF consist of all of the
Moreover, inspired by the weight-sum approaches genotypes and phenotypes.
[9,11,14], we also weight each variant; however, we not (cid:129) There are two unknown states for each site: one is
only considerthe likelihoodof avariantbeingcausal but the causal or non-causal status and the other is the
Wangetal.BMCGenomics2013,14(Suppl1):S11 Page3of9
http://www.biomedcentral.com/1471-2164/14/S1/S11
region location status. Here, we define them as the and the second-order clique potentials because there is
hidden states of this Markov random field. Let a scantyevidencesupportingthebiologicalormedicalsce-
latent vector R represent the region status, where nario of high-order potentials. For each variable, the
R = 1 denotes that the site s is located in an ele- MAFs and model parameters can be estimated by maxi-
s
vated region, while R = 0 denotes the s is located in mizing the likelihoodsof the genotypes. Then, the prob-
s
a background region. Additionally, let a latent vector ability of the variable and the variable itself can be
X represent the causal/non-causal status, where X = updated by MAFs and model parameters. Two or three
s
1 if the site s is causal (contributes to the pheno- iterationscanbeappliedifneededfortheconvergenceof
type); otherwise, X = 0. Probabilistic functions are theMRF.Thus,ourapproachselectsasubsetofcandidate
s
designed to present the probabilities of each hidden causal variants by updating the variables and avoids the
state. The RareProb framework is able to incorporate weakness of the same magnitude effect assumption
prior information, obtained by different software becausetheneighborhoodsystemisabletodescribeboth
tools, e.g. Align-GVGD [22] and SIFT [23], etc, by the“causal”and“protective”variants.
updating initial X vector and R vector.
(cid:129) A neighborhood system is required in the MRF Estimation of the hidden states in HMRF
model to describe the interactions among hidden Neighborhood system
states. Details of the hidden states and neighborhood Assume there are N/2 casesand N/2controlsamong all
system are shown in the section “Estimation of the of the genotypes (if the number of cases is not equal to
transition probabilities in HMRF”. thenumberofcontrols,thenalloftheresultsstillcanbe
(cid:129) There are two kinds of probabilities in the MRF usedbyapplyingnoncentralityparameters).Atacertain
model:emissionprobabilitiesandtransitionprobabil- variants,letθ denotetheMAFforthecases,andletthe
s
ities. Emission probabilities bridge the relationships number of genotypes in cases that carry at least one
among genotypes, phenotypes and hidden states. mutant allele be c+. Let r denote the MAF for the con-
s s
Moreover,hiddenstatesXandRarenotindependent trols, and let the number of genotypes in controls that
ofeachother,astherelationshipsbetweenthehidden −
carryatleastonemutantallelebe c .Then,wecandraw
s
states are described by the transition probabilities.
twobinomialdistributionsforthecasesandthecontrols
The conditional probabilityP(X =1|R = 1) denotes (cid:2) (cid:3) (cid:2) (cid:3)
s s [9,14]: c+ ∼Bin Nθ and c− ∼Bin N,ρ , where
theprobabilitythatthesitesisacausalsitewhenitis s 2 s s 2 s
located in an elevated region, while P(Xs = 0|Rs = 1) f(c+s|θs)=CcN+s θsc+s(1−θs)N2−c+s and f(c−s|ρs)=CcN−s ρsc−s(1−ρS)N2−c−s.
denotes the probability that the site s is non-causal 2 2
when it is located in an elevated region. Similarly, Thus,forasites,thestatisticofthedifferencebetweenθ
another two conditional probabilities, P(X = 1|R = andris
s s
(cid:4) (cid:5)
0) and P(X = 0|R = 0), present the probabilities of
being causasl or nosn-causal if the site is located in a 2 θˆs−ρˆs
backgroundregion.Detailsoftheemissionprobabil- zs = (cid:6) (cid:7)(cid:4) (cid:5)(cid:4) (cid:5)
itiesareshowninthesection“Estimationoftheemis- N2 θˆs+ρˆs 2−θˆs−ρˆs
tpshiroeonbtraapbnrilsoiitbtiieaosbnialpriretoiesbhsaobiwinlintHieinsMitnRheFH”sM,ecaRtniFod”n. t“hEesttimraantsioitnioonf where θˆs = Nc/+s2 is the estimation of θs and ρˆs = Nc−s/2
is the estimation of r. Similar to the linear kernel func-
s
tion, which calculates genetic similarities [10], we mea-
The central thesis of our approach is that causal rare
sure the likelihood between pairwise rare variants, which
variants,whichshouldbecollapsedtogether,aretreatedas
denotes how likely two variants would be collapsed
one random vector variable with certain dimensions.
Then,theprobabilityofthisbunchofcausalrarevariants together. For two variants s and s’, we define ωs,s’ as the
likelihood of collapsing as follows:
becomes the probability of one variable being associated
wleintchet[h2e1p],htehneoptyrpoeb.aBbialsiteydoofnthtihseraMnadrokmovv-Gariibabbsleecqaunivbae- ωs,s(cid:3) = z22z+szzs(cid:3)2
s s(cid:3)
decomposed into the sum of clique potentials. The first-
ordercliquepotentialsdescribetheprobabilityofonevar- The ω function has the following properties: (1) When
iantbeingcausal,whilethesecond-ordercliquepotentials both s and s’ are causal variants, due to the PAR, ωs,s’
measurethepair-wisegeneticsimilarities,whichsharethe locates in the interval (0, 1]. (2) If one variant is “causal”
idea of the kernel machine in regression frameworks but the other is “protective”, the likelihood takes on a
[10,24,25]. The neighborhood system in the MRF model negative value. (3) The likelihood encourages the col-
consists of clique potentials. In our approach, we select lapse of the variants with similar PAR. Those rare var-
thattheneighborhoodsystemonlycontainsthefirst-order iants whose MAFs increase rapidly in some cases, as we
Wangetal.BMCGenomics2013,14(Suppl1):S11 Page4of9
http://www.biomedcentral.com/1471-2164/14/S1/S11
morepnatiior-nweidsebteefostrse,,wchouiclhdabreeiodfetnentifineodtbcyonssinidgelere-sditientceostls- π(ρs)= ρsαρs−B1((α1ρ−,βρρs))βρs−1 and π(ρs)= ρsαρs−B1((α1ρ−,βρρs))βρs−1,
lapsing models [8]. Let ω.,. be the weight of two neigh- where a(·) ands b(s·) are hyper-parameters in tshesprior
bors. The closer the statistics zs and zs’ are, the larger distributions [26-28]. Then, the marginal distribution
the likelihood will be. And thus, the neighborhood sys- of c+ is
s
tem is built up.
(cid:2) (cid:3)
HRiniadradeebnvaacsrktiaagtnreotsusnids reeitghioern.loTchautesd, wine adnefienleevtahteedprroegbiaobniliotyr m(c+s)=CcN2+s (cid:13)(cid:13)(α(αθsθ)s(cid:13),β(βθsθ)s)(cid:13)(αθs(cid:13)+(cid:2)cα+sθ)s(cid:13)+βN2θs−+cN2+s(cid:3)+βθs
(Bayesian classifier) of s as The marginal distribution of c− is similar. The prob-
s
⎛ ⎞
ability of the observed genotypes on s is equal to the
(cid:10) (cid:2) (cid:3)
p(Xs|Xn(s))∝exp⎝γp(Xs|Rs)Xs+η ωs,s(cid:3)p(Xs(cid:3)|Rs(cid:3))Xs(cid:3)⎠ m c·s
s(cid:3)∈n(s) sum of Cc·s Thus we have:
N
2
wheren(s)denotestheneighborsofs.gandharetwo
MXRaFffepcatrsamtheeteprrso.bgarbeiplirteyseonftsXh,owwhsilterohngrleyptrheesesnttastuhsoowf P(Ys|Xs=1)=(cid:13)(cid:13)(α(αθsθ)s(cid:13),β(βθsθ)s)(cid:13)(αθs(cid:13)+(cα+sθ)s(cid:13)+(βN2θs−+cN2+s)+βθs)×(cid:13)(cid:13)(α(αρsρ)s(cid:13),β(βρsρ)s)(cid:13)(αρs(cid:13)+(cα−sρ)s(cid:13)+(βN2ρs−+cN2−s)+βρs)
s s
strongly the neighbors of s affect the probability of Xs. On the other hand, if Xs = 0, then there is no PAR
Here, we limit h >0, which encourages the pair-wise between θ and r that infers θ = r . Here, we simply
s s s s
weightsandprevents themfromcounteractingthenega- use r to denote the MAF of s for both the cases and
s
tiveweights.Thus,thejointprobabilityofthelatentvector controls. Thus, we have
(cid:4) (cid:13) (cid:13) (cid:5)
Xis p(X;(cid:7))∝exp γ Ms p(Xs|Rs)+η s(cid:3)∈n(s)ωs,s(cid:3)p(Xs(cid:3)|Rs(cid:3)) ,where (cid:13)(α,β) (cid:13)(α +c)(cid:13)(N−c +β)
F = (g, h). As the variants in different subsets (different P(Ys|Xs =0)= (cid:13)(α)s(cid:13)(βs ) s(cid:13)(αs +β +Ns) s
collapsinggroups)areconditionalindependent,thisjoint s s s s
probability covers all of the probabilities of the random where c =c++c−. We have now obtained all the
s s s
variables(collapsinggroups).Similarly,theprobabilityofs three emission probabilities of this HMRF: p (Y|X), P
locatedinanelevatedregioncanberepresentedby (Y|X = 0) and P (Y|X = 1).
s s s s
⎛ ⎞
(cid:2) (cid:3) (cid:10)
p Rs|Rn(s) ∝exp⎝τR+υ ωs,s(cid:3)Rs(cid:3)⎠ ETshtiemtartainosnitoiofnthperotrbaanbsiitliitoinesplrionbkatbhielithieisddinenHMstRatFesXand
s(cid:3)∈n(s) R. Let c+ be the counts of the causal variants on all of
X
and the joint probability of(cid:4)latent vector R can b(cid:5)e theelevatedregions,andletcEbethenumberofvariants
(cid:13) (cid:13) −
represented by p(R;(cid:7)R)∝exp τ Ms R+υ s,s(cid:3)ωs,s(cid:3)Rs(cid:3) , in those regions. Let cX be the counts ofthe causal var-
iants on all of the background regions, and c be the
where F = (τ, υ)·τ and υ are two MRF parameters. We B
R numberof variantsin those regions.Then, we draw two
also limit υ >0, which encourages the pair-wise weights binomial distributions: c+ ∼Bin(c , ξ);c− ∼Bin(c , ζ)
and prevents them from counteracting the negative X E X B
where ξ = P (X = 1|R = 1) andζ = P (X = 1|R = 0). We
weights.
alsoplacethepriordistributionsonξandζ,asfollows:
Estimation of the emission probabilities in HMRF
WenowestimatetheemissionprobabilitiestorelateXand c+X ∼Bin(cE, ξ);f(c+X|ξ)=Ccc+XEξc+X(1−ξ)cE−c+X
Rwiththeobserveddata.Aslinkagedisequilibriumisrarely
and
observedbetweenrarevariants[8],thevectorconsistsof
tpheenadlelenlitcfvraolmuesthfreomothoneresv,awrihanenttahaptaisrtciocnudlaitrioXnailslygiinvdene-. c−X ∼Bin(cB, ζ);f(c−X|ζ)=Ccc−XBζc−X(1−ζ)cB−c−X
Thus,thejointconditionalprobabilityofallofthegeno- where ξ = P (X = 1|R = 1) and ζ = P (X = 1|R = 0).
typesis We also place the prior distributions on ξ and ζ, as
(cid:14) (cid:15)
follows:
(cid:10)M
p(Y|X)=exp p(Y|X)
s s ξαξ−1(1−ξ)βξ−1 ζαζ−1(1−ζ)βζ−1
s=1 π(ξ)= ;π(ζ)=
B(αξ,βξ) B(αζ,βζ)
If X = 1, due to the PAR, θ ≠ r . We place a prior
s s s
distribution on θ and a prior on r [26]: where a(·) and b(·) are also hyper-parameters.
s s
Wangetal.BMCGenomics2013,14(Suppl1):S11 Page5of9
http://www.biomedcentral.com/1471-2164/14/S1/S11
Thus,wehavetheconditionalprobabilityofXgivenR: the updated Xˆ . If the distance is less than a preset
threshold, our approach will stop the iteration. After the
P(X|R)=(cid:13)(cid:13)(α(αξ)ξ(cid:13),β(βξ)ξ)(cid:13)(αξ(cid:13)+(cα+Xξ)(cid:13)+(βcEξ−+ccE+X)+βξ)×(cid:13)(cid:13)(α(αζ)ζ(cid:13),β(βζ)ζ)(cid:13)(αζ+(cid:13)(cα−Xζ)(cid:13)+(βcBζ−+ccB−X)+βζ) convergence of HMRF, we obtain the estimations of X
and R, as well as the MAFs for every variant. The col-
and the posterior distribution of ξ given c+X is lapsed rare variants can be tested based on the existing
statistics, e.g. in [9,10,14].
π(ξ|c+)= ξαξ+c+X−1(1−ξ)βξ+M−c+X−1
X B(αξ +c+X,βξ +M−c+X) Experiments and results
Similarly, the posterior distribution of ζ given c− is In this section, we apply our approach on a real dataset
X from [30] and also compare it with three other
π(ζ|c−)= ζαζ+c−X−1(1−ζ)βζ+M−c−X−1 aTphperotharceheescoumsipnagridsoifnfearepnptrotaycpheessoafresimRaurleaCteodvedra,twasheictsh.
X B(αζ +c−X,βζ +M−c−X) is based on [19], RWAS from [14] and LRT from [9].
Additionally, it seems that RareCover is not released
Thus far, we have obtained all of the three transition
probabilities of this HMRF: p (X|R), π(ξ|c+) and online, so as in many previous works, we re-implement
X
π(ζ|c−). this algorithm and the related statistics by ourselves.
X
Simulation frameworks
Estimation the model parameters
As the simulation settings in different papers [9,14,19]
Based on the Gibbs-Markov Equivalence [21], a pseudo-
are quite different, we adopt all of them and generate
likelihood estimation cycle can be applied to this hidden
three types of simulated datasets. In the first one, each
MRF to estimate the model parameters and update the
dataset has a fixed number of causal variants, while in
hidden states. We use the pseudo-likelihood estimation
because p (X; F) andp (R; F ) are difficult to compute the second dataset, the number of causal variants is
R
determinedbyallelicpopulationattributablerisk(PAR).
directly. The algorithm involves the following four steps:
The last simulation method first generates elevated
(cid:129) Step 1: Estimate aθ and br with θˆ and ρˆ by maxi- regions and background regions and then plants causal
mizing the likelihood L(Y/Xˆ). Update θˆ by maxi- variantsineachregion.Wedescribethethreesimulation
s methodsinthefollowingsections.
mizing the posterior distribution:
Fix number of causal variants
π(θs|c+s)= θBsαθs(cid:2)+αc+sθ−s +1(c1+s,−βθθss)+βNθs+N−−cc+s+s−(cid:3)1 FsEaiarlcsvth,awrviaaerngitaesnn,teforialsltoegwethnineegrdapattreaedsveiitonsudwseiaptphepnfridxoeeandcthnlyeusmb[e1bc4ear]usasonefdtch[a9eu]y-.
Similarly, Update ρˆ . believe thatrare variantsdo not show significantlinkage
s
(cid:129) Step 2: Estimate aξ, bξ and aζ, bζ with ξˆ and ζˆ by disequilibrium[9,14].Foreachvariant,theprobabilitydis-
tributionoftheMAFofsitesoncontrols,r,satisfiesthe
(cid:16) s
maximizing the transition probability L(X/R). Wright’sdistributionunderpurifyingselection[4],
Update ξˆ and ζˆ by maximizing the transition prob-
abilities π(ξ|c+) and π(ξ|c+), respectively. f(ρs)∝(ρs)βs−1(1−ρs)βN−1eσ−ρsσ
X X
(cid:129) Step 3: Estimate F and FR with (cid:7)ˆ and (cid:7)ˆR by wheresistheselectioncoefficient,bSistheprobability
maximizing the pseudo-likelihood functions: that the normal allelic site mutates to the causal variant,
(cid:14) (cid:15)
(cid:4) (cid:5) (cid:10)M (cid:4) (cid:5) andbNistheprobabilitythatacausalvariantrepairstoa
L Xˆ;(cid:7) =exp ps Xˆs|Xˆn(s) ;(cid:7) normal variant. We take s = 12.0, bS = 0.001 and bN =
0.00033, which are the same settings used by [9,11,14].
S
and L(cid:4)Rˆ;(cid:7)R(cid:5). Then,therelativeriskofsis: RR= (1−δδ)ρs +1,whereδis
the marginal PAR. The marginal PAR is equal to the
(cid:129) Step 4: Update Xˆ and Rˆ by groupPAR(Δ)dividedbythenumberofcausal variants,
(cid:4) (cid:5) (cid:4) (cid:5) (cid:4) (cid:5)
P X|Y,Xˆ ∝f Y|X;θˆ,ρˆ p X|Xˆ ;(cid:7)ˆ whiletherelativeriskofMvariantsis1[14].Afterwards,
s S/s s s s s n(s) the MAF of s for the cases is calculated according to
(cid:4) (cid:5)
and P Rs|X,RˆS/s . θs = (RRR−R1×)ρρss+1.Ineachdataset,wesimulateN=2000gen-
otypeswithhalfcasesandhalfcontrols.Themutationson
There are several ways to exit from this iteration. We the cases and the controls are sampled independently
measure the Euclidean distance between the current and accordingtoθ andr,respectively.
s s
Wangetal.BMCGenomics2013,14(Suppl1):S11 Page6of9
http://www.biomedcentral.com/1471-2164/14/S1/S11
Causal variants depends on PAR an iteration to C until it reaches 50,000 iterations. The
The second way generates a set, C, that contains all of transitionprobabilityfromCtoAisequaltor×Pr.After
the causal variants. Instead of a fixed number, the total we have enough genotypes, we sample 1000 cases and
number of causal variants depends on PAR [19], which 1000controlsfromthem.
is limited by Δ (the group PAR):
(cid:18) (cid:19) Comparisons on power
(cid:18)≥1−(cid:17) 1− θsPr Similar to the measurements in [9,14], the power of an
P approach is measured by the number of significant data-
s∈C D
sets, among many datasets, using a significance thresh-
wherePrrepresentsthepenetranceofthegroupofcau- old of 2.5 × 10-6 based on the Bonferroni correction
salvariantsandP isthediseaseprevalenceinthepopula- assuming 20000 genes, genome-wide. We test at most
D
tion.Differentsettingsareappliedintheexperiments. 1000 datasets for each comparison experiment.
We use the algorithm proposed in [19] to obtain the Power versus different proportions of causal variants
MAF of each causal variant. The algorithm samples the We compare the powers under different sizes of total
MAFofacausalvariants,θ,fromtheWright’sdistribu- variants. In the first group of experiments, we include
s
tion with s = 30.0, b = 0.2 and b = 0.002 [4,19], and 50 causal variants and vary the total number of variants
S N
thenappendsstoC.Next,thealgorithmcheckswhether from 100 to 5000. Thus, the proportions of causal var-
(cid:17) (cid:4) (cid:5)
1− θsPr >1−(cid:18) is true. If the inequality does iants decrease from 50% to 1%. In the second group of
s∈C PD experiments, we hold the group PAR as 5% and vary the
nothold,thealgorithmterminatesandoutputsC.Thus, total number of variants as before. The results are com-
weobtainallofthecausalvariantsandtheirMAFs.Ifthe pared in Table 1. From the results, our approach clearly
inequality holds, then the algorithm continuously sam- shows more powerful and more robust at dealing with
ples the MAF of the next causal variant. The mutations large-scale data. We also test our approach on different
ongenotypesaresampledaccordingtoθs. settings of the group PARs. Those results can be found
For those non-causal variants, we use Fu’s model [29] in Table S1 in the Additional file 1.
of allelic distributions on a coalescent, which is the The Type I error rate is another important measure-
same used in [19]. We adopt ρs = 5N.0. The mutations on ment for estimating an approach. To compute the Type
genotypes are sampled according to rs. The phenotype I error rate, we apply the same technique as [19]. Type I
of each individual (genotype) is computed by the pene-
trance of the subset, Pr. Thereafter, we sample 1000 of
Table 1The powercomparisons atdifferent proportions
the cases and 1000 of the controls.
ofcausal variants
Causal variants depends on regions
Total Causal RareProb RareCover RWAS LRT
Thereare many waystogenerate a datasetwithregions.
100 50 100% 100% 100% 100%
The simplest way is to preset the elevated regions and
200 50 100% 100% 99.6% 99.9%
the background regions and to plant causal variants
400 50 100% 100% 85.3% 88.6%
based on certain probabilities. An alternate way creates
600 50 100% 94.6% 54.1% 58.8%
the regions by a Markov chain. For each site, there are
800 50 100% 0.0% 33.0% 36.5%
twogroupsofstates.TheEstatedenotesthatthevariant
1000 50 100% 0.0% 20.7% 22.0%
islocatedinanelevatedregion,whiletheBstatedenotes
2000 50 100% 0.0% 2.0% 2.0%
that the variant is located in a background region. Both
3000 50 100% 0.0% 0.8% 0.0%
states E and B can transfer to a causal state C or a non-
causal state C¯ . If the Markov chain travels to the C¯ 4000 50 100% 0.0% 0.4% 0.0%
5000 50 100% 0.0% 0.3% 0.0%
state,itplantsamutantonthegenotypewithprobability
r. If the variant is considered to be causal, it may con- 200 1* 51.0% 0.0% 0.0% 0.0%
400 3* 77.0% 0.0% 0.0% 0.0%
tinuously transfer to the state A, which means that the
600 2* 63.6% 0.0% 0.0% 0.0%
genotypecarriesamutantthat may affect thephenotype
with penetrance Pr. Otherwise it arrives in the state Ā, 800 3* 57.1% 0.0% 0.0% 0.0%
1000 3* 59.0% 0.0% 0.0% 0.0%
and the Markov chain plants a mutant or a wild-type
2000 1* 34.0% 0.0% 0.0% 0.0%
alleleonthegenotypeafterwards.
3000 2* 41.2% 0.0% 0.0% 0.0%
Togenerateenoughgenotypes,we performthefollow-
ingstepsforeachvariant:iftheprocessdropsinto C¯ ,we 4000 3* 40.0% 0.0% 0.0% 0.0%
take 50,000 iterations to yield a mutant, where r is 5000 2* 29.8% 0.0% 0.0% 0.0%
sampled from the Wright’s distribution with s = 30.0, Theuppersectionofthistableshowstheresultswithafixednumberof
bS=0.2andbN=0.002.IfitdropsintoAorĀ,wedesign “c*a”uisnadlivcaartieasnttsh.aTthtehecovlaulmuenis“Caanusaavle”rsahgoewvsatluhee.”numberofcausalvariants,and
Wangetal.BMCGenomics2013,14(Suppl1):S11 Page7of9
http://www.biomedcentral.com/1471-2164/14/S1/S11
error rate is defined as the probability of a non-causal Table 2The powercomparisons fordifferent
variant being selected in the potential causal set. We configurations ofregions
compare our approach only with RareCover because Total Causal Regions Length RareProb CorrectR
RWAS or LRT does not select any potential causal var- 1000 36* 1 50 100% 96%
iants. The results on different configurations can be 37* 2 50 100% 98%
found in supplementary documents. Based on the 36* 3 50 100% 97%
results, our approach always holds reasonable Type I 35* 4 50 100% 98%
error rates. Although on some configurations RareProb
2000 73* 1 100 100% 97%
has a little higher Type I error rates, e.g. 1%-10% higher
73* 2 100 100% 97%
when gourp PAR is 5%, than RareCover, the absolute
70* 3 100 100% 98%
values are still satisfied. Moreover, when the group PAR
71* 4 100 100% 96%
decreases, RareProb always performs lower Type I error
Total Causal Regions Length RareCover CorrectR
rates than RareCover. These results can be found in
1000 36* 1 50 0.0% 1.9%
TableS2intheAdditionalfile2.Consideringbothstatisti-
37* 2 50 0.0% 1.4%
calpowerandTypeIerrorrate,theadvantageofRareProb
36* 3 50 0.0% 1.7%
cannotbeneglected:itisabletoidentifymostofthecau-
35* 4 50 0.0% 1.6%
sal variants with an acceptable Type I error rate. In the
2000 73* 1 100 0.0% 0.7%
other words, if an approach rarely identifies correct var-
73* 2 100 0.0% 0.8%
iants,alowTypeIerrorratebecomesmeaningless.
70* 3 100 0.0% 1.3%
Power versus different configurations of regions
71* 4 100 0.0% 0.8%
We compare the powers on different configurations of
elevated regions and background regions and test the Thecolumn“Causal”representsthetotalnumberofcausalvariants,“Region”
denotesthetotalnumberofelevatedregions,“Length”indicatesthetotal
performance of our approach in identifying the regions. numberofvariantslocatinginelevatedregions.Thecolumn“CorrectR”
At each total variant number, we preset the number of showsthepercentageofcorrectidentificationofregions.
regions between 2 and 8, with half elevated regions and
halfbackgroundregions.Inthesedatasets,theprobability set of 2506 cases and 2235 controls, which is called
of a rare variant being causal is 0.1 if the variant is “bonafidecase-controlstudies”[9,14].
locatedinanelevatedregion;otherwise,theprobabilityis We apply RareProb to this dataset without any prior
0.001if variantislocatedina background region.In the information.RareProbidentifiesvariant#c.4424A>Gasa
last group of experiments, the regions are generated by causalvariantandreportsasignificantassociationwitha
the Markov chain, where the transition probability of p-valueof8.8817×10-16.Asacomparison,authorsin[30]
remaining in the same regions (keeps in elevated region reportsthattheydididentifyasignificantassociationwith
or background region) is 0.8, while the transition prob- the help of the prior information, but that they did not
ability of transitioning between different regions (jumps findasignificantassociationonlyaccordingtotheresults
from an elevated region to a background region, or ofCMC.Sulandothers[14]appliedRWASandreportsa
jumpsfromabackgroundregiontoanelevatedregion)is non-significantassociationwithp-valueof0.3946without
0.2. The emission probabilities are the same as before. priorinformationandanon-significantassociationwithp-
Wetestthepowersandrecordthepercentagesofcorrect valueof0.0078and0.0881whenpriorinformationofvar-
identifications on the regions. The results are listed in iants is obtained by Align-GVGD [22] and SIFT [23],
Table2.Theresultsshowthatourapproachsuccessfully respectively. Sul and others [9] also applied LRT and
estimates the regions, while RareCover suffers difficulty reportsthata non-significantassociation with p-value of
on identifying neither candidate causal variants nor 0.3934wasfoundwithoutpriorinformation,butasignifi-
region information. We also test our approach on total cantassociationwithp-valueof0.0058and0.08384were
variantsbeing3000,4000and5000.Theseresultscanbe found introducing Align-GVGD scores and SIFT scores,
foundinTableS3intheAdditionalfile3. respectively.Ourapproachsuccessfullyidentifiesanasso-
ciationandclearlypointsoutthecandidatecausalvariant,
RareProb on real mutation screening data withoutpriorinformation,whileeitherRWASorLRTcan-
Finally,weapplyourapproachtoarealmutationscreen- notachievethis.
ing dataset. This dataset has been previously published
by [30]. Authors screen for a susceptibility gene, ATM, Conclusion
whichis thought to associate with ataxiatelangiectasia. In this article, we propose a probabilistic method, Rare-
ATM is also an intermediate-risk susceptibility gene for Prob, to identify multiplerare variantsthatcontribute to
breast cancer [9,14]. The dataset (ATM_CCMSdata_- dichotomous disease susceptibility. Our approach is
Dec2011_v1) we have consists of 121 rare variants in a inspiredbyRareCover.Bothapproachesselectasubsetof
Wangetal.BMCGenomics2013,14(Suppl1):S11 Page8of9
http://www.biomedcentral.com/1471-2164/14/S1/S11
potentiallycausal variantsfromthegiven variants,which SeanTavtigianandProfessorGeorgiaChenevix-TrenchforsharingtheATM
meansourapproachdoesnotrelyonthepre-selectionof datasetswithus.AuthorsthankProfessorChun-XiaYanM.D.andM.D.Feng
Zhufordiscussingtheelevatedregionandthebackgroundregionfroma
candidaterarevariants.Furthermore,asopposedtosimply medicalandclinicalview.
merging the variants in RareCover, our approach gains
Authordetails
power by considering the directions and the magnitudes
1DepartmentofComputerScienceandTechnology,Xi’anJiaotong
of the genetic effects. Both the causal and the protective University,Xi’an,P.R.China.2ComputerScienceandEngineeringDepartment,
variants can be described by pair-wise measurements, UniversityofConnecticut,Storrs,Connecticut06269-2155,USA.
respectively.Thismethodgetsridoftheweaknessoflos-
Authors’contributions
ingstatisticalpowerwhen“causal”,“neutral”and“protec- JWandZMconductedthisresearch.JWdesignedalgorithmsand
tive” variants are combined. Note that the pair-wise experiments.ZC,AYandJZdevelopedthesoftwarepackagesand
weightisnotthelinkagedisequilibrium(LD).LDisquite participatedintheperformanceanalysisandtheexperimentsonthereal
dataset.JW,ZMandJZwrotethispaper.Allauthorshavereadand
difficult to observe, although it is expected among rare approvedthefinalmanuscript.
variants. The pair-wisemeasurementsindicatethe likeli-
Declarations
hood of two variants being collapsed, whichis similar to
ThepublicationcostsforthisarticlewerefundedbyXi’anJiaotong
thekernelfunctionsinregression-basedframeworks.This University.
weightisthenusedtobuilduptheneighborhoodsystem ThisarticlehasbeenpublishedaspartofBMCGenomicsVolume14
ofthehiddenMarkovrandomfieldmodel. Supplement1,2013:SelectedarticlesfromtheEleventhAsiaPacific
BioinformaticsConference(APBC2013):Genomics.Thefullcontentsofthe
TheMarkovrandomfieldmodeltreatsallofthevariants
supplementareavailableonlineathttp://www.biomedcentral.com/
asonevectorandestimatestheircausal/non-causalstatus bmcgenomics/supplements/14/S1.
bygloballymaximizingthelikelihoodofgenotypesinstead
Competinginterests
ofbylocaloptimization.Ourapproachgainsmorepower
Theauthorsdeclarethattheyhavenocompetinginterests.
thanexistinggroup-wisecollapsingapproaches;RareProb
filters out those variants with non-causal status. At the Published:21January2013
sametime,unlikethepreviousselection-basedapproaches,
References
RareProb controls the false positive rate by partitioning
1. HirschhornNJoel,DalyJMark:Genome-wideassociationstudiesfor
elevated regions and background regions, instead of by commondiseasesandcomplextraits.NatureReviewsGenetics2005,
presetting any sliding windows. Regions are much more 6:95-108.
2. RopersHans-Hilger:Newperspectivesfortheelucidationofgenetic
flexible than preset sliding windows. While existing
disorders.AmJHumGenet2007,81:199-207.
approachescanonlyhandlehundredsofvariants,thereis 3. ManolioATeri,CollinsSFrancis,CoxJNancy,etal:Findingthemissing
no doubt that the total number of variants will increase heritabilityofcomplexdiseases.Nature2009,461:747-753.
4. PritchardKJonathan:Arerarevariantsresponsibleforsusceptibilityto
rapidly with the development of new technologies, e.g.
complexdiseases?AmJHumGenet2001,69:124-137.
applications of next generation sequencing. The simula- 5. ReichEDavid,LanderSEric:Ontheallelicspectrumofhumandisease.
tionexperimentsshowthatourapproachobtainssignifi- TrendsinGenetics2001,17:502-510.
6. ChapmanHNicola,WijsmanMEllen:Genomescreensusinglinkage
cantly more power, especially when the total number of
disequilibriumtests:optimalmarkercharacteristicsandfeasibility.AmJ
givenrarevariantsislarge.Wealsoapplyourapproachto HumGenet1998,63:1872-1885.
arealmutationscreeningdatasetandasignificantassocia- 7. XiongMomiao,ZhaoJinying,BoerwinkleEric:GeneralizedT2Testfor
genomeassociationstudies.AmJHumGenet2002,70:1257-1268.
tionisfound.Ourapproachisabletohandlethousandsof
8. BodmerWalter,BonillaCarolina:Commonandrarevariantsinmultifactorial
variants. Moreover, our approachis easy toextend toan susceptibilitytocommondiseases.NatureGenetics2008,40:695-701.
“additive” genetic model and multiple phenotypes by 9. SulHJae,HanBuhm,EskinEleazar:Increasingpowerofgroupwise
associationtestwithlikelihoodratiotest.ProceedingsofRECOMB2011,
updatingtheDirichletpriordistribution.
2011:28-31.
10. MichaelCWu,SeunggeunLee,TianxiCai,etal:Rare-variantassociation
testingforsequencingdatawiththesequencekernelassociationtest.
Additional material
AmJHumGenet2011,89:82-93.
11. MadsenEBo,BrowningRSharon:Agroupwiseassociationtestforrare
Additionalfile1:TableS1.Thepowercomparisonsatdifferentlevels mutationsusingaweightedsumstatistic.PLoSGenetics2009,5:
ofPARanddifferentnumbersofcausalvariants. e1000384.
12. MorgenthalerStephan,ThillyGWilliam:Astrategytodiscovergenesthat
Additionalfile2:TableS2.Thepowercomparisonsfordifferent
carrymulti-allelicormono-allelicriskforcommondiseases:acohort
configurationsofcausalvariantsdependedonPARs.
allelicsumstest(CAST).MutationResearch2007,615:28-56.
Additionalfile3:TableS3.Thepowercomparisonsfordifferent 13. LiBingshan,LealMSuzanne:Methodsfordetectingassociationswithrare
configurationsofregions. variantsforcommondiseases:applicationtoanalysisofsequencedata.
AmJHumGenet2008,83:311-321.
14. SulHJae,HanBuhm,HeDan,EskinEleazar:Anoptimalweighted
aggregatedassociationtestforidentificationofrarevariantsinvolvedin
Acknowledgements commondiseases.Genetics2011,188:181-188.
ThisworkwassupportedbyNationalScienceFoundation[IIS-0803440], 15. NealeMBenjamin,RivasAManuel,VoightFBenjamin,etal:Testingforan
[CCF-1116175]and[IIS-0953563]andthePh.D.ProgramsFoundationof unusualdistributionofrarevariants.PLoSGenetics2011,7:e1001322.
MinistryofEducationofChina[20100201110063].AuthorsthankProfessor
Wangetal.BMCGenomics2013,14(Suppl1):S11 Page9of9
http://www.biomedcentral.com/1471-2164/14/S1/S11
16. CohenJonathan,PertsemlidisAlexander,KotowskiKIngrid,etal:LowLDL
cholesterolinindividualsofAfricandescentresultingfromfrequent
nonsensemutationsinPCSK9.NatureGenetics2005,37:161-165.
17. CohenCJonathan,BoerwinkleEric,MosleyHThomas,HobbsHHelen:
SequencevariationsinPCSK9,lowLDL,andprotectionagainstcoronary
heartdisease.TheNewEnglandJournalofMedicine2006,354:1264-1272.
18. KathiresanSekar,MelanderOlle,AnevskiDragi,etal:Polymorphisms
associatedwithcholesterolandriskofcardiovascularevents.TheNew
EnglandJournalofMedicine2008,358:1240-1249.
19. BhatiaGaurav,BansalVikas,HarismendyOlivier,etal:Acoveringmethod
fordetectinggeneticassociationsbetweenrarevariantsandcommon
phenotypes.PLoSComputationalBiology2010,6:e1000954.
20. PritchardKJonathan,CoxNJ:Theallelicarchitectureofhumandisease
genes:commondisease-commonvariant-ornot?HumanMolecular
Genetics2002,11:2417-2423.
21. BesagJulian:Onthestatisticalanalysisofdirtypictures.Journalofthe
RoyalStatisticalSocietySeriesB1986,48:259-302.
22. TavtigianVSean,DeffenbaughAM,YinL,etal:Comprehensivestatistical
studyof452BRCA1missensesubstitutionswithclassificationofeight
recurrentsubstitutionsasneutral.JournalofMedicalGenetics2006,
43:295-305.
23. PaulineCNg,StevenHenikoff:SIFT:predictingaminoacidchangesthat
affectproteinfunction.NucleicAcidsResearch2003,31:3812-3814.
24. KweeCoulterLydia,DaweiLiu,XihongLin,etal:Apowerfulandflexible
multilocusassociationtestforquantitativetraits.AmJHumGenet2008,
82:386-397.
25. MichaelCWu,KraftP,MichaelPEpstein,etal:PowerfulSNP-setanalysisfor
case-controlgenome-wideassociationstudies.AmJHumGenet2010,
86:929-942.
26. QuintanaAMelanie,BersteinLJonine,ThomasCDuncan,ContiVDavid:
Incorporatingmodeluncertaintyindetectingrarevariants:theBayesian
riskindex.GeneticEpidemiology2011,35:638-649.
27. ContiVDavid,JamesGaudermanW:SNPs,haplotypes,andmodel
selectioninacandidategeneregion:theSIMPleanalysisformultilocus
data.GeneticEpidemiology2004,27:429-441.
28. MelanieAWilson,EdwinSIversen,MerliseAClyde,etal:Bayesianmodel
searchandmultilevelinferenceforSNPassociationstudies.Annalsof
AppliedStatistics2010,4:1342-1364.
29. FuYun-Xin:Statisticalpropertiesofsegregatingsites.Theoretical
PopulationBiology1995,48:172-197.
30. SeanVTavtigian,PeterJOefner,DavitBabikyan,etal:Rare,evolutionarily
unlikelymissensesubstitutionsinATMconferincreasedriskofbreast
cancer.AmJHumGenet2009,85:427-446.
doi:10.1186/1471-2164-14-S1-S11
Citethisarticleas:Wangetal.:Aprobabilisticmethodforidentifying
rarevariantsunderlyingcomplextraits.BMCGenomics201314(Suppl1):
S11.
Submit your next manuscript to BioMed Central
and take full advantage of:
• Convenient online submission
• Thorough peer review
• No space constraints or color figure charges
• Immediate publication on acceptance
• Inclusion in PubMed, CAS, Scopus and Google Scholar
• Research which is freely available for redistribution
Submit your manuscript at
www.biomedcentral.com/submit