Table Of Content

Wangetal.BMCGenomics2013,14(Suppl1):S11 http://www.biomedcentral.com/1471-2164/14/S1/S11 PROCEEDINGS Open Access A probabilistic method for identifying rare variants underlying complex traits Jiayin Wang2*, Zhongmeng Zhao1*, Zhi Cao1, Aiyuan Yang1, Jin Zhang2 From The Eleventh Asia Pacific Bioinformatics Conference (APBC 2013) Vancouver, Canada. 21-24 January 2013 Abstract Background: Identifying the genetic variants that contribute to disease susceptibilities is important both for developing methodologies and for studying complex diseases in molecular biology. It has been demonstrated that the spectrum of minor allelic frequencies (MAFs) of risk genetic variants ranges from common to rare. Although association studies are shifting to incorporate rare variants (RVs) affecting complex traits, existing approaches do not show a high degree of success, and more efforts should be considered. Results: In this article, we focus on detecting associations between multiple rare variants and traits. Similar to RareCover, a widely used approach, we assume that variants located close to each other tend to have similar impacts on traits. Therefore, we introduce elevated regions and background regions, where the elevated regions are considered to have a higher chance of harboring causal variants. We propose a hidden Markov random field (HMRF) model to select a set of rare variants that potentially underlie the phenotype, and then, a statistical test is applied. Thus, the association analysis can be achieved without pre-selection by experts. In our model, each variant has two hidden states that represent the causal/non-causal status and the region status. In addition, two Bayesian processes are used to compare and estimate the genotype, phenotype and model parameters. We compare our approach to the three current methods using different types of datasets, and though these are simulation experiments, our approach has higher statistical power than the other methods. The software package, RareProb and the simulation datasets are available at: http://www.engr.uconn.edu/~jiw09003. Introduction have limited diagnostic value [2,3]. While the identifica- In most existing genetic variant association studies, tion of common variants creates a dilemma, known as “common trait, common variants”, which asserts that “common trait, rare variants”, an alternative hypothesis, common genetic variants contribute to most of traits which asserts that multiple rare variants with moderate (disease susceptibilities), serves as the central assump- to high penetrances may collectively influence disease tion. Researchers have successfully identified some sig- susceptibilities, has been suggested in some literatures nificant associations between common single nucleotide [3-5]. Rare variants are defined as those whose minor polymorphisms (SNPs) and disease traits [1]. However, allele frequencies (MAF) are less than or equal to 0.01 despite the enormous efforts expended on association (≤ 10-2). Although some rare variants associated with studies of complex traits, common genetic variants only Mendeliandiseaseshavebeenidentified,moreoften,the show a moderate influence on different phenotypes in allelicpopulationattributablerisk(PAR),whichdescribes many reported disease associations and consequently a small reduction in the incidence that would be observed in unexposed samples compared to the actual exposurepattern,islow.Theoddsratio(OR),ameasure *Correspondence:[email protected];[email protected] 1DepartmentofComputerScienceandTechnology,Xi’anJiaotong of the strength of association or non-independence University,Xi’an,P.R.China between two binary data values, is also low. Moreover, 2ComputerScienceandEngineeringDepartment,UniversityofConnecticut, basedonthe“commontrait,rarevariants”hypothesis,in Storrs,Connecticut06269-2155,USA Fulllistofauthorinformationisavailableattheendofthearticle ©2013Wangetal.;licenseeBioMedCentralLtd.ThisisanopenaccessarticledistributedunderthetermsoftheCreativeCommons AttributionLicense(http://creativecommons.org/licenses/by/2.0),whichpermitsunrestricteduse,distribution,andreproductionin anymedium,providedtheoriginalworkisproperlycited. Wangetal.BMCGenomics2013,14(Suppl1):S11 Page2of9 http://www.biomedcentral.com/1471-2164/14/S1/S11 manycases,asetofrarevariants,insteadofjustonevar- alsocomputethepair-wiselikelihoodofcandidatevariants iant, should be identified to fully explain the genetic beingcollapsedtogether.Notethatalthoughitisdifficult influence. Boththe single-variant test[6] and the multi- toobserve,relativelylowinteractions(e.g.,linkagedisequi- ple-varianttest[7]havebeenappliedtorarevariantasso- librium) are expected between rare variants [4,11,13,20]. ciation studies. However, due to the reasons outlined Furthermore, in regression-based association methods, above, neither of them shows satisfactory power in genetic similarities are often used to reduce the dimen- obtaining associations. Although more and more atten- sionsoftheregressionmodels.Therefore,we introduced tion is being focused upon rare variants, there has only two kindsofgenetic regions, the elevated regionandthe beenlimitedsuccessthusfar[8-10]. background region, in our model analysis; the elevated Alternatively, the collapsing strategy, also called the regionhasahigherprobabilityofharboringacausalvar- “burden-basedtest”,isanotherapproachforrarevariants iant. This assumption that the causal variants are often associationstudies.Mostofthecollapse-basedapproaches located close to each other is often used, e.g. slide win- build on the “recessive-set” genetic model, in which the dows in RareCover [19]. However, the regions are more predisposing haplotype contains mutation(s) in at least flexiblethanapresetslidewindow,asinRareCover. onevariant[11]. Multiplerarevariantsinthesamelocus We adopt the “dominant” and “recessive set” genetic arecollapsed,basedondifferentstandards,thenstatistical model, which are also used in [9-11,14,15,19]. In the tests are applied. The locus here is defined as a selected dominantandrecessive-setmodel,thepredisposinggen- region thatconsistsofa groupofcandidate rare variants otype harbors the mutation(s) in at least one variant on [9,12-14]. However, it is argued that existing collapse- any of the two haplotypes. Therefore, for one genotype, basedapproachesassumeallrarevariantsimplicitlyinflu- there are two possible allelic values at each variant: one encing thephenotypeinthesame directionandwith the denotes that both haplotypes carry a wide-type allele, same magnitude [10,15]. Researchershave observed that whiletheotherdenotesthatatleastonehaplotypecarries anygivenrarevariantcouldhavenoeffect,couldbecau- a mutant. In our method, each variant has two hidden sal, or couldbe protective for the endpoints (traits) [15]. states,causal/non-causalstatusandelevated/background For example, some low-frequency variants in African regionstatus.TheMRFincludesthehiddenstates,emis- AmericansPCSK9 canhaveasubstantialeffectonserum sionprobabilitiesandtransitionprobabilities.The emis- Low-Density Lipoprotein Cholesterol (LDL-C)by increas- sion probabilities bridge the hidden states and the ingtheriskoforprotectingagainstmyocardialinfarction genotypes,whilethe transitionprobabilitieslink thetwo [16-18]. hiddenstates.Followingthepseudo-likelihoodestimation Collapse-based approacheshave low statistical powers method [21], we infer the model parameters and all of when “causal”, “neutral” and “protective” variants are thehiddenstates.Thesimulationexperimentsshowthat combined [13,15,19]. To overcome this weakness, some our approach outperforms RareCover, RWAS [14] and approaches [9,14] assume that the rare variants are well LRT [9] on different parametric settings. In particular, selected by experts, while weighting of each variant is RareProbobtainsbetterresultsonlarge-scaledata. another widelyusedstrategy[9,11,14].Inarecentstudy, Bhatia and others [19] suggest the development of a Methods “model-free” approach, RareCover, that only collapses a Notions and model overview subsetofpotentiallycausalvariantsfromallofthegiven Suppose we are given M rare variants (allelic sites) on a variants.Here,the “model” refersto the geneticassocia- set of N genotypes. Let s denote the allelic value of the i tion model that consists of the pre-selection candidate site s on the genotype i (1 ≤ i ≤ N, 1 ≤ s ≤ M), where s i variants. = 0 means both haplotypes of i have the wild type allele, MotivatedbyRareCover,inthisarticle,wefocusonrare while s = 1 means at least one haplotype has a mutant i variant association analysis without any pre-selection of allele. Each genotype carries a dichotomous phenotype. candidate variants. We propose a probabilistic approach, Let vector P denotes the phenotypes, where P = 1 i RareProb,tomakeselectionsusingaMarkovrandomfield represents that i is affected by the phenotype trait (MRF) model and identify multiple causal rare variants (being a case), while P = 0 represents that i is a control. i that influence a dichotomousphenotype using statistical The core of our approach is a Markov random field tests.Ourapproachconsidersboththecausalandthepro- (MRF) model. We first introduce four key components tective variants, whichdistinguishesit fromthe previous of modeling this MRF: study RareCover,anditisthereforea robust predictorof the direction and the magnitude of the genetic effects. (cid:129) The observed data of this MRF consist of all of the Moreover, inspired by the weight-sum approaches genotypes and phenotypes. [9,11,14], we also weight each variant; however, we not (cid:129) There are two unknown states for each site: one is only considerthe likelihoodof avariantbeingcausal but the causal or non-causal status and the other is the Wangetal.BMCGenomics2013,14(Suppl1):S11 Page3of9 http://www.biomedcentral.com/1471-2164/14/S1/S11 region location status. Here, we define them as the and the second-order clique potentials because there is hidden states of this Markov random field. Let a scantyevidencesupportingthebiologicalormedicalsce- latent vector R represent the region status, where nario of high-order potentials. For each variable, the R = 1 denotes that the site s is located in an ele- MAFs and model parameters can be estimated by maxi- s vated region, while R = 0 denotes the s is located in mizing the likelihoodsof the genotypes. Then, the prob- s a background region. Additionally, let a latent vector ability of the variable and the variable itself can be X represent the causal/non-causal status, where X = updated by MAFs and model parameters. Two or three s 1 if the site s is causal (contributes to the pheno- iterationscanbeappliedifneededfortheconvergenceof type); otherwise, X = 0. Probabilistic functions are theMRF.Thus,ourapproachselectsasubsetofcandidate s designed to present the probabilities of each hidden causal variants by updating the variables and avoids the state. The RareProb framework is able to incorporate weakness of the same magnitude effect assumption prior information, obtained by different software becausetheneighborhoodsystemisabletodescribeboth tools, e.g. Align-GVGD [22] and SIFT [23], etc, by the“causal”and“protective”variants. updating initial X vector and R vector. (cid:129) A neighborhood system is required in the MRF Estimation of the hidden states in HMRF model to describe the interactions among hidden Neighborhood system states. Details of the hidden states and neighborhood Assume there are N/2 casesand N/2controlsamong all system are shown in the section “Estimation of the of the genotypes (if the number of cases is not equal to transition probabilities in HMRF”. thenumberofcontrols,thenalloftheresultsstillcanbe (cid:129) There are two kinds of probabilities in the MRF usedbyapplyingnoncentralityparameters).Atacertain model:emissionprobabilitiesandtransitionprobabil- variants,letθ denotetheMAFforthecases,andletthe s ities. Emission probabilities bridge the relationships number of genotypes in cases that carry at least one among genotypes, phenotypes and hidden states. mutant allele be c+. Let r denote the MAF for the con- s s Moreover,hiddenstatesXandRarenotindependent trols, and let the number of genotypes in controls that ofeachother,astherelationshipsbetweenthehidden − carryatleastonemutantallelebe c .Then,wecandraw s states are described by the transition probabilities. twobinomialdistributionsforthecasesandthecontrols The conditional probabilityP(X =1|R = 1) denotes (cid:2) (cid:3) (cid:2) (cid:3) s s [9,14]: c+ ∼Bin Nθ and c− ∼Bin N,ρ , where theprobabilitythatthesitesisacausalsitewhenitis s 2 s s 2 s located in an elevated region, while P(Xs = 0|Rs = 1) f(c+s|θs)=CcN+s θsc+s(1−θs)N2−c+s and f(c−s|ρs)=CcN−s ρsc−s(1−ρS)N2−c−s. denotes the probability that the site s is non-causal 2 2 when it is located in an elevated region. Similarly, Thus,forasites,thestatisticofthedifferencebetweenθ another two conditional probabilities, P(X = 1|R = andris s s (cid:4) (cid:5) 0) and P(X = 0|R = 0), present the probabilities of being causasl or nosn-causal if the site is located in a 2 θˆs−ρˆs backgroundregion.Detailsoftheemissionprobabil- zs = (cid:6) (cid:7)(cid:4) (cid:5)(cid:4) (cid:5) itiesareshowninthesection“Estimationoftheemis- N2 θˆs+ρˆs 2−θˆs−ρˆs tpshiroeonbtraapbnrilsoiitbtiieaosbnialpriretoiesbhsaobiwinlintHieinsMitnRheFH”sM,ecaRtniFod”n. t“hEesttimraantsioitnioonf where θˆs = Nc/+s2 is the estimation of θs and ρˆs = Nc−s/2 is the estimation of r. Similar to the linear kernel func- s tion, which calculates genetic similarities [10], we mea- The central thesis of our approach is that causal rare sure the likelihood between pairwise rare variants, which variants,whichshouldbecollapsedtogether,aretreatedas denotes how likely two variants would be collapsed one random vector variable with certain dimensions. Then,theprobabilityofthisbunchofcausalrarevariants together. For two variants s and s’, we define ωs,s’ as the likelihood of collapsing as follows: becomes the probability of one variable being associated wleintchet[h2e1p],htehneoptyrpoeb.aBbialsiteydoofnthtihseraMnadrokmovv-Gariibabbsleecqaunivbae- ωs,s(cid:3) = z22z+szzs(cid:3)2 s s(cid:3) decomposed into the sum of clique potentials. The first- ordercliquepotentialsdescribetheprobabilityofonevar- The ω function has the following properties: (1) When iantbeingcausal,whilethesecond-ordercliquepotentials both s and s’ are causal variants, due to the PAR, ωs,s’ measurethepair-wisegeneticsimilarities,whichsharethe locates in the interval (0, 1]. (2) If one variant is “causal” idea of the kernel machine in regression frameworks but the other is “protective”, the likelihood takes on a [10,24,25]. The neighborhood system in the MRF model negative value. (3) The likelihood encourages the col- consists of clique potentials. In our approach, we select lapse of the variants with similar PAR. Those rare var- thattheneighborhoodsystemonlycontainsthefirst-order iants whose MAFs increase rapidly in some cases, as we Wangetal.BMCGenomics2013,14(Suppl1):S11 Page4of9 http://www.biomedcentral.com/1471-2164/14/S1/S11 morepnatiior-nweidsebteefostrse,,wchouiclhdabreeiodfetnentifineodtbcyonssinidgelere-sditientceostls- π(ρs)= ρsαρs−B1((α1ρ−,βρρs))βρs−1 and π(ρs)= ρsαρs−B1((α1ρ−,βρρs))βρs−1, lapsing models [8]. Let ω.,. be the weight of two neigh- where a(·) ands b(s·) are hyper-parameters in tshesprior bors. The closer the statistics zs and zs’ are, the larger distributions [26-28]. Then, the marginal distribution the likelihood will be. And thus, the neighborhood sys- of c+ is s tem is built up. (cid:2) (cid:3) HRiniadradeebnvaacsrktiaagtnreotsusnids reeitghioern.loTchautesd, wine adnefienleevtahteedprroegbiaobniliotyr m(c+s)=CcN2+s (cid:13)(cid:13)(α(αθsθ)s(cid:13),β(βθsθ)s)(cid:13)(αθs(cid:13)+(cid:2)cα+sθ)s(cid:13)+βN2θs−+cN2+s(cid:3)+βθs (Bayesian classifier) of s as The marginal distribution of c− is similar. The prob- s ⎛ ⎞ ability of the observed genotypes on s is equal to the (cid:10) (cid:2) (cid:3) p(Xs|Xn(s))∝exp⎝γp(Xs|Rs)Xs+η ωs,s(cid:3)p(Xs(cid:3)|Rs(cid:3))Xs(cid:3)⎠ m c·s s(cid:3)∈n(s) sum of Cc·s Thus we have: N 2 wheren(s)denotestheneighborsofs.gandharetwo MXRaFffepcatrsamtheeteprrso.bgarbeiplirteyseonftsXh,owwhsilterohngrleyptrheesesnttastuhsoowf P(Ys|Xs=1)=(cid:13)(cid:13)(α(αθsθ)s(cid:13),β(βθsθ)s)(cid:13)(αθs(cid:13)+(cα+sθ)s(cid:13)+(βN2θs−+cN2+s)+βθs)×(cid:13)(cid:13)(α(αρsρ)s(cid:13),β(βρsρ)s)(cid:13)(αρs(cid:13)+(cα−sρ)s(cid:13)+(βN2ρs−+cN2−s)+βρs) s s strongly the neighbors of s affect the probability of Xs. On the other hand, if Xs = 0, then there is no PAR Here, we limit h >0, which encourages the pair-wise between θ and r that infers θ = r . Here, we simply s s s s weightsandprevents themfromcounteractingthenega- use r to denote the MAF of s for both the cases and s tiveweights.Thus,thejointprobabilityofthelatentvector controls. Thus, we have (cid:4) (cid:13) (cid:13) (cid:5) Xis p(X;(cid:7))∝exp γ Ms p(Xs|Rs)+η s(cid:3)∈n(s)ωs,s(cid:3)p(Xs(cid:3)|Rs(cid:3)) ,where (cid:13)(α,β) (cid:13)(α +c)(cid:13)(N−c +β) F = (g, h). As the variants in different subsets (different P(Ys|Xs =0)= (cid:13)(α)s(cid:13)(βs ) s(cid:13)(αs +β +Ns) s collapsinggroups)areconditionalindependent,thisjoint s s s s probability covers all of the probabilities of the random where c =c++c−. We have now obtained all the s s s variables(collapsinggroups).Similarly,theprobabilityofs three emission probabilities of this HMRF: p (Y|X), P locatedinanelevatedregioncanberepresentedby (Y|X = 0) and P (Y|X = 1). s s s s ⎛ ⎞ (cid:2) (cid:3) (cid:10) p Rs|Rn(s) ∝exp⎝τR+υ ωs,s(cid:3)Rs(cid:3)⎠ ETshtiemtartainosnitoiofnthperotrbaanbsiitliitoinesplrionbkatbhielithieisddinenHMstRatFesXand s(cid:3)∈n(s) R. Let c+ be the counts of the causal variants on all of X and the joint probability of(cid:4)latent vector R can b(cid:5)e theelevatedregions,andletcEbethenumberofvariants (cid:13) (cid:13) − represented by p(R;(cid:7)R)∝exp τ Ms R+υ s,s(cid:3)ωs,s(cid:3)Rs(cid:3) , in those regions. Let cX be the counts ofthe causal variants on all of the background regions, and c be the where F = (τ, υ)·τ and υ are two MRF parameters. We B R numberof variantsin those regions.Then, we draw two also limit υ >0, which encourages the pair-wise weights binomial distributions: c+ ∼Bin(c , ξ);c− ∼Bin(c , ζ) and prevents them from counteracting the negative X E X B where ξ = P (X = 1|R = 1) andζ = P (X = 1|R = 0). We weights. alsoplacethepriordistributionsonξandζ,asfollows: Estimation of the emission probabilities in HMRF WenowestimatetheemissionprobabilitiestorelateXand c+X ∼Bin(cE, ξ);f(c+X|ξ)=Ccc+XEξc+X(1−ξ)cE−c+X Rwiththeobserveddata.Aslinkagedisequilibriumisrarely and observedbetweenrarevariants[8],thevectorconsistsof tpheenadlelenlitcfvraolmuesthfreomothoneresv,awrihanenttahaptaisrtciocnudlaitrioXnailslygiinvdene-. c−X ∼Bin(cB, ζ);f(c−X|ζ)=Ccc−XBζc−X(1−ζ)cB−c−X Thus,thejointconditionalprobabilityofallofthegeno- where ξ = P (X = 1|R = 1) and ζ = P (X = 1|R = 0). typesis We also place the prior distributions on ξ and ζ, as (cid:14) (cid:15) follows: (cid:10)M p(Y|X)=exp p(Y|X) s s ξαξ−1(1−ξ)βξ−1 ζαζ−1(1−ζ)βζ−1 s=1 π(ξ)= ;π(ζ)= B(αξ,βξ) B(αζ,βζ) If X = 1, due to the PAR, θ ≠ r . We place a prior s s s distribution on θ and a prior on r [26]: where a(·) and b(·) are also hyper-parameters. s s Wangetal.BMCGenomics2013,14(Suppl1):S11 Page5of9 http://www.biomedcentral.com/1471-2164/14/S1/S11 Thus,wehavetheconditionalprobabilityofXgivenR: the updated Xˆ . If the distance is less than a preset threshold, our approach will stop the iteration. After the P(X|R)=(cid:13)(cid:13)(α(αξ)ξ(cid:13),β(βξ)ξ)(cid:13)(αξ(cid:13)+(cα+Xξ)(cid:13)+(βcEξ−+ccE+X)+βξ)×(cid:13)(cid:13)(α(αζ)ζ(cid:13),β(βζ)ζ)(cid:13)(αζ+(cid:13)(cα−Xζ)(cid:13)+(βcBζ−+ccB−X)+βζ) convergence of HMRF, we obtain the estimations of X and R, as well as the MAFs for every variant. The col- and the posterior distribution of ξ given c+X is lapsed rare variants can be tested based on the existing statistics, e.g. in [9,10,14]. π(ξ|c+)= ξαξ+c+X−1(1−ξ)βξ+M−c+X−1 X B(αξ +c+X,βξ +M−c+X) Experiments and results Similarly, the posterior distribution of ζ given c− is In this section, we apply our approach on a real dataset X from [30] and also compare it with three other π(ζ|c−)= ζαζ+c−X−1(1−ζ)βζ+M−c−X−1 aTphperotharceheescoumsipnagridsoifnfearepnptrotaycpheessoafresimRaurleaCteodvedra,twasheictsh. X B(αζ +c−X,βζ +M−c−X) is based on [19], RWAS from [14] and LRT from [9]. Additionally, it seems that RareCover is not released Thus far, we have obtained all of the three transition probabilities of this HMRF: p (X|R), π(ξ|c+) and online, so as in many previous works, we re-implement X π(ζ|c−). this algorithm and the related statistics by ourselves. X Simulation frameworks Estimation the model parameters As the simulation settings in different papers [9,14,19] Based on the Gibbs-Markov Equivalence [21], a pseudo- are quite different, we adopt all of them and generate likelihood estimation cycle can be applied to this hidden three types of simulated datasets. In the first one, each MRF to estimate the model parameters and update the dataset has a fixed number of causal variants, while in hidden states. We use the pseudo-likelihood estimation because p (X; F) andp (R; F ) are difficult to compute the second dataset, the number of causal variants is R determinedbyallelicpopulationattributablerisk(PAR). directly. The algorithm involves the following four steps: The last simulation method first generates elevated (cid:129) Step 1: Estimate aθ and br with θˆ and ρˆ by maxi- regions and background regions and then plants causal mizing the likelihood L(Y/Xˆ). Update θˆ by maxi- variantsineachregion.Wedescribethethreesimulation s methodsinthefollowingsections. mizing the posterior distribution: Fix number of causal variants π(θs|c+s)= θBsαθs(cid:2)+αc+sθ−s +1(c1+s,−βθθss)+βNθs+N−−cc+s+s−(cid:3)1 FsEaiarlcsvth,awrviaaerngitaesnn,teforialsltoegwethnineegrdapattreaedsveiitonsudwseiaptphepnfridxoeeandcthnlyeusmb[e1bc4ear]usasonefdtch[a9eu]y-. Similarly, Update ρˆ . believe thatrare variantsdo not show significantlinkage s (cid:129) Step 2: Estimate aξ, bξ and aζ, bζ with ξˆ and ζˆ by disequilibrium[9,14].Foreachvariant,theprobabilitydis- tributionoftheMAFofsitesoncontrols,r,satisfiesthe (cid:16) s maximizing the transition probability L(X/R). Wright’sdistributionunderpurifyingselection[4], Update ξˆ and ζˆ by maximizing the transition probabilities π(ξ|c+) and π(ξ|c+), respectively. f(ρs)∝(ρs)βs−1(1−ρs)βN−1eσ−ρsσ X X (cid:129) Step 3: Estimate F and FR with (cid:7)ˆ and (cid:7)ˆR by wheresistheselectioncoefficient,bSistheprobability maximizing the pseudo-likelihood functions: that the normal allelic site mutates to the causal variant, (cid:14) (cid:15) (cid:4) (cid:5) (cid:10)M (cid:4) (cid:5) andbNistheprobabilitythatacausalvariantrepairstoa L Xˆ;(cid:7) =exp ps Xˆs|Xˆn(s) ;(cid:7) normal variant. We take s = 12.0, bS = 0.001 and bN = 0.00033, which are the same settings used by [9,11,14]. S and L(cid:4)Rˆ;(cid:7)R(cid:5). Then,therelativeriskofsis: RR= (1−δδ)ρs +1,whereδis the marginal PAR. The marginal PAR is equal to the (cid:129) Step 4: Update Xˆ and Rˆ by groupPAR(Δ)dividedbythenumberofcausal variants, (cid:4) (cid:5) (cid:4) (cid:5) (cid:4) (cid:5) P X|Y,Xˆ ∝f Y|X;θˆ,ρˆ p X|Xˆ ;(cid:7)ˆ whiletherelativeriskofMvariantsis1[14].Afterwards, s S/s s s s s n(s) the MAF of s for the cases is calculated according to (cid:4) (cid:5) and P Rs|X,RˆS/s . θs = (RRR−R1×)ρρss+1.Ineachdataset,wesimulateN=2000gen- otypeswithhalfcasesandhalfcontrols.Themutationson There are several ways to exit from this iteration. We the cases and the controls are sampled independently measure the Euclidean distance between the current and accordingtoθ andr,respectively. s s Wangetal.BMCGenomics2013,14(Suppl1):S11 Page6of9 http://www.biomedcentral.com/1471-2164/14/S1/S11 Causal variants depends on PAR an iteration to C until it reaches 50,000 iterations. The The second way generates a set, C, that contains all of transitionprobabilityfromCtoAisequaltor×Pr.After the causal variants. Instead of a fixed number, the total we have enough genotypes, we sample 1000 cases and number of causal variants depends on PAR [19], which 1000controlsfromthem. is limited by Δ (the group PAR): (cid:18) (cid:19) Comparisons on power (cid:18)≥1−(cid:17) 1− θsPr Similar to the measurements in [9,14], the power of an P approach is measured by the number of significant data- s∈C D sets, among many datasets, using a significance thresh- wherePrrepresentsthepenetranceofthegroupofcau- old of 2.5 × 10-6 based on the Bonferroni correction salvariantsandP isthediseaseprevalenceinthepopula- assuming 20000 genes, genome-wide. We test at most D tion.Differentsettingsareappliedintheexperiments. 1000 datasets for each comparison experiment. We use the algorithm proposed in [19] to obtain the Power versus different proportions of causal variants MAF of each causal variant. The algorithm samples the We compare the powers under different sizes of total MAFofacausalvariants,θ,fromtheWright’sdistribu- variants. In the first group of experiments, we include s tion with s = 30.0, b = 0.2 and b = 0.002 [4,19], and 50 causal variants and vary the total number of variants S N thenappendsstoC.Next,thealgorithmcheckswhether from 100 to 5000. Thus, the proportions of causal var- (cid:17) (cid:4) (cid:5) 1− θsPr >1−(cid:18) is true. If the inequality does iants decrease from 50% to 1%. In the second group of s∈C PD experiments, we hold the group PAR as 5% and vary the nothold,thealgorithmterminatesandoutputsC.Thus, total number of variants as before. The results are com- weobtainallofthecausalvariantsandtheirMAFs.Ifthe pared in Table 1. From the results, our approach clearly inequality holds, then the algorithm continuously sam- shows more powerful and more robust at dealing with ples the MAF of the next causal variant. The mutations large-scale data. We also test our approach on different ongenotypesaresampledaccordingtoθs. settings of the group PARs. Those results can be found For those non-causal variants, we use Fu’s model [29] in Table S1 in the Additional file 1. of allelic distributions on a coalescent, which is the The Type I error rate is another important measure- same used in [19]. We adopt ρs = 5N.0. The mutations on ment for estimating an approach. To compute the Type genotypes are sampled according to rs. The phenotype I error rate, we apply the same technique as [19]. Type I of each individual (genotype) is computed by the penetrance of the subset, Pr. Thereafter, we sample 1000 of Table 1The powercomparisons atdifferent proportions the cases and 1000 of the controls. ofcausal variants Causal variants depends on regions Total Causal RareProb RareCover RWAS LRT Thereare many waystogenerate a datasetwithregions. 100 50 100% 100% 100% 100% The simplest way is to preset the elevated regions and 200 50 100% 100% 99.6% 99.9% the background regions and to plant causal variants 400 50 100% 100% 85.3% 88.6% based on certain probabilities. An alternate way creates 600 50 100% 94.6% 54.1% 58.8% the regions by a Markov chain. For each site, there are 800 50 100% 0.0% 33.0% 36.5% twogroupsofstates.TheEstatedenotesthatthevariant 1000 50 100% 0.0% 20.7% 22.0% islocatedinanelevatedregion,whiletheBstatedenotes 2000 50 100% 0.0% 2.0% 2.0% that the variant is located in a background region. Both 3000 50 100% 0.0% 0.8% 0.0% states E and B can transfer to a causal state C or a non- causal state C¯ . If the Markov chain travels to the C¯ 4000 50 100% 0.0% 0.4% 0.0% 5000 50 100% 0.0% 0.3% 0.0% state,itplantsamutantonthegenotypewithprobability r. If the variant is considered to be causal, it may con- 200 1* 51.0% 0.0% 0.0% 0.0% 400 3* 77.0% 0.0% 0.0% 0.0% tinuously transfer to the state A, which means that the 600 2* 63.6% 0.0% 0.0% 0.0% genotypecarriesamutantthat may affect thephenotype with penetrance Pr. Otherwise it arrives in the state Ā, 800 3* 57.1% 0.0% 0.0% 0.0% 1000 3* 59.0% 0.0% 0.0% 0.0% and the Markov chain plants a mutant or a wild-type 2000 1* 34.0% 0.0% 0.0% 0.0% alleleonthegenotypeafterwards. 3000 2* 41.2% 0.0% 0.0% 0.0% Togenerateenoughgenotypes,we performthefollow- ingstepsforeachvariant:iftheprocessdropsinto C¯ ,we 4000 3* 40.0% 0.0% 0.0% 0.0% take 50,000 iterations to yield a mutant, where r is 5000 2* 29.8% 0.0% 0.0% 0.0% sampled from the Wright’s distribution with s = 30.0, Theuppersectionofthistableshowstheresultswithafixednumberof bS=0.2andbN=0.002.IfitdropsintoAorĀ,wedesign “c*a”uisnadlivcaartieasnttsh.aTthtehecovlaulmuenis“Caanusaavle”rsahgoewvsatluhee.”numberofcausalvariants,and Wangetal.BMCGenomics2013,14(Suppl1):S11 Page7of9 http://www.biomedcentral.com/1471-2164/14/S1/S11 error rate is defined as the probability of a non-causal Table 2The powercomparisons fordifferent variant being selected in the potential causal set. We configurations ofregions compare our approach only with RareCover because Total Causal Regions Length RareProb CorrectR RWAS or LRT does not select any potential causal var- 1000 36* 1 50 100% 96% iants. The results on different configurations can be 37* 2 50 100% 98% found in supplementary documents. Based on the 36* 3 50 100% 97% results, our approach always holds reasonable Type I 35* 4 50 100% 98% error rates. Although on some configurations RareProb 2000 73* 1 100 100% 97% has a little higher Type I error rates, e.g. 1%-10% higher 73* 2 100 100% 97% when gourp PAR is 5%, than RareCover, the absolute 70* 3 100 100% 98% values are still satisfied. Moreover, when the group PAR 71* 4 100 100% 96% decreases, RareProb always performs lower Type I error Total Causal Regions Length RareCover CorrectR rates than RareCover. These results can be found in 1000 36* 1 50 0.0% 1.9% TableS2intheAdditionalfile2.Consideringbothstatisti- 37* 2 50 0.0% 1.4% calpowerandTypeIerrorrate,theadvantageofRareProb 36* 3 50 0.0% 1.7% cannotbeneglected:itisabletoidentifymostofthecau- 35* 4 50 0.0% 1.6% sal variants with an acceptable Type I error rate. In the 2000 73* 1 100 0.0% 0.7% other words, if an approach rarely identifies correct var- 73* 2 100 0.0% 0.8% iants,alowTypeIerrorratebecomesmeaningless. 70* 3 100 0.0% 1.3% Power versus different configurations of regions 71* 4 100 0.0% 0.8% We compare the powers on different configurations of elevated regions and background regions and test the Thecolumn“Causal”representsthetotalnumberofcausalvariants,“Region” denotesthetotalnumberofelevatedregions,“Length”indicatesthetotal performance of our approach in identifying the regions. numberofvariantslocatinginelevatedregions.Thecolumn“CorrectR” At each total variant number, we preset the number of showsthepercentageofcorrectidentificationofregions. regions between 2 and 8, with half elevated regions and halfbackgroundregions.Inthesedatasets,theprobability set of 2506 cases and 2235 controls, which is called of a rare variant being causal is 0.1 if the variant is “bonafidecase-controlstudies”[9,14]. locatedinanelevatedregion;otherwise,theprobabilityis We apply RareProb to this dataset without any prior 0.001if variantislocatedina background region.In the information.RareProbidentifiesvariant#c.4424A>Gasa last group of experiments, the regions are generated by causalvariantandreportsasignificantassociationwitha the Markov chain, where the transition probability of p-valueof8.8817×10-16.Asacomparison,authorsin[30] remaining in the same regions (keeps in elevated region reportsthattheydididentifyasignificantassociationwith or background region) is 0.8, while the transition prob- the help of the prior information, but that they did not ability of transitioning between different regions (jumps findasignificantassociationonlyaccordingtotheresults from an elevated region to a background region, or ofCMC.Sulandothers[14]appliedRWASandreportsa jumpsfromabackgroundregiontoanelevatedregion)is non-significantassociationwithp-valueof0.3946without 0.2. The emission probabilities are the same as before. priorinformationandanon-significantassociationwithp- Wetestthepowersandrecordthepercentagesofcorrect valueof0.0078and0.0881whenpriorinformationofvar- identifications on the regions. The results are listed in iants is obtained by Align-GVGD [22] and SIFT [23], Table2.Theresultsshowthatourapproachsuccessfully respectively. Sul and others [9] also applied LRT and estimates the regions, while RareCover suffers difficulty reportsthata non-significantassociation with p-value of on identifying neither candidate causal variants nor 0.3934wasfoundwithoutpriorinformation,butasignifi- region information. We also test our approach on total cantassociationwithp-valueof0.0058and0.08384were variantsbeing3000,4000and5000.Theseresultscanbe found introducing Align-GVGD scores and SIFT scores, foundinTableS3intheAdditionalfile3. respectively.Ourapproachsuccessfullyidentifiesanasso- ciationandclearlypointsoutthecandidatecausalvariant, RareProb on real mutation screening data withoutpriorinformation,whileeitherRWASorLRTcan- Finally,weapplyourapproachtoarealmutationscreen- notachievethis. ing dataset. This dataset has been previously published by [30]. Authors screen for a susceptibility gene, ATM, Conclusion whichis thought to associate with ataxiatelangiectasia. In this article, we propose a probabilistic method, Rare- ATM is also an intermediate-risk susceptibility gene for Prob, to identify multiplerare variantsthatcontribute to breast cancer [9,14]. The dataset (ATM_CCMSdata_- dichotomous disease susceptibility. Our approach is Dec2011_v1) we have consists of 121 rare variants in a inspiredbyRareCover.Bothapproachesselectasubsetof Wangetal.BMCGenomics2013,14(Suppl1):S11 Page8of9 http://www.biomedcentral.com/1471-2164/14/S1/S11 potentiallycausal variantsfromthegiven variants,which SeanTavtigianandProfessorGeorgiaChenevix-TrenchforsharingtheATM meansourapproachdoesnotrelyonthepre-selectionof datasetswithus.AuthorsthankProfessorChun-XiaYanM.D.andM.D.Feng Zhufordiscussingtheelevatedregionandthebackgroundregionfroma candidaterarevariants.Furthermore,asopposedtosimply medicalandclinicalview. merging the variants in RareCover, our approach gains Authordetails power by considering the directions and the magnitudes 1DepartmentofComputerScienceandTechnology,Xi’anJiaotong of the genetic effects. Both the causal and the protective University,Xi’an,P.R.China.2ComputerScienceandEngineeringDepartment, variants can be described by pair-wise measurements, UniversityofConnecticut,Storrs,Connecticut06269-2155,USA. respectively.Thismethodgetsridoftheweaknessoflos- Authors’contributions ingstatisticalpowerwhen“causal”,“neutral”and“protec- JWandZMconductedthisresearch.JWdesignedalgorithmsand tive” variants are combined. Note that the pair-wise experiments.ZC,AYandJZdevelopedthesoftwarepackagesand weightisnotthelinkagedisequilibrium(LD).LDisquite participatedintheperformanceanalysisandtheexperimentsonthereal dataset.JW,ZMandJZwrotethispaper.Allauthorshavereadand difficult to observe, although it is expected among rare approvedthefinalmanuscript. variants. The pair-wisemeasurementsindicatethe likeli- Declarations hood of two variants being collapsed, whichis similar to ThepublicationcostsforthisarticlewerefundedbyXi’anJiaotong thekernelfunctionsinregression-basedframeworks.This University. weightisthenusedtobuilduptheneighborhoodsystem ThisarticlehasbeenpublishedaspartofBMCGenomicsVolume14 ofthehiddenMarkovrandomfieldmodel. Supplement1,2013:SelectedarticlesfromtheEleventhAsiaPacific BioinformaticsConference(APBC2013):Genomics.Thefullcontentsofthe TheMarkovrandomfieldmodeltreatsallofthevariants supplementareavailableonlineathttp://www.biomedcentral.com/ asonevectorandestimatestheircausal/non-causalstatus bmcgenomics/supplements/14/S1. bygloballymaximizingthelikelihoodofgenotypesinstead Competinginterests ofbylocaloptimization.Ourapproachgainsmorepower Theauthorsdeclarethattheyhavenocompetinginterests. thanexistinggroup-wisecollapsingapproaches;RareProb filters out those variants with non-causal status. At the Published:21January2013 sametime,unlikethepreviousselection-basedapproaches, References RareProb controls the false positive rate by partitioning 1. HirschhornNJoel,DalyJMark:Genome-wideassociationstudiesfor elevated regions and background regions, instead of by commondiseasesandcomplextraits.NatureReviewsGenetics2005, presetting any sliding windows. Regions are much more 6:95-108. 2. RopersHans-Hilger:Newperspectivesfortheelucidationofgenetic flexible than preset sliding windows. While existing disorders.AmJHumGenet2007,81:199-207. approachescanonlyhandlehundredsofvariants,thereis 3. ManolioATeri,CollinsSFrancis,CoxJNancy,etal:Findingthemissing no doubt that the total number of variants will increase heritabilityofcomplexdiseases.Nature2009,461:747-753. 4. PritchardKJonathan:Arerarevariantsresponsibleforsusceptibilityto rapidly with the development of new technologies, e.g. complexdiseases?AmJHumGenet2001,69:124-137. applications of next generation sequencing. The simula- 5. ReichEDavid,LanderSEric:Ontheallelicspectrumofhumandisease. tionexperimentsshowthatourapproachobtainssignifi- TrendsinGenetics2001,17:502-510. 6. ChapmanHNicola,WijsmanMEllen:Genomescreensusinglinkage cantly more power, especially when the total number of disequilibriumtests:optimalmarkercharacteristicsandfeasibility.AmJ givenrarevariantsislarge.Wealsoapplyourapproachto HumGenet1998,63:1872-1885. arealmutationscreeningdatasetandasignificantassocia- 7. XiongMomiao,ZhaoJinying,BoerwinkleEric:GeneralizedT2Testfor genomeassociationstudies.AmJHumGenet2002,70:1257-1268. tionisfound.Ourapproachisabletohandlethousandsof 8. BodmerWalter,BonillaCarolina:Commonandrarevariantsinmultifactorial variants. Moreover, our approachis easy toextend toan susceptibilitytocommondiseases.NatureGenetics2008,40:695-701. “additive” genetic model and multiple phenotypes by 9. SulHJae,HanBuhm,EskinEleazar:Increasingpowerofgroupwise associationtestwithlikelihoodratiotest.ProceedingsofRECOMB2011, updatingtheDirichletpriordistribution. 2011:28-31. 10. MichaelCWu,SeunggeunLee,TianxiCai,etal:Rare-variantassociation testingforsequencingdatawiththesequencekernelassociationtest. Additional material AmJHumGenet2011,89:82-93. 11. MadsenEBo,BrowningRSharon:Agroupwiseassociationtestforrare Additionalfile1:TableS1.Thepowercomparisonsatdifferentlevels mutationsusingaweightedsumstatistic.PLoSGenetics2009,5: ofPARanddifferentnumbersofcausalvariants. e1000384. 12. MorgenthalerStephan,ThillyGWilliam:Astrategytodiscovergenesthat Additionalfile2:TableS2.Thepowercomparisonsfordifferent carrymulti-allelicormono-allelicriskforcommondiseases:acohort configurationsofcausalvariantsdependedonPARs. allelicsumstest(CAST).MutationResearch2007,615:28-56. Additionalfile3:TableS3.Thepowercomparisonsfordifferent 13. LiBingshan,LealMSuzanne:Methodsfordetectingassociationswithrare configurationsofregions. variantsforcommondiseases:applicationtoanalysisofsequencedata. AmJHumGenet2008,83:311-321. 14. SulHJae,HanBuhm,HeDan,EskinEleazar:Anoptimalweighted aggregatedassociationtestforidentificationofrarevariantsinvolvedin Acknowledgements commondiseases.Genetics2011,188:181-188. ThisworkwassupportedbyNationalScienceFoundation[IIS-0803440], 15. NealeMBenjamin,RivasAManuel,VoightFBenjamin,etal:Testingforan [CCF-1116175]and[IIS-0953563]andthePh.D.ProgramsFoundationof unusualdistributionofrarevariants.PLoSGenetics2011,7:e1001322. MinistryofEducationofChina[20100201110063].AuthorsthankProfessor Wangetal.BMCGenomics2013,14(Suppl1):S11 Page9of9 http://www.biomedcentral.com/1471-2164/14/S1/S11 16. CohenJonathan,PertsemlidisAlexander,KotowskiKIngrid,etal:LowLDL cholesterolinindividualsofAfricandescentresultingfromfrequent nonsensemutationsinPCSK9.NatureGenetics2005,37:161-165. 17. CohenCJonathan,BoerwinkleEric,MosleyHThomas,HobbsHHelen: SequencevariationsinPCSK9,lowLDL,andprotectionagainstcoronary heartdisease.TheNewEnglandJournalofMedicine2006,354:1264-1272. 18. KathiresanSekar,MelanderOlle,AnevskiDragi,etal:Polymorphisms associatedwithcholesterolandriskofcardiovascularevents.TheNew EnglandJournalofMedicine2008,358:1240-1249. 19. BhatiaGaurav,BansalVikas,HarismendyOlivier,etal:Acoveringmethod fordetectinggeneticassociationsbetweenrarevariantsandcommon phenotypes.PLoSComputationalBiology2010,6:e1000954. 20. PritchardKJonathan,CoxNJ:Theallelicarchitectureofhumandisease genes:commondisease-commonvariant-ornot?HumanMolecular Genetics2002,11:2417-2423. 21. BesagJulian:Onthestatisticalanalysisofdirtypictures.Journalofthe RoyalStatisticalSocietySeriesB1986,48:259-302. 22. TavtigianVSean,DeffenbaughAM,YinL,etal:Comprehensivestatistical studyof452BRCA1missensesubstitutionswithclassificationofeight recurrentsubstitutionsasneutral.JournalofMedicalGenetics2006, 43:295-305. 23. PaulineCNg,StevenHenikoff:SIFT:predictingaminoacidchangesthat affectproteinfunction.NucleicAcidsResearch2003,31:3812-3814. 24. KweeCoulterLydia,DaweiLiu,XihongLin,etal:Apowerfulandflexible multilocusassociationtestforquantitativetraits.AmJHumGenet2008, 82:386-397. 25. MichaelCWu,KraftP,MichaelPEpstein,etal:PowerfulSNP-setanalysisfor case-controlgenome-wideassociationstudies.AmJHumGenet2010, 86:929-942. 26. QuintanaAMelanie,BersteinLJonine,ThomasCDuncan,ContiVDavid: Incorporatingmodeluncertaintyindetectingrarevariants:theBayesian riskindex.GeneticEpidemiology2011,35:638-649. 27. ContiVDavid,JamesGaudermanW:SNPs,haplotypes,andmodel selectioninacandidategeneregion:theSIMPleanalysisformultilocus data.GeneticEpidemiology2004,27:429-441. 28. MelanieAWilson,EdwinSIversen,MerliseAClyde,etal:Bayesianmodel searchandmultilevelinferenceforSNPassociationstudies.Annalsof AppliedStatistics2010,4:1342-1364. 29. FuYun-Xin:Statisticalpropertiesofsegregatingsites.Theoretical PopulationBiology1995,48:172-197. 30. SeanVTavtigian,PeterJOefner,DavitBabikyan,etal:Rare,evolutionarily unlikelymissensesubstitutionsinATMconferincreasedriskofbreast cancer.AmJHumGenet2009,85:427-446. doi:10.1186/1471-2164-14-S1-S11 Citethisarticleas:Wangetal.:Aprobabilisticmethodforidentifying rarevariantsunderlyingcomplextraits.BMCGenomics201314(Suppl1): S11. Submit your next manuscript to BioMed Central and take full advantage of: • Convenient online submission • Thorough peer review • No space constraints or color ﬁgure charges • Immediate publication on acceptance • Inclusion in PubMed, CAS, Scopus and Google Scholar • Research which is freely available for redistribution Submit your manuscript at www.biomedcentral.com/submit

A probabilistic method for identifying rare variants underlying complex traits. PDF

0.33 MB·English

by Wang, Jiayin

#journals #pubmed

Checking for file health...

Save to my drive

Quick download

Download

Download A probabilistic method for identifying rare variants underlying complex traits. PDF Free - Full Version

by Wang, Jiayin| 0.33| English

Download A probabilistic method for identifying rare variants underlying complex traits. by Wang, Jiayin in PDF format completely FREE. No registration required, no payment needed. Get instant access to this valuable resource on PDFdrive.to!

Free Download PDF

About A probabilistic method for identifying rare variants underlying complex traits.

No description available for this book.

Detailed Information

Author:	Wang, Jiayin
Language:	English
File Size:	0.33
Format:	PDF
Price:	FREE

Download Free PDF

Safe & Secure Download - No registration required

Why Choose PDFdrive for Your Free A probabilistic method for identifying rare variants underlying complex traits. Download?

100% Free: No hidden fees or subscriptions required for one book every day.
No Registration: Immediate access is available without creating accounts for one book every day.
Safe and Secure: Clean downloads without malware or viruses
Multiple Formats: PDF, MOBI, Mpub,... optimized for all devices
Educational Resource: Supporting knowledge sharing and learning

Frequently Asked Questions

Is it really free to download A probabilistic method for identifying rare variants underlying complex traits. PDF?

Yes, on https://PDFdrive.to you can download A probabilistic method for identifying rare variants underlying complex traits. by Wang, Jiayin completely free. We don't require any payment, subscription, or registration to access this PDF file. For 3 books every day.

How can I read A probabilistic method for identifying rare variants underlying complex traits. on my mobile device?

After downloading A probabilistic method for identifying rare variants underlying complex traits. PDF, you can open it with any PDF reader app on your phone or tablet. We recommend using Adobe Acrobat Reader, Apple Books, or Google Play Books for the best reading experience.

Is this the full version of A probabilistic method for identifying rare variants underlying complex traits.?

Yes, this is the complete PDF version of A probabilistic method for identifying rare variants underlying complex traits. by Wang, Jiayin. You will be able to read the entire content as in the printed version without missing any pages.

Is it legal to download A probabilistic method for identifying rare variants underlying complex traits. PDF for free?

https://PDFdrive.to provides links to free educational resources available online. We do not store any files on our servers. Please be aware of copyright laws in your country before downloading.

The materials shared are intended for research, educational, and personal use in accordance with fair use principles.