Table Of Content

ARK: Aggregation of Reads by K-Means for Estimation of Bacterial Community Composition Koslicki, D., Chatterjee, S., Shahrivar, D., Walker, A. W., Francis, S. C., Fraser, L. J., ... & Corander, J. (2015). ARK: Aggregation of Reads by K-Means for Estimation of Bacterial Community Composition. PLoS ONE, 10(10), e0140644. doi:10.1371/journal.pone.0140644 10.1371/journal.pone.0140644 Public Library of Science Version of Record http://cdss.library.oregonstate.edu/sa-termsofuse RESEARCHARTICLE ARK: Aggregation of Reads by K-Means for Estimation of Bacterial Community Composition DavidKoslicki1,SaikatChatterjee2*,DamonShahrivar2,AlanW.Walker3,Suzanna C.Francis4,LouiseJ.Fraser5,MikkoVehkaperä6,YuehengLan7,JukkaCorander8 1DeptofMathematics,OregonStateUniversity,Corvallis,UnitedStatesofAmerica,2Deptof CommunicationTheory,KTHRoyalInstituteofTechnology,Stockholm,Sweden,3MicrobiologyGroup, RowettInstituteofNutritionandHealth,UniversityofAberdeen,Aberdeen,UnitedKingdom,4MRCTropical a11111 EpidemiologyGroup,LondonSchoolofHygieneandTropicalMedicine,London,UnitedKingdom,5Illumina CambridgeLtd.,ChesterfordResearchPark,Essex,UnitedKingdom,6DeptofElectronicandElectrical Engineering,UniversityofSheffield,Sheffield,UnitedKingdom,7DeptofPhysics,TsinghuaUniversity, Beijing,China,8DeptofMathematicsandStatistics,UniversityofHelsinki,Helsinki,Finland *[email protected] OPENACCESS Abstract Citation:KoslickiD,ChatterjeeS,ShahrivarD, WalkerAW,FrancisSC,FraserLJ,etal.(2015)ARK: AggregationofReadsbyK-MeansforEstimationof BacterialCommunityComposition.PLoSONE10 Motivation (10):e0140644.doi:10.1371/journal.pone.0140644 Estimationofbacterialcommunitycompositionfromhigh-throughputsequenced16SrRNA Editor:JonathanH.Badger,NationalCancer geneampliconsisakeytaskinmicrobialecology.Sincethesequencedatafromeachsam- Institute,UNITEDSTATES pletypicallyconsistofalargenumberofreadsandareadverselyimpactedbydifferentlev- Received:April20,2015 elsofbiologicalandtechnicalnoise,accurateanalysisofsuchlargedatasetsis Accepted:September28,2015 challenging. Published:October23,2015 Results Copyright:Thisisanopenaccessarticle,freeofall copyright,andmaybefreelyreproduced,distributed, Therehasbeenarecentsurgeofinterestinusingcompressedsensinginspiredandcon- transmitted,modified,builtupon,orotherwiseused vex-optimizationbasedmethodstosolvetheestimationproblemforbacterialcommunity byanyoneforanylawfulpurpose.Theworkismade composition.Thesemethodstypicallyrelyonsummarizingthesequencedatabyfrequen- availableundertheCreativeCommonsCC0public domaindedication. ciesoflow-orderk-mersandmatchingthisinformationstatisticallywithataxonomically structureddatabase.Hereweshowthattheaccuracyoftheresultingcommunitycomposi- DataAvailabilityStatement:Thedataand programsareavailablefromGitHub(https://github. tionestimatescanbesubstantiallyimprovedbyaggregatingthereadsfromasamplewith com/dkoslicki/ARK)orfromhere(http://www.ee.kth. anunsupervisedmachinelearningapproachpriortotheestimationphase.Theaggregation se/ctsoftware).Furtherrealbiologicaldatathatwe ofreadsisapre-processingapproachwhereweuseastandardK-meansclusteringalgo- usedforexperimenthavebeensubmittedtothe EuropeanNucleotideArchiveusingtheaccession rithmthatpartitionsalargesetofreadsintosubsetswithreasonablecomputationalcostto numberPRJEB9828. provideseveralvectorsoffirstorderstatisticsinsteadofonlysinglestatisticalsummariza- Funding:ThisworkwassupportedbytheSwedish tionintermsofk-merfrequencies.Theoutputoftheclusteringisthenprocessedfurtherto ResearchCouncilLinnaeusCentreACCESS(S.C.), obtainthefinalestimateforeachsample.TheresultingmethodiscalledAggregationof ERCgrant239784(J.C.),theAcademyofFinland ReadsbyK-means(ARK),anditisbasedonastatisticalargumentviamixturedensityfor- CenterofExcellenceCOIN(J.C.),theAcademyof Finland(M.V.),theScottishGovernment’sRuraland mulation.ARKisfoundtoimprovethefidelityandrobustnessofseveralrecentlyintroduced EnvironmentScienceandAnalyticalServices methods,withonlyamodestincreaseincomputationalcomplexity. PLOSONE|DOI:10.1371/journal.pone.0140644 October23,2015 1/16 ARK Division(RESAS)(A.W.W),andtheUKMRC/DFID Availability grantG1002369(S.C.F).L.J.F.receivedfundingin Anopensource,platform-independentimplementationofthemethodintheJuliaprogram- theformofsalaryfromIlluminaCambridgeLtd.The fundershadnoroleinstudydesign,datacollection minglanguageisfreelyavailableathttps://github.com/dkoslicki/ARK.AMatlabimplementa- andanalysis,decisiontopublish,orpreparationof tionisavailableathttp://www.ee.kth.se/ctsoftware. themanuscript. CompetingInterests:L.J.F.receivedfundinginthe formofsalaryfromIlluminaCambridgeLtd.Thisdoes notaltertheauthors’adherencetoallthePLOSONE policiesonsharingdataandmaterials. Introduction Theadventofhigh-throughputsequencingtechnologieshasenableddetectionofbacterial communitycompositionatanunprecedentedlevelofdetail.Atechnologicalapproachisto produceforeachsamplealargenumberofreadsfromampliconsofthe16SrRNAgene,which enablesanidentificationandcomparisonoftherelativefrequenciesofdifferenttaxonomic unitspresentacrosssamples.Therapidlyincreasingnumberofreadsproducedpersample resultsintheneedforfasttaxonomicclassificationofsamples.Thisproblemhasattractedcon- siderablerecentattention[1–5]. Manyexistingapproachestothebacterialcommunitycompositionestimationproblemuse 16SrRNAgeneampliconsequencingwherealargeamountofmoderatelengthreads(around 250–500bp)areproducedfromeachsampleandthengenerallyeitherclusteredorclassifiedto obtainacompositionestimateoftaxonomicunits.Intheclusteringapproach,readsare groupedintotaxonomicunitsbyeitherdistance-basedorprobabilisticmethods[6–8],such thattheactualtaxonomiclabelsareassignedtotheclustersafterwardsbymatchingtheircon- sensussequencestoareferencedatabase.Incontrasttotheclusteringmethods,theclassifica- tionapproachisbasedonusingareferencedatabasedirectlytoassignreadstomeaningful biologicalunits.Methodsfortheclassificationofreadshavebeenbasedeitheronhomology usingsequencesimilarity,orongenomicsignaturesintermsofk-mercomposition.Examples ofhomology-basedmethodsincludeMEGAN[9,10]andphylogeneticanalysis[11].Another popularapproachistouseaBayesianclassifier[1,12,13].Onesuchmethod,theRibosomal DatabaseProject’s(RDP)naïveBayesianclassifier(NBC)[1],assignsalabelexplicitlytoeach readproducedforaparticularsample.DespitethemethodologicalsimplicityofNBC,theRDP classifiermaystillrequireseveraldaystoprocessalargedatasetinadesktopenvironmentdue totheread-by-readclassificationapproach.Giventhischallenge,considerablyfasterestimation methodsbasedonmixturesofk-mercountshavebeendeveloped,forexample,Taxy[2], Quikr[3]andtherecentlyproposedSEK[14].Taxyisaconvex-optimizationbasedmethod. SEKandQuikraresparsesignalprocessingbasedmethods(inspiredbycompressedsensing andconvex-optimization),andSEKwasshowntoperformbetterthanQuikrandTaxyin[14]. Taxy,QuikrandSEKalluseastheirmaininputa(statistical)meanvectorofsamplek-mer countscomputedfromthereadsobtainedforasample.Thek-mercounts(alsocalledk-mers) arefeaturevectorsextractedfromrawsequencedata.Thenecessarymodelingassumptionis thatthesamplemeanvectorofk-mercounts(thatmeansfirstorderstatistics)issufficiently informativeaboutthesamplecomposition.Thesethreemethodsdonotusethereadsinany additionalwayoncethemeanvectorofk-mersiscomputed.Weproposehereanalternative basisofinformationaggregationthatremainscomputationallytractabletoallowprocessingof largesetsofreads.Borrowingideasfromsourcecodinginsignalprocessing[15,16],clustering inmachinelearningandsourcecoding[17],fusioninsignalestimation[18]anddivide-and- conquerbasedshotgunsequenceassembly[19],ournovelapproachfirstsegregatesthefullset ofreadsintosubsets(inthek-mersfeaturespace),computesthemeanvectorforeachsubset, employsastandardmethod(suchasTaxy,QuikrorSEK)toestimatecompositionforeach PLOSONE|DOI:10.1371/journal.pone.0140644 October23,2015 2/16 ARK subset,andfinallyfusestheseestimatesintoacompositionestimatejointlyforallthereads.To segregatethereadsintosubsets,wechoosetoemploytheK-meansclusteringalgorithm[20]. SincetheK-meansclusteringalgorithmissimpleandcomputationallyinexpensiveforarea- sonablenumberQofclusters(subsets),itcanbeusedtopartitionevenfairlylargesetsofreads intomore(intra)homogeneoussubsets.Byitsveryalgorithmicnature,K-meansclustering partitionsthefeaturespaceintoQnon-overlappingregionsandprovidesasetofcorrespond- ingmeanvectors.Thisiscalledcodebookgenerationinvectorquantization[15],originally fromsignalprocessing,codingandclustering.OurnewmethodistermedasAggregationof ReadsbyK-means(ARK).Fromthestatisticalperspective,theoreticaljustificationofARK stemsfromamodelingframeworkwithamixtureofdensities. Methods Summarizingreadsequencedatabysinglemeank-mercounts Inthemethoddescription,wedenotethenon-negativereallinebyR andstatisticalexpecta- + tionoperatorbyE[.].First,wedescribethepreviouslypublishedapproachofusingsinglek- mersummariesforeachsample.Letx2R4k andC denoterandomk-merfeaturevectorsand þ m mthtaxonomicunit,respectively.Givenatestsetofk-mers(computedfromreads),thedistri- butionofthetestsetismodeledas X M pðxÞ¼ pðC Þ pðxjC Þ; ð1Þ m m m¼1 wherewedenoteprobabilityfortaxonomicunitm(orclassweight)byp(C ),satisfying P m M pðC Þ¼1.NotethatfpðC ÞgM isthecompositionoftaxonomicunitsinthegiventest m¼1 m m m¼1 set(reads).Theinferencetaskistoestimatep(C )asaccuratelyaspossiblewithareasonable m computationalresource.Letusderivethemeanvector Z Z Z X X M M E½x(cid:2)¼ xpðxÞdx¼ x pðC Þ pðxjC Þdx¼ pðC Þ x pðxjC Þdx: ð2Þ m m m m m¼1 m¼1 ThemeanE[x]containsinformationaboutp(C )inthisprobabilisticformulation.Inpractice, m theinformationsummaryisobtainedbycomputingthesamplemeanfromthecompletesetof readsavailableforasample.Letusdenotethesamplemeanofk-mersfeaturevectorsofreads byμ2R4k withtheassumptionthatμ(cid:3)E[x].Severalmethods,suchasTaxy[2],Quikr[3], þ andSEK[14]usethesamplemeanμdirectlyasthemaininputtocomputethecompositionp (C ). m AggregationofreadsbyK-means(ARK) Fortheabove-describedprincipleofinformationaggregationfromthereadsbythemeanvec- torofk-mercounts,computationofthesamplemeanvectorisstraightforward.Thisconse- quentlyenableshandlingofaverylargeamountofreadswithlowcomputationalcost. However,wehypothesizethatthesamplemeanvectorcomputedfromthefullsetofreadsis notsufficientintermsofinformationcontenttofacilitateaccurateestimationofp(C ).Indeed, m sincetypicallythenumberoftrainingtaxonomicunitsMismuchlargerthanthenumberofk- mers(forexamplek=6),thesetofk-mervectorsforfC gM isnotlinearlyindependent,and m m¼1 soweriskreconstructingamixtureoftaxonomicunitsasasingletaxonomicunit.Hence,we segregatethereadsintoseveralsubsetsandcomputeasamplemeanvectorseparatelyforeach subset,assumingthatasetofsamplemeanvectorsismoreinformativethanasinglemean PLOSONE|DOI:10.1371/journal.pone.0140644 October23,2015 3/16 ARK vector.Notethatinthecasewheretheresultingreadsubsetswerenotinpracticedistinctfrom eachotherintermsoftheirk-mercounts,thesubsequentcompositionestimatewouldeffec- tivelybeidenticaltotheestimateobtainedwithasingledatasummarydescribedinEqs(1)and (2). Letuspartitionthek-mersfeaturespaceR4k intoQnon-overlappingregionsR suchthat þ q [Q R ¼R4k and8q,r,q6¼r,R \R =;.SuchpartitionscanbeformedbyastandardK- q¼1 q þ q r meansalgorithmthattypicallyusesanearestneighborclassificationrulebasedonsquare Euclideandistancemeasure.Thenon-overlappingregionsR arecalledVoronoiregions.We P q defineP ≜Pr(x2R )satisfying Q P ¼1.Inpractice,P iscomputedas q q q¼1 q q number of feature vectors in R P ¼ q: ð3Þ q total number of feature vectors Itisremindedthatthefeaturevectorsarek-mers.Thedistributionofthefulltestsetandsub- setscanbewrittenas P pðxÞ¼ Q P pðxjx2R Þ; q¼1 q q P ð4Þ pðxjx2R Þ¼ M pðC jx2R Þ pðxjC ;x2R Þ; q m¼1 m q m q wherethefirstequationfollowsastandardmixturedensityframework.Now,ifwecanestimate p(C jx2R ),thenthefinalquantityofinterestp(C )canbeestimatedas m q m X Q pðC Þ¼ P pðC jx2R Þ: ð5Þ m q m q q¼1 Theestimationofp(C )inEq(5)isajudiciousfusionofp(C jx2R )throughalinearcombi- m m q nation.LetusnowderivethemeanvectorforR ,whichisaconditionalmeanvector q E½xjx2R (cid:2) R q ¼ x pðxjx2R Þdx ð6Þ P q R ¼ M pðC jx2R Þ x pðxjC ;x2R Þ dx: m¼1 m q m q ThemeanE[xjx2R ]containsinformationaboutp(C jx2R ).Inpracticeweusethesam- q m q plemeandenotedbyμ withtheassumptionthatμ (cid:3)E[xjx2R ].ComparingEqs(2)and q q q (6),fortheqthVoronoiregionR wecanestimatecompositionp(C jx2R )byusingan q m q appropriatecompositionestimationmethod,suchasTaxy,QuikrorSEK. Algorithms TheARKalgorithmcanbeimplementedbyfollowingsteps. 1. Dividethefulltestdatasetofk-mersintoQsubsets.TheregionR correspondstotheqth q subset. 2. Fortheqthsubset,computeP andthesamplemeanμ . q q 3. Fortheqthsubset,applyacompositionestimationmethodthatusestheinputμ ;estimatep q (C jx2R ). m q P 4. Estimatep(Cm)bypðCmÞ¼ Qq¼1Pq pðCm jx2RqÞ. TheARKmethodisdescribedusingaflow-chartinFig1.Theflow-chartshowsthemain componentsoftheoverallsystemandtheassociatedoff-lineandon-linecomputations.The PLOSONE|DOI:10.1371/journal.pone.0140644 October23,2015 4/16 ARK Fig1.Aflow-chartoftheARKmethod. doi:10.1371/journal.pone.0140644.g001 crucialcomputational/statisticalchallengesrelatedtotheARKalgorithmoutlinedaboveareas follows: 1. WhatisanappropriatenumberofsubsetsQ? 2. HowshouldoneformthesubsetsR ? q Theabovepointsareinherenttoanysubsetformingalgorithm,andmoregenerallytoany clusteringalgorithm.Furthermore,findingoptimalregions(orclusters)requiresalternative optimizationtechniques.Givenapre-definedQ,typicallyaK-meansalgorithmperformstwo PLOSONE|DOI:10.1371/journal.pone.0140644 October23,2015 5/16 ARK alternatingoptimizationsteps.Theseare:(1)givenasetofrepresentationvectorsfμ gQ (also q q¼1 calledcodevectors)formnewclustersfR gQ byanearestneighborrule(orformnewsubsets q q¼1 fromthefulldataset),(2)findthesetofclusterrepresentationvectorsgiventheassignmentof dataintoclusters.TheoptimalrepresentationvectoristhemeanvectorifsquaredEuclidean distanceisusedforthenearestneighborrule.TheK-meansalgorithminitializeswithasetof representativevectorsandrunsalternatingoptimizationuntilconvergenceinthesensethat theaveragesquaredEuclideandistanceisnolongerreduced.Inthepresentpaperweperform theclusteringusingapopularvectorquantizationmethodcalledtheLinde-Buzo-Gray(LBG) algorithm[15](orsourcecodingliterature).ThereareseveralvariantsoftheLBGavailable.In onevariant,thealgorithmstartswithQ=1andthenslowlysplitsthedenseandhighprobabil- ityclusterstoendupwithahighQ,suchthatitdoesnotdeviatesignificantlyfromanexponen- tiallydecayingbitrateversuscodingdistortion(rate-distortion)curve. InARK,weusethefollowingtwostrategiestosolvethetwochallengeslistedabove. 1. Optimal/deterministicstrategy:StartwithQ=1,whichcorrespondstotheprevious approachwithasinglemeanvectorasthedatasummary.ThensetQ=2forLBGalgorithm thatusessquareEuclideandistanceasthedistortionmeasure;theLBGalgorithmminimizes meanofsquareEuclideandistance(alsocalledmeansquareerror).Initializationisdoneby astandardsplitapproachwherethemeanvectorisperturbed.UsingQ=2,fRqg2q¼1is formedandweestimatep(C ).Subsequently,Qisincreasedbyoneuntilaconvergencecri- m terionismet.ForQ(cid:4)3,wealwayssplitthehighestrankingclusterintotwosubclustersand usetheLBGalgorithmtofindtheoptimalclusters.ThenumberofclustersQisnolonger increasediftheestimatedvaluesofp(C )differnegligiblyforQand(Q−1).Inpractice,the m stoppingconditionweuseisthatthevariationaldistancebetweenp(Cm)jQandp(Cm)j(Q−1) islessthanapredeterminedthreshold.Thisconditioncanbewrittenas P (cid:2) (cid:3) Mm¼1abs pðCmÞjQ(cid:5)pðCmÞjðQ(cid:5)1Þ <Z,withauserdefinedchoiceofthethresholdη.Note thatη2(0,1]providesanallowablelimitasascaledvariationaldistance(VD)betweentwo probabilitymassfunctions;atypicalchoiceofηcanbe0.01.Thisstrategyistypicallyfound toprovideconsistentperformanceimprovementinthesenseofestimatingp(C )withthe m increaseinQbythestepofone,butwithoutabsoluteguaranteeasthetargetoptimization strategyminimizesmeansquareerror.Furthermore,weallowanincrementinthenumber ofclustersuptoapre-definedmaximumlimitQ .TypicallyQ ispreferablychosenas max max anintegerpoweroftwo.AtypicalchoiceofQ canbebetween16to256. max 2. Non-optimal/randomstrategy:Forverylargetestsets,weuseapre-determinedQanda randomchoiceoftheQrepresentationvectors.ThenthefulltestsetisdividedintoQsub- setsbyanearestneighborruleandwecomputethesetofQmeanvectors{μ },andcluster q probabilities{P }.Eventhoughthisnon-optimalstrategydoesnotuseanalternatingopti- q mization(suchasLBGalgorithm)toformoptimalclusters,itdividesthefulltestsetinto sub-sets,resultinginasetofQlocalizedmeanvectorsacrossthefulltestset. FinallywementionthattheuseofK-meansisfullymotivatedbyitssimplicityandcompu- tationalease.UseofstatisticalK-meansintheformofexpectation-maximizationbasedmix- turemodeling(forexample,Gaussianmixturemodel)couldhavebeeninvestigated,but requiresmorecomputationtohandlealargedatasetofreads. Syntheticdatagenerationformethodevaluation ToevaluatetheperformanceoftheARKmethod,weconductedexperimentsforsimulated dataasdescribedbelow.Forthese,andallcomputationsreportedintheremainderofthe PLOSONE|DOI:10.1371/journal.pone.0140644 October23,2015 6/16 ARK paper,weusedMatlabversionR2013b(withsomeinstancesofCcode),onadesktopworksta- tionwithanIntelCorei74930Kprocessorand64GbofRAM. Testdatasets(Reads). Wesimulated18016SrRNAgene454-likedatasetsusingtheRDP trainingset7andtheGrinderreadsimulator[21]targetingtheV1–V2andV3–V5variable regionswithreadlengthsfixedat250bpornormallydistributedwithameanof450bpand variance50bp.Readdepthswerechosentobeeither10K,100Kor250K,whilethreedifferent readdistributionswereused:powerlaw,uniform,andlinear.Diversitywassetateither50, 100,or500taxaandchimerapercentagesweresetto5%or35%.TheBalzermodel[22]was chosenforhomopolymererrors,andcopybiaswasincludedwhilelengthbiaswasexcluded. Trainingdataset(Reference). InourARKexperimentsweusedQuikr[3]andSEK[14] toestimatep(C jx2R ).TheRDPtrainingset7wasusedasthebasereferencedatabasefor m q bothQuikrandSEK.NotethatthisisthesameasdatabaseD utilizedin[3].Whileinthe small mainmanuscriptweusethesamedataforbothtrainingandtestingthebasemethods(Quikr andSEK),inS1Fileweincluderesultsobtainedwhenthetestdatasetshavetaxaabsentfrom thetrainingdatabase(thatis,sistertaxahavebeenexcludedfromthetrainingdatabase).As expected,allmethodsexperiencealossinreconstructionaccuracywhensistertaxaareabsent, butARKQuikrandARKSEKarestillmoreaccuratethanRDP’sNBC. Realbiologicaldata TofurtherevaluateARK,wealsoutilized28IlluminaMiSeq16SrRNAgenehumanbody-site associatedsamples,plusonenegativecontrolsample.Therealdataconsistofatotalofover5.7 Mreadsdistributedoverthreevariableregions(V1–V2,V3–V4,andV3–V5)aswellastwo bodysites(vaginaandfeces). ForeachofthesesamplesDNAwasextractedusingtheFastDNASPINKitforSoilwitha FastPrepmachine(MPBiomedicals)followingthemanufacturer’sprotocol.16SrRNAgene ampliconsweregeneratedfromtheDNAextractionsusingtheprimercombinationslistedin Section5ofS1File.TheQ5High-fidelitypolymerasekit(NewEnglandBiolabs)wasusedto amplifythe16SrRNAgenes,andPCRconditionswereasfollows:98°Cfor2minutes,followed by20cyclesof98°Cfor30seconds,50°Cfor30secondsand72°Cfor1minute30seconds,fol- lowedbyafinalextensionstepat72°Cfor5minutes.FollowingPCR,theampliconswerethen purifiedusingtheWizardSVGelandPCRClean-Upkit(Promega,UK).Sequencingof16S rRNAgeneampliconswascarriedoutbyIlluminaInc.(LittleChesterford,UK)usingaMiSeq instrumentrunfor2x250(V1–V2),300+200(V3–V4)and400+200(V3–V5)cycles.These datahavebeensubmittedtotheEuropeanNucleotideArchiveusingtheaccessionnumber PRJEB9828. Aftertrimming20bpofprimeroffeachread,thesequencesweretrimmedfromtheright untilallbaseshadaqualityscoregreaterthan27.Thisreducedthetotalnumberofreadsto approximately4M,andreducedthemeanreadlengthfrom315bpto257bp.Wethenutilized allresultingunpairedreads(bothforwardandreverse)includinganyduplicatesequences.We includeinS1Fileresultsforanalternativeerror-correctionprotocol,aswellasresultsfor assemblingpaired-endreads(FigsEandFinS1File). EthicsStatement Forhumanbody-siteassociatedsamples,thefaecalsamplesusedwerenotpartofaclinical studysothereisnocorrespondingethicalapprovalorwrittenconsent.Therewerenoclinical records.Thesamplesareanonymisedandde-identified.Further,vaginalsampleswerecol- lectedaspartofanobservationalmicrobicidefeasibilitystudy.Thestudywasapprovedbythe EthicsCommitteesoftheNationalInstituteforMedicalResearchinTanzaniaandLondon PLOSONE|DOI:10.1371/journal.pone.0140644 October23,2015 7/16 ARK SchoolofHygieneandTropicalMedicine,andallparticipantsgavewritteninformedconsent. Allrecordswereanonymizedandde-identifiedpriortothisretrospectiveanalysis. Results Performancemeasureandrelevantmethods Asaquantitativeperformancemeasure,weusevariationaldistance(VD)tocomparebetween knownproportionsoftaxonomicunitsp=[p(C ),p(C ),...,p(C )]tandtheestimatedpropor- 1 2 M tionsp^ ¼½p^ðC Þ; p^ðC Þ;...;p^ðC Þ(cid:2)t.TheVDisdefinedas 1 2 M VD¼0:5(cid:6)kp(cid:5)p^k 2½0;1(cid:2): 1 AlowVDindicatesmoresatisfactoryperformance. ForARK,weusedbothSEKandQuikrastheunderlyingestimationmethodsappliedto eachcluster.Theserecentmethodswerechosenasappropriaterepresentativesoffastandaccu- ratesparsesignalprocessingapproaches.Ak-mersizeofk=6wasusedforbothQuikrand SEK. AspartoftheSEKpipeline,sequencesinagivendatabasearesplitintosubsequences.We selectedfromthe10,046sequencesintheRDPtrainingset7allsequenceslongerthan700bp inlength,andthensplitthesequencesintosubsequencesoflength400bpwith100bpofover- lap.ThiscorrespondstosettingL =400andL =100asspecifiedin[14].WeusedtheSEK w p algorithmOMPþ;1withparametersasin[14]. sek ResultsforSimulatedData Effectofincreasingnumberofclusters. Wefirstinvestigatehowanincreaseinthenum- berofclustersQaffectsthecompositionreconstructionfidelityandalgorithmexecutiontime forthesimulateddata.Onlythenon-optimal/randomstrategyofK-meansclusteringwasuti- lizedaswefoundthattheperformanceimprovementforoptimal/deterministicstrategywas insignificantgiventheresultingincreaseinexecutiontime(resultsnotshown).Averagingthe VDerroratthegenusleveloverall180simulatedexperiments,itwasfoundthatcombining ARKwithbothSEKandQuikrresultedinapowerlawkindofdecayofVDerrorasafunction ofthenumberofclusters(Fig2).ARKcausesasubstantialincreaseinreconstructionfidelity whichcanbeseensinceusingARKSEKorARKQuikrwithoneclusterisequivalenttorun- ningSEKorQuikrwithnomodification. Sincetheunderlyingalgorithm(SEKorQuikr)mustbeexecutedoneachclusterformedby theK-meansclustering,weexpectthetotalalgorithmexecutiontimetoincreasebyafactor equaltothenumberofchosenclusters.AsseeninFig3,bothalgorithmsexperiencean increaseinexecutiontimeroughlyproportionaltothenumberofclusters. Fixednumberofclusters. Asseenabove,giventhedecreaseinVDasafunctionofthe numberofclusters,wealsofixedthenumberofclustersQto75tocomparetheperformance oftheunderlyingalgorithmswithandwithoutARK.Therewasasignificantdecreaseinthe VDerror(asseeninFig4)atthecostofanincreaseinexecutiontime(asseeninFig5).How- ever,giventhespeedofbothQuikrandSEK,weexpecttheadditionofARKwillnotresultin prohibitivelylongexecutiontimes.Indeed,asseenabove,onrealbiologicaldatabothARK QuikrandARKSEKarestillseveralhoursfasterthantheRibosomalDatabaseProject’sNaïve BayesianClassifier(RDP’sNBC)[1],evenwhenusing75clusters. PLOSONE|DOI:10.1371/journal.pone.0140644 October23,2015 8/16 ARK Fig2.ResultsfortherandomK-meansclusteringonthesimulateddata.MeanVDerroratthegenus levelasafunctionofthenumberofclusters.NotetheimprovementthatARKcontributestoeachmethod. doi:10.1371/journal.pone.0140644.g002 Fig3.ResultsfortherandomK-meansclusteringonthesimulateddata.Meanexecutiontimeincrease (factorgivenincomparisontorunningSEKorQuikrintheabsenceofARK)asafunctionofnumberof clusters.Thedashedlinerepresentsalinewithslope1. doi:10.1371/journal.pone.0140644.g003 PLOSONE|DOI:10.1371/journal.pone.0140644 October23,2015 9/16

Description:

els of biological and technical noise, accurate analysis of such large using sequence similarity, or on genomic signatures in terms of k-mer SEK and Quikr are sparse signal processing based methods (inspired by .. Ethics Committees of the National Institute for Medical Research in Tanzania and

ARK: Aggregation of Reads by K-Means for Estimation of Bacterial Community Composition PDF

17 Pages·2015·1.57 MB·English

by Koslicki, David

Checking for file health...

Save to my drive

Quick download

Download

Download ARK: Aggregation of Reads by K-Means for Estimation of Bacterial Community Composition PDF Free - Full Version

by Koslicki, David| 2015| 17 pages| 1.57| English

Download ARK: Aggregation of Reads by K-Means for Estimation of Bacterial Community Composition by Koslicki, David in PDF format completely FREE. No registration required, no payment needed. Get instant access to this valuable resource on PDFdrive.to!

Free Download PDF

About ARK: Aggregation of Reads by K-Means for Estimation of Bacterial Community Composition

Detailed Information

Author:	Koslicki, David
Publication Year:	2015
Pages:	17
Language:	English
File Size:	1.57
Format:	PDF
Price:	FREE

Download Free PDF

Safe & Secure Download - No registration required

Why Choose PDFdrive for Your Free ARK: Aggregation of Reads by K-Means for Estimation of Bacterial Community Composition Download?

100% Free: No hidden fees or subscriptions required for one book every day.
No Registration: Immediate access is available without creating accounts for one book every day.
Safe and Secure: Clean downloads without malware or viruses
Multiple Formats: PDF, MOBI, Mpub,... optimized for all devices
Educational Resource: Supporting knowledge sharing and learning

Frequently Asked Questions

Is it really free to download ARK: Aggregation of Reads by K-Means for Estimation of Bacterial Community Composition PDF?

Yes, on https://PDFdrive.to you can download ARK: Aggregation of Reads by K-Means for Estimation of Bacterial Community Composition by Koslicki, David completely free. We don't require any payment, subscription, or registration to access this PDF file. For 3 books every day.

How can I read ARK: Aggregation of Reads by K-Means for Estimation of Bacterial Community Composition on my mobile device?

After downloading ARK: Aggregation of Reads by K-Means for Estimation of Bacterial Community Composition PDF, you can open it with any PDF reader app on your phone or tablet. We recommend using Adobe Acrobat Reader, Apple Books, or Google Play Books for the best reading experience.

Is this the full version of ARK: Aggregation of Reads by K-Means for Estimation of Bacterial Community Composition?

Yes, this is the complete PDF version of ARK: Aggregation of Reads by K-Means for Estimation of Bacterial Community Composition by Koslicki, David. You will be able to read the entire content as in the printed version without missing any pages.

Is it legal to download ARK: Aggregation of Reads by K-Means for Estimation of Bacterial Community Composition PDF for free?

https://PDFdrive.to provides links to free educational resources available online. We do not store any files on our servers. Please be aware of copyright laws in your country before downloading.

The materials shared are intended for research, educational, and personal use in accordance with fair use principles.