Table Of ContentARK: Aggregation of Reads by K-Means for Estimation of Bacterial
Community Composition
Koslicki, D., Chatterjee, S., Shahrivar, D., Walker, A. W., Francis, S. C., Fraser, L.
J., ... & Corander, J. (2015). ARK: Aggregation of Reads by K-Means for
Estimation of Bacterial Community Composition. PLoS ONE, 10(10), e0140644.
doi:10.1371/journal.pone.0140644
10.1371/journal.pone.0140644
Public Library of Science
Version of Record
http://cdss.library.oregonstate.edu/sa-termsofuse
RESEARCHARTICLE
ARK: Aggregation of Reads by K-Means for
Estimation of Bacterial Community
Composition
DavidKoslicki1,SaikatChatterjee2*,DamonShahrivar2,AlanW.Walker3,Suzanna
C.Francis4,LouiseJ.Fraser5,MikkoVehkaperä6,YuehengLan7,JukkaCorander8
1DeptofMathematics,OregonStateUniversity,Corvallis,UnitedStatesofAmerica,2Deptof
CommunicationTheory,KTHRoyalInstituteofTechnology,Stockholm,Sweden,3MicrobiologyGroup,
RowettInstituteofNutritionandHealth,UniversityofAberdeen,Aberdeen,UnitedKingdom,4MRCTropical
a11111
EpidemiologyGroup,LondonSchoolofHygieneandTropicalMedicine,London,UnitedKingdom,5Illumina
CambridgeLtd.,ChesterfordResearchPark,Essex,UnitedKingdom,6DeptofElectronicandElectrical
Engineering,UniversityofSheffield,Sheffield,UnitedKingdom,7DeptofPhysics,TsinghuaUniversity,
Beijing,China,8DeptofMathematicsandStatistics,UniversityofHelsinki,Helsinki,Finland
*[email protected]
OPENACCESS
Abstract
Citation:KoslickiD,ChatterjeeS,ShahrivarD,
WalkerAW,FrancisSC,FraserLJ,etal.(2015)ARK:
AggregationofReadsbyK-MeansforEstimationof
BacterialCommunityComposition.PLoSONE10 Motivation
(10):e0140644.doi:10.1371/journal.pone.0140644
Estimationofbacterialcommunitycompositionfromhigh-throughputsequenced16SrRNA
Editor:JonathanH.Badger,NationalCancer geneampliconsisakeytaskinmicrobialecology.Sincethesequencedatafromeachsam-
Institute,UNITEDSTATES
pletypicallyconsistofalargenumberofreadsandareadverselyimpactedbydifferentlev-
Received:April20,2015 elsofbiologicalandtechnicalnoise,accurateanalysisofsuchlargedatasetsis
Accepted:September28,2015 challenging.
Published:October23,2015
Results
Copyright:Thisisanopenaccessarticle,freeofall
copyright,andmaybefreelyreproduced,distributed, Therehasbeenarecentsurgeofinterestinusingcompressedsensinginspiredandcon-
transmitted,modified,builtupon,orotherwiseused
vex-optimizationbasedmethodstosolvetheestimationproblemforbacterialcommunity
byanyoneforanylawfulpurpose.Theworkismade
composition.Thesemethodstypicallyrelyonsummarizingthesequencedatabyfrequen-
availableundertheCreativeCommonsCC0public
domaindedication. ciesoflow-orderk-mersandmatchingthisinformationstatisticallywithataxonomically
structureddatabase.Hereweshowthattheaccuracyoftheresultingcommunitycomposi-
DataAvailabilityStatement:Thedataand
programsareavailablefromGitHub(https://github. tionestimatescanbesubstantiallyimprovedbyaggregatingthereadsfromasamplewith
com/dkoslicki/ARK)orfromhere(http://www.ee.kth. anunsupervisedmachinelearningapproachpriortotheestimationphase.Theaggregation
se/ctsoftware).Furtherrealbiologicaldatathatwe
ofreadsisapre-processingapproachwhereweuseastandardK-meansclusteringalgo-
usedforexperimenthavebeensubmittedtothe
EuropeanNucleotideArchiveusingtheaccession rithmthatpartitionsalargesetofreadsintosubsetswithreasonablecomputationalcostto
numberPRJEB9828. provideseveralvectorsoffirstorderstatisticsinsteadofonlysinglestatisticalsummariza-
Funding:ThisworkwassupportedbytheSwedish tionintermsofk-merfrequencies.Theoutputoftheclusteringisthenprocessedfurtherto
ResearchCouncilLinnaeusCentreACCESS(S.C.), obtainthefinalestimateforeachsample.TheresultingmethodiscalledAggregationof
ERCgrant239784(J.C.),theAcademyofFinland
ReadsbyK-means(ARK),anditisbasedonastatisticalargumentviamixturedensityfor-
CenterofExcellenceCOIN(J.C.),theAcademyof
Finland(M.V.),theScottishGovernment’sRuraland mulation.ARKisfoundtoimprovethefidelityandrobustnessofseveralrecentlyintroduced
EnvironmentScienceandAnalyticalServices methods,withonlyamodestincreaseincomputationalcomplexity.
PLOSONE|DOI:10.1371/journal.pone.0140644 October23,2015 1/16
ARK
Division(RESAS)(A.W.W),andtheUKMRC/DFID Availability
grantG1002369(S.C.F).L.J.F.receivedfundingin
Anopensource,platform-independentimplementationofthemethodintheJuliaprogram-
theformofsalaryfromIlluminaCambridgeLtd.The
fundershadnoroleinstudydesign,datacollection minglanguageisfreelyavailableathttps://github.com/dkoslicki/ARK.AMatlabimplementa-
andanalysis,decisiontopublish,orpreparationof tionisavailableathttp://www.ee.kth.se/ctsoftware.
themanuscript.
CompetingInterests:L.J.F.receivedfundinginthe
formofsalaryfromIlluminaCambridgeLtd.Thisdoes
notaltertheauthors’adherencetoallthePLOSONE
policiesonsharingdataandmaterials.
Introduction
Theadventofhigh-throughputsequencingtechnologieshasenableddetectionofbacterial
communitycompositionatanunprecedentedlevelofdetail.Atechnologicalapproachisto
produceforeachsamplealargenumberofreadsfromampliconsofthe16SrRNAgene,which
enablesanidentificationandcomparisonoftherelativefrequenciesofdifferenttaxonomic
unitspresentacrosssamples.Therapidlyincreasingnumberofreadsproducedpersample
resultsintheneedforfasttaxonomicclassificationofsamples.Thisproblemhasattractedcon-
siderablerecentattention[1–5].
Manyexistingapproachestothebacterialcommunitycompositionestimationproblemuse
16SrRNAgeneampliconsequencingwherealargeamountofmoderatelengthreads(around
250–500bp)areproducedfromeachsampleandthengenerallyeitherclusteredorclassifiedto
obtainacompositionestimateoftaxonomicunits.Intheclusteringapproach,readsare
groupedintotaxonomicunitsbyeitherdistance-basedorprobabilisticmethods[6–8],such
thattheactualtaxonomiclabelsareassignedtotheclustersafterwardsbymatchingtheircon-
sensussequencestoareferencedatabase.Incontrasttotheclusteringmethods,theclassifica-
tionapproachisbasedonusingareferencedatabasedirectlytoassignreadstomeaningful
biologicalunits.Methodsfortheclassificationofreadshavebeenbasedeitheronhomology
usingsequencesimilarity,orongenomicsignaturesintermsofk-mercomposition.Examples
ofhomology-basedmethodsincludeMEGAN[9,10]andphylogeneticanalysis[11].Another
popularapproachistouseaBayesianclassifier[1,12,13].Onesuchmethod,theRibosomal
DatabaseProject’s(RDP)naïveBayesianclassifier(NBC)[1],assignsalabelexplicitlytoeach
readproducedforaparticularsample.DespitethemethodologicalsimplicityofNBC,theRDP
classifiermaystillrequireseveraldaystoprocessalargedatasetinadesktopenvironmentdue
totheread-by-readclassificationapproach.Giventhischallenge,considerablyfasterestimation
methodsbasedonmixturesofk-mercountshavebeendeveloped,forexample,Taxy[2],
Quikr[3]andtherecentlyproposedSEK[14].Taxyisaconvex-optimizationbasedmethod.
SEKandQuikraresparsesignalprocessingbasedmethods(inspiredbycompressedsensing
andconvex-optimization),andSEKwasshowntoperformbetterthanQuikrandTaxyin[14].
Taxy,QuikrandSEKalluseastheirmaininputa(statistical)meanvectorofsamplek-mer
countscomputedfromthereadsobtainedforasample.Thek-mercounts(alsocalledk-mers)
arefeaturevectorsextractedfromrawsequencedata.Thenecessarymodelingassumptionis
thatthesamplemeanvectorofk-mercounts(thatmeansfirstorderstatistics)issufficiently
informativeaboutthesamplecomposition.Thesethreemethodsdonotusethereadsinany
additionalwayoncethemeanvectorofk-mersiscomputed.Weproposehereanalternative
basisofinformationaggregationthatremainscomputationallytractabletoallowprocessingof
largesetsofreads.Borrowingideasfromsourcecodinginsignalprocessing[15,16],clustering
inmachinelearningandsourcecoding[17],fusioninsignalestimation[18]anddivide-and-
conquerbasedshotgunsequenceassembly[19],ournovelapproachfirstsegregatesthefullset
ofreadsintosubsets(inthek-mersfeaturespace),computesthemeanvectorforeachsubset,
employsastandardmethod(suchasTaxy,QuikrorSEK)toestimatecompositionforeach
PLOSONE|DOI:10.1371/journal.pone.0140644 October23,2015 2/16
ARK
subset,andfinallyfusestheseestimatesintoacompositionestimatejointlyforallthereads.To
segregatethereadsintosubsets,wechoosetoemploytheK-meansclusteringalgorithm[20].
SincetheK-meansclusteringalgorithmissimpleandcomputationallyinexpensiveforarea-
sonablenumberQofclusters(subsets),itcanbeusedtopartitionevenfairlylargesetsofreads
intomore(intra)homogeneoussubsets.Byitsveryalgorithmicnature,K-meansclustering
partitionsthefeaturespaceintoQnon-overlappingregionsandprovidesasetofcorrespond-
ingmeanvectors.Thisiscalledcodebookgenerationinvectorquantization[15],originally
fromsignalprocessing,codingandclustering.OurnewmethodistermedasAggregationof
ReadsbyK-means(ARK).Fromthestatisticalperspective,theoreticaljustificationofARK
stemsfromamodelingframeworkwithamixtureofdensities.
Methods
Summarizingreadsequencedatabysinglemeank-mercounts
Inthemethoddescription,wedenotethenon-negativereallinebyR andstatisticalexpecta-
+
tionoperatorbyE[.].First,wedescribethepreviouslypublishedapproachofusingsinglek-
mersummariesforeachsample.Letx2R4k andC denoterandomk-merfeaturevectorsand
þ m
mthtaxonomicunit,respectively.Givenatestsetofk-mers(computedfromreads),thedistri-
butionofthetestsetismodeledas
X
M
pðxÞ¼ pðC Þ pðxjC Þ; ð1Þ
m m
m¼1
wherewedenoteprobabilityfortaxonomicunitm(orclassweight)byp(C ),satisfying
P m
M pðC Þ¼1.NotethatfpðC ÞgM isthecompositionoftaxonomicunitsinthegiventest
m¼1 m m m¼1
set(reads).Theinferencetaskistoestimatep(C )asaccuratelyaspossiblewithareasonable
m
computationalresource.Letusderivethemeanvector
Z Z Z
X X
M M
E½x(cid:2)¼ xpðxÞdx¼ x pðC Þ pðxjC Þdx¼ pðC Þ x pðxjC Þdx: ð2Þ
m m m m
m¼1 m¼1
ThemeanE[x]containsinformationaboutp(C )inthisprobabilisticformulation.Inpractice,
m
theinformationsummaryisobtainedbycomputingthesamplemeanfromthecompletesetof
readsavailableforasample.Letusdenotethesamplemeanofk-mersfeaturevectorsofreads
byμ2R4k withtheassumptionthatμ(cid:3)E[x].Severalmethods,suchasTaxy[2],Quikr[3],
þ
andSEK[14]usethesamplemeanμdirectlyasthemaininputtocomputethecompositionp
(C ).
m
AggregationofreadsbyK-means(ARK)
Fortheabove-describedprincipleofinformationaggregationfromthereadsbythemeanvec-
torofk-mercounts,computationofthesamplemeanvectorisstraightforward.Thisconse-
quentlyenableshandlingofaverylargeamountofreadswithlowcomputationalcost.
However,wehypothesizethatthesamplemeanvectorcomputedfromthefullsetofreadsis
notsufficientintermsofinformationcontenttofacilitateaccurateestimationofp(C ).Indeed,
m
sincetypicallythenumberoftrainingtaxonomicunitsMismuchlargerthanthenumberofk-
mers(forexamplek=6),thesetofk-mervectorsforfC gM isnotlinearlyindependent,and
m m¼1
soweriskreconstructingamixtureoftaxonomicunitsasasingletaxonomicunit.Hence,we
segregatethereadsintoseveralsubsetsandcomputeasamplemeanvectorseparatelyforeach
subset,assumingthatasetofsamplemeanvectorsismoreinformativethanasinglemean
PLOSONE|DOI:10.1371/journal.pone.0140644 October23,2015 3/16
ARK
vector.Notethatinthecasewheretheresultingreadsubsetswerenotinpracticedistinctfrom
eachotherintermsoftheirk-mercounts,thesubsequentcompositionestimatewouldeffec-
tivelybeidenticaltotheestimateobtainedwithasingledatasummarydescribedinEqs(1)and
(2).
Letuspartitionthek-mersfeaturespaceR4k intoQnon-overlappingregionsR suchthat
þ q
[Q R ¼R4k and8q,r,q6¼r,R \R =;.SuchpartitionscanbeformedbyastandardK-
q¼1 q þ q r
meansalgorithmthattypicallyusesanearestneighborclassificationrulebasedonsquare
Euclideandistancemeasure.Thenon-overlappingregionsR arecalledVoronoiregions.We
P q
defineP ≜Pr(x2R )satisfying Q P ¼1.Inpractice,P iscomputedas
q q q¼1 q q
number of feature vectors in R
P ¼ q: ð3Þ
q total number of feature vectors
Itisremindedthatthefeaturevectorsarek-mers.Thedistributionofthefulltestsetandsub-
setscanbewrittenas
P
pðxÞ¼ Q P pðxjx2R Þ;
q¼1 q q
P ð4Þ
pðxjx2R Þ¼ M pðC jx2R Þ pðxjC ;x2R Þ;
q m¼1 m q m q
wherethefirstequationfollowsastandardmixturedensityframework.Now,ifwecanestimate
p(C jx2R ),thenthefinalquantityofinterestp(C )canbeestimatedas
m q m
X
Q
pðC Þ¼ P pðC jx2R Þ: ð5Þ
m q m q
q¼1
Theestimationofp(C )inEq(5)isajudiciousfusionofp(C jx2R )throughalinearcombi-
m m q
nation.LetusnowderivethemeanvectorforR ,whichisaconditionalmeanvector
q
E½xjx2R (cid:2)
R q
¼ x pðxjx2R Þdx ð6Þ
P q R
¼ M pðC jx2R Þ x pðxjC ;x2R Þ dx:
m¼1 m q m q
ThemeanE[xjx2R ]containsinformationaboutp(C jx2R ).Inpracticeweusethesam-
q m q
plemeandenotedbyμ withtheassumptionthatμ (cid:3)E[xjx2R ].ComparingEqs(2)and
q q q
(6),fortheqthVoronoiregionR wecanestimatecompositionp(C jx2R )byusingan
q m q
appropriatecompositionestimationmethod,suchasTaxy,QuikrorSEK.
Algorithms
TheARKalgorithmcanbeimplementedbyfollowingsteps.
1. Dividethefulltestdatasetofk-mersintoQsubsets.TheregionR correspondstotheqth
q
subset.
2. Fortheqthsubset,computeP andthesamplemeanμ .
q q
3. Fortheqthsubset,applyacompositionestimationmethodthatusestheinputμ ;estimatep
q
(C jx2R ).
m q
P
4. Estimatep(Cm)bypðCmÞ¼ Qq¼1Pq pðCm jx2RqÞ.
TheARKmethodisdescribedusingaflow-chartinFig1.Theflow-chartshowsthemain
componentsoftheoverallsystemandtheassociatedoff-lineandon-linecomputations.The
PLOSONE|DOI:10.1371/journal.pone.0140644 October23,2015 4/16
ARK
Fig1.Aflow-chartoftheARKmethod.
doi:10.1371/journal.pone.0140644.g001
crucialcomputational/statisticalchallengesrelatedtotheARKalgorithmoutlinedaboveareas
follows:
1. WhatisanappropriatenumberofsubsetsQ?
2. HowshouldoneformthesubsetsR ?
q
Theabovepointsareinherenttoanysubsetformingalgorithm,andmoregenerallytoany
clusteringalgorithm.Furthermore,findingoptimalregions(orclusters)requiresalternative
optimizationtechniques.Givenapre-definedQ,typicallyaK-meansalgorithmperformstwo
PLOSONE|DOI:10.1371/journal.pone.0140644 October23,2015 5/16
ARK
alternatingoptimizationsteps.Theseare:(1)givenasetofrepresentationvectorsfμ gQ (also
q q¼1
calledcodevectors)formnewclustersfR gQ byanearestneighborrule(orformnewsubsets
q q¼1
fromthefulldataset),(2)findthesetofclusterrepresentationvectorsgiventheassignmentof
dataintoclusters.TheoptimalrepresentationvectoristhemeanvectorifsquaredEuclidean
distanceisusedforthenearestneighborrule.TheK-meansalgorithminitializeswithasetof
representativevectorsandrunsalternatingoptimizationuntilconvergenceinthesensethat
theaveragesquaredEuclideandistanceisnolongerreduced.Inthepresentpaperweperform
theclusteringusingapopularvectorquantizationmethodcalledtheLinde-Buzo-Gray(LBG)
algorithm[15](orsourcecodingliterature).ThereareseveralvariantsoftheLBGavailable.In
onevariant,thealgorithmstartswithQ=1andthenslowlysplitsthedenseandhighprobabil-
ityclusterstoendupwithahighQ,suchthatitdoesnotdeviatesignificantlyfromanexponen-
tiallydecayingbitrateversuscodingdistortion(rate-distortion)curve.
InARK,weusethefollowingtwostrategiestosolvethetwochallengeslistedabove.
1. Optimal/deterministicstrategy:StartwithQ=1,whichcorrespondstotheprevious
approachwithasinglemeanvectorasthedatasummary.ThensetQ=2forLBGalgorithm
thatusessquareEuclideandistanceasthedistortionmeasure;theLBGalgorithmminimizes
meanofsquareEuclideandistance(alsocalledmeansquareerror).Initializationisdoneby
astandardsplitapproachwherethemeanvectorisperturbed.UsingQ=2,fRqg2q¼1is
formedandweestimatep(C ).Subsequently,Qisincreasedbyoneuntilaconvergencecri-
m
terionismet.ForQ(cid:4)3,wealwayssplitthehighestrankingclusterintotwosubclustersand
usetheLBGalgorithmtofindtheoptimalclusters.ThenumberofclustersQisnolonger
increasediftheestimatedvaluesofp(C )differnegligiblyforQand(Q−1).Inpractice,the
m
stoppingconditionweuseisthatthevariationaldistancebetweenp(Cm)jQandp(Cm)j(Q−1)
islessthanapredeterminedthreshold.Thisconditioncanbewrittenas
P (cid:2) (cid:3)
Mm¼1abs pðCmÞjQ(cid:5)pðCmÞjðQ(cid:5)1Þ <Z,withauserdefinedchoiceofthethresholdη.Note
thatη2(0,1]providesanallowablelimitasascaledvariationaldistance(VD)betweentwo
probabilitymassfunctions;atypicalchoiceofηcanbe0.01.Thisstrategyistypicallyfound
toprovideconsistentperformanceimprovementinthesenseofestimatingp(C )withthe
m
increaseinQbythestepofone,butwithoutabsoluteguaranteeasthetargetoptimization
strategyminimizesmeansquareerror.Furthermore,weallowanincrementinthenumber
ofclustersuptoapre-definedmaximumlimitQ .TypicallyQ ispreferablychosenas
max max
anintegerpoweroftwo.AtypicalchoiceofQ canbebetween16to256.
max
2. Non-optimal/randomstrategy:Forverylargetestsets,weuseapre-determinedQanda
randomchoiceoftheQrepresentationvectors.ThenthefulltestsetisdividedintoQsub-
setsbyanearestneighborruleandwecomputethesetofQmeanvectors{μ },andcluster
q
probabilities{P }.Eventhoughthisnon-optimalstrategydoesnotuseanalternatingopti-
q
mization(suchasLBGalgorithm)toformoptimalclusters,itdividesthefulltestsetinto
sub-sets,resultinginasetofQlocalizedmeanvectorsacrossthefulltestset.
FinallywementionthattheuseofK-meansisfullymotivatedbyitssimplicityandcompu-
tationalease.UseofstatisticalK-meansintheformofexpectation-maximizationbasedmix-
turemodeling(forexample,Gaussianmixturemodel)couldhavebeeninvestigated,but
requiresmorecomputationtohandlealargedatasetofreads.
Syntheticdatagenerationformethodevaluation
ToevaluatetheperformanceoftheARKmethod,weconductedexperimentsforsimulated
dataasdescribedbelow.Forthese,andallcomputationsreportedintheremainderofthe
PLOSONE|DOI:10.1371/journal.pone.0140644 October23,2015 6/16
ARK
paper,weusedMatlabversionR2013b(withsomeinstancesofCcode),onadesktopworksta-
tionwithanIntelCorei74930Kprocessorand64GbofRAM.
Testdatasets(Reads). Wesimulated18016SrRNAgene454-likedatasetsusingtheRDP
trainingset7andtheGrinderreadsimulator[21]targetingtheV1–V2andV3–V5variable
regionswithreadlengthsfixedat250bpornormallydistributedwithameanof450bpand
variance50bp.Readdepthswerechosentobeeither10K,100Kor250K,whilethreedifferent
readdistributionswereused:powerlaw,uniform,andlinear.Diversitywassetateither50,
100,or500taxaandchimerapercentagesweresetto5%or35%.TheBalzermodel[22]was
chosenforhomopolymererrors,andcopybiaswasincludedwhilelengthbiaswasexcluded.
Trainingdataset(Reference). InourARKexperimentsweusedQuikr[3]andSEK[14]
toestimatep(C jx2R ).TheRDPtrainingset7wasusedasthebasereferencedatabasefor
m q
bothQuikrandSEK.NotethatthisisthesameasdatabaseD utilizedin[3].Whileinthe
small
mainmanuscriptweusethesamedataforbothtrainingandtestingthebasemethods(Quikr
andSEK),inS1Fileweincluderesultsobtainedwhenthetestdatasetshavetaxaabsentfrom
thetrainingdatabase(thatis,sistertaxahavebeenexcludedfromthetrainingdatabase).As
expected,allmethodsexperiencealossinreconstructionaccuracywhensistertaxaareabsent,
butARKQuikrandARKSEKarestillmoreaccuratethanRDP’sNBC.
Realbiologicaldata
TofurtherevaluateARK,wealsoutilized28IlluminaMiSeq16SrRNAgenehumanbody-site
associatedsamples,plusonenegativecontrolsample.Therealdataconsistofatotalofover5.7
Mreadsdistributedoverthreevariableregions(V1–V2,V3–V4,andV3–V5)aswellastwo
bodysites(vaginaandfeces).
ForeachofthesesamplesDNAwasextractedusingtheFastDNASPINKitforSoilwitha
FastPrepmachine(MPBiomedicals)followingthemanufacturer’sprotocol.16SrRNAgene
ampliconsweregeneratedfromtheDNAextractionsusingtheprimercombinationslistedin
Section5ofS1File.TheQ5High-fidelitypolymerasekit(NewEnglandBiolabs)wasusedto
amplifythe16SrRNAgenes,andPCRconditionswereasfollows:98°Cfor2minutes,followed
by20cyclesof98°Cfor30seconds,50°Cfor30secondsand72°Cfor1minute30seconds,fol-
lowedbyafinalextensionstepat72°Cfor5minutes.FollowingPCR,theampliconswerethen
purifiedusingtheWizardSVGelandPCRClean-Upkit(Promega,UK).Sequencingof16S
rRNAgeneampliconswascarriedoutbyIlluminaInc.(LittleChesterford,UK)usingaMiSeq
instrumentrunfor2x250(V1–V2),300+200(V3–V4)and400+200(V3–V5)cycles.These
datahavebeensubmittedtotheEuropeanNucleotideArchiveusingtheaccessionnumber
PRJEB9828.
Aftertrimming20bpofprimeroffeachread,thesequencesweretrimmedfromtheright
untilallbaseshadaqualityscoregreaterthan27.Thisreducedthetotalnumberofreadsto
approximately4M,andreducedthemeanreadlengthfrom315bpto257bp.Wethenutilized
allresultingunpairedreads(bothforwardandreverse)includinganyduplicatesequences.We
includeinS1Fileresultsforanalternativeerror-correctionprotocol,aswellasresultsfor
assemblingpaired-endreads(FigsEandFinS1File).
EthicsStatement
Forhumanbody-siteassociatedsamples,thefaecalsamplesusedwerenotpartofaclinical
studysothereisnocorrespondingethicalapprovalorwrittenconsent.Therewerenoclinical
records.Thesamplesareanonymisedandde-identified.Further,vaginalsampleswerecol-
lectedaspartofanobservationalmicrobicidefeasibilitystudy.Thestudywasapprovedbythe
EthicsCommitteesoftheNationalInstituteforMedicalResearchinTanzaniaandLondon
PLOSONE|DOI:10.1371/journal.pone.0140644 October23,2015 7/16
ARK
SchoolofHygieneandTropicalMedicine,andallparticipantsgavewritteninformedconsent.
Allrecordswereanonymizedandde-identifiedpriortothisretrospectiveanalysis.
Results
Performancemeasureandrelevantmethods
Asaquantitativeperformancemeasure,weusevariationaldistance(VD)tocomparebetween
knownproportionsoftaxonomicunitsp=[p(C ),p(C ),...,p(C )]tandtheestimatedpropor-
1 2 M
tionsp^ ¼½p^ðC Þ; p^ðC Þ;...;p^ðC Þ(cid:2)t.TheVDisdefinedas
1 2 M
VD¼0:5(cid:6)kp(cid:5)p^k 2½0;1(cid:2):
1
AlowVDindicatesmoresatisfactoryperformance.
ForARK,weusedbothSEKandQuikrastheunderlyingestimationmethodsappliedto
eachcluster.Theserecentmethodswerechosenasappropriaterepresentativesoffastandaccu-
ratesparsesignalprocessingapproaches.Ak-mersizeofk=6wasusedforbothQuikrand
SEK.
AspartoftheSEKpipeline,sequencesinagivendatabasearesplitintosubsequences.We
selectedfromthe10,046sequencesintheRDPtrainingset7allsequenceslongerthan700bp
inlength,andthensplitthesequencesintosubsequencesoflength400bpwith100bpofover-
lap.ThiscorrespondstosettingL =400andL =100asspecifiedin[14].WeusedtheSEK
w p
algorithmOMPþ;1withparametersasin[14].
sek
ResultsforSimulatedData
Effectofincreasingnumberofclusters. Wefirstinvestigatehowanincreaseinthenum-
berofclustersQaffectsthecompositionreconstructionfidelityandalgorithmexecutiontime
forthesimulateddata.Onlythenon-optimal/randomstrategyofK-meansclusteringwasuti-
lizedaswefoundthattheperformanceimprovementforoptimal/deterministicstrategywas
insignificantgiventheresultingincreaseinexecutiontime(resultsnotshown).Averagingthe
VDerroratthegenusleveloverall180simulatedexperiments,itwasfoundthatcombining
ARKwithbothSEKandQuikrresultedinapowerlawkindofdecayofVDerrorasafunction
ofthenumberofclusters(Fig2).ARKcausesasubstantialincreaseinreconstructionfidelity
whichcanbeseensinceusingARKSEKorARKQuikrwithoneclusterisequivalenttorun-
ningSEKorQuikrwithnomodification.
Sincetheunderlyingalgorithm(SEKorQuikr)mustbeexecutedoneachclusterformedby
theK-meansclustering,weexpectthetotalalgorithmexecutiontimetoincreasebyafactor
equaltothenumberofchosenclusters.AsseeninFig3,bothalgorithmsexperiencean
increaseinexecutiontimeroughlyproportionaltothenumberofclusters.
Fixednumberofclusters. Asseenabove,giventhedecreaseinVDasafunctionofthe
numberofclusters,wealsofixedthenumberofclustersQto75tocomparetheperformance
oftheunderlyingalgorithmswithandwithoutARK.Therewasasignificantdecreaseinthe
VDerror(asseeninFig4)atthecostofanincreaseinexecutiontime(asseeninFig5).How-
ever,giventhespeedofbothQuikrandSEK,weexpecttheadditionofARKwillnotresultin
prohibitivelylongexecutiontimes.Indeed,asseenabove,onrealbiologicaldatabothARK
QuikrandARKSEKarestillseveralhoursfasterthantheRibosomalDatabaseProject’sNaïve
BayesianClassifier(RDP’sNBC)[1],evenwhenusing75clusters.
PLOSONE|DOI:10.1371/journal.pone.0140644 October23,2015 8/16
ARK
Fig2.ResultsfortherandomK-meansclusteringonthesimulateddata.MeanVDerroratthegenus
levelasafunctionofthenumberofclusters.NotetheimprovementthatARKcontributestoeachmethod.
doi:10.1371/journal.pone.0140644.g002
Fig3.ResultsfortherandomK-meansclusteringonthesimulateddata.Meanexecutiontimeincrease
(factorgivenincomparisontorunningSEKorQuikrintheabsenceofARK)asafunctionofnumberof
clusters.Thedashedlinerepresentsalinewithslope1.
doi:10.1371/journal.pone.0140644.g003
PLOSONE|DOI:10.1371/journal.pone.0140644 October23,2015 9/16
Description:els of biological and technical noise, accurate analysis of such large using sequence similarity, or on genomic signatures in terms of k-mer SEK and Quikr are sparse signal processing based methods (inspired by .. Ethics Committees of the National Institute for Medical Research in Tanzania and