Table Of Content(cid:2)
MULTIPLE BIOLOGICAL
SEQUENCE ALIGNMENT
(cid:2) (cid:2)
(cid:2)
(cid:2)
(cid:2) (cid:2)
WileySerieson
Bioinformatics:ComputationalTechniquesandEngineering
Acompletelistofthetitlesinthisseriesappearsattheendofthisvolume.
(cid:2)
(cid:2)
MULTIPLE BIOLOGICAL
SEQUENCE ALIGNMENT
Scoring Functions, Algorithms
and Applications
KENNGUYEN
XUANGUO
YIPAN
(cid:2) (cid:2)
(cid:2)
(cid:2)
Copyright©2016byJohnWiley&Sons,Inc.Allrightsreserved
PublishedbyJohnWiley&Sons,Inc.,Hoboken,NewJersey
PublishedsimultaneouslyinCanada
Nopartofthispublicationmaybereproduced,storedinaretrievalsystem,ortransmittedinanyformor
byanymeans,electronic,mechanical,photocopying,recording,scanning,orotherwise,exceptas
permittedunderSection107or108ofthe1976UnitedStatesCopyrightAct,withouteithertheprior
writtenpermissionofthePublisher,orauthorizationthroughpaymentoftheappropriateper-copyfeeto
theCopyrightClearanceCenter,Inc.,222RosewoodDrive,Danvers,MA01923,(978)750-8400,fax
(978)750-4470,oronthewebatwww.copyright.com.RequeststothePublisherforpermissionshould
beaddressedtothePermissionsDepartment,JohnWiley&Sons,Inc.,111RiverStreet,Hoboken,NJ
07030,(201)748-6011,fax(201)748-6008,oronlineathttp://www.wiley.com/go/permissions.
LimitofLiability/DisclaimerofWarranty:Whilethepublisherandauthorhaveusedtheirbesteffortsin
preparingthisbook,theymakenorepresentationsorwarrantieswithrespecttotheaccuracyor
completenessofthecontentsofthisbookandspecificallydisclaimanyimpliedwarrantiesof
merchantabilityorfitnessforaparticularpurpose.Nowarrantymaybecreatedorextendedbysales
representativesorwrittensalesmaterials.Theadviceandstrategiescontainedhereinmaynotbesuitable
foryoursituation.Youshouldconsultwithaprofessionalwhereappropriate.Neitherthepublishernor
authorshallbeliableforanylossofprofitoranyothercommercialdamages,includingbutnotlimitedto
special,incidental,consequential,orotherdamages.
Forgeneralinformationonourotherproductsandservicesorfortechnicalsupport,pleasecontactour
CustomerCareDepartmentwithintheUnitedStatesat(800)762-2974,outsidetheUnitedStatesat
(cid:2) (cid:2)
(317)572-3993orfax(317)572-4002.
Wileyalsopublishesitsbooksinavarietyofelectronicformats.Somecontentthatappearsinprintmay
notbeavailableinelectronicformats.FormoreinformationaboutWileyproducts,visitourwebsiteat
www.wiley.com.
LibraryofCongressCataloging-in-PublicationData:
Names:Nguyen,Ken,1975-author.|Guo,Xuan,1987-author.|Pan,Yi,1960-
author.
Title:Multiplebiologicalsequencealignment:scoringfunctions,algorithms
andapplications/KenNguyen,XuanGuo,YiPan.
Description:Hoboken,NewJersey:JohnWiley&Sons,2016.|Includes
bibliographicalreferencesandindex.
Identifiers:LCCN2016004186|ISBN9781118229040(cloth)|ISBN9781119273752
(epub)
Subjects:LCSH:Sequencealignment(Bioinformatics)
Classification:LCCQH441.N482016|DDC572.8–dc23LCrecordavailableat
http://lccn.loc.gov/2016004186
CoverimagecourtesyofGettyImages/OktalStudio
Typesetin10/12ptTimesLTStdbySPiGlobal,Chennai,India
PrintedintheUnitedStatesofAmerica
10987654321
(cid:2)
(cid:2)
CONTENTS
Preface xi
(cid:2) 1 Introduction 1 (cid:2)
1.1 Motivation, 2
1.2 TheOrganizationofthisBook, 2
1.3 SequenceFundamentals, 3
1.3.1 Protein, 5
1.3.2 DNA/RNA, 6
1.3.3 SequenceFormats, 6
1.3.4 Motifs, 7
1.3.5 SequenceDatabases, 9
2 Protein/DNA/RNAPairwiseSequenceAlignment 11
2.1 SequenceAlignmentFundamentals, 12
2.2 Dot-PlotMatrix, 12
2.3 DynamicProgramming, 14
2.3.1 Needleman–Wunsch’sAlgorithm, 15
2.3.2 Example, 16
2.3.3 Smith–Waterman’sAlgorithm, 17
2.3.4 AffineGapPenalty, 19
2.4 WordMethod, 19
2.4.1 Example, 20
2.5 SearchingSequenceDatabases, 21
(cid:2)
(cid:2)
vi CONTENTS
2.5.1 FASTA, 21
2.5.2 BLAST, 21
3 QuantifyingSequenceAlignments 25
3.1 EvolutionandMeasuringEvolution, 25
3.1.1 JukesandCantor’sModel, 26
3.1.2 MeasuringRelatedness, 28
3.2 SubstitutionMatricesandScoringMatrices, 28
3.2.1 IdentityScores, 28
3.2.2 Substitution/MutationScores, 29
3.3 GAPS, 32
3.3.1 SequenceDistances, 35
3.3.2 Example, 35
3.4 ScoringMultipleSequenceAlignments, 36
3.4.1 Sum-of-PairScore, 36
3.5 CircularSumScore, 38
3.6 ConservationScoreSchemes, 39
3.6.1 WuandKabat’sMethod, 39
3.6.2 Jores’sMethod, 39
3.6.3 LocklessandRanganathan’sMethod, 40
3.7 DiversityScoringSchemes, 40
(cid:2) 3.7.1 Background, 41 (cid:2)
3.7.2 Methods, 41
3.8 StereochemicalPropertyMethods, 42
3.8.1 Valdar’sMethod, 43
3.9 HierarchicalExpectedMatchingProbabilityScoringMetric(HEP), 44
3.9.1 BuildinganAACCHScoringTree, 44
3.9.2 TheScoringMetric, 46
3.9.3 ProofofScoringMetricCorrectness, 47
3.9.4 Examples, 48
3.9.5 ScoringMetricandSequenceWeightingFactor, 49
3.9.6 EvaluationDataSets, 50
3.9.7 EvaluationResults, 52
4 SequenceClustering 59
4.1 UnweightedPairGroupMethodwithArithmeticMean – UPGMA, 60
4.2 Neighborhood-JoiningMethod – NJ, 61
4.3 OverlappingSequenceClustering, 65
5 MultipleSequencesAlignmentAlgorithms 69
5.1 DynamicProgramming, 70
5.1.1 DCA, 70
5.2 ProgressiveAlignment, 71
(cid:2)
(cid:2)
CONTENTS vii
5.2.1 ClustalFamily, 73
5.2.2 PIMA:Pattern-InducedMultisequenceAlignment, 73
5.2.3 PRIME:Profile-BasedRandomizedIterationMethod, 74
5.2.4 DIAlign, 75
5.3 ConsistencyandProbabilisticMSA, 76
5.3.1 POA:PartialOrderGraphAlignment, 76
5.3.2 PSAlign, 77
5.3.3 ProbCons:ProbabilisticConsistency-BasedMultipleSequence
Alignment, 78
5.3.4 T-Coffee:Tree-BasedConsistencyObjectiveFunctionfor
AlignmentEvaluation, 79
5.3.5 MAFFT:MSABasedonFastFourierTransform, 80
5.3.6 AVID, 81
5.3.7 EulerianPathMSA, 81
5.4 GeneticAlgorithms, 82
5.4.1 SAGA:SequenceAlignmentbyGeneticAlgorithm, 83
5.4.2 GAandSelf-OrganizingNeuralNetworks, 84
5.4.3 FAlign, 85
5.5 NewDevelopmentinMultipleSequenceAlignmentAlgorithms, 85
5.5.1 KB-MSA:Knowledge-BasedMultipleSequence
Alignment, 85
5.5.2 PADT:ProgressiveMultipleSequenceAlignmentBasedon
(cid:2) (cid:2)
DynamicWeightedTree, 94
5.6 TestDataandAlignmentMethods, 97
5.7 Results, 98
5.7.1 MeasuringAlignmentQuality, 98
5.7.2 RT-OSMResults, 98
6 PhylogenyinMultipleSequenceAlignments 103
6.1 TheTreeofLife, 103
6.2 PhylogenyConstruction, 105
6.2.1 DistanceMethods, 106
6.2.2 Character-BasedMethods, 107
6.2.3 MaximumLikelihoodMethods, 109
6.2.4 Bootstrapping, 110
6.2.5 SubtreePruningandRe-grafting, 111
6.3 InferringPhylogenyfromMultipleSequenceAlignments, 112
7 MultipleSequenceAlignmentonHigh-PerformanceComputing
Models 113
7.1 ParallelSystems, 113
7.1.1 Multiprocessor, 113
7.1.2 Vector, 114
(cid:2)
(cid:2)
viii CONTENTS
7.1.3 GPU, 114
7.1.4 FPGA, 114
7.1.5 ReconfigurableMesh, 114
7.2 ExitingParallelMultipleSequenceAlignment, 114
7.3 Reconfigurable-MeshComputingModels – (R-Mesh), 116
7.4 PairwiseDynamicProgrammingAlgorithms, 118
7.4.1 R-MeshMaxSwitches, 118
7.4.2 R-MeshAdder/Subtractor, 118
7.4.3 Constant-TimeDynamicProgrammingonR-Mesh, 120
7.4.4 AffineGapCost, 123
7.4.5 R-MeshOn/OffSwitches, 124
7.4.6 DynamicProgrammingBacktrackingonR-Mesh, 125
7.5 ProgressiveMultipleSequenceAlignmentONR-Mesh, 126
7.5.1 HierarchicalClusteringonR-Mesh, 127
7.5.2 ConstantRun-TimeSum-of-PairScoringMethod, 128
7.5.3 ParallelProgressiveMSAAlgorithmandItsComplexity
Analysis, 129
8 SequenceAnalysisServices 133
8.1 EMBL-EBI:EuropeanBioinformaticsInstitute, 133
8.2 NCBI:NationalCenterforBiotechnologyInformation, 135
(cid:2) (cid:2)
8.3 GenomeNetandDataBankofJapan, 136
8.4 OtherSequenceAnalysisandAlignmentWebServers, 137
8.5 SeqAna:MultipleSequenceAlignmentwithQualityRanking, 138
8.6 PairwiseSequenceAlignmentandOtherAnalysisTools, 140
8.7 ToolEvaluation, 142
9 MultipleSequenceforNext-GenerationSequences 145
9.1 Introduction, 145
9.2 OverviewofNextGenerationSequenceAlignmentAlgorithms, 147
9.2.1 AlignmentAlgorithmsBasedonSeedingandHashTables, 147
9.2.2 AlignmentAlgorithmsBasedonSuffixTries, 151
9.3 Next-GenerationSequencingTools, 154
10 MultipleSequenceAlignmentforVariationsDetection 161
10.1 Introduction, 161
10.2 GeneticVariants, 163
10.3 VariationDetectionMethodsBasedonMSA, 165
10.4 EvaluationMethodology, 172
10.4.1 PerformanceMetrics, 172
10.4.2 SimulatedSequenceData, 174
10.4.3 RealSequenceData, 175
10.5 ConclusionandFutureWork, 176
(cid:2)
Description:Covers the fundamentals and techniques of multiple biological sequence alignment and analysis, and shows readers how to choose the appropriate sequence analysis tools for their tasks This book describes the traditional and modern approaches in biological sequence alignment and homology search. Thi