Table Of ContentLecture Notes
Institute of Bioinformatics, Johannes Kepler University Linz
Bioinformatics III
Structural Bioinformatics and Genome Analysis
Summer Semester 2007
by Sepp Hochreiter
(Chapters 2 and 3 by Noura Chelbat)
Institute of Bioinformatics Tel. +43 732 2468 8880
Johannes Kepler University Linz Fax +43 732 2468 9308
A-4040 Linz, Austria http://www.bioinf.jku.at
(cid:13)c 2007SeppHochreiter&NouraChelbat
This material, no matter whether in printed or electronic form, may be used for personal and
educational use only. Any reproduction of this manuscript, no matter whether as a whole or in
parts,nomatterwhetherinprintedorinelectronicform,requiresexplicitprioracceptanceofthe
author.
Preface
This course is part of the curriculum of the master of science in bioinformatics at the Johannes
Kepler University Linz. Focus of the course is structural bioinformatics (part 1) and genome
analysis (part 2). These two topics are merged in this course because of the master’s program
schedule.
The spacial restriction did neither allow to introduce all methods nor allow to explain the
introducedonesinmoredetail.
Thestudentsshouldgaininsightsintothetopicsandmethodsofstructuralbioinformaticsand
genome analysis. The students should learn how to choose appropriate methods from a given
pool of approaches to structural bioinformatics (e.g. structural alignment or 3D prediction) and
to genome analysis (e.g. microarray technique). The students should learn to understand and to
evaluate the different approaches, know their advantages and disadvantages as well as where to
obtain and how to use them. In a step further, the students should be able to adapt standard
algorithms for their own purposes or to modify those algorithms for specific applications with
certainpriorknowledgeorspecialconstraints.
StructuralBioinformatics
A main topic in structural bioinformatics is to give computational approaches to predict and an-
alyze the spatial structure of macromolecules like proteins, DNA, and RNA. Their 3D structure
is predicted based on the 1D structure, the nucleotide or amino acid sequence, which is obtained
from genome sequencing. Knowing and understanding their 3D structure is crucial for inferring
andmodifyingtheirfunction. Directapplicationscouldbeinmedicalandpharmacologicalfields
–especiallyfordrugdesign,whereitisimportanttodeterminewhichgroupsofligandsbindand
regulateaprotein,whichproteinsarepotentialtargetsfordrugs,etc.
For detecting the 3D structure the methods from Bioinformatics I allow for homology and
comparative modeling by sequence-sequence comparisons, where it is assumed that similar se-
quences have the same 3D structure. Another approach which includes structural information is
sequence-structure comparison by computing the sequences-to-structure-fitness through “thread-
ing”,whichdetermineshowwellasequencefitstoagiven3Dstructure.
Bymodelingthephysicallawsdetailsabouttheproteinfunctionandliganddockingbehavior
isobtained. Modelingisoftenbasedonmoleculardynamicsusingforcefieldswhichapproximate
thephysicallaws.
iii
GenomeAnalysis
Main focus of the genome analysis will be the microarray technique and the preprocessing and
analysismethodsassociatedwithit.
Themicroarraytechniquegeneratesageneexpressionprofilewhichgivestheexpressionstates
ofgenesinacellbyreportingthemRNAconcentration. ThemRNAconcentrationinturnreports
the cell status determined by what and how many proteins are currently produced. The DNA
microarray technologies such as cDNA and oligonucleotide arrays provide means of measuring
tens of thousands of genes simultaneously (a snapshot of the cell). The microarrays are a large
scalehigh-throughputmethodformolecularbiologicalexperimentation.
Theinformationobtainedbyrecognizinggenesthatshareexpressionpatternsandhencemight
be regulated together are assumed to be in the same genetic pathway. Therefore the microarray
techniquehelpstounderstandthedynamicsandregulationbehaviorinacell.
One of the goals of microarray technology is the detection of genes that are differentially
expressed in tissue samples like healthy and cancerous tissues to see which genes are relevant
for cancer. It has important applications in pharmaceutical and clinical research and helps in
understandinggeneregulationandinteractions.
Genomeanalysisincludesalsogenomeanatomyandgenomeindividuality(e.g.repetitionsor
singlenucleotidepolymorphism).
Wewilladdressalsoactualgenomicresearchquestionsaboutalternativesplicingandnucleo-
someposition.
iv
Literature
David W. Mount. Bioinformatics – Sequence and Genome Analysis. Cold Spring Harbor
LabaratoryPress,ColdSpringHarbor,NewYork,USA,2004.
PhilipE.BourneandHelgeWeissig. StructuralBioinformatics. Wiley-Liss,Hoboken,New
Jersey,USA,2003.
MichaelJ.E.Sternberg. ProteinStructurePrediction. OxfordUniversityPress,1996.
Steen Knudsen. Guide to Analysis of DNA Microarray Data. John Wiley & Sohns, Hobo-
ken,NewJersey,USA,2004.
ErnstWitandJohnMcClure. StatisticsforMicroarrays. JohnWiley&SohnsLtd.,England,
2004.
Pierre Baldi and G. Wesley Hatfield. DNA Microarrays and Gene Expression – From Ex-
periments to Data Analysis and Modeling. Cambridge University Press, United Kingdom,
2002.
Geoffry J. McLachlan, Kim-Anh Do, and Christophe Ambroise. Analyzing Microarray
GeneExpressionData. JohnWiley&SohnsInc.,Hoboken,NewJersey,USA,2004.
JeromeK.Percus. MathematicsofGenomeAnalysis. CambridgeUniversityPress,United
Kingdom,2002.
v
vi
Contents
I StructuralBioinformatics 1
1 Introduction 3
2 ChemicalandPhysicalBackground 13
2.1 AtomicBounds: ABasicIntroduction . . . . . . . . . . . . . . . . . . . . . . . 13
2.1.1 Non-CovalentInteractions . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.1.1.1 Charge-ChargeInteractionsorIonicBounds . . . . . . . . . . 15
2.1.1.2 DipoleInteractions . . . . . . . . . . . . . . . . . . . . . . . 17
2.1.1.3 VanderWaalsForcesorDispersion . . . . . . . . . . . . . . . 20
2.1.1.4 HydrogenBond . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.1.1.5 Hydrophobic-HydrophilicInteractions . . . . . . . . . . . . . 25
2.1.2 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.1.3 Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.2 Fromchainpolypeptide1Dconfigurationtofolded2D . . . . . . . . . . . . . . 30
2.2.1 Aminoacids: classificationandchemical-physicalproperties . . . . . . . 30
2.2.1.1 Peptidebond . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.2.1.2 TorsionanglesPhi(Φ)andPsi(Ψ) . . . . . . . . . . . . . . . 39
2.2.1.3 Ramachandranplot . . . . . . . . . . . . . . . . . . . . . . . 41
2.2.2 Interactionsandfolding . . . . . . . . . . . . . . . . . . . . . . . . . . 45
2.2.2.1 Bonds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
2.2.2.2 Thermodynamics . . . . . . . . . . . . . . . . . . . . . . . . 48
2.2.3 SecondaryStructureElements . . . . . . . . . . . . . . . . . . . . . . . 50
2.2.3.1 Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
2.2.3.1.1 α-helix. . . . . . . . . . . . . . . . . . . . . . . . . 52
2.2.3.1.2 β-sheet . . . . . . . . . . . . . . . . . . . . . . . . 52
2.2.3.1.3 TurnandLoops . . . . . . . . . . . . . . . . . . . . 56
2.2.3.1.4 Coiledcoil . . . . . . . . . . . . . . . . . . . . . . . 56
2.2.3.1.5 TIMbarrels . . . . . . . . . . . . . . . . . . . . . . 58
2.2.3.2 MotifsandDomains . . . . . . . . . . . . . . . . . . . . . . . 58
2.2.3.2.1 Homeodomains . . . . . . . . . . . . . . . . . . . . 60
2.2.3.2.2 Leucinezipper . . . . . . . . . . . . . . . . . . . . . 60
2.2.3.2.3 Zincfinger . . . . . . . . . . . . . . . . . . . . . . . 61
2.2.3.2.4 Transmembraneelements . . . . . . . . . . . . . . . 61
2.2.4 3DStructure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
2.2.5 Majormethodsofstructuredetermination . . . . . . . . . . . . . . . . . 64
2.2.5.1 X-rayCrystallography . . . . . . . . . . . . . . . . . . . . . . 64
vii
2.2.5.2 NMRSpectroscopy . . . . . . . . . . . . . . . . . . . . . . . 66
2.2.6 Viewers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
2.2.6.1 Rasmol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
2.2.6.2 Chime . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
2.2.6.3 Pymol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
2.2.7 Firstapproximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
2.2.7.1 PDB-function . . . . . . . . . . . . . . . . . . . . . . . . . . 69
2.2.7.2 SCOP-Classes . . . . . . . . . . . . . . . . . . . . . . . . . . 70
2.2.8 Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
2.2.9 Annexes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
3 StructuralComparisonandAlignment 75
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
3.2 MethodsforStructureComparisonandAlignment. . . . . . . . . . . . . . . . . 77
3.2.1 Basicremind . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
3.2.1.1 Dynamicprogramming . . . . . . . . . . . . . . . . . . . . . 78
3.2.1.2 DistanceMatrix . . . . . . . . . . . . . . . . . . . . . . . . . 79
3.2.2 SARF2,VAST,COMPARER . . . . . . . . . . . . . . . . . . . . . . . 82
3.2.3 SARF2: SpatialArrangementofBackboneFragments . . . . . . . . . . 82
3.2.3.1 VAST:VectorAlignmentSearchTool . . . . . . . . . . . . . . 83
3.2.3.2 COMPARER . . . . . . . . . . . . . . . . . . . . . . . . . . 84
3.2.4 CE,DALI,SSAP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
3.2.4.1 CE:CombinatorialExtensionoftheOptimumPath . . . . . . 85
3.2.4.2 DALI:DistanceMatrixAlignment . . . . . . . . . . . . . . . 91
3.2.5 SSAP:SecondaryStructureAlignmentProgram . . . . . . . . . . . . . 95
3.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
3.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
3.5 Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
4 ProteinSecondaryStructurePrediction 105
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
4.2 AssigningSecondaryStructuretoMeasuredStructures . . . . . . . . . . . . . . 105
4.2.1 DSSP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
4.2.2 STRIDE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
4.2.3 DEFINEandP-Curve . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
4.3 PredictionofSecondaryStructure . . . . . . . . . . . . . . . . . . . . . . . . . 113
4.3.1 Chou-FasmanMethod . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
4.3.2 GORMethods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
4.3.3 Lim’sMethod . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
4.3.4 NeuralNetworks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
4.3.5 PHD,PSIPRED,PREDATOR,JNet,JPred2,NSSP,SSPro . . . . . . . . 118
4.4 EvaluatingSecondaryStructurePrediction . . . . . . . . . . . . . . . . . . . . . 122
4.4.1 Non-HomologousTestSequences . . . . . . . . . . . . . . . . . . . . . 122
4.4.2 SecondaryStructureClasses . . . . . . . . . . . . . . . . . . . . . . . . 123
4.4.3 QualityMeasures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
4.4.4 ProblemsinQualityComparisons . . . . . . . . . . . . . . . . . . . . . 126
viii
5 Homology3DStructurePrediction 127
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
5.2 ComparativeModeling: Sequence-SequenceComparison . . . . . . . . . . . . . 128
5.3 Threading: Sequence-StructureAlignment . . . . . . . . . . . . . . . . . . . . . 132
6 AbInitioPredictionandMolecularDynamics 139
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
6.2 AbInitioMethods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
6.3 MolecularDynamics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
II GenomeAnalysis 143
7 Introduction 145
8 DNAMicroarrays 147
8.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
8.2 DNAMicroarrayHistoryandCurrentStatus . . . . . . . . . . . . . . . . . . . . 148
8.3 DNAMicroarrayTechniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
8.3.1 OligonucleotideArrays . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
8.3.2 cDNA/SpottedArrays . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
8.3.3 OtherTechniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
8.3.3.1 SAGE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
8.3.3.2 DigitalMicromirrorArrays . . . . . . . . . . . . . . . . . . . 163
8.3.3.3 InkjetArrays . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
8.3.3.4 BeadArrays . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
8.3.3.5 NanomechanicalCantilevers . . . . . . . . . . . . . . . . . . 164
8.4 MicroarrayNoise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
8.5 ImageAnalysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
8.6 BackgroundCorrection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
8.7 Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
8.8 PMCorrection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
8.9 Summarization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
8.10 DifferentCombinationsoftheProcessingSteps . . . . . . . . . . . . . . . . . . 176
8.11 MicroarrayGeneSelectionProtocol . . . . . . . . . . . . . . . . . . . . . . . . 178
8.11.1 DescriptionoftheProtocol . . . . . . . . . . . . . . . . . . . . . . . . . 178
8.11.2 CommentsontheProtocolandonGeneSelection . . . . . . . . . . . . . 180
8.11.3 ClassificationofSamples . . . . . . . . . . . . . . . . . . . . . . . . . . 181
9 DNAAnalysis 183
9.1 GenomeAnatomy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
9.2 GeneFinding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
9.2.1 HiddenMarkovmodels . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
9.2.2 Neuralnetworks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
9.2.3 HomologySearch. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
9.2.4 PromoterPrediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
9.2.4.1 Prokaryotes: E.coli . . . . . . . . . . . . . . . . . . . . . . . 190
ix
9.2.4.2 Eukaryotes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
9.2.5 ESTClusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
9.2.6 PerformanceofGenePredictionMethods . . . . . . . . . . . . . . . . . 194
9.3 AlternativeSplicingandNucleosomes . . . . . . . . . . . . . . . . . . . . . . . 195
9.3.1 Nucleosomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
9.4 ComparativeGenomics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
9.5 GenomicIndividuality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
9.5.1 SequenceRepeats. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
9.5.2 SNPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
10 DNASequenceStatistics 209
10.1 LocalCharacteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
10.2 Long-RangeCharacteristics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
10.2.1 MatchingProbabilityofSubsequences . . . . . . . . . . . . . . . . . . . 209
10.2.2 SpectralAnalysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217
10.2.3 EntropyAnalysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
A ProbabilityGeneratingFunction 221
A.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
A.2 Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
A.2.1 Powerseries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
A.2.2 Probabilitiesandexpectations . . . . . . . . . . . . . . . . . . . . . . . 221
A.2.3 Functionsofindependentrandomvariables . . . . . . . . . . . . . . . . 222
A.3 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
A.4 Examplecalculation: useofbivariategeneratingfunctions . . . . . . . . . . . . 224
B ContactPotentialforThreading 227
C 3DPredictionChallengeResults(CASP7) 233
x