Table Of ContentVersion: April 26, 2019
Contents
I Preface 32
1 Welcome to the Biostar Handbook 34
1.1 How to download the book? . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
1.2 Online courses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
1.3 Access your account . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
1.4 How was the book developed? . . . . . . . . . . . . . . . . . . . . . . . . . . 36
1.5 How is this book different? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
1.6 Who is a Biostar? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2 About the author 38
3 Why bioinformatics? 41
3.1 What is this book about? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.2 What is covered in the book? . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4 What is bioinformatics? 43
4.1 How has bioinformatics changed? . . . . . . . . . . . . . . . . . . . . . . . . 43
4.2 What subfields of bioinformatics exist? . . . . . . . . . . . . . . . . . . . . . 43
4.3 Is there a list of functional assays used in bioinformatics? . . . . . . . . . . . 45
4.4 But what is bioinformatics, really? . . . . . . . . . . . . . . . . . . . . . . . 46
4.5 Is creativity required to succeed? . . . . . . . . . . . . . . . . . . . . . . . . 47
4.6 Are analyses all alike? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.7 Should life scientists know bioinformatics? . . . . . . . . . . . . . . . . . . . 48
4.8 What type of computer is required? . . . . . . . . . . . . . . . . . . . . . . . 48
4.9 Is there data with the book? . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.10 Who is the book for? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.11 Is bioinformatics hard to learn? . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.12 Can I learn bioinformatics from this book? . . . . . . . . . . . . . . . . . . . 49
4.13 How long will it take me to learn bioinformatics? . . . . . . . . . . . . . . . 50
5 Biology for bioinformaticians 51
5.1 What is DNA? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.2 Is there a directionality of DNA? . . . . . . . . . . . . . . . . . . . . . . . . 54
5.3 What is sense/antisense? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.4 What is DNA sequencing? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
2
CONTENTS 3
5.5 What gets sequenced? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.6 What is a genome? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.7 What is a genome’s purpose? . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.8 How big is a genome? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.9 What is RNA? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.10 How does a genome function? . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.11 What is a protein? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.12 How are proteins made? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.13 What is an ORF? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.14 What is a gene? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.15 Do genomes have other features? . . . . . . . . . . . . . . . . . . . . . . . . 59
5.16 What is homology? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
6 How is bioinformatics practiced? 62
6.1 What is the recommended computer for bioinformatics? . . . . . . . . . . . 62
6.2 How much computing power do we need? . . . . . . . . . . . . . . . . . . . . 63
6.3 Does learning bioinformatics need massive computing power? . . . . . . . . . 63
6.4 What about the cloud? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
6.5 Do I need to know Unix to do bioinformatics? . . . . . . . . . . . . . . . . . 64
6.6 Do I need to learn a programming language? . . . . . . . . . . . . . . . . . . 64
6.7 Are there alternatives to using Unix? . . . . . . . . . . . . . . . . . . . . . . 64
6.8 What is Bioconductor? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
6.9 What is Galaxy? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
6.10 What is BaseSpace? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
6.11 Are commercial bioinformatics software packages expensive? . . . . . . . . . 67
6.12 Should I freelance as a bioinformatician? . . . . . . . . . . . . . . . . . . . . 68
6.13 What do bioinformaticians look like? . . . . . . . . . . . . . . . . . . . . . . 69
7 How to solve it 70
7.1 What is holistic data analysis? . . . . . . . . . . . . . . . . . . . . . . . . . . 70
7.2 How do I perform a holistic analysis? . . . . . . . . . . . . . . . . . . . . . . 70
7.3 What are the rules of a bioinformatics analysis? . . . . . . . . . . . . . . . . 71
7.4 What does “make it work” mean? . . . . . . . . . . . . . . . . . . . . . . . . 72
7.5 Why is fast so important? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
7.6 What does “simple” mean? . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
7.7 How to deal with anxiety and stress? . . . . . . . . . . . . . . . . . . . . . . 73
II Installation 75
8 How to set up your computer 77
8.1 How do I set up my computer? . . . . . . . . . . . . . . . . . . . . . . . . . 77
8.2 Is this going to be difficult? . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
8.3 How do I prepare my computer? . . . . . . . . . . . . . . . . . . . . . . . . . 78
8.4 How do I initialize my terminal? . . . . . . . . . . . . . . . . . . . . . . . . . 78
4 CONTENTS
8.5 What are environments? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
8.6 How do I install conda? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
8.7 What is bioconda? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
8.8 Create a bioinformatics environment . . . . . . . . . . . . . . . . . . . . . . 80
8.9 Activate and install bioinformatics tools . . . . . . . . . . . . . . . . . . . . 80
8.10 How do I check that Entrez Direct works? . . . . . . . . . . . . . . . . . . . 80
8.11 How do I verify that all other programs work? . . . . . . . . . . . . . . . . . 81
8.12 How do I fix installation problems? . . . . . . . . . . . . . . . . . . . . . . . 82
8.13 How do I report installation problems? . . . . . . . . . . . . . . . . . . . . . 82
8.14 How do I use conda in general? . . . . . . . . . . . . . . . . . . . . . . . . . 83
8.15 How do I install a new tool? . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
8.16 How do I upgrade a tool? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
8.17 Do the automatically installed tools always work? . . . . . . . . . . . . . . . 84
8.18 How do I update conda? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
8.19 How should I set up my file structure? . . . . . . . . . . . . . . . . . . . . . 84
8.20 What to do if I get stuck? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
9 Choose and install a text editor 86
9.1 What features should my text editor have? . . . . . . . . . . . . . . . . . . . 86
9.2 Viewing whitespace . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
9.3 Super annoying behaviors . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
9.4 Which text editor to choose? . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
9.5 Watch the line endings on Windows! . . . . . . . . . . . . . . . . . . . . . . 89
III UNIX COMMAND LINE 90
10 Introduction to Unix 92
10.1 What is the command line? . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
10.2 What does the command line look like? . . . . . . . . . . . . . . . . . . . . . 92
10.3 What are the advantages of the command line? . . . . . . . . . . . . . . . . 93
10.4 What are the disadvantages of the command line? . . . . . . . . . . . . . . . 93
10.5 Is knowing the command line necessary? . . . . . . . . . . . . . . . . . . . . 93
10.6 Is the command line hard to learn? . . . . . . . . . . . . . . . . . . . . . . . 93
10.7 Do other people also make many mistakes? . . . . . . . . . . . . . . . . . . . 94
10.8 How much Unix do I need to know to be able to progress? . . . . . . . . . . 94
10.9 How do I access the command line? . . . . . . . . . . . . . . . . . . . . . . . 94
10.10What is a shell? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
10.11What is the best way to learn Unix? . . . . . . . . . . . . . . . . . . . . . . 96
10.12How do I troubleshoot errors? . . . . . . . . . . . . . . . . . . . . . . . . . . 97
10.13Where can I learn more about the shell? . . . . . . . . . . . . . . . . . . . . 97
11 The Unix bootcamp 98
11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
11.2 Why Unix? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
CONTENTS 5
11.3 Typeset Conventions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
11.4 1. The Terminal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
11.5 2. Your first Unix command . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
11.6 3: The Unix tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
11.7 4: Finding out where you are . . . . . . . . . . . . . . . . . . . . . . . . . . 102
11.8 5: Making new directories . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
11.9 6: Getting from ‘A’ to ‘B’ . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
11.107: The root directory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
11.118: Navigating upwards in the Unix filesystem . . . . . . . . . . . . . . . . . 104
11.129: Absolute and relative paths . . . . . . . . . . . . . . . . . . . . . . . . . . 105
11.1310: Finding your way back home . . . . . . . . . . . . . . . . . . . . . . . . 105
11.1411: Making the ls command more useful . . . . . . . . . . . . . . . . . . . . 106
11.1512: Man pages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
11.1613: Removing directories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
11.1714: Using tab completion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
11.1815: Creating empty files with the touch command . . . . . . . . . . . . . . . 108
11.1916: Moving files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
11.2017: Renaming files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
11.2118: Moving directories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
11.2219: Removing files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
11.2320: Copying files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
11.2421: Copying directories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
11.2522: Viewing files with less (or more) . . . . . . . . . . . . . . . . . . . . . . 112
11.2623: Viewing files with cat . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
11.2724: Counting characters in a file . . . . . . . . . . . . . . . . . . . . . . . . . 113
11.2825: Editing small text files with nano . . . . . . . . . . . . . . . . . . . . . . 113
11.2926: The $PATH environment variable . . . . . . . . . . . . . . . . . . . . . . 114
11.3027: Matching lines in files with grep . . . . . . . . . . . . . . . . . . . . . . . 115
11.3128: Combining Unix commands with pipes . . . . . . . . . . . . . . . . . . . 116
11.32Miscellaneous Unix power commands . . . . . . . . . . . . . . . . . . . . . . 116
12 Data analysis with Unix 118
12.1 What directory should I use? . . . . . . . . . . . . . . . . . . . . . . . . . . 118
12.2 Where are we getting the data from? . . . . . . . . . . . . . . . . . . . . . . 119
12.3 How do I obtain a data file that is online? . . . . . . . . . . . . . . . . . . . 120
12.4 How many feature types are in this data? . . . . . . . . . . . . . . . . . . . 125
12.5 The single most useful Unix pattern . . . . . . . . . . . . . . . . . . . . . . . 126
12.6 One-liners . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
13 Data compression 128
13.1 What is a compressed format? . . . . . . . . . . . . . . . . . . . . . . . . . . 128
13.2 What objects may be compressed? . . . . . . . . . . . . . . . . . . . . . . . 128
13.3 What are some common compression formats? . . . . . . . . . . . . . . . . . 129
13.4 Is there a bioinformatics-specific compression format? . . . . . . . . . . . . . 129
13.5 How do I compress or uncompress a file? . . . . . . . . . . . . . . . . . . . . 129
6 CONTENTS
13.6 How do I compress or uncompress multiple files? . . . . . . . . . . . . . . . . 130
13.7 What is a tarbomb? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
13.8 How do we use tar again? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
IV DATA SOURCES 133
14 What is data? 135
14.1 So what is data? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
14.2 Essential properties of data . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
14.3 How life scientists think about bioinformatics . . . . . . . . . . . . . . . . . 136
14.4 What bioinformatics is in reality . . . . . . . . . . . . . . . . . . . . . . . . 136
14.5 What is the state of data in bioinformatics? . . . . . . . . . . . . . . . . . . 136
14.6 What kind of problems does bioinformatics data have? . . . . . . . . . . . . 137
14.7 How complete is the data that will I obtain? . . . . . . . . . . . . . . . . . . 139
14.8 Final thoughts on data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
15 Biological data sources 142
15.1 Where is biomedical data stored? . . . . . . . . . . . . . . . . . . . . . . . . 142
15.2 What are the major DNA data repositories? . . . . . . . . . . . . . . . . . . 142
15.3 What kind of other data sources are there? . . . . . . . . . . . . . . . . . . . 144
15.4 Is there a list of “all” resources? . . . . . . . . . . . . . . . . . . . . . . . . . 145
15.5 What’s in a name? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
15.6 Project systematic names . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
16 Common data types 147
16.1 A Quick look at the GENBANK format. . . . . . . . . . . . . . . . . . . . . . . 147
16.2 A quick look at the FASTQ format . . . . . . . . . . . . . . . . . . . . . . . . 148
16.3 A quick look at the GFF/GTF/BED formats . . . . . . . . . . . . . . . . . . . 149
16.4 A quick look at the SAM/BAM formats . . . . . . . . . . . . . . . . . . . . . . 150
16.5 Can I convert between formats. . . . . . . . . . . . . . . . . . . . . . . . . . 150
16.6 What is reference data? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
16.7 What are genomic builds? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
16.8 Is there a list of “all” resources? . . . . . . . . . . . . . . . . . . . . . . . . . 152
16.9 Should download data first or get data on-demand? . . . . . . . . . . . . . . 152
17 Human and mouse genomes 153
17.1 How many genomic builds does the human genome have? . . . . . . . . . . . 153
17.2 Why is the genomic build hg19 still in use? . . . . . . . . . . . . . . . . . . 154
17.3 Should we use the old version of a genome? . . . . . . . . . . . . . . . . . . 154
17.4 How do we transfer genomic coordinates between builds? . . . . . . . . . . . 154
17.5 Human gene naming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
17.6 Which human data source should I trust? . . . . . . . . . . . . . . . . . . . 156
17.7 Which human/mouse genome data should I be using? . . . . . . . . . . . . . 156
17.8 Is there a better resource for human annotations? . . . . . . . . . . . . . . . 156
CONTENTS 7
17.9 How do I access NCBI RefSeq and GenBank? . . . . . . . . . . . . . . . . . 157
17.10What can I get from ENSEMBL? . . . . . . . . . . . . . . . . . . . . . . . . 157
17.11What a working strategy for finding reference information . . . . . . . . . . 158
18 Automating access to NCBI 159
18.1 Note . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
18.2 What is Entrez? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
18.3 How is Entrez pronounced? . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
18.4 How do we automate access to Entrez? . . . . . . . . . . . . . . . . . . . . . 160
18.5 How is data organized in NCBI? . . . . . . . . . . . . . . . . . . . . . . . . . 160
18.6 How do I use Entrez E-utils web API? . . . . . . . . . . . . . . . . . . . . . 161
18.7 How do I use Entrez Direct? . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
18.8 How do we search with Entrez Direct? . . . . . . . . . . . . . . . . . . . . . 162
18.9 How to do more work with Entrez Direct? . . . . . . . . . . . . . . . . . . . 163
19 Entrez Direct by example 164
19.1 How do I use efetch? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
19.2 How do I use esearch to obtain project related data? . . . . . . . . . . . . . 165
19.3 How do I get run information on a project? . . . . . . . . . . . . . . . . . . 165
19.4 How do I get even more information on a project? . . . . . . . . . . . . . . . 166
19.5 How do I extract taxonomy information? . . . . . . . . . . . . . . . . . . . . 166
V DATA FORMATS 168
20 Introduction to data formats 170
20.1 What is a data format? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
20.2 Should I re-format (transform) my data? . . . . . . . . . . . . . . . . . . . . 170
20.3 When to re-format (transform) data? . . . . . . . . . . . . . . . . . . . . . . 172
21 The GenBank format 173
21.1 What is the GenBank format? . . . . . . . . . . . . . . . . . . . . . . . . . . 173
21.2 When do we use the GenBank format? . . . . . . . . . . . . . . . . . . . . . 174
21.3 What is RefSeq? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
21.4 How are RefSeq sequences named? . . . . . . . . . . . . . . . . . . . . . . . 174
22 The FASTA format 178
22.1 What is the FASTA format? . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
22.2 Are there problems with this format? . . . . . . . . . . . . . . . . . . . . . . 179
22.3 Is there more information in FASTA headers? . . . . . . . . . . . . . . . . . 180
22.4 Is there more information in the FASTA sequences? . . . . . . . . . . . . . . 180
22.5 Where do I get a fasta file? . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
23 The FASTQ format 182
23.1 What is the FASTQ format? . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
23.2 How to recognize FASTQ qualities by eye . . . . . . . . . . . . . . . . . . . 183
8 CONTENTS
23.3 Are there different versions of the FASTQ encoding? . . . . . . . . . . . . . 184
23.4 Is there more information in FASTQ headers? . . . . . . . . . . . . . . . . . 184
23.5 What is a Phred score? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
23.6 How do I convert FASTQ quality codes at the command line? . . . . . . . . 185
23.7 Closing thoughts on FASTQ . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
24 Advanced FASTQ processing 187
24.1 Example data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
24.2 How to produce an overview of FASTQ files? . . . . . . . . . . . . . . . . . . 188
24.3 How do I get the GC content? . . . . . . . . . . . . . . . . . . . . . . . . . . 188
24.4 How to I get the percentage for custom bases? . . . . . . . . . . . . . . . . . 189
24.5 How to extract a subset of sequences with name/ID list file? . . . . . . . . . 189
24.6 How do I find FASTA/Q sequences containing degenerate bases and locate
them? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
24.7 How do I remove FASTA/Q records with duplicated sequences? . . . . . . . 190
24.8 How do I locate motif/subsequence/enzyme digest sites in FASTA/Q sequence?190
24.9 How do I sort a huge number of FASTA sequences by length? . . . . . . . . 191
24.10How do I split FASTA sequences according to information in the header? . . 191
24.11How do I search and replace within a FASTA header using character strings
from a text file? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
24.12How do I extract paired reads from two paired-end reads files? . . . . . . . . 193
24.13How to concatenate two FASTA sequences in to one? . . . . . . . . . . . . . 194
VI VISUALIZING DATA 196
25 Visualizing biological data 198
25.1 What are the challenges of visualization? . . . . . . . . . . . . . . . . . . . . 198
25.2 What is a genome browser? . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
25.3 Why are default browser screens so complicated? . . . . . . . . . . . . . . . 199
25.4 What types of data may be visualized in a genome browser? . . . . . . . . . 200
25.5 How do I interpret glyphs? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
25.6 Which standalone genome browsers should I use? . . . . . . . . . . . . . . . 201
25.7 What about online genome browsers? . . . . . . . . . . . . . . . . . . . . . . 201
26 Using the Integrative Genomics Viewer 202
26.1 What is IGV? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
26.2 How to run IGV on Mac and Linux? . . . . . . . . . . . . . . . . . . . . . . 203
26.3 How to run IGV on Windows Bash? . . . . . . . . . . . . . . . . . . . . . . 203
26.4 What does the IGV interface look like? . . . . . . . . . . . . . . . . . . . . . 203
26.5 What data does IGV come with? . . . . . . . . . . . . . . . . . . . . . . . . 203
26.6 How do I create a custom genome in IGV? . . . . . . . . . . . . . . . . . . . 204
CONTENTS 9
VII SEQUENCE ONTOLOGY 206
27 What do the words mean? 208
27.1 Why is the ontology necessary? . . . . . . . . . . . . . . . . . . . . . . . . . 208
27.2 Are there other ontologies? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
27.3 Who names the genes? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
27.4 What will our data tell us? . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
28 Sequence ontology 211
28.1 What is the Sequence Ontology (SO)? . . . . . . . . . . . . . . . . . . . . . 211
28.2 Where do I see Sequence Ontology (SO) terms used? . . . . . . . . . . . . . 211
28.3 How do I access the Sequence Ontology browser? . . . . . . . . . . . . . . . 212
28.4 Does all sequencing data obey the rules of SO? . . . . . . . . . . . . . . . . 214
28.5 How are the SO relationships defined? . . . . . . . . . . . . . . . . . . . . . 214
28.6 Will I need to access the SO data directly? . . . . . . . . . . . . . . . . . . . 215
28.7 How can I investigate the SO data? . . . . . . . . . . . . . . . . . . . . . . . 215
28.8 How many Sequence Ontology terms are there? . . . . . . . . . . . . . . . . 215
28.9 How can I quickly search the Sequence Ontology? . . . . . . . . . . . . . . . 215
28.10How to search for other information? . . . . . . . . . . . . . . . . . . . . . . 216
VIII GENE ONTOLOGY 218
29 Gene ontology 220
29.1 What is the Gene Ontology (GO)? . . . . . . . . . . . . . . . . . . . . . . . 220
29.2 How is the GO designed? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220
29.3 What kind of properties do annotated gene products have ? . . . . . . . . . 221
29.4 Where can access the GO online? . . . . . . . . . . . . . . . . . . . . . . . . 222
29.5 How are GO terms organized? . . . . . . . . . . . . . . . . . . . . . . . . . . 222
29.6 Where can the visualize GO terms online? . . . . . . . . . . . . . . . . . . . 224
30 Understanding the GO data 227
30.1 What format is the GO data in? . . . . . . . . . . . . . . . . . . . . . . . . 227
30.2 What does a GO term file contain? . . . . . . . . . . . . . . . . . . . . . . . 227
30.3 What is a GO association file? . . . . . . . . . . . . . . . . . . . . . . . . . . 228
30.4 Where can I find the association files for different organisms? . . . . . . . . . 228
30.5 How to get the human gene association file? . . . . . . . . . . . . . . . . . . 229
30.6 What format does the GO association file have? . . . . . . . . . . . . . . . . 229
30.7 Do the association files represent all of the accumulated biological knowledge? 230
30.8 What kind of properties does the GO data have? . . . . . . . . . . . . . . . 230
30.9 What are the most annotated human genes and proteins? . . . . . . . . . . . 231
30.10What are the ten most highly annotated genes in the GO dataset? . . . . . . 232
30.11Do the GO annotations change? . . . . . . . . . . . . . . . . . . . . . . . . . 233
30.12How complete is the GO? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233
31 Functional analysis 235
10 CONTENTS
31.1 The sorry state of data categorization . . . . . . . . . . . . . . . . . . . . . . 235
31.2 What is a functional analysis? . . . . . . . . . . . . . . . . . . . . . . . . . . 236
31.3 What is an Over-Representation Analysis (ORA)? . . . . . . . . . . . . . . . 236
31.4 Are there different ways to compute ORA analyses? . . . . . . . . . . . . . . 237
31.5 What are problems with the ORA analysis? . . . . . . . . . . . . . . . . . . 239
31.6 Why do we still use the ORA analysis? . . . . . . . . . . . . . . . . . . . . . 239
31.7 What is a Functional Class Scoring (FCS)? . . . . . . . . . . . . . . . . . . . 239
31.8 Should I trust the results of functional analyses? . . . . . . . . . . . . . . . . 240
32 Gene set enrichment 241
32.1 What is a gene set enrichment analysis? . . . . . . . . . . . . . . . . . . . . 241
32.2 What tools are used to perform enrichment analysis? . . . . . . . . . . . . . 241
32.3 Will different tools produce different results? . . . . . . . . . . . . . . . . . . 242
32.4 How do I perform a gene set enrichment analysis? . . . . . . . . . . . . . . . 243
33 Using the AGRIGO server 245
33.1 Authors note . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
33.2 How to use AgriGO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
33.3 What is the Singular Enrichment Analysis (SEA) in AgriGo? . . . . . . . . . 246
33.4 How to use SEA? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246
33.5 What are the outputs of AgriGO? . . . . . . . . . . . . . . . . . . . . . . . . 247
33.6 How do I prepare a custom annotation for AgriGO? . . . . . . . . . . . . . . 247
34 Using the g:Profiler server 251
34.1 What is the g:Profiler? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251
34.2 What is a standout feature of the g:Profiler? . . . . . . . . . . . . . . . . . . 251
34.3 What functionality does the g:Profile have? . . . . . . . . . . . . . . . . . . 252
34.4 How to use g:profiler at the command line . . . . . . . . . . . . . . . . . . . 254
35 Using the DAVID server 256
35.1 Authors note . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256
35.2 What is DAVID? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257
35.3 What does DAVID do? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257
35.4 What are the different steps in a DAVID analysis? . . . . . . . . . . . . . . . 257
35.5 How do I start an analysis with DAVID? . . . . . . . . . . . . . . . . . . . . 258
35.6 What is the Functional Annotation Tool? . . . . . . . . . . . . . . . . . . . 259
35.7 What is the Functional Annotation Summary? . . . . . . . . . . . . . . . . . 259
35.8 How do I explore the Functional Annotation Summary? . . . . . . . . . . . . 260
35.9 What is a Functional Annotation Chart ? . . . . . . . . . . . . . . . . . . . 260
35.10What is Functional Annotation Clustering? . . . . . . . . . . . . . . . . . . 261
35.11What is a Functional Annotation Table? . . . . . . . . . . . . . . . . . . . . 262
35.12What is the Gene Functional Classification Tool? . . . . . . . . . . . . . . . 262
35.13What is an EASE Score? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264
35.14What is the Kappa statistic? . . . . . . . . . . . . . . . . . . . . . . . . . . 265
35.15What does the Gene ID Conversion Tool do? . . . . . . . . . . . . . . . . . . 265