Table Of Content

Practical Protein Bioinformatics Florencio Pazos • Mónica Chagoyen Practical Protein Bioinformatics 1 3 Florencio Pazos Mónica Chagoyen National Centre for Biotechnology National Centre for Biotechnology (CNB-CSIC) (CNB-CSIC) Madrid Madrid Spain Spain ISBN 978-3-319-12726-2 ISBN 978-3-319-12727-9 (eBook) DOI 10.1007/978-3-319-12727-9 Library of Congress Control Number: 2014954634 Springer Cham Heidelberg New York Dordrecht London © Springer International Publishing Switzerland 2015 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recita- tion, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, express or implied, with respect to the material contained herein or for any errors or omissions that may have been made. Printed on acid-free paper Springer is part of Springer Science+Business Media (www.springer.com) Introduction Bioinformatics methods are becoming part of the standard toolboxes of Life Sci- ence laboratories. Wisely applied, these approaches can enormously restrict the ex- perimental work and complement the results obtained by “wet” methods. While many bioinformatics methods and protocols are not mature enough to be used by non-bioinformaticians (in terms of implementation, or easiness of the interpretation of the results), many others are already at a stage in which they can be used by non experts. They can be accessed through standard web interfaces from any computer irrespective of its hardware or operative system, they do not require the installation/ maintenance of specific software, and their results are easy to interpret and gener- ally presented in a graphical interactive way. In spite of this, our experience shows that many wet labs are not aware of many of the tools freely available. This book covers these tools that work with information related to proteins. Only the tools fulfilling the type of requirements mentioned above are included. For example, tools requiring the installation of software, or the pre/post processing or parsing of the input or the results are discarded. While the book tries to be exhaustive in covering all aspects related to proteins (from genomic sequences to protein networks, going thorough protein sequences and three-dimensional structures), the number of tools commented in each category is necessarily restricted, and many tools with similar goals are not commented. This selection, personal and biased by definition, is based on our own experience working in a Protein Analysis Facility of a Molecular Biology research centre. The point of view of the book is totally practical. This is not intended to be an introductory textbook on Bioinformatics. The theoretical bases of the tools are only tangentially mentioned as long as such knowledge is required to better interpret the results. We try to explain the usage of the tools in a similar way protocols are described in Molecular Biology books. Practical examples with real proteins are included in the different sections. Since the field is moving very fast and many tools get improved, are surpassed, or the web addresses change or simply disappear, the book is associated to a dynam- ic web site that will try to reflect these changes. This site will maintain an updated list of the tools and include information on their eventual upgrades and changes. This site can be accessed at http://csbg.cnb.csic.es/PB/. v vi Introduction For each tool discussed, a table with the following information is included: Tool name and Original URL QR code description ReadSeq – http://www.ebi.ac.uk/Tools/sfc/readseq/ Conversion http://csbg.cnb.csic.es/PB/E1010 between various sequence file formats Permanent URL The table contains the name and a short description of the tool, as well as its current web address (URL). A “permanent” URL is also included. Right now that is just an automatic redirection to the original URL of the tool. But in case that original URL changes or disappears, the permanent URL will reflect that, providing information on the change, proposing alternative tools, etc. Finally, the table contains a “quick response” (QR) code that can be scanned so as to automatically point the device browser to the permanent URL. The bibliographic references of the tools are also included so that interested users can obtain more information. An index with the tools, as well as “How to…?” index have also been included to facilitate localizing the procedure/tool of interest. Finally, we would like to acknowledge the developers of these tools, who are investing their time and resources for creating and maintaining a large ecosystem of interconnected web applications that facilitates the daily work of molecular biologists. Contents 1 Sequences .................................................................................................... 1 1.1 Introduction ......................................................................................... 1 1.2 Representing Protein Sequences in the Computer .............................. 1 1.2.1 Sequence File Formats ............................................................ 2 1.2.2 Sequence Format Conversion Tools ........................................ 3 1.3 Main Protein Sequence Databases ...................................................... 3 1.3.1 Sequences and Database Entries ............................................. 5 1.4 Basic Sequence-Based Characteristics ................................................ 8 1.5 Compare Two Protein Sequences ........................................................ 10 1.5.1 T ypes of Pair-Wise Sequence Alignments .............................. 12 1.6 Finding Similar Sequences in a Database (Basic) ............................... 14 1.6.1 W hich Sequence Database to Search? .................................... 15 1.6.2 BLAST .................................................................................... 15 1.7 Compare More than Two Sequences ................................................... 20 1.7.1 Multiple Sequence Alignments: Formats and Conversion ...... 23 1.7.2 A lignment Editing and Representation ................................... 23 1.7.3 Summarizing MSAs ................................................................ 26 1.8 Finding Similar Sequences in a Database (Advanced) ....................... 29 1.8.1 Sequence Profiles .................................................................... 29 1.8.2 Iterative Profile Construction .................................................. 30 1.8.3 HMM Profile Search Against a Sequence Database ............... 31 1.8.4 HMM Profile Search Against a Profile Database ................... 31 1.9 Protein Motifs, Domains and Families ................................................ 32 1.10 Basic Phylogeny ................................................................................ 36 2 S tructures .................................................................................................... 43 2.1 Introduction ......................................................................................... 43 2.1.1 Storing Protein Structures—The PDB File Format ................ 43 2.2 Main Protein Structure Databases ....................................................... 45 2.2.1 Classifications of Structural Domains ..................................... 49 2.3 Structure Manipulation, Visualization and Comparison ..................... 52 2.3.1 Structure Manipulation and Visualization ............................... 52 2.3.2 Structure Comparison.............................................................. 55 vii viii Contents 2.4 Prediction of 1D Structural Features ................................................... 61 2.4.1 Secondary Structure and Solvent Accessibility....................... 61 2.4.2 T ransmembrane Segments ...................................................... 63 2.4.3 Coiled-Coils ............................................................................ 66 2.4.4 Disordered Regions ................................................................. 68 2.4.5 Protein Sorting Signals ............................................................ 71 2.5 Predicting Protein 3D Structure .......................................................... 72 2.5.1 T emplate-Based (Homology-Based Approaches) ................... 74 2.5.2 T emplate-Based (Fragment-Based Approaches) ..................... 76 2.5.3 Model Quality Checks............................................................. 77 2.6 A nalysis of Protein Structure .............................................................. 78 2.6.1 Mapping Conservation ............................................................ 78 2.6.2 Protein and Ligand Contacts ................................................... 79 2.6.3 Surface Clefts, Binding Pockets, Tunnels and Internal Cavities ...................................................................... 81 3 Systems ........................................................................................................ 85 3.1 Introduction ......................................................................................... 85 3.1.1 Protein/Gene Functional Annotations ..................................... 85 3.1.2 ID Conversions........................................................................ 87 3.2 Annotation Enrichment Analysis of Large Proteins Sets ........................ 87 3.3 Protein Interaction Networks .............................................................. 90 3.4 Metabolic Networks ............................................................................ 92 3.4.1 Retrieve the Metabolic-Related Information Associated to a Protein of Interest .......................................... 92 3.4.2 Map a Large Set of Proteins in the Metabolome ..................... 95 3.5 Other Biological Networks ................................................................. 97 Bibliography ..................................................................................................... 101 Index .................................................................................................................. 105 Chapter 1 Sequences 1.1 Introduction An amino acid sequence represents the protein’s biochemical composition as a lin- ear polymer built from the covalent attachment of a series of amino acids by means of peptide bonds. It is also referred to as the primary structure of a protein. Amino acid sequences reflect the exact and unique composition of nascent proteins as they are translated from their mRNA templates. As such they can be thought as the perfect fingerprint to identify a protein, and a valuable source of information that can be used to further infer structural and functional information. They are also the natural link between a protein and its genetic information. In this chapter we present those bioinformatics analyses that deal with protein sequences, from the representation of protein sequences in the computer, to the browsing of the main sequence collections, and the comparison of protein sequences in their different forms: pair-wise and multiple sequence alignments, database searches, and basic phylogenetic analysis. 1.2 Representing Protein Sequences in the Computer Protein sequences are represented in the computer as a string of characters, using the one-letter notation for amino acid residues established by the IUPAC-IUB (IUPAC-IUB Commission on Biochemical Nomenclature. A one-letter notation for amino acid sequences. Tentative rules. 1969). This notation assigns a letter to each of the 20 natural amino acids, as well as additional characters for representing other features of the sequence (e.g. “X” for residue of unknown nature). The polypep- tide chain is always represented from the N-terminus (left, or first character in the string) to the C-terminus (right, or last character in the string). © Springer International Publishing Switzerland 2015 1 F. Pazos, M. Chagoyen, Practical Protein Bioinformatics, DOI 10.1007/978-3-319-12727-9_1 2 1 Sequences Example: Human SPINK1 gene coding for a pancreatic secretory trypsin inhibitor: MKVTGIFLLSALALLSLSGNTGADSLGREAKCYNELNGCTKIYDPVCGTDGNTYPNECVLCFENRKRQTSILIQKSGPC | | N-term C-term 1.2.1 Sequence File Formats The file formats most commonly used for storing protein sequences are plain text (ASCII) files. You can open these files in a simple text editor (like Notepad in Win- dows, or TextEdit in Mac) to read, edit, copy/paste to web forms, etc. Although it is tempting to store and share sequences with colleagues in Word files (.doc or .docx) or PDF files (.pdf), we recommend you not to do so. The reason is that bioinformatics software, both stand-alone and web-based applications, are not able to read these formats (so you won’t be able to open or upload these files as input to these programs). It is, therefore more convenient to store and share sequences in plain text files and, always in addition -not as a replacement-, use other type of programs for creating “visual” add-ons or adding manual annotations (like coloring, etc.). FASTA Format It is a very simple format that can contain one or multiple sequences. Each sequence is represented with a header line (starting with “>” followed by a string of characters, commonly used to include an identifier and a short description), and the characters representing the sequence in the following lines (up to a new header line or the end of the file). There is no standard file extension, although “.fasta”, “.fas” and “.fa” are widely used. Note that FASTA files can contain unaligned or aligned sequences (see Sect. 1.7.1). Take a look at http://en.wikipedia.org/wiki/FASTA_for- mat for an exhaustive description of the format. Example of a single sequence FASTA file: >sp|P00995|ISK1_HUMAN Pancreatic secretory trypsin inhibitor MKVTGIFLLSALALLSLSGNTGADSLGREAKCYNELNGCTKIYDPVCGTDGNTYPNECVL CFENRKRQTSILIQKSGPC

Practical Protein Bioinformatics PDF

111 Pages·2015·6.221 MB·English

by Florencio Pazos, Mónica Chagoyen (auth.)

Checking for file health...

Save to my drive

Quick download

Download

Download Practical Protein Bioinformatics PDF Free - Full Version

by Florencio Pazos, Mónica Chagoyen (auth.)| 2015| 111 pages| 6.221| English

Download Practical Protein Bioinformatics by Florencio Pazos, Mónica Chagoyen (auth.) in PDF format completely FREE. No registration required, no payment needed. Get instant access to this valuable resource on PDFdrive.to!

Free Download PDF

About Practical Protein Bioinformatics

No description available for this book.

Detailed Information

Author:	Florencio Pazos, Mónica Chagoyen (auth.)
Publication Year:	2015
Pages:	111
Language:	English
File Size:	6.221
Format:	PDF
Price:	FREE

Download Free PDF

Safe & Secure Download - No registration required

Why Choose PDFdrive for Your Free Practical Protein Bioinformatics Download?

100% Free: No hidden fees or subscriptions required for one book every day.
No Registration: Immediate access is available without creating accounts for one book every day.
Safe and Secure: Clean downloads without malware or viruses
Multiple Formats: PDF, MOBI, Mpub,... optimized for all devices
Educational Resource: Supporting knowledge sharing and learning

Frequently Asked Questions

Is it really free to download Practical Protein Bioinformatics PDF?

Yes, on https://PDFdrive.to you can download Practical Protein Bioinformatics by Florencio Pazos, Mónica Chagoyen (auth.) completely free. We don't require any payment, subscription, or registration to access this PDF file. For 3 books every day.

How can I read Practical Protein Bioinformatics on my mobile device?

After downloading Practical Protein Bioinformatics PDF, you can open it with any PDF reader app on your phone or tablet. We recommend using Adobe Acrobat Reader, Apple Books, or Google Play Books for the best reading experience.

Is this the full version of Practical Protein Bioinformatics?

Yes, this is the complete PDF version of Practical Protein Bioinformatics by Florencio Pazos, Mónica Chagoyen (auth.). You will be able to read the entire content as in the printed version without missing any pages.

Is it legal to download Practical Protein Bioinformatics PDF for free?

https://PDFdrive.to provides links to free educational resources available online. We do not store any files on our servers. Please be aware of copyright laws in your country before downloading.

The materials shared are intended for research, educational, and personal use in accordance with fair use principles.