Table Of Content

Yangetal.BMCBioinformatics2013,14:33 http://www.biomedcentral.com/1471-2105/14/33 SOFTWARE Open Access HTQC: a fast quality control toolkit for Illumina sequencing data Xi Yang1, Di Liu1, Fei Liu1, Jun Wu1, Jing Zou1, Xue Xiao1, Fangqing Zhao2 and Baoli Zhu1* Abstract Background: Illumina sequencing platform is widely used in genome research.Sequence reads quality assessment and control are neededfor downstream analysis. However, software that provides efficient quality assessment and versatile filtration methods is still lacking. Results: We have developed a toolkit named HTQC – abbreviation ofHigh-Throughput Quality Control –for sequence readsqualitycontrol, which consistsof sixprograms for readsqualityassessment,reads filtration and generation of graphic reports. Conclusions: The HTQC toolkit can generate reads quality assessment faster than existing tools, providing guidance for reads filtration utilities that allow users to choose different strategies to removelow quality reads. Background tools lack of function versatility or run-time efficiency. Next generation sequencing technologies are generating For example, PIQA and FastQC only create reports on massive sequence data [1], and different platforms can reads quality, but provide no tool for reads filtration introduce varied level of sequence reads error. Among [3,4]. SolexaQA and BIGpre provide only read trimming them, the Illumina platform is the most widely used for utility [5,6]. Furthermore, many of these software tools genome sequencing with the least error rate per base. are implemented using Perl language [3,5,6], which as a However, due to the nature of the method, it still pre- dynamic-typedscriptlanguage,provideseaseondeveloping sents a considerable amount of errors that has its spe- mini programs, but causes severe loss on run-time per- cific errors pattern [2]. The device performs sequencing formance [7]. Therefore, a high performance sequence by DNA synthesis on clusters of identical DNA mole- readsQCtoolkitisneededforafasterassessmentandver- cules simultaneously. When elongation of some DNA satilefunctionality. molecules is stopped accidentally, it creates disturbance of the cluster’s fluorescent signal, resulting sequencing Implementation errors. Such errors accumulate during the process of se- TheHTQCtoolkitconsistsofsixprogramsthatcanper- quencing, and cause reads quality decreasing while the form reads quality assessment and filtration. To improve length grows. Besides, deficiency on sequencing chips run-time performance, the time-consuming programs and the existence of air bubbles on chip surface can are implemented using C++. The FASTQ format is used cause failure on reads from a whole tile. To get reliable for sequence data input and output, and the QC report result in downstream analysis, it is necessary to remove isgenerated astab-separatedplain-textfile[8].Tocreate low quality reads avoiding mismatches in read mapping, graphical charts of QC report, a Perl script is included. andfalsepathsduringgenomeassembly. The GNU Glib is used for base utilities such as Currently there are several reads quality control (QC) command-line parser, portable support for threads and software tools (Table 1). However, all these software asynchronousqueues. AllprogramsofHTQC toolkit are capable of single-end or paired-end sequencing experi- *Correspondence:[email protected] ments, 33-based or 64-based quality score encodings 1CASKeyLaboratoryofPathogenicMicrobiologyandImmunology,Institute and FASTQ sequence identifier format from different ofMicrobiology,ChineseAcademyofSciences,NO.1WestBeichenRoad, version of CASAVA tools (the traditional format like ChaoyangDistrict,Beijing,China Fulllistofauthorinformationisavailableattheendofthearticle “@HWUSI-EAS100R:6:73:941:1973#0/1”, and the new ©2013Yangetal.;licenseeBioMedCentralLtd.ThisisanOpenAccessarticledistributedunderthetermsoftheCreative CommonsAttributionLicense(http://creativecommons.org/licenses/by/2.0),whichpermitsunrestricteduse,distribution,and reproductioninanymedium,providedtheoriginalworkisproperlycited. Yangetal.BMCBioinformatics2013,14:33 Page2of4 http://www.biomedcentral.com/1471-2105/14/33 Table1Comparisonofsequencingqualitycontrol each position of the sequence reads, a heat map and softwaretools a box chart are presented with quality distribution Name Programming Sequencing Readsfiltrationmethods (Figure 1AB). The stacked bar chart that represents the language platforms base composition on each position is shown at the same HTQC C++ Illumina Tailtrimming,filterby time (Figure 1C). The cycle-specific errors and the quality/length/tile rapid-falling quality on reads tail can be viewed on these FastQC Java Notlimited,any No three charts. To find tile-specific problems like high FASTQfile error rate or low data production, a stacked bar chart NGSQC Perl Illumina,454 Trimmingonly shows the number of reads in different quality ranges SolexaQA Perl Illumina Trimmingonly using different color, each tile in one bar (Figure 1F). BIGpre Perl Illumina,454 Duplicateremoval Theotherchartsthat showthedistributionofreadswith varied length and reads with varied quality can provide PIQA R,C++ Illumina No an overview of sequencing quality (Figure 1DE). For paired-endreadsqualityassessment,theht_statprogram format used by CASAVAversion 1.8 like “@EAS139:136: isusedtocreateseparate chartsforeachend,andtocal- FC706VJ:2:2104:15343:1973931:Y:18:ATCACG”)[9]. culate the correlation between reads quality of two ends The quality control pipeline begins with a quality as- (Figure 1G). All these results generated by ht_stat pro- sessment on raw reads. Program ht_stat generates QC gram are written to a series of tab-delimited plain-text report from raw sequence reads in different aspects. On files, which can be visualized using ht_stat_draw.pl, A B C D E F G Figure1Chartsforreadqualityassessment.A:heatmapforreadqualityalongread.B:boxchartforreadqualityalongread.C:basecomposition alongread.D:readlengthdistribution.E:readqualitydistribution.F:readqualityondifferenttiles.G:correlationbetweentworeadends. Yangetal.BMCBioinformatics2013,14:33 Page3of4 http://www.biomedcentral.com/1471-2105/14/33 or any spreadsheet software like Microsoft Excel or applied to cut the low quality reads using the program LibreOffice Calc. ht_trim. In Figure 1C, there was at least 10% of reads After the assessment of sequence reads quality is that have an invalid nucleotide sequence represented by obtained,lowqualityreadsshouldberemoved.TheHTQC contiguous Ns. The bad reads that contained these Ns tool kit provides four different programs that include can be filtered with the program ht_length_filter or ht_tile_filter, ht_trim, ht_qual_filter and ht_length_filter, ht_qual_filter. When the quality assessmentwas done by to perform reads filtration. The ht_tile_filter is designed to tiles, we observed tiles 5, 31, 110, 113, 117, 118 pro- remove reads from problematic tiles that may not be duced reads with very low quality (Figure 1F) that can reliable due to sequencing chip quality; the ht_trim cuts be removed using ht_tile_filter. For the quality assess- low quality bases at the beginning or the end of the reads ment of any paired-end reads, if the quality of read 2 until the quality score reaches a given threshold; the was worse than read 1, such quality imbalance can be ht_qual_filter remove reads with low quality and the picked upbyht_stat(Figure 1G). ht_length_filter remove short reads. When only one end of the paired-end reads is of acceptable quality, it is stored in Programrun-timeefficiency a separate file. The cutoff valueof these programs, such as To compare the time cost of HTQC with existing tools, thethresholdsontheminimumreadsqualityorminimum quality assessment on the above dataset was performed readlengthareuserdefined. using ht_stat, SolexaQA version 1.12, FastQC version 0.10.1 and BIGpre version 2.0.2. The benchmark was Results and discussion performed using a Dell PowerEdge server with two AMD Workflowdemonstration Opteron 2427 CPUs (12 cores in total). The server’s To demonstrate the function of HTQC, a paired-end se- operationsystemwasFedoraLinux14.ForSolexaQA,itby quence data of human gut metagenome was used as an default only uses a part of the input sequences from each example.To reduce thetime cost,one tenth from atotal tile, thus we input the size of the test dataset (3,559,988 of 35,625,015 paired-end reads were randomly picked. reads)asthetilesamplesizeto ensureall readswereused. Thereads length was 120bp. The quality assessment was ForBIGpre, dueto itsinabilityto parsethenewestversion performed using ht_stat, which shows the reads quality of headerformatproducedbyCASAVA1.8,ortoworkon in a series of charts that were described above in Imple- quality records with “@” character, we modified its source mentation. When quality assessment was done by base code to allow it work properly with the test dataset. The position, there was a gradual decrease of reads quality SolexaQA and the modified BIGpre programs were towards the 3’-end (tail) that can be observed in executed using Perl version 5.12.4 contained within the Figure 1A and 1B. The tail trimming would be routinely Fedora Linux system. The C++ source code of ht_stat was A B C Figure2Timecostofht_statandotherQCsoftware.A:realtime;B:usertime;C:systemtime.Themeaningof“real”,“user”and“system” timeisdescribedinthechapterofprogramrun-timeefficiency. Yangetal.BMCBioinformatics2013,14:33 Page4of4 http://www.biomedcentral.com/1471-2105/14/33 compiled using GNU GCCversion4.5.1 with optimization Abbreviation flag “-O2”. For the performance test of multiple threads, QC:Qualitycontrol. benchmarks for ht_stat and FastQC were carried out in Competinginterests two groups: one benchmark using 1 worker thread, the Theauthorsdeclarethattheyhavenocompetinginterests. other benchmark using 3 worker threads. Therefore, we Authors’contributions have six benchmarks in total: one each for SolexaQA and XYimplementedtheprogram;FL,XX,JZandJWprovidedtestdata;BLZ,DL, BIGpre, two each for ht_stat and FastQC. All benchmarks FQZandXYwrotethemanuscript.Allauthorsreadandapprovedthefinal wererun20timesrepeatedlytoovercomeanyrandomdis- manuscript. turbance of the computer system, and the amount of real Authordetails time(elapsedtimeinrealworld),systemtime(CPUcostin 1CASKeyLaboratoryofPathogenicMicrobiologyandImmunology,Institute systemmode)andusertime(CPUcostinusermode)were ofMicrobiology,ChineseAcademyofSciences,NO.1WestBeichenRoad, ChaoyangDistrict,Beijing,China.2BeijingInstitutesofLifeSciences,Chinese recorded using the time command that is a part of the AcademyofSciences,NO.1WestBeichenRoad,ChaoyangDistrict,Beijing, Linux base system. When compared with the other two China. Perlprograms,bothbenchmarksof ht_statusedoneorder Received:7September2012Accepted:27January2013 ofmagnitudefeweramountofusertime,whichshowedthe Published:31January2013 advantage of C++ language. The parallel comparison with FastQC,aJavaprogram,ht_statwasaboutthreetimesfas- References 1. MetzkerML:Sequencingtechnologies-thenextgeneration.NatRev terinbothsinglethreadandthreethreadstest(Figure2A). Genet2010,11(1):31–46. However,thehigheramountofsystemtimeinourprogram 2. DohmJC,LottazC,BorodinaT,HimmelbauerH:Substantialbiasesinultra- shows the cost of the use of threads, which indicates our shortreaddatasetsfromhigh-throughputDNAsequencing. NucleicAcidsRes2008,36(16):e105. parallel processing model needs further improvements 3. Martinez-AlcantaraA,BallesterosE,FengC,RojasM,KoshinskyH,FofanovVY, (Figure 2C). When comparing the two benchmarks of HavlakP,FofanovY:PIQA:pipelineforIlluminaG1genomeanalyzerdata ht_stat, the time cost of the three-threaded benchmark is qualityassessment.Bioinformatics2009,25(18):2438–2439. 4. FastQC:http://www.bioinformatics.babraham.ac.uk/projects/fastqc/. onlyonethirdoftheone-threadedbenchmark,whichindi- 5. ZhangT,LuoY,LiuK,PanL,ZhangB,YuJ,HuS:BIGpre:aquality catestheefficiencyofmultiplethreads(Figure2A).Further- assessmentpackagefornext-generationsequencingdata.Genomics more, there is no significant difference in the amount of ProteomicsBioinformatics2011,9(6):238–244. 6. CoxMP,PetersonDA,BiggsPJ:SolexaQA:At-a-glancequalityassessment user time and system time between 1-thread and 3-thread ofIlluminasecond-generationsequencingdata.BMCBioinformatics2010, benchmarksof ht_stat(Figure2B),whichindicatestheuse 11:485. ofmorethreadsdoesnotproduceadditionalsystemcost. 7. perlguts-IntroductiontothePerlAPI:http://perldoc.perl.org/perlguts.html. 8. CockPJ,FieldsCJ,GotoN,HeuerML,RicePM:TheSangerFASTQfile formatforsequenceswithqualityscores,andtheSolexa/IlluminaFASTQ Conclusions variants.NucleicAcidsRes2010,38(6):1767–1771. The HTQC tool kit provides convenient utilities for 9. CASAVAv1.8Changes:Illumina,Inc;2011.http://support.illumina.com/ downloads/casava_18_changes.ilmn. Illumina sequencing QC. It can process sequencing data faster than the existing tools, and generates quality assess- doi:10.1186/1471-2105-14-33 mentreportbothinplain-textandgraphicalrepresentation, Citethisarticleas:Yangetal.:HTQC:afastqualitycontroltoolkitfor Illuminasequencingdata.BMCBioinformatics201314:33. which can help in making decisions about further reads filtration. TheHTQC packagealso provides four programs that can perform reads filtration using different methods. Unlikeprevioustoolsinwhichonlysinglefiltrationmethod is allowed, user can choose the method they prefer to removethelowqualityreads,andcombineseveralfiltration methodsinanyorder. Availabilityandrequirements Submit your next manuscript to BioMed Central Project name:HTQC and take full advantage of: Projecthomepage:https://sourceforge.net/projects/htqc Operation system: Linux, potentially any POSIX com- • Convenient online submission pliant system. • Thorough peer review Otherrequirements:GNUGlibhttp://ftp.gnome.org/pub/ • No space constraints or color ﬁgure charges GNOME/sources/glib, pkg-config http://www.freedesktop. • Immediate publication on acceptance org/wiki/Software/pkg-config, CMake http://www.cmake. • Inclusion in PubMed, CAS, Scopus and Google Scholar org,Perlhttp://www.perl.org,Gnuplothttp://gnuplot.info • Research which is freely available for redistribution Programminglanguages: C++,Perl License:GNUGPLversion 3orlater Submit your manuscript at www.biomedcentral.com/submit

HTQC: a fast quality control toolkit for Illumina sequencing data. PDF

0.58 MB·English

by Yang, Xi

#journals #pubmed

Checking for file health...

Save to my drive

Quick download

Download

Download HTQC: a fast quality control toolkit for Illumina sequencing data. PDF Free - Full Version

by Yang, Xi| 0.58| English

Download HTQC: a fast quality control toolkit for Illumina sequencing data. by Yang, Xi in PDF format completely FREE. No registration required, no payment needed. Get instant access to this valuable resource on PDFdrive.to!

Free Download PDF

About HTQC: a fast quality control toolkit for Illumina sequencing data.

No description available for this book.

Detailed Information

Author:	Yang, Xi
Language:	English
File Size:	0.58
Format:	PDF
Price:	FREE

Download Free PDF

Safe & Secure Download - No registration required

Why Choose PDFdrive for Your Free HTQC: a fast quality control toolkit for Illumina sequencing data. Download?

100% Free: No hidden fees or subscriptions required for one book every day.
No Registration: Immediate access is available without creating accounts for one book every day.
Safe and Secure: Clean downloads without malware or viruses
Multiple Formats: PDF, MOBI, Mpub,... optimized for all devices
Educational Resource: Supporting knowledge sharing and learning

Frequently Asked Questions

Is it really free to download HTQC: a fast quality control toolkit for Illumina sequencing data. PDF?

Yes, on https://PDFdrive.to you can download HTQC: a fast quality control toolkit for Illumina sequencing data. by Yang, Xi completely free. We don't require any payment, subscription, or registration to access this PDF file. For 3 books every day.

How can I read HTQC: a fast quality control toolkit for Illumina sequencing data. on my mobile device?

After downloading HTQC: a fast quality control toolkit for Illumina sequencing data. PDF, you can open it with any PDF reader app on your phone or tablet. We recommend using Adobe Acrobat Reader, Apple Books, or Google Play Books for the best reading experience.

Is this the full version of HTQC: a fast quality control toolkit for Illumina sequencing data.?

Yes, this is the complete PDF version of HTQC: a fast quality control toolkit for Illumina sequencing data. by Yang, Xi. You will be able to read the entire content as in the printed version without missing any pages.

Is it legal to download HTQC: a fast quality control toolkit for Illumina sequencing data. PDF for free?

https://PDFdrive.to provides links to free educational resources available online. We do not store any files on our servers. Please be aware of copyright laws in your country before downloading.

The materials shared are intended for research, educational, and personal use in accordance with fair use principles.