Table Of ContentYangetal.BMCBioinformatics2013,14:33
http://www.biomedcentral.com/1471-2105/14/33
SOFTWARE Open Access
HTQC: a fast quality control toolkit for Illumina
sequencing data
Xi Yang1, Di Liu1, Fei Liu1, Jun Wu1, Jing Zou1, Xue Xiao1, Fangqing Zhao2 and Baoli Zhu1*
Abstract
Background: Illumina sequencing platform is widely used in genome research.Sequence reads quality assessment
and control are neededfor downstream analysis. However, software that provides efficient quality assessment and
versatile filtration methods is still lacking.
Results: We have developed a toolkit named HTQC – abbreviation ofHigh-Throughput Quality Control –for
sequence readsqualitycontrol, which consistsof sixprograms for readsqualityassessment,reads filtration and
generation of graphic reports.
Conclusions: The HTQC toolkit can generate reads quality assessment faster than existing tools, providing guidance
for reads filtration utilities that allow users to choose different strategies to removelow quality reads.
Background tools lack of function versatility or run-time efficiency.
Next generation sequencing technologies are generating For example, PIQA and FastQC only create reports on
massive sequence data [1], and different platforms can reads quality, but provide no tool for reads filtration
introduce varied level of sequence reads error. Among [3,4]. SolexaQA and BIGpre provide only read trimming
them, the Illumina platform is the most widely used for utility [5,6]. Furthermore, many of these software tools
genome sequencing with the least error rate per base. are implemented using Perl language [3,5,6], which as a
However, due to the nature of the method, it still pre- dynamic-typedscriptlanguage,provideseaseondeveloping
sents a considerable amount of errors that has its spe- mini programs, but causes severe loss on run-time per-
cific errors pattern [2]. The device performs sequencing formance [7]. Therefore, a high performance sequence
by DNA synthesis on clusters of identical DNA mole- readsQCtoolkitisneededforafasterassessmentandver-
cules simultaneously. When elongation of some DNA satilefunctionality.
molecules is stopped accidentally, it creates disturbance
of the cluster’s fluorescent signal, resulting sequencing
Implementation
errors. Such errors accumulate during the process of se-
TheHTQCtoolkitconsistsofsixprogramsthatcanper-
quencing, and cause reads quality decreasing while the
form reads quality assessment and filtration. To improve
length grows. Besides, deficiency on sequencing chips
run-time performance, the time-consuming programs
and the existence of air bubbles on chip surface can
are implemented using C++. The FASTQ format is used
cause failure on reads from a whole tile. To get reliable
for sequence data input and output, and the QC report
result in downstream analysis, it is necessary to remove
isgenerated astab-separatedplain-textfile[8].Tocreate
low quality reads avoiding mismatches in read mapping,
graphical charts of QC report, a Perl script is included.
andfalsepathsduringgenomeassembly.
The GNU Glib is used for base utilities such as
Currently there are several reads quality control (QC)
command-line parser, portable support for threads and
software tools (Table 1). However, all these software
asynchronousqueues. AllprogramsofHTQC toolkit are
capable of single-end or paired-end sequencing experi-
*Correspondence:[email protected] ments, 33-based or 64-based quality score encodings
1CASKeyLaboratoryofPathogenicMicrobiologyandImmunology,Institute and FASTQ sequence identifier format from different
ofMicrobiology,ChineseAcademyofSciences,NO.1WestBeichenRoad,
version of CASAVA tools (the traditional format like
ChaoyangDistrict,Beijing,China
Fulllistofauthorinformationisavailableattheendofthearticle “@HWUSI-EAS100R:6:73:941:1973#0/1”, and the new
©2013Yangetal.;licenseeBioMedCentralLtd.ThisisanOpenAccessarticledistributedunderthetermsoftheCreative
CommonsAttributionLicense(http://creativecommons.org/licenses/by/2.0),whichpermitsunrestricteduse,distribution,and
reproductioninanymedium,providedtheoriginalworkisproperlycited.
Yangetal.BMCBioinformatics2013,14:33 Page2of4
http://www.biomedcentral.com/1471-2105/14/33
Table1Comparisonofsequencingqualitycontrol each position of the sequence reads, a heat map and
softwaretools a box chart are presented with quality distribution
Name Programming Sequencing Readsfiltrationmethods (Figure 1AB). The stacked bar chart that represents the
language platforms base composition on each position is shown at the same
HTQC C++ Illumina Tailtrimming,filterby time (Figure 1C). The cycle-specific errors and the
quality/length/tile
rapid-falling quality on reads tail can be viewed on these
FastQC Java Notlimited,any No three charts. To find tile-specific problems like high
FASTQfile
error rate or low data production, a stacked bar chart
NGSQC Perl Illumina,454 Trimmingonly shows the number of reads in different quality ranges
SolexaQA Perl Illumina Trimmingonly using different color, each tile in one bar (Figure 1F).
BIGpre Perl Illumina,454 Duplicateremoval Theotherchartsthat showthedistributionofreadswith
varied length and reads with varied quality can provide
PIQA R,C++ Illumina No
an overview of sequencing quality (Figure 1DE). For
paired-endreadsqualityassessment,theht_statprogram
format used by CASAVAversion 1.8 like “@EAS139:136: isusedtocreateseparate chartsforeachend,andtocal-
FC706VJ:2:2104:15343:1973931:Y:18:ATCACG”)[9]. culate the correlation between reads quality of two ends
The quality control pipeline begins with a quality as- (Figure 1G). All these results generated by ht_stat pro-
sessment on raw reads. Program ht_stat generates QC gram are written to a series of tab-delimited plain-text
report from raw sequence reads in different aspects. On files, which can be visualized using ht_stat_draw.pl,
A B
C D E
F G
Figure1Chartsforreadqualityassessment.A:heatmapforreadqualityalongread.B:boxchartforreadqualityalongread.C:basecomposition
alongread.D:readlengthdistribution.E:readqualitydistribution.F:readqualityondifferenttiles.G:correlationbetweentworeadends.
Yangetal.BMCBioinformatics2013,14:33 Page3of4
http://www.biomedcentral.com/1471-2105/14/33
or any spreadsheet software like Microsoft Excel or applied to cut the low quality reads using the program
LibreOffice Calc. ht_trim. In Figure 1C, there was at least 10% of reads
After the assessment of sequence reads quality is that have an invalid nucleotide sequence represented by
obtained,lowqualityreadsshouldberemoved.TheHTQC contiguous Ns. The bad reads that contained these Ns
tool kit provides four different programs that include can be filtered with the program ht_length_filter or
ht_tile_filter, ht_trim, ht_qual_filter and ht_length_filter, ht_qual_filter. When the quality assessmentwas done by
to perform reads filtration. The ht_tile_filter is designed to tiles, we observed tiles 5, 31, 110, 113, 117, 118 pro-
remove reads from problematic tiles that may not be duced reads with very low quality (Figure 1F) that can
reliable due to sequencing chip quality; the ht_trim cuts be removed using ht_tile_filter. For the quality assess-
low quality bases at the beginning or the end of the reads ment of any paired-end reads, if the quality of read 2
until the quality score reaches a given threshold; the was worse than read 1, such quality imbalance can be
ht_qual_filter remove reads with low quality and the picked upbyht_stat(Figure 1G).
ht_length_filter remove short reads. When only one end of
the paired-end reads is of acceptable quality, it is stored in Programrun-timeefficiency
a separate file. The cutoff valueof these programs, such as To compare the time cost of HTQC with existing tools,
thethresholdsontheminimumreadsqualityorminimum quality assessment on the above dataset was performed
readlengthareuserdefined. using ht_stat, SolexaQA version 1.12, FastQC version
0.10.1 and BIGpre version 2.0.2. The benchmark was
Results and discussion performed using a Dell PowerEdge server with two AMD
Workflowdemonstration Opteron 2427 CPUs (12 cores in total). The server’s
To demonstrate the function of HTQC, a paired-end se- operationsystemwasFedoraLinux14.ForSolexaQA,itby
quence data of human gut metagenome was used as an default only uses a part of the input sequences from each
example.To reduce thetime cost,one tenth from atotal tile, thus we input the size of the test dataset (3,559,988
of 35,625,015 paired-end reads were randomly picked. reads)asthetilesamplesizeto ensureall readswereused.
Thereads length was 120bp. The quality assessment was ForBIGpre, dueto itsinabilityto parsethenewestversion
performed using ht_stat, which shows the reads quality of headerformatproducedbyCASAVA1.8,ortoworkon
in a series of charts that were described above in Imple- quality records with “@” character, we modified its source
mentation. When quality assessment was done by base code to allow it work properly with the test dataset. The
position, there was a gradual decrease of reads quality SolexaQA and the modified BIGpre programs were
towards the 3’-end (tail) that can be observed in executed using Perl version 5.12.4 contained within the
Figure 1A and 1B. The tail trimming would be routinely Fedora Linux system. The C++ source code of ht_stat was
A B C
Figure2Timecostofht_statandotherQCsoftware.A:realtime;B:usertime;C:systemtime.Themeaningof“real”,“user”and“system”
timeisdescribedinthechapterofprogramrun-timeefficiency.
Yangetal.BMCBioinformatics2013,14:33 Page4of4
http://www.biomedcentral.com/1471-2105/14/33
compiled using GNU GCCversion4.5.1 with optimization Abbreviation
flag “-O2”. For the performance test of multiple threads, QC:Qualitycontrol.
benchmarks for ht_stat and FastQC were carried out in
Competinginterests
two groups: one benchmark using 1 worker thread, the Theauthorsdeclarethattheyhavenocompetinginterests.
other benchmark using 3 worker threads. Therefore, we
Authors’contributions
have six benchmarks in total: one each for SolexaQA and
XYimplementedtheprogram;FL,XX,JZandJWprovidedtestdata;BLZ,DL,
BIGpre, two each for ht_stat and FastQC. All benchmarks FQZandXYwrotethemanuscript.Allauthorsreadandapprovedthefinal
wererun20timesrepeatedlytoovercomeanyrandomdis- manuscript.
turbance of the computer system, and the amount of real
Authordetails
time(elapsedtimeinrealworld),systemtime(CPUcostin 1CASKeyLaboratoryofPathogenicMicrobiologyandImmunology,Institute
systemmode)andusertime(CPUcostinusermode)were ofMicrobiology,ChineseAcademyofSciences,NO.1WestBeichenRoad,
ChaoyangDistrict,Beijing,China.2BeijingInstitutesofLifeSciences,Chinese
recorded using the time command that is a part of the
AcademyofSciences,NO.1WestBeichenRoad,ChaoyangDistrict,Beijing,
Linux base system. When compared with the other two China.
Perlprograms,bothbenchmarksof ht_statusedoneorder
Received:7September2012Accepted:27January2013
ofmagnitudefeweramountofusertime,whichshowedthe
Published:31January2013
advantage of C++ language. The parallel comparison with
FastQC,aJavaprogram,ht_statwasaboutthreetimesfas- References
1. MetzkerML:Sequencingtechnologies-thenextgeneration.NatRev
terinbothsinglethreadandthreethreadstest(Figure2A). Genet2010,11(1):31–46.
However,thehigheramountofsystemtimeinourprogram 2. DohmJC,LottazC,BorodinaT,HimmelbauerH:Substantialbiasesinultra-
shows the cost of the use of threads, which indicates our shortreaddatasetsfromhigh-throughputDNAsequencing.
NucleicAcidsRes2008,36(16):e105.
parallel processing model needs further improvements
3. Martinez-AlcantaraA,BallesterosE,FengC,RojasM,KoshinskyH,FofanovVY,
(Figure 2C). When comparing the two benchmarks of HavlakP,FofanovY:PIQA:pipelineforIlluminaG1genomeanalyzerdata
ht_stat, the time cost of the three-threaded benchmark is qualityassessment.Bioinformatics2009,25(18):2438–2439.
4. FastQC:http://www.bioinformatics.babraham.ac.uk/projects/fastqc/.
onlyonethirdoftheone-threadedbenchmark,whichindi-
5. ZhangT,LuoY,LiuK,PanL,ZhangB,YuJ,HuS:BIGpre:aquality
catestheefficiencyofmultiplethreads(Figure2A).Further- assessmentpackagefornext-generationsequencingdata.Genomics
more, there is no significant difference in the amount of ProteomicsBioinformatics2011,9(6):238–244.
6. CoxMP,PetersonDA,BiggsPJ:SolexaQA:At-a-glancequalityassessment
user time and system time between 1-thread and 3-thread
ofIlluminasecond-generationsequencingdata.BMCBioinformatics2010,
benchmarksof ht_stat(Figure2B),whichindicatestheuse 11:485.
ofmorethreadsdoesnotproduceadditionalsystemcost. 7. perlguts-IntroductiontothePerlAPI:http://perldoc.perl.org/perlguts.html.
8. CockPJ,FieldsCJ,GotoN,HeuerML,RicePM:TheSangerFASTQfile
formatforsequenceswithqualityscores,andtheSolexa/IlluminaFASTQ
Conclusions variants.NucleicAcidsRes2010,38(6):1767–1771.
The HTQC tool kit provides convenient utilities for 9. CASAVAv1.8Changes:Illumina,Inc;2011.http://support.illumina.com/
downloads/casava_18_changes.ilmn.
Illumina sequencing QC. It can process sequencing data
faster than the existing tools, and generates quality assess- doi:10.1186/1471-2105-14-33
mentreportbothinplain-textandgraphicalrepresentation, Citethisarticleas:Yangetal.:HTQC:afastqualitycontroltoolkitfor
Illuminasequencingdata.BMCBioinformatics201314:33.
which can help in making decisions about further reads
filtration. TheHTQC packagealso provides four programs
that can perform reads filtration using different methods.
Unlikeprevioustoolsinwhichonlysinglefiltrationmethod
is allowed, user can choose the method they prefer to
removethelowqualityreads,andcombineseveralfiltration
methodsinanyorder.
Availabilityandrequirements Submit your next manuscript to BioMed Central
Project name:HTQC and take full advantage of:
Projecthomepage:https://sourceforge.net/projects/htqc
Operation system: Linux, potentially any POSIX com- • Convenient online submission
pliant system. • Thorough peer review
Otherrequirements:GNUGlibhttp://ftp.gnome.org/pub/ • No space constraints or color figure charges
GNOME/sources/glib, pkg-config http://www.freedesktop. • Immediate publication on acceptance
org/wiki/Software/pkg-config, CMake http://www.cmake. • Inclusion in PubMed, CAS, Scopus and Google Scholar
org,Perlhttp://www.perl.org,Gnuplothttp://gnuplot.info • Research which is freely available for redistribution
Programminglanguages: C++,Perl
License:GNUGPLversion 3orlater Submit your manuscript at
www.biomedcentral.com/submit