Table Of ContentShatabdaetal.BMCBioinformatics2013,14(Suppl2):S19
http://www.biomedcentral.com/1471-2105/14/S2/S19
PROCEEDINGS Open Access
The road not taken: retreat and diverge in local
search for simplified protein structure prediction
Swakkhar Shatabda1,2*, MA Hakim Newton1,2, Mahmood A Rashid1,2, Duc Nghia Pham1,2, Abdul Sattar1,2
From The Eleventh Asia Pacific Bioinformatics Conference (APBC 2013)
Vancouver, Canada. 21-24 January 2013
Abstract
Background: Given a protein’s amino acid sequence, the protein structure prediction problem is to find a three
dimensional structure that has the native energy level. For many decades, it has been one of the most challenging
problems in computational biology. A simplified version of the problem is to find an on-lattice self-avoiding walk
that minimizes the interaction energy among the amino acids. Local search methods have been preferably used in
solving the protein structure prediction problem for their efficiency in finding very good solutions quickly.
However, they suffer mainly from two problems: re-visitation and stagnancy.
Results: In this paper, we present an efficient local search algorithm that deals with these two problems. During
search, we select the best candidate at each iteration, but store the unexplored second best candidates in a set of
elite conformations, and explore them whenever the search faces stagnation. Moreover, we propose a new non-
isomorphic encoding for the protein conformations to store the conformations and to check similarity when
applied with a memory based search. This new encoding helps eliminate conformations that are equivalent under
rotation and translation, and thus results in better prevention of re-visitation.
Conclusion: On standard benchmark proteins, our algorithm significantly outperforms the state-of-the art
approaches for Hydrophobic-Polar energy models and Face Centered Cubic Lattice.
Background X-ray crystallography and Nuclear Magnetic Resonance
Proteinsarethemostimportantofallorgan-ismspresent (NMR) spectroscopy are very muchslow and expensive.
inthelivingcell.Givenaprotein’saminoacidsequence, For these issues, many researchers from other fields are
theproteinstructureprediction(PSP)problemistofind attracted to solve the problem using their own techni-
a three dimensional native structure that has the lowest ques[1,2].
freeenergy.Inordertofunctionproperly,theproteinhas Computational methods applied to PSP fall into three
to fold into its native structure. Mis-folded proteins broadcategories:abinitio,homologymodelingandprotein
cause many criticaldiseasessuchasAlzheimer’sdisease, threading.Thelatertwomethodsdependonthetemplates
Cystic fibrosis, and Mad Cow disease. Knowledge about (or structures) of known proteins and are useful only
thisnativestructureisofparamountimportanceandcan whenmatchingtemplatesarefound.Researchinabinitio
haveanenormousimpactonthefieldofdrugdiscovery. PSPhasbeeninstigatedbythefamousAnfinsen’sdogma.
Not much is known about the folding process and the In1973Nobel Prize Laureate Christian B.Anfinsen sug-
nature of the energy function is also very complex. For gested that the native structure of a globular protein is
manydecades,ithasbeenconsideredoneofthehardest determined only by its primary amino acidsequence [3].
problems in biology. In vitro laboratory methods like The ab initio PSP can be viewed as a search problem,
where one has to find a stable, unique, and kinetically
accessible native structure from the space of all possible
*Correspondence:[email protected]
1InstituteofIntelligentandIntegratedSystems,GriffithUniversity, structures (also called conformations). The search space
Queensland,Australia forthisproblem, even in the simplifiedmodels, contains
Fulllistofauthorinformationisavailableattheendofthearticle
©2013Shatabdaetal.;licenseeBioMedCentralLtd.ThisisanopenaccessarticledistributedunderthetermsoftheCreative
CommonsAttributionLicense(http://creativecommons.org/licenses/by/2.0),whichpermitsunrestricteduse,distribution,and
reproductioninanymedium,providedtheoriginalworkisproperlycited.
Shatabdaetal.BMCBioinformatics2013,14(Suppl2):S19 Page2of9
http://www.biomedcentral.com/1471-2105/14/S2/S19
anastronomicallylargenumberofconformations.There- search space and makes the similarity matching of the
fore,systematicsearchtechniquesarealmostimpractical conformationsefficient.Theseisomorphicconformations
sincetheyperformexhaustivesearchandrequiresahuge areessentiallysameandshowdifferencesonlybecauseof
amount of computational resources. In contrast, local the translational and rotational symmetry. We applied
search methods are normally very quick in finding good thisencodinginouralgorithmalongwiththe longterm
solutions,althoughtheysufferfromre-visitationandstag- memoryoflocalminimaproposedin[10].Experimental
nation,andrequiregoodheuristics. resultsshowthatouralgorithmsignificantlyoutperforms
Performance of the computational methods also the state-of-the-art algorithms on standard benchmark
degradeswhenappliedtothehighresolutionmodelsthat proteinsusingHydrophobic-Polar(HP)energymodeland
deal with real structures of proteins. Thisis due tothree FaceCenteredCubic(FCC)lattice.
reasons: i) the unknown contributing factors of different
forces to the energy functions, ii) protein models with Related work
atomicleveldetailsrequirehugecomputationaleffort,and Lau and Dill [1] proposed a simplified HP energy model
iii) the space of possible conformations is very large and for protein structure prediction problem. It is proved to
complex. For these reasons, the general paradigm of de be a hard combinatorial problem [11]. Due to the com-
novo PSP is to begin with the sampling of a large set of plexity, several techniques and their hybridizations have
candidate(decoy)structuresguidedbyascoringfunction. been applied to solve the problem. The similarity with
Inthefinalstage,therefinementsaredonetoachievethe the thermodynamic nature of the protein folding allured
real structure. The simplified models, though lack many the researchers to apply simulated annealing [12,13].
details,providearealisticback-bonefortheproteinsand Genetic algorithms were first applied to solve this pro-
canberefinedtogetrealstructures[4]. blem by Unger and Moult [14]. The basic genetic algo-
Localsearchalgorithmswhenappliedtolargeproteins rithm was subsequently improved by many researchers
(sequence length around 200 monomers) suffer from a [15-17].
huge number of re-visitation and stagnation. To handle Yue and Dill [18] applied constraint based approaches
theseissues,anumberoftechniqueshavebeenappliedin for the first time and developed the Constraint Based
theliteratureofPSP[5-7]thatincludetabulists,adaptive HydrophobicCoreConstruction(CHCC)algorithm.Their
measures, and various restart mechanisms. Similar methodhadseveralpitfalls:CHCCcouldonlysupportthe
approacheshavealsobeenusedinotherdomainssuchas HPmodelandfailedtoreportdegeneracyornon-unique
propositional satisfiability [8] and quadratic assignment structures for several protein sequences. The research
problem [9]. Many of the algorithms apply random group of Rolf Backofen developed a Constrained-based
restarts or restart from the best local minimum [6,7]; ProteinStructurePrediction(CPSP)tool[19],whichpro-
whichdonotsolvetheproblemingeneral. vided solutions to these problems. However, CPSP tool
depends on pre-calculated cores and does not converge
Our contribution for larger protein sequences. Palu et al. [20] developed
Inthispaper,we presentanewalgorithmforthesimpli- COLAsolverusinghighlyoptimizedconstraintsandpro-
fied protein structure prediction problem. During the pagators to obtainsatisfactory resultsonsmallandmed-
search, our method selects the best candidate in each ium-sizedinstances(length<80).Leshetal.[5]provided
iteration, but memorizesthe secondbest conformations anovelsetoftransformationscalledpullmovesextendible
that are generated but not selected or explored (called toanylattice.BothLeshetal.[5]andBlazewiczetal.[21]
elite conformations) at each iteration. Whenever the implementedtabusearchmeta-heuristicsin-dependentof
search faces stagnation,we selectthe bestconformation eachother.
from this elite set and continue search from there. This Hybridtechniques thatcombinethepowerof different
retreathelpsthesearch diverge.Similartechniqueshave strategies provided better results. Using the pull moves,
been used in the systematic search techniques like A* Klauet al. [22] proposed aninteractive optimization fra-
search, but they require a huge amount of memory to mework called Human Guided Simple Search (HuGS).
store the unexplored frontier. We maintain only a small Usingthesamepullmoveset,Ullahetal.[23]proposeda
set of previously generated conformations by discarding two-stageoptimizationapproach.Furthermore,Ullahetal.
conformations with similar fitness. It reduces the mem- [24] combined local search and constraint programming
oryrequirementandprovidesamechanismtogobackto approaches.Theyintroducedaproteinfoldingsimulation
earlier conformations with lower fitness value but with procedureonFCClatticeandemployedtheCOLAsolver
potential to lead towards better search regions. We also [20] to generate neighborhood states for a simulated
proposeanewnon-isomorphicencodingthatreducethe annealingbasedlocalsearch.TheyusedMJmatriceswith
non-unique or isomorphic conformations from the 20×20aminoacidpairwiseinteractions.Theytestedtheir
Shatabdaetal.BMCBioinformatics2013,14(Suppl2):S19 Page3of9
http://www.biomedcentral.com/1471-2105/14/S2/S19
approaches onsome real proteins (length <80) from the Ala, Pro, Val, Leu, Ile, Met, Phe, Tyr, Trp); and hydro-
ProteinData Bank(PDB).Jiangetal.[25]combinedtabu philic or polar P (Ser, Thr, Cys, Asn, Gln, Lys, His, Arg,
searchstrategy(GTS)withgeneticalgorithmsinthetwo- Asp, Glu). The given amino acid sequence of a protein
dimensionalHPModel. is represented as a string s of the alphabet {H, P}. The
Cebrian et al. [26] used tabu search to find 3D struc- free energy calculation for the HP model, shown in (1),
tures of Harvard instances [27] on FCC lattices for the counts only the energy interactions between two non-
first time. In their subsequent work, Dotu et al. [6,7] consecutive amino acid monomers.
applied Large Neighborhood Search (LNS) to further (cid:2)
optimize the results found in [26]. They also improved E= cij.eij (1)
the tabu search by adopting a new neighborhood selec- i,j:i+1<j
tion technique [7]. Both of their methods are implemen-
wherec =1onlyiftwomonomersiandjareneighbors
ted in COMET. Shatabda et al. [10] proposed a memory ij
(or in contact) on the lattice and 0 otherwise. The other
based approach on top of the algorithm proposed by
term, e is calculated depending on the type of amino
Dotu et al. [7] and improved the results on the FCC lat- ij
acids:e =-1ifs=s=Hand0otherwise.Minimizingthe
tice and HP energy model. Other methods (such as ij i j
summationin(1)isequivalenttomaximizingthenumber
Simulated Annealing [12], Ant Colony Optimization
ofnon-consecutiveH-Hcontacts.Severalothervariantsof
(ACO) [28], and Extremal Optimization [29]) are also
HP-model[31]existintheliterature.
found in the literature.
Using the HP energy model together with the FCC
lattice, the simplified PSP problem is defined as: given a
Materials and methods
sequence s of length n, find a self avoiding walk p ...
1
Proteinsarepolymersofaminoacidmonomers.Inasim-
p on the lattice such that the energy defined by (1) is
n
plified model, all monomers have an equal size and all
minimized.
bondsareofanequallength.Eachamino acidmonomer
Local search framework
is represented by a single point and its position is
The local search framework was originally proposed in
restricted to a three dimensional lattice. A simplified
[7]. The algorithm is similar to that of the procedure
energyfunctionisusedincalculatingtheenergyofacon-
localSearch () presented in Table 1 except in Lines 6,
formation.Thegivenaminoacidsequencefitsintoafixed
9-10and14.Itdependsonastructuredrandomizediniti-
lattice, where every two consecutive monomers in the
alizationmethodandmaintainsasimpletabulisttopre-
sequencearealsoneighboronthelattice(calledthechain
vent recently used moves. In the framework, moves
constraint) andtwomonomerscan notoccupy the same
involving single monomer are only allowed. For any
latticepoint(calledtheselfavoidingconstraint).
givenconformationcandasequencepositioni,amove(i,
p, c) that moves an amino acidi to a new position p is
FCC lattice allowed,if (i)pisfreeandisin contactwithbothamino
The Face Centered Cubic (FCC) lattice is preferred over
acids at positions i - 1 and i + 1, and (ii) i is not in the
other lattices since it has the highest packing density
tabulist.Thelengthofthetabulisttakesarandomvalue
[30] for spheres of equal size, and provides the highest from[4,n/4],wherenisthelengthofthesequence.The
degree of freedom for placing an amino acid monomer.
movecanbeappliedtoeitherHorPtype ofaminoacid
Thus, it provides a realistic discrete mapping for
at each iteration. The fitness function minimizes the
proteins. The FCC lattice is generated by the follo-
summationofHH-distancesforallnon-consecutivepairs
−→ −→
wing basis vectors: v =(1, 1, 0), v =(−1, −1, 0), of H-monomers. The fitness function can be formally
1 2
−→ −→ −→ definedasthefollowing:
v = (−1, 1, 0), v =(1, −1, 0), v = (0, 1, 1),
3 4 5
−→ −→ −→
v =(0, 1, −1), v =(0, −1, −1), v =(0, −1, 1), (cid:2)n
−→6 −→7 −→8 f(c)= (dv(i,j))2× (si =H,sj =H) (2)
v =(1, 0, 1), v =(−1, 0, 1), v =(−1, 0, −1),
9 10 11 i,j:i+1<j
−v→=(1, 0, −1). Two lattice points p, q∈L are said
12
−→ wheredv(i,j)=d(i,j)2-2andd(i,j)=(x-x)2+(y-y)2
to be in contact or neighbors of each other, if q=p+ vi + (z- z)2, i.e. square of the Euclidean disitanjce betiweejn
−→ i j
for some vector v in the basis of lattice L. theithandjthaminoacidsinthecurrentconformationc
i
ofasequencesoflengthn.Theenergylevelofthestruc-
HP energy model ture is still determined by the HP energy value. The fit-
The Hydrophobic-Polar (HP) energy model was pro- nessfunctionisusedtodrivethesearchonly.Thesearch
posed by Lau and Dill [1]. In this model, all the amino algorithm periodically switches the type of the acid and
acids are divided into two groups: hydrophobic H (Gly, selectsthebestmoveonaamino-acidwhichisnotinthe
Shatabdaetal.BMCBioinformatics2013,14(Suppl2):S19 Page4of9
http://www.biomedcentral.com/1471-2105/14/S2/S19
Table 1Local Setach Framework.
ProcedurelocalSearch() ProcedureselectMove()
1 initialize() 1 whilemoveList.notEmpty()do
2 initializeTabu() 2 m¬getNextCandidate()
3 whileiteration≤maxIterationdo 3 c¬getConformation(m)
4 selectMonomerType() 4 e¬getNonIsoEncoding(s)
5 generateMoves() 5 b¬getPacked(s)
6 selectMove() 6 ifmatch(b,proximity)then
7 performMove() 7 discardm
8 updateCosts() 8 else
9 iflocalminimaisdetectedthen 9 updateEliteSet()
10 storeLocalMinima() 10 returnm
11 end 11 end
12 ifnonImprovingSteps≥maxStablethen 12 end
13 initializeTabu() 13 ifmoveList.empty()then
14 selectFromEliteSet() 14 nomovespossible
15 end 15 nonImprovingSteps¬maxStable+1
16 end 16 end
tabulist. In case of P moves, it selects a random move searchgoesonbygeneratingtheneighborsoftheselected
sinceamoveofPtype aminoaciddoesnotaffectthefit- conformations. The other potential conformations with
ness function. The search restarts from the previously goodfitnessvaluesareneverusedasthesearchisgreedy
found best solution whenever the fitness function is not innature.Wecallthemeliteconformations.Theseconfor-
improvingformaxStablesteps.Thememory-basedsearch mations, if explored ever, may lead to better search
in [10] extends this local search framework. It stores a regions. Note that, in the systematic search techniques,
proportionofthelocalminimaencounteredandwhenever these conformations are stored and explored. However,
a move is selected, it generates the conformation and they require a huge amount of memory. Moreover, the
checkssimilaritywiththestoredlocalminima.Ifthegen- selectioninasystematicsearchlikeA*searchdependson
erated conformation is within a given proximity of a a heuristic function that requires the goal to be known
stored local minimum, the conformation is discarded. beforehand. In our case, the optimal structure is totally
Hamming distance isused asthe similarity measure and unknown and we can not afford to store a huge number
relativeencodingtorepresenttheconformations. of conformations. In our algorithm, we store the second
Our algorithm is developed on top of the memory- bestconformationsandexplorethemwheneverthesearch
based search. The pseudo-code for our algorithm is facesstagnation.
depicted in Table 1. Our algorithm differs from the
memory-based approach in Line 14 of Procedure local- Store
Search() where we select a conformation from the elite Westorethesecondbestconformationsineachiteration
setatstagnationandinLine9ofProcedureselectMove() inasetcalledeliteset.Ateachiteration, when amoveis
wherewestoretheprominentbutnotselectedcandidate selected, we update this elite set of conformations. The
conformations into the elite set. It also differs in the pseudo-code for the updateEliteSet() procedure is given
encodingoftherepresentationoftheconformations.We in the right side of Table 2. We use a priority queue
dothatatLine4ofProcedureselectMove()beforematch- sorted intheorderoffitness valueanditeration number
ing it with stored local minima and at Line 10 of Proce- to store the elite conformations. Before inserting a con-
durelocalSearch()whilestoringthelocalminimum.Rest formationintothepriorityqueue,wecheckforsimilarity
of this section describes the detail of the procedures of in the stored local minima list and store it only if no
ouralgorithm. matchisfound.
Elite conformations
Ineachiterationofalocalsearch,anumberofconforma- Explore
tions are generated. However, only a few of them are We select the top element from the priority queue
exploredinthenextiterations.Inthecaseofasinglecan- whenever the search stagnates. The search then con-
didate search, only a single conformation, which is typi- tinues from the selected elite conformation. The search
cally the bestconformationaccording to the heuristic,is algorithm, guided by the fitness function defined in (2),
selectedforthenextiteration.Insuccessiveiterations,the quickly forms a compact hydrophobic core at the center
Shatabdaetal.BMCBioinformatics2013,14(Suppl2):S19 Page5of9
http://www.biomedcentral.com/1471-2105/14/S2/S19
Table 2Pseudo-code forElite Set Methods.
ProcedureupdateEliteSet() ProcedureselectFromEliteSet()
1 sb¬setofsecondbestcandidates 1 whileeliteSet.notEmpty()do
2 whilesb.notEmpty()do 2 c¬eliteSet.getTopElement()
3 m¬sb.getNextCandidate() 3 e¬getNonIsoEncoding(c)
4 c¬getConformation(m) 4 b¬getPacked(e)
5 e¬getNonIsoEncoding(c) 5 ifmatch(b,proximity)==falsethen
6 b¬getPacked(e) 6 elitSet.release()
7 ifmatch(b,proximity)==falsethen 7 returnc
8 eliteSet.push(c) 8 end
9 end 9 eliteSet.popElement()
10 end 10 end
of the conformation and the greedy search oscillates this issue. Shatabda et al. [10] used the relative encoding
withinthe same regionof the searchspace before it can proposedbyBackofenetal.[32]intheiralgorithm.Their
improvethefitnessfunctiontobreakthecoreortoform encoding scheme starts from a fixed direction and con-
somealternate core. Thedetailednature ofthesearch is tinuestoupdateabasematrixthroughoutthechain.The
discussedin [10]. The oscillatingnature indicatesthat if efficiencyofthealgorithmthusdependsofthedimension
we select a conformation from a region in the search ofthelattice.Moreover,adecodingalgorithmisneededto
space, then we can ignore the other conformations with getbacktheabsoluteencodingsortheco-ordinatepoints.
the same or near fitness value and within the temporal ThecomputationalcomplexityoftheiralgorithmisO(nl3),
locality.Everytimeaneliteconformationisselectedform where nisthenumberof absolute directions and listhe
the list, we do that by discarding a fixed proportion of dimension of thelattice.The complexity of thedecoding
thetopelementsfromthelist.Thisresultsineliminating algorithmisalsoO(nl3).Anon-isomorphicencodingwas
the conformations that are similar in fitness value and alsoproposedin[33]forcubiclatticesthatcalculatesthe
structure,andarealsotemporallyproximate.Thisretreat anglesbetweentwoconsecutiveabsolutedirectionvectors
effectively helps the search diverge. It also reduces the anden-codesthemovesequence.Thisencodingalsocosts
memory requirement for the priority queue used. The more as it requires computation of angles between the
detailed pseudo-code of the method is given in the left directionvectors.
side of Table 2. The method elitSet.release() at Line 6 In this paper, we propose a new non-isomorphic
releasesthetopelementsfromtheeliteset. encoding,whichisgenericforanylatticeandrequiresno
Non-isomorphic encoding separatedecodingalgorithm;theencodingitselfmapsto
Manytechniqueshavebeenemployedintheliteratureto the absolute directions. Instead of relative angles, our
represent the protein conformations. These representa- algorithm depends on the relative occurrence of the
tionsallowthesearchtokeepthecandidateconformations absolutedirectionswithinthechain.ItrequiresonlyO(n)
updated and perform operations like similarity checking time to encode. The pseudo-code of our algorithm is
(memory-based algorithms) and crossover (genetic algo- given in Table 3.Thisalgorithmcalculatesthe encoding
rithms).Themostobviouswaytorepresenttheconforma- on the fly. It starts with an emptyMap andevery time a
tions is to use Cartesian co-ordinates of the amino-acid newabsolutedirectionisencounteredinthesequence,it
monomers. However, such a representation contains assigns the next available code to it. Once the mapping
translational symmetry, which can be solved if absolute for all possible directions is found then the algorithm is
encoding is used. Absolute encoding is found from the just a simple lookup from the mapping array. In the
absolutedirectionvectorsbetweentheconsecutivepoints resultssection,weshowtheeffectivenessofourencoding
intheamino-acidchain.Thealphabetsizeoftheabsolute schemewhenappliedtothememory-basedsearch[10].
encodingdependsonthelatticeused.FortheFCClattice,
thealphabetsizeis12sincethenumberofbasisvectorsis Results and discussion
12. However, absolute encoding is not suitable when we We implemented our algorithm in C++ and ran experi-
check similarity between two conformations since it ments on the NICTA (http://www.nicta.com.au) cluster
containstheproblemofrotationalsymmetry.Twoidenti- machine. The cluster has a number of machines each
cal conformations with rotational symmetry are repre- equipped with two 6-core CPUs (AMD Opteron @2.8
sentedbydifferentabsoluteencoding(seetheexamplein GHz, 3MBL2/6M L3 Cache) and64 GB Memory,run-
Figure 1). This type of encoding is called isomorphic ningRocksOS(aLinuxvariantforcluster).Wecompared
encoding.Non-isomorphicencodingsprovideasolutionto the performance of our algorithm to that of the tabu
Shatabdaetal.BMCBioinformatics2013,14(Suppl2):S19 Page6of9
http://www.biomedcentral.com/1471-2105/14/S2/S19
Figure 1 Isomorphic Encoding. Two identical structuresin cubic latticehaving different absolute encoding; structure in theleft has the
encoding“DSES”,andthestructureatrightwithencoding“UNEN”,whereD=DownU=Up,N=North,S=South,E=EastandW=West.
searchbyDotuetal.[7]andthememorybasedapproach LS-Tabude-notesthetabusearchbyDotuet al.[7].The
proposed in[10].Algorithmswere run 50 times foreach best and average energy levels achieved are reported in
of the protein sequences. Each run was given 5 hours to Table 4. We set proximity measure to 3 and only 5% of
finish. We could not compare our results with the Large the local minima was stored while maxStable was set to
Neighborhood Search (LNS) [7] since the COMET pro- 100 for our algorithm. For other algorithms, we set the
gramexitedwith‘toomuchmemoryneeded’errorforthe parameters as recommended by the authors. The best
large-sizedbenchmarkproteinsthatwehaveselected.We energy levels reported by Dotu et al. [7] are also shown
do not show results for small-sized Harvard instances under the column LNS. These results were produced by
(length = 48) or other smaller protein sequences since largeneighborhoodsearch.Optimallowerboundsforthe
bothalgorithmsreachnearoptimalconformationsandthe minimumenergyvaluesfortheproteinsarealsoreported
differenceoftheenergylevelsachievedfortheseproteins under the column ‘E’ generated by the CPSP tools [19].
l
arerelativelysmall. Note that these values are obtained by using exhaustive
searchmethodsandareusedonlytoevaluatehowfarour
Results resultsarefromthem.Themissingvaluesindicatewhere
We show results for two sets of benchmarks in Table 4. no such bound was found and the values marked with *
ThefirstsixproteinsarealsousedbyDotuetal.[7].The are the values for which the algorithm did not converge
R instances (length = 200) are originally taken from [34] evenafter24hoursofrun.
and the f180 instances (length = 200) are provided by We also used a second set of benchmark proteins
Sebastian Will [7]. LS-New denotes our algorithm and derived from the famous Critical Assessment of Techni-
LS-Memdenotesthememory-basedapproachin[10]and ques for Protein Structure Prediction (CASP) competi-
tion (http://predictioncenter.org/casp9/targetlist.cgi).
These proteins are of length 230 ± 50. Six protein
Table 3Pseudo-code forNon-Isomorphic Encoding.
sequences were randomly chosen from the target list.
ProceduregetNonIsoEncoding(s) These sequences are then converted into HP sequences.
1 initMap() Results for these six proteins are also given in Table 4
2 fori¬1toNdo (lower part). The PDB ids for each of these proteins are
3 absdir=c.getAbsDir(i) also given. The parameter settings for these six proteins
4 ifabsdirisanewdirectionthen were also kept the same. LNS column contains no data
5 Map[absdir]¬dirCount for these six proteins since they were not used in [7].
6 dirCount++
7 end
Analysis
8 encoded[i]=Map[absdir];
From the average energy levels shown in bold-face in
9 end
Table 4, it is clearly evident that, for all the twelve pro-
10 returnencoded
teins, our algorithm significantly outperforms both of
Shatabdaetal.BMCBioinformatics2013,14(Suppl2):S19 Page7of9
http://www.biomedcentral.com/1471-2105/14/S2/S19
Table 4Experimental Results.
Protein LS-New LS-Mem LS-Tabu
Seq. Length E best avg best avg R.I. best avg R.I. LNS
l
R1 200 -384 -359 -339 -353 -326 22.41% -332 -318 31.81% -330
R2 200 -383 -361 -343 -351 -330 24.52% -337 -324 32.20% -333
R3 200 -385 -354 -340 -352 -330 18.18% -339 -323 27.41% -334
f180_1 180 -378* -361 -341 -360 -334 15.90% -338 -327 27.45% -293
f180_2 180 -381* -368 -350 -362 -340 24.39% -345 -334 34.02% -312
f180_3 180 -378 -365 -355 -357 -343 34.28% -352 -339 41.02% -313
3no6 229 -455 -419 -397 -400 -375 27.50% -390 -373 29.26% -
3mr7 189 -355 -320 -304 -311 -292 19.04% -301 -287 25% -
3mse 179 -323 -288 -271 -278 -254 22.63% -266 -249 29.72% -
3mqz 215 -474 -430 -404 -415 -386 20.45% -401 -383 23.07% -
3on7 279 ? -514 -476 -499 -463 - -491 -461 - -
3no3 258 -494 -406 -376 -397 -361 11.27% -388 -359 12.59% -
ThebestandaverageenergylevelsachievedandrelativeimprovementsofouralgorithmoverotheralgorithmsfortheR,f180andinstancestakenfromCASP.
the algorithms. We performed statistical t-test for inde- continuesto improve inthe stagnant situations and thus
pendent samples with 95% level of significance to verify producesbetterresults.
the significant difference in performances. We report
the new lowest energy levels (w.r.t. incomplete search Effect of the non-isomorphic encoding
methods) for all twelve proteins. These energy levels are The effects of the new non-isomorphic encoding of the
shown in italic-faced font in Table 4. protein conformations have been two-fold. Firstly, it
resultedinthereductionofdegeneracy,whichisevident
Relative improvement in the number of discarded conformations during the
InTable4,we reporttherelativeachievementincolumn search. Secondlytheefficient computationimprovedthe
‘R.I.’.Relativeimprovementofourapproachismeasuredin runtime. In the memory-based approach proposed in
termsofthedifference withoptimalboundoftheenergy [10], the authorsused the relative encoding proposed in
level.Thisvalueissignificantbecauseitgetshardertofind [32]. When applied with the memory-based algorithm
better conformations as the energy level of a protein proposedin[10],ournewencodingresultedinmoredis-
sequenceapproachestheoptimal.Wedefine: cards and less computation time, as shown in Table 5.
The discarded conformations are the approximate mea-
RelativeImprovement= Eo−Er ×100% (3) sure of similar conformations encountered during the
E −E search. The experimental results for six proteins are
l r
showninTable5forfirstonemillioniterations.
where E is the average energy level achieved by our
o
approach, Er is the average energy level achieved by the Conclusions
otherapproach,and Elistheoptimallower boundofthe In this paper, we presented a local search algorithm for
energy level. The missing values indicate the absence of solvingtheproteinstructurepredictionproblemonFCC
anylowerboundforthecorrespondingproteinsequence. latticeusinglowresolutionHPenergymodel.Experimen-
Similar measurements were also used in [10]. From the talresultsshowsthatouralgorithmoutperformsthestate-
values reported in Table 4, we clearly see that our algo- of-the art algorithms. We useda novel encodingscheme
rithmproducesconformationsthataresignificantlybetter to represent the conformations along with a set of elite
intermsoftheaverageenergylevelachieved. conformationstohandlethestagnationofthelocalsearch.
We believe that use of domain specific heuristics while
Search progress selectingtheconformationsfromtheelitesetcanfurther
In Figure 2, we show search progress ofthree algorithms improve the performance of the algorithm. In future, we
fortheproteinsequenceR1.Averageenergylevelbyeach wish to explore that and apply our techniques to higher
of the algorithms for 50 runs are shown. All three algo- resolutionsandotherenergymodelstoseetheeffect.We
rithmsachievealmostthesamelevelofenergyinitiallybut wish to apply our techniques to other domains such as
assoonasthesearchmakesprogress,thetabusearchand propositionalsatisfiability,vehiclerouting.Webelievethe
the memory-based search fail to overcome stagnation. It proposed encoding scheme will add efficiency to search
is clearly evident from the graph that our algorithm techniquessuchasgeneticalgorithms.
Shatabdaetal.BMCBioinformatics2013,14(Suppl2):S19 Page8of9
http://www.biomedcentral.com/1471-2105/14/S2/S19
Figure2SearchProgress.SearchprogressofthreealgorithmsforProteinR1over300minutes.
Declarations
Table 5Effect ofNon-Isomorphic Encoding. Thepublicationcostsforthisarticlewerefundedbythecorresponding
author’sinstitution.
Protein OurEncoding RelativeEncoding
ThisarticlehasbeenpublishedaspartofBMCBioinformaticsVolume14
Seq. runtime #discards runtime #ofdiscards Supplement2,2013:SelectedarticlesfromtheEleventhAsiaPacific
BioinformaticsConference(APBC2013):Bioinformatics.Thefullcontentsof
R1 28.33 354712 39.6 91664
thesupplementareavailableonlineathttp://www.biomedcentral.com/
R2 30.4 406572 42.6 91219 bmcbioinformatics/supplements/14/S2.
R3 25.74 475357 42.6 92765
Authordetails
f180_1 22.90 402738 35.4 103059
1InstituteofIntelligentandIntegratedSystems,GriffithUniversity,
f180_2 27.34 317317 39.0 93814 Queensland,Australia.2QueenslandResearchLaboratory,NationalICTof
f180_3 24.54 358326 37.8 89810 Australia.
Comparisoninruntime(inminutes)andthenumbersofdiscardswhileusing Published:21January2013
ournon-isomorphicencodingandtherelativeencoding[32]forfirst1million
iterationsofthememory-basedalgorithmbyShatabdaetal.[10].
References
1. LauKF,DillKA:Alatticestatisticalmechanicsmodelofthe
conformationalandsequencespacesofproteins.Macromolecules1989,
Authors’contributions 22(10):3986-3997.
SSconceivedtheoriginalideaofeliteconformationsandnon-isomorphic 2. KlauGW,LeshN,MarksJ,MitzenmacherM:Human-guidedtabusearch.
encoding.Allauthorscontributedsignificantlyintheimplementation, Proceedingsofthe18thNationalConferenceonArtificialIntelligence2002,41-47.
experimentationandwritingofthemanuscriptandapprovedthefinal 3. AnfinsenCB:Principlesthatgovernthefoldingofproteinchains.Science
version. 1973,181(4096):223-230.
4. RotkiewiczP,SkolnickJ:Fastprocedureforreconstructionoffull-atom
Competinginterests proteinmodelsfromreducedrepresentations.JournalofComputational
Theauthorsdeclarethattheyhavenocompetinginterests. Chemistry2008,29(9):1460-1465.
5. LeshN,MitzenmacherM,WhitesidesS:Acompleteandeffectivemove
Acknowledgements setforsimplifiedproteinfolding.ProceedingsoftheSeventhAnnual
WegratefullyacknowledgethesupportoftheGriffithUniversityeResearch InternationalConferenceonResearchinComputationalMolecularBiology
ServicesTeamandtheuseoftheHighPerformanceComputingCluster 2003,188-195,RECOMB‘03.
“Gowonda”tocompletethisresearch.WealsothankNICTA,whichisfunded 6. DotuI,CebriánM,VanHentenryckP,CloteP:Proteinstructureprediction
bytheAustralianGovernmentasrepresentedbytheDepartmentof withlargeneighborhoodconstraintprogrammingsearch.Principlesand
Broadband,CommunicationsandtheDigitalEconomyandtheAustralian PracticeofConstraintProgrammingSpringer;2008,82-96.
ResearchCouncilthroughtheICTCentreofExcellenceprogram.
Shatabdaetal.BMCBioinformatics2013,14(Suppl2):S19 Page9of9
http://www.biomedcentral.com/1471-2105/14/S2/S19
7. DotuI,CebrianM,VanHentenryckP,CloteP:Onlatticeproteinstructure 33. HoqueT,ChettyM,DooleyLS:Non-isomorphiccodinginlatticemodel
predictionrevisited.IEEE/ACMTransactionsonComputationalBiologyand anditsimpactforproteinfoldingpredictionusinggeneticalgorithm.
Bioinformatics2011,8(6):1620-1632. ProceedingsofIEEESymposiumonComputationalIntelligencein
8. MazureB,SaisL,GrégoireÉ:TabusearchforSAT.Proceedingsofthe BioinformaticsandComputationalBiology(CIBCB)2006,1-8,IEEE.
NationalConferenceonArtificialIntelligence1997,281-285. 34. BackofenR,WillS:Aconstraint-basedapproachtostructureprediction
9. BattitiR,TecchiolliG,etal:Thereactivetabusearch.ORSAJournalon forsimplifiedproteinmodelsthatoutperformsotherexistingmethods.
Computing1994,6:126-126. LogicProgramming2003,49-71.
10. ShatabdaS,NewtonM,PhamDN,SattarA:Memory-basedlocalsearchfor
simplifiedproteinstructureprediction.Proceedingsofthe3rdACM doi:10.1186/1471-2105-14-S2-S19
ConferenceonBioinformatics,ComputationalBiologyandBiomedicine2012, Citethisarticleas:Shatabdaetal.:Theroadnottaken:retreatand
345-352,BCB‘12,ACM. divergeinlocalsearchforsimplifiedproteinstructureprediction.BMC
11. BergerB,LeightonT:Proteinfoldinginthehydrophobic-hydrophilic(HP) Bioinformatics201314(Suppl2):S19.
isNP-complete.ProceedingsoftheSecondAnnualInternationalConference
onComputationalMolecularBiology1998,30-39,RECOMB‘98.
12. KawaiH,KikuchiT,OkamotoY:Apredictionoftertiarystructuresof
peptidebytheMonteCarlosimulatedannealingmethod.Protein
Engineering1989,3(2):85-94.
13. KapsokalivasL,GanX,AlbrechtAA,SteinhöfelK:Population-basedlocal
searchforproteinfoldingsimulationintheMJenergymodelandcubic
lattices.ComputationalBiologyandChemistry2009,33(4):283-294.
14. UngerR,MoultJ:Ageneticalgorithmforthreedimensionalprotein
foldingsimulations.Proceedingsofthe5thInternationalConferenceon
GeneticAlgorithms1993,581-588.
15. KonigR,DandekarT:Improvinggeneticalgorithmsforproteinfolding
simulationsbysystematiccrossover.Biosystems1999,50:17-25.
16. KrasnogorN,HartW,PeltaD:Proteinstructurepredictionwith
evolutionaryalgorithms.ProceedingsoftheGeneticandEvolutionary
Computationconference1999,1596-1601.
17. HoqueT,ChettyM,SattarA:Proteinfoldingpredictionin3DFCCHP
latticemodelusinggeneticalgorithm.IEEECongressonEvolutionary
Computation2007,4138-4145.
18. YueK,DillK:Forcesoftertiarystructuralorganizationinglobular
proteins.ProcNatlAcadSciUSA1995,92:146-150.
19. MannM,BackofenR:CPSP-tools-Exactandcompletealgorithmsforhigh-
throughput3Dlatticeproteinstudies.BMCBioinformatics2008,9:230.
20. AlessandroDP,DovierA,PontelliE:Aconstraintsolverfordiscrete
lattices,itsparallelization,andapplicationtoproteinstructure
prediction.Software-PracticeandExperience2007,37:1405-1449.
21. BlazewiczJ,DillK,LukasiakP,MilostanM:Atabusearchstrategyfor
findinglowenergystructuresofproteinsinHP-model.Computational
MethodsinScienceandTechnology2004,10:7-19.
22. KlauGW,LeshN,MarksJ,MitzenmacherM:Human-guidedtabusearch.
Proceedingsofthe18thNationalConferenceonArtificialIntelligence2002,41-47.
23. UllahAD,KapsokalivasL,MannM,SteinhöfelK:Proteinfoldingsimulation
bytwo-stageoptimization.InComputationalIntelligenceandIntelligent
SystemsCaiZ,LiZ,KangZ,LiuY2009,138.
24. UllahAZMD,SteinhöfelK:Ahybridapproachtoproteinfoldingproblem
integratingconstraintprogrammingwithlocalsearch.BMCBioinformatics
2010,11(S-1):39.
25. JiangT,CuiQ,ShiG,MaS:Proteinfoldingsimulationsofthe
hydrophobic-hydrophilicmodelbycombiningtabusearchwithgenetic
algorithms.JournalofChemicalPhysics2003,119(8):4592-4596.
26. CebriánM,DotúI,VanHentenryckP,CloteP:Proteinstructureprediction
onthefacecenteredcubiclatticebylocalsearch.InProceedingsofthe
23rdNationalConferenceonArtificialIntelligence.Volume1.AAAI’08,AAAI
Press;2008:241-246.
27. YueK,FiebigK,ThomasP,ChanH,ShakhnovichE,DillK:Atestoflattice
proteinfoldingalgorithms.ProcNatlAcadSciUSA1995,92:325.
28. ShmygelskaA,HoosH:Anantcolonyoptimisationalgorithmforthe2D Submit your next manuscript to BioMed Central
and3Dhydrophobicpolarproteinfoldingproblem.BMCbioinformatics and take full advantage of:
2005,6:30.
29. LuH,YangG:Extremaloptimizationforproteinfoldingsimulationson
• Convenient online submission
thelattice.Computers&MathematicswithApplications2009,57:1855-1861.
30. CipraB:Packingchallengemasteredatlast.Science1998,281(5381):1267. • Thorough peer review
31. Bornberg-BauerE:ChaingrowthalgorithmsforHP-typelatticeproteins.
• No space constraints or color figure charges
ProceedingsoftheFirstAnnualInternationalConferenceonComputational
MolecularBiologyRECOMB‘97,NewYork,NY,USA:ACM;1997,47-55. • Immediate publication on acceptance
32. BackofenR,WillS,CloteP:Algorithmicapproachtoquantifyingthe • Inclusion in PubMed, CAS, Scopus and Google Scholar
hydrophobicforcecontributioninproteinfolding.Proceedingsofthe
• Research which is freely available for redistribution
PacificSymposiumonBiocomputing2000,92-103.
Submit your manuscript at
www.biomedcentral.com/submit