Table Of ContentThe limited role of non-native contacts in folding pathways of a lattice protein
Brian C. Gin,1,2,3 Juan P. Garrahan,4 and Phillip L. Geissler∗1,2
1Department of Chemistry, University of California at Berkeley, Berkeley, California 94720
2Chemical Sciences and Physical Biosciences Divisions,
Lawrence Berkeley National Lab, Berkeley, California 94720
3School of Medicine, University of California at San Francisco, San Francisco, California 94143
4School of Physics and Astronomy, University of Nottingham, Nottingham, NG7 2RD, U.K.
Models of protein energetics which neglect interactions between amino acids that are not
adjacent in the native state, such as the Go¯ model, encode or underlie many influential ideas on
protein folding. Implicit in this simplification is a crucial assumption that has never been critically
evaluated in a broad context: Detailed mechanisms of protein folding are not biased by non-native
contacts,typicallyimaginedasaconsequenceofsequencedesignand/ortopology. Herewepresent,
9 using computer simulations of a well-studied lattice heteropolymer model, the first systematic test
0 of this oft-assumed correspondence over the statistically significant range of hundreds of thousands
0 of amino acid sequences, and a concomitantly diverse set of folding pathways. Enabled by a
2 novel means of fingerprinting folding trajectories, our study reveals a profound insensitivity of the
order in which native contacts accumulate to the omission of non-native interactions. Contrary
n
to conventional thinking, this robustness does not arise from topological restrictions and does not
a
J dependonfoldingrate. Wefindinsteadthatthecrucialfactorindiscriminatingamongtopological
pathways is the heterogeneity of native contact energies. Our results challenge conventional
1
thinkingontherelationshipbetweensequencedesignandfreeenergylandscapesforproteinfolding,
2
and help justify the widespread use of Go¯-like models to scrutinize detailed folding mechanisms of
real proteins.
]
M
Keywords: Go¯ Model, Non-Native Contacts, Lattice Model, Protein Folding, Principle of
B
Minimum Frustration, Energy Landscape
.
o
i
b I. INTRODUCTION The G¯o model was originally proposed as a schematic
- but microscopic perspective on the stability and ki-
q
netic accessibility of proteins’ native states. It accord-
[
Current understanding of protein folding has been inglyprovidedgenericinsightintoissuesofcooperativity,
2 strongly shaped by theoretical and computational stud- nucleation, and the relationship between sequence and
v ies of simplified models1. Such models are typically con- structure1. Recent studies have ascribed a much more
1 structed by discarding fine details of molecular structure
literal significance to the detailed dynamical pathways
3
or by making simplifying assumptions about the ener- definedbyG¯o-likemodels5. Inparticular,directcompar-
2
gies of interaction among amino acid residues. A spe-
2 isons have been drawn between folding mechanisms pre-
cialclass ofmodels, basedonG¯o’s insights2, assertsthat
. dicted by G¯o-like models for specific proteins and those
8 onlyasubsetofinteractions,thosebetweensegmentsofa suggested by experimental results16,17,18. However, it is
0
protein that contact one another in the native state, are
8 not clear to what extent such a detailed correspondence
crucially important for folding. The G¯o model further
0 with G¯o-like models should be expected. General the-
: assumes a unique energy scale for these native contacts. ories offer only rough guidance, and few computational
v Here, we will focus on elaborated “G¯o-like” models that
studies have compared folding pathways of G¯o-like mod-
i
X allow for a diversity of native contact energies. els and their “full” counterparts (in which non-native
r Neglect of non-native contacts offers substantial com- contact energies are included) in a broad context19.
a
putational relief to numerical simulations, allowing thor- Veryfavorableinteractionsbetweensegmentsofapro-
ough kinetic and thermodynamic studies to be per- teinthatarenotadjacentinthefoldedstategenerallyim-
formed even for detailed molecular representations3,4,5,6. pedefolding. Theymightdosobyintroducingdetoursor
It further establishes a basis for theories that fo- traps on the route to the native state, or simply by sta-
cus on gaps in the spectrum of conformational bilizing the ensemble of unfolded conformations20,21,22.
energies7,8 and the funnel-like nature of potential energy It is often imagined that the former possibility plagues a
landscapes9,10,11,12,13. Corroboratedbyexperiment,con- vastmajorityofnon-naturalaminoacidsequences,which
cepts intrinsic to and inspired by G¯o-like models now fold sluggishly if at all23,24. According to this picture,
form a canon of widely accepted ideas about how pro- non-native contacts should feature prominently in the
teins fold1,14,15. convoluted folding pathways of an undesigned sequence.
Suchkineticfrustrationcouldposeseveralbiologicalrisks
in vivo,whereaggregationandslowresponsecanbeseri-
ous liabilities. Indeed, typical proteins taken from living
∗Correspondingauthor. E-mail: [email protected] organisms fold reliably and with relative efficiency25.
2
These notions and observations motivate a “principle comparison of fast-folding pathways in the two models
of minimum frustration” asserting that natural amino would not be especially informative. In that case the
acidsequenceshavebeen“designed”byevolutiontomin- sequence of events that advance a molecule toward the
imize the disruptive influence of non-native contacts on native state (which we designate as its folding mecha-
the dynamics of folding9. One might thus apply G¯o-like nism) would be exclusively a question of geometry and
modelstothesedesignedsequenceswithconfidence,since local mobility. We have found, to the contrary, that a
the omitted interactions are precisely the ones whose ef- wealth offolding mechanisms arepossible even fora sin-
fects have been mitigated by natural selection. By con- gle native conformation.
trast, one might expect G¯o-like models to poorly rep- Spanning a range of hundreds of thousands of se-
resent folding mechanisms of slowly folding molecules, quences, with widely varying rates and mechanisms, the
whose non-native interactions are presumably responsi- work reported in this paper constitutes a thorough test
ble for hampering pathways to the native state21,22. ofcertainaspectsoftheprincipalofminimumfrustration
Testingtheseideasofsequencedesignandkineticfrus- andaddressesatanewlevelofkineticdetailthedynam-
tration is made difficult by several factors. Experimen- ical realism that can be expected from G¯o-like models.
tally,microscopicdetailsoffoldingkineticscannotbere- Ourresultsforthelatticeheteropolymermodelevidence
solved but only inferred from indirect observables or the aremarkablystrongmechanisticcorrespondencebetween
effectsofmutations. Furthermore,themostconcretehy- full and G¯o-like models. Unexpectedly, this dynamical
potheses stemming from the principle of minimum frus- conformity holds not only for fast-folding sequences but
tration involve G¯o-like models, which cannot be real- also for the slowest sequences whose folding can be fol-
ized in the laboratory. Computer simulations of detailed lowedinpractice. Closecorrespondenceinfoldingmech-
molecular representations can generate, at great cost, anismsholdsaslongastheG¯o-likeapproximationretains
dynamical information sufficient to determine a folding the heterogeneity in native contact energies of the full
mechanism for only the smallest of natural proteins26. potential. These findings suggest a profound frustration
Although the statistical dynamics of coarse-grained or invariance in the ensemble of trajectories that proceed
schematic representations can be readily explored, biol- from deep within the unfolded state all the way to the
ogy does not provide collections of fast-folding and slow- native structure.
folding sequences to compare in these artificial contexts.
Finally, even when appropriate ensembles of sequences
and ensembles of folding trajectories are available, use- II. METHODS
fulcomparisonofG¯o-likemodelsanditsfullcounterpart
requires a compact way of characterizing the course of We focus on lattice heteropolymers, whose folding
highly chaotic dynamics27. A general method for this properties have been studied extensively for specific ex-
purpose is not available, though studies of nucleation amplesequences,structures,andchainlengths1,30. Here,
as a rate-limiting fluctuation provide a useful starting a protein’s conformation is described by a self-avoiding
point28,29. walk on a three dimensional lattice with spacing a (see
This paper presents the first systematic, large-scale for example Fig 1a). Each vertex of this walk represents
comparison of folding pathways within G¯o-like and full an amino acid monomer, which possesses no internal
models. We focus on a schematic lattice representation structureandinteractsonlywith“contacting”monomers
of proteins, well-suited for this task in several ways: (a) that occupy adjacent vertices. For a chain comprising N
geometrically, because contacting segments of the chain monomers the energy of a particular configuration can
can be unambiguously identified, (b) statistically, be- thus be written
causerepresentativeensemblesoffoldingtrajectoriescan
N−1 N N−3 N
be generated for large numbers of amino acid sequences, (cid:88) (cid:88) (cid:88) (cid:88)
E = u (r )+ B ∆(r −a), (1)
core ij ij ij
and (c) conceptually, because the essential competition
i=1 j>i i=1 j=i+3
betweencontactenergeticsandchainconnectivitycanbe
isolatedfromcomplicatingeffectsofsecondarystructure, where r = |r −r |. The hard-core potential u (r),
ij i j core
side-chainpacking,etc. Whiletheselattereffectsunques- which takes on values of ∞ for r = 0 and 0 for r >
tionably bear in important ways on the folding of real 0, imposes the constraint of self-avoidance. Interaction
proteins, it is nevertheless imperative to understand the energies B are determined by the sequence-dependent
ij
fundamental physical scenarios they enrich and modify. identities of monomers i and j according to the model of
Indeed, much of biologists’ working intuition for protein MiyazawaandJernigan31 (MJ),andactonlyataspatial
folding and design was developed in the context of sim- separation of one lattice spacing [∆(x) = 1 if x = 0 and
ilarly schematic models. Our results challenge some of vanishes otherwise].
those notions. Thestandarddynamicalrulesforevolvingsuchachain
It has been conjectured that well-designed lattice het- molecule proceed from a Metropolis Monte Carlo algo-
eropolymers fold through mechanisms that are deter- rithm. Trial moves, in which one or two randomly se-
mined solely by their native structures25. Were this lected monomers move in an “edge-flip” or “crankshaft”
hypothesis correct, for both full and G¯o-like models, a fashion, are accepted with probabilities that generate a
3
FIG. 1: (a) 48-mer native structure of the lattice heteropolymer studied in this work. (b) Example of histograms of the order
ofpermanentformationofnativecontacts(contactappearanceorder,orCAO)foreachoftheninenativecontactsofa12-mer
latticestructure. Histogramsarecollectedfromthesetoffoldingtrajectoriesofagivenaminoacidsequence. (c)SameasFig.
1b but shown as a density map. (Right, Top) CAO maps of three fast folding sequences of the 48-mer (Fig. 1a), for both the
full potential energy and the Go¯-like approximation (which disregards non-native contact energies, but maintains the original
heterogeneityinnativecontactenergies). Theoverlapparameterq quantifiesthesimilarityofCAOmaps,andthustopological
folding pathways. The overlap of CAO maps between full and Go¯-like potentials for each sequence is close to one, q ≈ 0.9,
indicating the similarity of their folding mechanisms. In contrast, the overlap between CAO maps of different sequences is
<
muchsmaller,q∼0.2. (Right,Bottom)Sameasbeforebutnowforthreeslowfoldingsequences. Again,theCAOdistributions
of full and Go¯-like potentials are very similiar, while those between different sequences are not.
Boltzmann distribution at temperature T = 0.16(cid:15) /k , chains of modest length (say, N = 27), the number of
0 B
where (cid:15) sets the energy scale of the MJ model. For ex- possible conformations is sufficiently immense to moti-
0
ample, the strongest attractive interaction (between two vate Levinthal’s paradox, i.e., it is not obvious that they
cysteines) has an energy (cid:15) =−1.06(cid:15) ; for lysine-lysine should be able fold at all. Folding occurs in a coopera-
CC 0
(cid:15) = 0.25(cid:15) . Folding trajectories are initiated from tive fashion, and occurs efficiently only for well-designed
KK 0
swollen configurations drawn from a high-temperature sequences. For a given sequence certain residues figure
(k T/(cid:15) = 100) equilibrium distribution in which con- much more prominently in folding kinetics than others;
B 0
tact energies are negligible compared to typical thermal correspondingly, certain residues are more highly con-
excitations. served than others in computer simulations of evolution-
This caricature clearly lacks many of the chemical de- ary dynamics.
tails underlying the function and secondary structure of
real proteins. By capturing an essential interplay be-
tween diverse local interactions and constraints of poly-
mer connectivity, it nonetheless recapitulates many non- The G¯o-like approximation of the model of Eq. (1) is
trivial features of protein statistical mechanics: Even for constructedsimplybyignoringtheenergiesofnon-native
4
contacts, orderofacontact’sappearancecorrelatesstronglywitha
statistical measure of commitment to folding at the time
E˜ =N(cid:88)−1(cid:88)N u (r )+N(cid:88)−3 (cid:88)N N B ∆(r −a), (2) when that contact forms permanently. We use the pa-
core ij ij ij ij rameter p , the probability that a trajectory initiated
fold
i=1 j>i i=1 j=i+3 fromagivenconfigurationwillreachthefoldedstatebe-
fore first relaxing to a state with few native contacts33,
where N = 1 if the monomers i and j are adjacent in
ij to demonstrate this fact. Fig. 3c shows that the average
the native configuration, and N = 0 otherwise. While
ij value of p rises steadily with CAO, from a value well
disregarding the energy contribution of non-native con- fold
below p =1/2 up to p =1.
tacts, the energy function E˜ of Eq. (2) retains the full fold fold
Thepointatwhichp crosses1/2isoftenconsidered
heterogeneity in native contacts energies of the original fold
the transition state for folding. The set of contacts con-
potential,Eq.(1). Wewillshowbelowthatitisacrucial
sistenlypresentinsuchconfigurationsiscorrespondingly
aspect of the G¯o-like models we study here.
designated as the folding nucleus. We have confirmed
Many studies previously suggested that lattice het-
thatthenucleusidentifiedinthiswaycorrespondsclosely
eropolymers of modest length fold via a nucleation
mechanism28,29. Formation of a handful of key contacts with the set of contacts that have formed permanently
whenp =1/2. Additionally,wehaveverifiedthatthe
poises the system at a transition state, from which the fold
CAO-identified nucleus of several sequences from Mirny
chain can rapidly access the folded state or, with equal
et al.25 are consistent with the nucleus identified in that
probability,returntotheunfoldedstate. Thissetofcru-
study. While this consistency check reflects favorably on
cial contacts comprises a “folding nucleus” and serves as
the soundness of exploring folding mechanisms by scru-
a bare synopsis of dynamical pathways that lead to the
tinizing CAOs, it does not imply that CAO analysis is
native state.
predicated on putative nucleation mechanisms for fold-
A cogent comparison of folding mechanisms requires a
ing. Regardless of whether the rate-determining steps
meansofcharacterizingdynamicalpathwaysthatisboth
in folding are uphill, downhill, or neutral in free energy;
thorough and computationally inexpensive. Identifying
regardless of whether folding is kinetically a two-state
the folding nucleus satisfies neither or these necessities
phenomenon; regardless of whether the progress of fold-
well. In particular, locating configurations from which
ing is plagued by long-lived kinetic traps, CAOs trace
the folded and unfolded states are equally accessible in-
a history of conformational change that emphasizes any
volves propagation of many trajectories and, by itself,
event with enduring topological consequences.
doesnotdelineateroutestowardandawayfromthetran-
sition state32. We have devised an alternative measure What CAOs do not resolve is the unproductive devel-
that is not only succinct and computationally tractable, opment of native structure. Attention is focused solely
but also characterizes the entire route from the unfolded onsegmentsoftimeevolutionthatbridgefoldedandun-
to the folded state. Specifically, we record the order in foldedbasinsofattraction. Occasionalexcursionswithin
which native contacts form permanently during a pro- the unfolded state amass an atypically large number of
tein’s folding mechanism. Our parameters thus chroni- native contacts, but due either to topology or to the
clelastingchangesinthechain’s“topology”,understood presence of interfering non-native contacts do not in fact
in terms of linkages through the polymer backbone and makeprogresstowardfolding. CAOscontainnoinforma-
through non-bonded contacts. tion about these excursions. In comparing full and G¯o-
This contact appearance order (CAO) is a highly non- like models, we therefore make no statements about the
trivial measure of the progress toward folding and pro- character of such non-folding dynamics. By exclusively
vides a detailed characterization of mechanism in the examining accumulation of native contacts, we also lose
sense we have defined. It is simple to calculate from direct information regarding the evolution of non-native
the time-dependence of a trajectory spanning unfolded contacts. If the rupture of a particular non-native con-
and folded states. Like persistence times34 in the con- tact were a crucial step in folding of a certain sequence,
text of non-equilibdium systems, such as glasses, it is in- our methods would not detect its occurrence explicitly.
trinsically a multi-time quantity; it can neither be com- Westress,however,thatsubstantialnon-nativestructure
puted for a single configuration, nor can it be used to is present when the first permanent native contacts are
build constrained ensembles whose statistics shed light formed. We could therefore indirectly detect the signifi-
on the nature of reaction coordinates. But, also like per- cance of non-native contact dynamics through influences
sistence times34, it focuses attention on key dynamical on the pattern of early topological changes.
events with unmatched precision. For our purpose of di- Compiling the order of permanent contact formation
agnosing the occurrence of lasting topological changes, over many folding trajectories of a given sequence, we
CAOs serve almost ideally. For some other approaches, constructforeachnativecontactastatisticaldistribution
e.g., surveying the free energy landscapes on which fold- ofCAO.Fig. 1b,cillustratehowthesetofresultingCAO
ing takes place, CAOs would serve poorly. histograms form a visual fingerprint of a sequence’s fold-
We have verified that the mechanistic meaning we as- ing mechanism. Because the dynamical events it chroni-
cribetoCAOsisconsistentwithmoreconventionalchar- clesspanawiderangeofp , aCAOhistogramcharac-
fold
acterizations of reaction progress. Most importantly, the terizes not only the transition state for folding, but also
5
FIG. 2: (a) Distribution of CAO overlaps, P(q), between different sequences, and between full and Go¯-like potential, for 1000
sequenceschosenrandomlyoutof105 sequencesthatfoldtothe48-merstructureofFig. 1a. Thesequencesinthisdistribution
were generated by a single high T evolutionary trajectory (see Appendix). The inset shows that the similarity between full
ev
and Go¯-like pathways for each sequence is independent of folding rate. Data for this inset was generated from 2000 sequences
chosen randomly from 5 independent evolutionary runs (5×105 total sequences), all folding to the native 48-mer structure of
√
Fig. 1a. (b)Distributionoftheroot-mean-squaredfluctuationsofcontactorder, δC,overthesetofG¯o-likesequences. CAOs
in heterogeneous Go¯-like potentials vary less from one folding trajectory to another than in the homogeneous Go¯ model. It is
the heterogeneity in native contact energies that selects specific folding pathways; this selectivity is absent in a homogeneous
Go¯ potential. The inset shows the CAO map of the homogeneous Go¯ potential, cf. Fig. 1. (c) Average p as a function of
fold
number of permanent native contacts formed, for the full and Go¯-like potentials, for a fast and a slow folding sequence. In all
casesp isclosetozerountilthefirstpermanentcontactsaremade,confirmingthatourCAOanalysiscapturestherelevant
fold
dynamical folding regime. p is the probability for a given conformation to reach the folded state before unfolding. For a
fold
givenfoldingtrajectory,wecalculatep accordingtothemethodofFaiscaetal.33,byrunningindependenttrajectoriesfrom
fold
configurations chosen at evenly-spaced time intervals. We regard a molecule as unfolded when the instantaneous number of
native contacts drops to a value consistent with the average number of native contacts in the unfolded state. Additionally, we
require that this threshold lie below any value found in equilibrium fluctuations of the native state.
the dynamics of ascent to and descent from the transi- others have proposed that such variations are weak, i.e.,
tion state. The correspondence between an amino acid that topology of the folded structure prescribes a nearly
sequence and its CAO histogram is as subtle as (if not unique topological route for folding. Using methods de-
more so) the connection between sequence and native scribed in the Appendix, we have generated an unprece-
conformation that defines some of the most challenging dentedly diverse set of sequences that fold to the same
aspects of the protein folding problem. Most of the re- target structure within the full model. As shown in Fig.
sults we will present concern a single native structure 1 variations in CAO statistics within this set are much
(that shown in Fig. 1a for N = 48), removing a po- more substantial than previously thought. Any success
tentiallytrivialagreementbetweenfullandG¯o-likemod- ofG¯o-likemodelsinreproducingfoldingpathwaysofthe
els. Even for this unique structure, sequences of the full full model cannot be attributed simply to their sharing
model differing by only a few point mutations can ex- a common native structure.
hibit qualitatively different CAO histograms, reflecting We quantify similarity of CAO statistics (for two se-
substantial changes in folding pathway. The distribution quences within the same model, or for full and G¯o-like
of contact energies can thus play a critical and complex models with the same sequence) using an “overlap” pa-
role in determining folding mechanism, over and above rameter q35. Inspired by the theory of spin glasses, we
dictating its endpoint. Given this nontrivial relationship define q such that 0 ≤ q ≤ 1, with larger q representing
itwouldbesurprisingifnon-nativecontactsdidnotgen- greater similarity. The analogy with spin glasses would
erally act to shape or bias CAO statistics. assign an overlap q(α,β) between the CAO distributions
for two sequences α and β proportional to
TheprimarygoalofthispaperistocomparetheCAO
statistics of sequences propagated using full and G¯o-like
1 n(cid:88)maxnm(cid:88)ax−1
models. In judging their similarities and differences, it is P(α)(C)P(β)(C), (3)
essentialtoestablishforreferencehowsignificantlyCAO nmax n n
n=1 C=1
histograms can vary, within either model, for sequences
that fold to a common structure. As mentioned above, whereP(α)(C)istheprobabilitythatnativecontactnis
n
6
made permanently at order C in a folding trajectory of structures, not between different sequences that adopt
sequence α, and n is the total number of native con- them. In loose terms folding dynamics of the homoge-
max
tacts. An accurate numerical estimate of the quantity in neousG¯omodelresembleasuperpositionofthosewede-
Eq. (3), however, is problematic to obtain, requiring the terminedfordiversesequencesofthefullmodel. Whereas
generation of an inordinate number of folding trajecto- in the full model a typical set of contact energies selects
ries. Asanalternative,wedefineq usingacloselyrelated a well-defined folding pathway, an egalitarian set of sta-
quantity, bilizing energies permits broad sampling of routes to the
native state.
q(α,β) = 1 n(cid:88)max(cid:118)(cid:117)(cid:117)(cid:116)2(cid:32) σn(α)σn(β) (cid:33) enGer¯og-ielisk,ehmowodeveelsr,thcaaptteumrebrtahceetvoaproielotygicinalnpaattivhewcaoynstfaoclt-
nmax n=1 (σn(α))2+(σn(β))2 lowed by their full model counterparts with striking ac-
(cid:16) (cid:17)2 curacy. CAO histograms obtained from full and G¯o-like
(cid:104)C(cid:105)(α)−(cid:104)C(cid:105)(β)
n n dynamics for any particular sequence can hardly be dis-
×exp− , (4)
(cid:16) (cid:17)2 (cid:16) (cid:17)2 tinguished, see Fig. 1. Not only are the average CAOs
σ(α) + σ(β)
n n of each contact nearly equivalent, but also fine details of
CAO statistics are unaffected by neglect of non-native
where (cid:104)C(cid:105)(nα) = (cid:80)nnm=a1xPn(α)(C)C is the average CAO contact energies. While previous work hypothesized a
of contact #n for sequence α and (σ(α))2 = dynamical correspondence for fast folders, the topologi-
n
(cid:80)nmaxP(α)(C)(C −(cid:104)C(cid:105)(α))2 is its variance. Equations calconformityoffullandG¯o-likemechanismsweobserve
n=1 n n for slow folders is highly unexpected.
(3)and(4)arecompletelyequivalentinthecaseofGaus-
sian distributed CAOs. Even for non-Gaussian statis- For sequences with folding rates <∼10−9, we are un-
tics, q(α,β) remains a useful, computationally tractable, able to harvest folding trajectories in sufficient numbers
and similarly bounded measure of how similarly two se- to construct CAO histograms. According to microscopic
quences fold. reversibility, however, topological routes for folding are
identical to time-reversed routes of unfolding. We have
thereforeextendedouranalysisofcontactappearanceor-
III. RESULTS AND DISCUSSION derforefficientlyfoldingsequencestooneofcontactdis-
appearance order (CDO) for very sluggishly folding se-
In the ensemble of sequences we generated, the fastest quences. The agreement between CDO histograms of
folding sequences access the native state more than 1000 full and G¯o-like models is no less striking than that of
times more rapidly than the slowest. CAO histograms the CAO histograms plotted in Fig. 1, even in cases
weregeneratedforallsequences,eachoneevincingawell- where the “native” state is grossly unstable. These cal-
defined topological pathway. Typically, the appearance culationsaresomewhatlessstraigthforward: theorderof
order C of a given native contact varies from one trajec- first disappearance (CDO) is equivalent to the order of
torytoanotherbyonlyafewpositions(seebelow). This permanent appearance (CAO), but only for trajectories
regularity belies substantial conformational fluctuations reaching the unfolded state without revisiting the native
attending each folding event, which exert little influence state. As such, they require specifying when a molecule
on the formation of permanent contacts. Sharply peaked has unfolded. For this purpose, we regard a molecule as
CAOhistogramsdonotindicatealackofcomplexity,but unfolded when the instantaneous number of native con-
instead a successful characterization of forward progress tactsdropstoavalueconsistentwiththeaveragenumber
along the reaction coordinate for folding. ofnativecontactsintheunfoldedstate. Additionally,we
Figure 1 shows CAO histograms for several sequences require that this threshold lie below any value found in
folding to this specific 48-mer structure (depicted in Fig. equilibrium fluctuations of the native state. We have
1a). Results are presented for dynamics propagated ac- verified that CAO and CDO histograms indeed match
cording to both full and Go¯-like models. Comparing for sequences folding at moderate rates.
these topological fingerprints across different sequences Quantitativemeasuresofmechanisticdiversityarepre-
hints at the broad variety of possible folding pathways. sentedinFig. 2a. Foreachpairofsequencesgeneratedby
Contacts essential to early stages of folding for one se- our evolutionary simulation we computed the similarity
quencecanbeirrelevantinthepathwaytakenbyanother. parameterq betweenCAOhistogramsforthefullmodel.
This finding contrasts strongly with the “one-structure The resulting distribution of q values is broadly peaked
one-nucleus”hypothesis, bolsteringrecentreportsofdis- at q ≈0.4, signifying that there is a significant diversity
similar folding nuclei29. ofCAOpathwaysrepresentedbythesequencesintheen-
Strong variations in the topological folding pathways semble. For each individual sequence we also quantified
chosen from one sequence to another immediately in- the relationship between CAO histograms generated by
dicate that the original homogeneous G¯o model27 can- full and G¯o-like models. These q values are distributed
not capture the folding behavior of a typical sequence. much more narrowly about a considerably higher aver-
With a homogeneous set of native contact energies, that age, q ≈ 0.9. Using sequence-to-sequence variation in
model can only discriminate between different native CAO pathways as a yardstick, the irrelevance of non-
7
FIG. 3: (a) Number of native contact as a function of time in a folding trajectory, illustrating the “prefolding” (blue) and
“folding”(red)phasesofthedynamics. Theprefoldingphaseextendsfromthefoldingtrajectory’sstarttimeuntilthetimethe
first permanent native contact is formed. The folding phase extends from this time to the time when the native conformation
isreached. Thefull(green)curveshowsthep ,whichonlydepartsfromzeroafterthefoldingphasehasstarted(cf. Fig. 2).
fold
(b, right panels) Distribution of the duration of the prefolding and folding phases, in the full potential and its corresponding
Go¯-likeapproximation. Forfast-foldingsequences(toppanel)thedistributionsforbothfoldingandprefoldingdurationsofthe
Go¯-like model are close to those of the full potential. For slow-folding sequences (middle panel) the Go¯-like model reproduces
thedistributionoffoldingduration,butunderestimatestheprefoldingtimes. IftheGo¯-likepotentialofslow-foldingsequences
is supplemented by random non-native contact energies (bottom panel) the prefolding distributions can be made to mach,
without disrupting the correspondence in the folding phase distributions. (c) Ratio between full and Go¯-like models’ folding
(red) and prefolding (blue) phase durations, for all sequences ordered according to folding rate; the full lines are the average
ratios for each scatter plot. For fast folders, the average times as calculated from the full and Go¯-like models are comparable,
bothforthefoldingandprefoldingphases. Forslowfolders,theprefoldingtimeinGo¯-likemodelismuchsmallerthanthatin
the full potential, and this difference increases with decreasing folding rate.
nativecontactsforthetopologicalfoldingpathwayisbe- The relevance of CAOs for the folding dynamics is il-
yond doubt. The inset to Fig. 2a emphasizes that this lustrated in Fig. 2c. For two sequences and their G¯o-like
result has little to do with folding efficiency. Typical q approximations, it plots p 32,36 as a function of the
fold
valuesforthefull/G¯o-likecomparisonarejustashighfor totalnumberofpermanentnativecontactsformed,aver-
the slowest folders examined as for the fastest. aged over 200 folding trajectories. p gives the prob-
fold
Figure 2b quantifies the variation of CAO between ability for trajectories initiated from a particular config-
folding trajectories. For each sequence we quantify the uration to fold completely before visiting the unfolded
root mean-squared fluctuation in the contact order: state, and provides a standard basis for defining transi-
tion states in complex systems32,36. Fig. 2c shows that
δC = 1 n(cid:88)max(cid:112)(cid:104)C2(cid:105)−(cid:104)C (cid:105)2. (5) pfold (cid:28) 1 when the first permanent contact is formed.
nmax n n Since pfold = 1 by definition when the last permanent
n=1 contact is formed, CAO histograms chronicle nearly the
entire course of folding dynamics, all the way from the
Fig.2bshowsthedistributionofδC amongtheensemble
unfoldedbasinofattraction(p =0)tothenativestate
ofG¯o-likesequences. ItispeakedatavalueofδC ≈7.5. fold
(p =1).
In contrast, for the homogeneous G¯o model δC ≈ 12.5, fold
indicating that CAO values are much more broadly dis- Insensitivity of topological folding pathways to non-
tributed between trajectories (see inset to Fig. 2b). The native contact energies by no means implies a complete
homogenous G¯o model indeed lacks the pathway speci- dynamical equivalence of full and G¯o-like models. For
ficity exhibited when contact energies are diverse, as in example, a sequence’s mean first passage time for fold-
heterogeneous G¯o-like models. ing can differ by as many as three orders of magnitude
8
FIG.4: TheCAOcorrespondencebetweenthefullpotentialandtheGo¯-likeapproximationisrobusttochangesinchainlength
or target native structure. (Left) CAO maps is a 12-mer folding to the structure shown in the figure. (Center) A sequence
of the 48-mer of Fig. 1 which has a secondary stable configuration. Each target structure defines a G¯o-like approximation
from the set of their native contacts. Each G¯o-like model predicts accurately the CAO map for folding to the corresponding
structure. (Right) Correspondence of Go¯-like/full CAO maps in a 64-mer.
for full and G¯o-like models. This discrepancy is larger permanentcontactismade,bylessthananorderofmag-
for sequences with slower folding rates. Such discrep- nitude. By contrast, pre-folding dynamics of poorly de-
ancies may be due to the presence of off-pathway traps signedsequencesarequitesensitivetonon-nativecontact
in the unfolded state, and possibly non-native stabilized energies. For the example shown in the middle panel of
intermediates along the folding pathway. However, our Fig. 3b, the waiting period prior to formation of a single
calculations suggest that such marked distinctions are permanent contact is roughly three orders of magnitude
largely limited to dynamics occurring before the value of longerinthefullmodelasintheG¯o-likemodel. Nosuch
the committor function p increases significantly from dilationisobservedforsequencesthatfoldquicklyinthe
fold
zero,i.e. beforesignificantprogresshasbeenmadealong full model.
the folding reaction coordinate.
Because contact appearance order is a sensitive mea-
AsillustratedinFig.3a,wecandivideeachfoldingtra- sure of approach to the dynamical bottleneck for fold-
jectory into a period before any permanent contacts are ing, our division of pre-folding and folding phases is a
made(the“pre-foldingphase”)andtheremainingperiod kinetically meaningful one. Most importantly, p (cid:28) 1
fold
in which lasting native structure develops (the “folding throughout pre-folding dynamics as seen in Fig. 3a, in-
phase”). Note that this division takes place well before dicating that the system remains well within the un-
a molecule commits to the folded state (p > 1/2); folded basin of attraction. Only when permanent con-
fold
indeed, the number of non-native contacts at the begin- tacts are made does p rise significantly, so that the
fold
ning of the folding phase is typically comparable to that folding phase encompasses entirely departure from the
of the unfolded state. Fig. 3b shows the distributions unfolded state and transit to the native structure. It is
of pre-folding and folding phases’ durations for two se- remarkablethatnon-nativecontacts, whichcansubstan-
quences representative of fast and slow folders. In both tially prolong dwell times in the unfolded state, exert
cases the influence of non-native contacts on the folding no discernible influence on the topological folding order,
phase dynamics is weak. Non-native contacts mildly ex- and only a small effect on the duration of folding phase
tend the time required to complete folding after the first dynamics.
9
Our simulations suggest that progress toward the na- folded states is essentially determined by native energies
tive state is essentially orthogonal to the formation and alone, despite the fact that substantial non-native struc-
rupture of non-native contacts. A number of such con- ture must be disrupted en route.
tactsarecertainlypresentovermuchofthecourseoffold- Lattice heteropolymers are perhaps the crudest rep-
ing,buttheydolittletodecidewhatconformationalrear- resentation of protein mechanics to which our analysis
rangementsbringachainclosertoitstransitionstatefor could be meaningfully applied. The correspondence be-
folding. To further test this idea, we studied folding dy- tween full and G¯o-like folding mechanisms we have re-
namicsgovernedbypotentialenergyfunctionsthatcom- vealed might break down in more detailed models. For
bine aspects of full and G¯o-like models. Specifically, we example, it has been reported that lattice heteropoly-
selected a set of non-native contact energies at random mersdonotexhibitglassyfoldingdynamicsevenatvery
from a Gaussian distribution, see Fig. 3b. The “frus- low temperatures, while non-Arrhenius temperature de-
trating” influence of these random energies match pre- pendence naturally arises in slightly elaborated models
cisely the behavior we have reported for the full model: that describe side chain packing in addition to back-
CAO histograms are completely insensitive to the aver- boneconformation24. G¯o-likeenergeticscouldalterfold-
agestrengthandvarianceofnon-nativeattractions,while ing pathways by abating the frustration underlying such
overall folding rates decrease with increasing non-native glassy relaxation. This possibility, which merits further
attraction strength. investigation, does not however negate the significance
The observation of correspondence between dynam- of our findings. Our primary purpose is not to justify
ics of the full lattice model and that of a heterogeneous the use of G¯o-like models for detailed study of real pro-
G¯o-like approximation does not noticeably depend upon teins’ folding mechanisms. It is instead to establish the
chain length or on details of native structure. We have influenceofnon-nativeinteractionsondynamicsintrinsic
generatedsequenceswitharangeoffoldingratesforsev- to the fundamental interplay between chain connectivity
eralnativeconformationsofchainswithlengths8,12,48, and heterogeneous contact interactions. That interplay,
and 64. For the two shortest chains, we used each max- whose understanding is central to any instructive physi-
imally compact lattice structure as a folded state. For calpictureofproteinfolding,isnotjustpresentinsimple
the two longest chains, we studied several native struc- lattice models – it is the exclusive source of their com-
turesvaryingsignificantlyincompactnessandincontact plexity. Theresultswehavepresentedthereforeestablish
order37. Typical results shown in Fig. 4 highlight that an important point: Mechanistic aspects of protein fold-
the fidelity of G¯o-like folding mechanisms is a very gen- ing that arise from the basic physics of heteropolymer
eral feature of these lattice heteropolymers. freezing are remarkably insensitive to non-native struc-
ture.
IV. CONCLUSIONS
Acknowledgments
Several arguments have been presented in the liter-
ature to justify the use of G¯o models in studying the
WewishtothankD.Chandler,J.Chodera,K.DuBay,
folding mechanisms of real proteins. Most commonly as-
R. Jack, and S. Whitelam for useful discussions, and W.
serted (based on the principle of minimum frustration)
Eaton,E.Shakhnovich,andA.Szaboforcriticalreadings
is that evolutionary optimization of real sequences re-
of the manuscript.
moves kinetic barriers and renders the energy landscape
This research used resources of the National Energy
smoothlyfunneledandthereforeG¯o-like11,15. Biasesdue
Research Scientific Computing Center, which is sup-
totopologicalfeaturesofthenativestate,unchangedina
portedbytheOfficeofScienceoftheU.S.Departmentof
protein’s G¯o-like represention, have also been invoked to
Energy under Contract No. DE-AC02-05CH11231. This
justifymechanisticfidelity38,39. Ourresultsdemonstrate,
work was supported by the Director, Office of Science,
however, that neither of these assumptions need hold for
Office of Basic Energy Sciences, Chemical Sciences and
aG¯o-likemodeltoreproduceinfinedetailthetopological
Physical Biosciences Divisions, of the U.S. Department
ordering of folding events of a lattice heteropolymer.
of Energy under Contract No. DE-AC02-05CH11231. In
Robustness of the detailed mechanism for folding to
carrying out this work JPG was supported by EPSRC
omission of non-native contacts is not a consequence of
grant GR/S54074/01.
sequence design within the schematic lattice models we
have studied. It is a fundamental emergent feature of
their statistical dynamics, independent of folding effi-
ciency over the entire range accessible to our numerical V. APPENDIX
simulations. Rather than introducing kinetic roadblocks
thatreshapetransitionstatesforfolding,energeticdiver- Ourmethodofsequencegeneration,whicheffectsabi-
sionsduetonon-nativecontactsappeartostronglyaffect ased random walk in the space of all possible sequences,
only physical properties of the unfolded state. Even the is an extension of the method of Mirny et al.25. To gen-
durationoftrajectorysegmentsthatspanfoldedandun- erate ensembles of sequences folding to a specific native
10
structure, we introduce random point mutations and ac- rapidly than mean first passage time calculations em-
cept them with a Metropolis probability ployed in Mirny et al.25.
(cid:20) (cid:18) ∆F‡(β)−∆F‡(α)(cid:19)(cid:21) Our evolutionary simulations, conducted at moder-
P =min 1,exp − (6) ate “temperature” T = 0.05/k , demonstrate that
acc T ev B
ev in fact many folding pathways can provide efficient ac-
cess to a single native state. It is therefore not at all
that generates a Boltzmann-like distribution. Here,
self-evident that a particular, well-designed amino acid
∆F‡(α) is an estimated activation free energy for fold-
sequence should arrive at its native structure via simi-
ing of sequence α, k(α) = k exp(−∆F‡(α)/k T). We
0 B lar routes in full and G¯o-like versions of the lattice het-
estimate the folding rate constant k(α) for sequence α,
eropolymer model.
relative to the rate of basic microscopic motions k , by
0
computing the fraction of trajectories (cid:104)h (cid:105) ≈ 1 − Using this method, we have generated hundreds of
fold τ
exp(−k(α)τ) that fold within a fixed amount of time τ thousands of sequences which fold to given structures
(with k(α) (cid:28) τ−1 (cid:28) k ). This strategy offers two dis- (forexamplethatofFig. 1a)throughavarietyoffolding
0
tinct advantages: (1) the evolutionary temperature T , mechanisms. Thisistheensembleofsequencesweusein
ev
whichgovernsthestringencyofselectionforefficientfold- this paper. Further details of the evolutionary dynamics
ing, can be controlled systematically; and (2) estimates used to generate these large ensembles of sequences will
of folding efficiency via (cid:104)h (cid:105) can converge much more be given in a forthcoming publication40.
fold τ
1 Shakhnovich, E., 2006. Protein folding thermodynamics 1998. Pathwaysforproteinfolding: isanewviewneeded?
anddynamics: wherephysics,chemistry,andbiologymeet. Curr Opin Struct Biol 8:68–79.
Chem Rev 106:1559–1588. 15 Onuchic,J.N.,andP.G.Wolynes,2004.Theoryofprotein
2 Go,N.,1983. Theoreticalstudiesofproteinfolding. Annu folding. Curr Opin Struct Biol 14:70–75.
Rev Biophys Bioeng 12:183–210. 16 Karanicolas, J., and C. L. Brooks, 2003. Improved Go-
3 Shimada,J.,A.V.Ishchenko,andE.I.Shakhnovich,2000. like models demonstrate the robustness of protein folding
Analysis of knowledge-based protein-ligand potentials us- mechanisms towards non-native interactions. J Mol Biol
ing a self-consistent method. Protein Sci 9:765–775. 334:309–325.
4 Shimada, J., E. L. Kussell, and E. I. Shakhnovich, 2001. 17 Levy,Y.,andJ.N.Onuchic,2006. Mechanismsofprotein
Thefoldingthermodynamicsandkineticsofcrambinusing assembly: lessons from minimalist models. Acc Chem Res
anall-atomMonteCarlosimulation.JMolBiol 308:79–95. 39:135–142.
5 Takada,S.,1999. Go-ingforthepredictionofproteinfold- 18 Simler,B.R.,Y.Levy,J.N.Onuchic,andC.R.Matthews,
ing mechanisms. Proc Natl Acad Sci U S A 96:11698– 2006. Thefoldingenergylandscapeofthedimerizationdo-
11700. mainofEscherichiacoliTrprepressor: ajointexperimental
6 Shoemaker, B. A., and P. G. Wolynes, 1999. Exploring and theoretical investigation. J Mol Biol 363:262–278.
structuresinproteinfoldingfunnelswithfreeenergyfunc- 19 Clementi, C., and S. S. Plotkin, 2004. The effects of non-
tionals: thedenaturedensemble. J Mol Biol 287:657–674. native interactions on protein folding rates: theory and
7 Shakhnovich,E.I.,andA.M.Gutin,1990. Implicationsof simulation. Protein Sci 13:1750–1766.
thermodynamicsofproteinfoldingforevolutionofprimary 20 Sali,A.,E.Shakhnovich,andM.Karplus,1994. Howdoes
sequences. Nature 346:773–775. a protein fold? Nature 369:248–251.
8 Sali,A.,E.Shakhnovich,andM.Karplus,1994.Kineticsof 21 Paci, E., M. Vendruscolo, and M. Karplus, 2002. Validity
proteinfolding.Alatticemodelstudyoftherequirements of Go¯ models: comparison with a solvent-shielded empiri-
for foldingto the native state. J Mol Biol 235:1614–1636. cal energy decomposition. Biophys J 83:3032–3038.
9 Bryngelson, J. D., and P. G. Wolynes, 1987. Spin glasses 22 Paci, E., M. Vendruscolo, and M. Karplus, 2002. Native
andthestatisticalmechanicsofproteinfolding. Proc Natl and non-native interactions along protein folding and un-
Acad Sci U S A 84:7524–7528. folding pathways. Proteins 47:379–392.
10 Onuchic, J. N., P. G. Wolynes, Z. Luthey-Schulten, and 23 Shakhnovich,E.I.,andA.M.Gutin,1993. Engineeringof
N.D.Socci,1995. Towardanoutlineofthetopographyof stable and fast-folding sequences of model proteins. Proc
a realistic protein-folding funnel. Proc Natl Acad Sci U S Natl Acad Sci U S A 90:7195–7199.
A 92:3626–3630. 24 Gutin, A., A. Sali, V. Abkevich, M. Karplus, and E. I.
11 Onuchic,J.N.,N.D.Socci,Z.Luthey-Schulten,andP.G. Shakhnovich, 1998. Temperature dependence of the fold-
Wolynes, 1996. Protein folding funnels: the nature of the ing rate in a simple protein model: Search for a “glass”
transition state ensemble. Fold Des 1:441–450. transition. Journal of Chemical Physics 108:6466–6483.
12 Onuchic, J. N., Z. Luthey-Schulten, and P. G. Wolynes, 25 Mirny,L.A.,V.I.Abkevich,andE.I.Shakhnovich,1998.
1997. Theory of protein folding: the energy landscape Howevolutionmakesproteinsfoldquickly. ProcNatlAcad
perspective. Annu Rev Phys Chem 48:545–600. Sci U S A 95:4976–4981.
13 Socci,N.D.,J.N.Onuchic,andP.G.Wolynes,1998. Pro- 26 Schaeffer,R.D.,A.Fersht,andV.Daggett,2008.Combin-
tein folding mechanisms and the multidimensional folding ing experiment and simulation in protein folding: closing
funnel. Proteins 32:136–158. the gap for small model systems. Curr Opin Struct Biol
14 Pande,V.S.,A.Grosberg,T.Tanaka,andD.S.Rokhsar, 18:4–9.