Table Of ContentSelf-assembly, modularity and physical complexity
S. E. Ahnert,1 I. G. Johnston,2 T. M. A. Fink,3,4,5 J. P. K. Doye,6 and A. A. Louis2
1Theory of Condensed Matter, Cavendish Laboratory,
University of Cambridge, JJ Thomson Avenue, Cambridge CB3 0HE, UK
2Rudolf Peierls Centre for Theoretical Physics, University of Oxford, 1 Keble Road, Oxford OX1 3NP, UK
3CNRS UMR144, INSERM U900, Institut Curie, 26 rue d’Ulm, Paris, F-75248 France
4Mines ParisTech, Fontainebleau, F-77300 France
5London Institute for Mathematical Sciences, 22 South Audley St, London W1K 2NY, UK
0
6Physical & Theoretical Chemistry Laboratory, Department of Chemisty,
1
University of Oxford, South Parks Road, Oxford OX1 3QZ, UK
0
2
We present a quantitative measure of physical complexity, based on the amount of information
n required to build a given physical structure through self-assembly. Our procedure can be adapted
a to any given geometry, and thus to any given type of physical system. We illustrate our approach
J using self-assembling polyominoes, and demonstrate the breadth of its potential applications by
2 quantifyingthephysicalcomplexityofmoleculesandproteincomplexes. Thismeasureisparticularly
1 wellsuitedforthedetectionofsymmetryandmodularityintheunderlyingstructure,andallowsfor
a quantitative definition of structural modularity. Furthermore we use our approach to show that
] symmetric and modular structures are favoured in biological self-assembly, for example of protein
h
complexes. Lastly,wealsointroducethenotionsofjoint,mutualandconditionalcomplexity,which
c
providea useful distance measure between physical structures.
e
m
-
ALGORITHMIC COMPLEXITY the theoretical study of self-assembling structures. This
t
a framework can be used to study the properties of real
t
s self-assembling systems, but, more generally, it can also
Morethanfortyyearsago,Kolmogorov[1]andChaitin
.
t be used to measure the physical complexity of any con-
a [2]laidthefoundationsofalgorithmicinformationtheory,
m by introducing the concept of algorithmic information struct, self-assembling or not. The exact nature of the
self-assemblyframeworkdependsontheunderlyingphys-
- content, or Kolmogorov complexity, for a given string
d of information [3]. This measure of complexity is de- icalsystem,but it alwayscontains twobasic ingredients:
n a set of building blocks and a set of rules. We shall call
fined as the length of the shortest possible programon a
o this combination an assembly kit S. Eachbuilding block
universal computer that will output the string in ques-
c
ihasf interfaces,whichtypicallyaresubjecttogeomet-
[ tion. Here weproposea conceptuallyanalogousmeasure i
ric constraints (depending on the physical system). At-
of the complexity of any connected physical structure.
2 tachedtoeachinterfacej ofagivenbuildingblockiisan
Instead of a universal computer which translates a pro-
v
integer χ ∈ [1,...,c]. The c possible values of these in-
4 gram into a string of information, we consider a general ij
tegers are the colours of these interfaces. The number of
6 framework of self-assembly rules, which act together to
4 createa physicalobject. The ‘program’nowis oursetof distinctcolouringsofthebuildingblocksdependsentirely
3 onthegeometryoftheproblem. Thesecondingredientof
self-assemblybuildingblocksandrules,the‘computer’is
2. given by the physical interactions of the self-assembling the assembly kit is the set of rules, which takes the form
1 ofaninteractionmatrixbetweencolours. Inthesimplest
building blocks, and the ‘output’ is the final structure.
9 casethismatrixisbinary,where1signifiesattractionand
Using this approach we investigate the physical com-
0
0signifiesnointeractionatall. Manymoresophisticated
: plexity of shapes in two and three dimensions, includ-
v interactionmatricesinvolvingrepulsionandacontinuous
ing polyominoes, molecules and protein complexes. Our
i spectrum of energies are easily imaginable.
X work generalizes ideas first explored in [4, 5], and opens
For any system of self-assembling particles we need to
r them up to a wide range of applications. Furthermore,
a in the context of protein complexes it offers the kind of also specify a model for the actual assembly process. A
convenientchoiceisamodelassumingasinglenucleusin
biological application of information-theoretic concepts
solution [4], which makes the assumption that each dis-
demanded in [6].
joint object has one fixed nucleus building block which
is surrounded by a solution containing a freely moving
populationcontainingmanycopiesofeachtypeofbuild-
SELF-ASSEMBLY KIT
ing block. Each time step (i) a fixed building block, (ii)
a site adjacent site to it, (iii) a random rotational ori-
Therearemanyexamplesofself-assemblingstructures entation, and (iv) a building block from the solution are
in physics, chemistry and biology [7]. Examples include chosenatrandom,andthe new,randomlyrotatedbuild-
thin films [8], micelles [9], viruses [10, 11] and DNA [12– ingblockbecomesfixedtoitspositioniftherulesallowit.
17]. Our aim is to introduce a general framework for Note that some assembly kits always assemble into the
2
deterministic non-deterministic information I(S ) to describe it in some given language
A
L. Our aim is to minimize this quantity, as we define
C C the length of the description of the minimum assembly
A C C kit S˜ as the complexity K(A) of structure A:
A
A A C A B
A C B B A K(A)=I(S˜A)=minI(SA)
C C SA
A
C B C
in analogy to the concept of Kolmogorov complexity.
B C C C C Anysymmetryormodularitywhichthe structureAcon-
B B B C tains decreases the amount of information required to
C
A B
describe the structure and will therefore be reflected in
A
A C its minimum assembly kit S˜ , and by extension in the
A
B C
value of K(A).
C B
If a minimum assembly kit is deterministic, an inter-
action matrix A (with elements a ) between a total of c
ij
FIG. 1: An example of deterministic and non-deterministic
colours, of which c self-interact, can be rewritten as:
s
self-assembly kits, using simple 2D lattice structures (poly-
ominoes). Inboth cases, colours A andB attract each other,
a =[1−(imod2)]δ +(imod2)δ
butCattractsneitherAnorB.Nocolourattractsitself. The ij i(j+1) i(j−1)
kitontheleftwillalwaysassembleintothecrossshapewhile
fori≤c−c , anda =δ otherwise,sothat onecolour
s ij ij
that on the right will assemble into an irregular cluster, as
always only interacts with one other colour. With this
thereare several ways in which the two blocks can attach.
constraint, the amount of information, in bits, required
to describe a self-assembly kit S , with b building block
A
same shape - these we call ‘deterministic’ - while ones types, is:
which contain ambiguous rules are ‘non-deterministic’.
b
See Figure 1 for an example of a deterministic and a
I(S )=log (c +1)+ c log c+log F (1)
non-deterministic self-assembly kit. A 2 s i 2 2 i
Xi=1
As a simple example of a self-assembling system, we
will consider self-assembling polyominoes. A polyomino The first term relates to the number of self-interacting
(also known as a lattice animal) is a set of connected colours,the secondmeasurestheinformationrequiredto
sitesona(typicallysquare)lattice[18]. Theseconnected describe which ci colours out of the total of c colours
sites are our self-assembly building blocks. Every build- appear on building block i, and the third term log2Fi
ingblockhasfoursides(sothatf =4foralli),whichare measurestheinformationdescribingthedistinctarrange-
i
painted with one of c colours. These colours can attract ment of the ci colours on the fi faces of building block
eachotherornot,asencodedinac×cbinaryinteraction i. For a general building block with fi labelled faces, Fi
matrix. Each distinct way of colouring a building block takes the form of:
corresponds to a different building block type. We do
not regard rotated colourings as distinct. The geometry fi−ci+1fi−ci+2−k1 fi−ci+(ci−1)−Σ′ fi!
F(c ,f )= ...
of the 2D lattice gives rise to a particular set of build- i i ci k !
ing block colourings in the context of self-assembly. If kX1=1 kX2=1 kciX−1=1 Qm=1 m
we have c colours,the total number of such colourings is where Σ′ = ci−2k(i), and the k(i) signify the number
[19]: j=1 j j
of times coloPur j occurs on block i.
Nc =(c4+c2+2c)/4 For polyominoes Fi = F(ci) = Nc′i, where Nc′i is the
number of necklaces with exactly c colours, given by
i
These particular colourings are also known as necklaces,
whichcanbedefinedasequivalenceclassesofstringsun- N′ =N −ci−1 ci N′
der rotation [19]. The definition of necklaces used here ci ci (cid:18) k (cid:19) k
Xk=1
assumes that the building blocks have a fixed chirality -
inother wordsthatthe necklaceswhichthe coloursform with N′ = 1. It follows that N′ = 4, N′ = 9, and
1 2 3
on the building blocks are fixed.[43] N′ =6. Asbefore,the complexityK(A)ofpolyominoA
4
is the minimum of I(S ) over all possible assembly kits
A
S . Note that Wang tiles [20] are a special case of self-
A
THE MINIMUM KIT assemblingpolyominoes. Thetilesystemdescribedin[5]
is also similar to our framework for the case of polyomi-
Everydeterministic assemblykit S ,whichalwaysas- nos, but (like Wang tiles) only considers self-interacting
A
sembles into a structure A, requires a certain amount of colours, and treats rotated tiles as distinct. As a result
3
our encoding, based on necklaces, makes symmetry and (a) Classifythe(sub)graphaccordingtothenum-
modularity in the structure more directly measurable. ber of connections and (depending on the ge-
If the faces are geometrically unconstrained - as one ometry) the arrangement of connections.
would imagine for a node with a set of freely moving
(b) Label all nodes which are not yet labelled
links-andhenceunlabelled,wewouldonlyneedtospec-
and which have exactly one unlabelled node
ifyhowmuchthereisofeachcolour. Thiscanbewritten
among their neighbours. The new labels dis-
usingF = cik(i),sothatlog F = cilog k(i). How-
i j j 2 i j 2 j tinguish nodes according to their species as
ever, this oQnly works under the condPition that multiple
well as the topologically distinct label distri-
connectionsbetweenthe samepairofbuildingblocksare
butions among their neighbours.
prohibited.
The general algorithm we use to find the minimum (c) Repeat step 5b until all nodes are labelled or
assemblykitS˜, andthus the complexity K,forpolyomi- no more nodes can be labelled.
noes and other structures is described in the following
(d) All labelled nodes we define as category 1
section.
nodes and any remaining unlabelled nodes
(i.e. nodes withatleasttwounlabelledneigh-
bours) are defined as category 2 nodes.
A GENERAL MEASURE OF STRUCTURAL
COMPLEXITY (e) Label all category 2 nodes simultaneously ac-
cording to their neighbourhoods.
Below we describe a general algorithm for minimizing
(f) Repeat step 5e, using the previous labellings
the assembly kit size for a connected physical structure
todistinguishneighbourhoods,untillabellings
without relying on steric effects. Taking these into ac-
are stable.
count can minimize the assembly kit even further, but
theircomputationishighlydependentonthegeometryof (g) Thesefinallabels,fornodesinbothcategories,
thesystemandinmostcasesnon-trivial(seeDiscussion). denote the building block types. The number
Note also that in some structures, such as polyominoes, of final labels, or types, is b. These can be
someedgesofthe contactgraphcanbe redundantinthe subdivided in to b category 1 building block
1
context of the assembly process. Whether contact graph types and b category 2 building block types.
2
edgesingeneralcanbe redundantornotdepends onthe The category 2 type of block i is denoted t .
i
nature of the structure and the assumptions connected
(h) Thedegreeofeachbuildingblocktypeiinthe
to the self-assembly of that structure (see Discussion).
contact graph (or subgraph) is the number of
Similarly, when interfaces are defined by geometry, as
its interfaces f .
forthefoursidesofapolyominobuildingblock,itmakes i
sensetointroduceaneutralcolour(ν =1below). Insys-
(i) The total number of colours, including ν ∈
temswithavaryingnumberofinterfacesonthebuilding
{0,1} neutral colours, is c = 2(b −1)+ν +
1
blocks, neutral colours are usually not required (ν =0).
b2 1− z (1−(a δ δ )) . The
Tominimizetheassemblykitwetakethefollowingsteps: i,j=1 k,l=1 kl itk jtl
(cid:16) (cid:17)
Psum expressiQon gives the number of different
1. Divide the structure into building blocks (usually types of interfaces which occur between cate-
a naturaldivision). The number of building blocks gory 2 building block types[44]. The number
is the size of the structure, denoted z. ofcoloursci onbuildingblockiisequaltothe
number of building block types in its contact
2. Determine the equivalence of these units in terms graph neighbour set.
ofany additionalcriteria(e.g. types ofatoms,pro-
(j) Using b, c, {f } and {c } in equation (1), cal-
teins). This categorization is the species of build- i i
culate the information I required to specify
ing block.
this assembly kit, and thus the complexity K
of the structure.
3. Establishacontactgrapha fortheunits(insome
ij
cases,suchasmolecules, this may requiresetting a
distance cutoff). 6. If edges can be redundant: Minimize this quantity
over all spanning subgraphs.
4. If edges can be redundant: Considerthespaceofall
spanning subgraphs of this graph. Figure2illustratesthecrucialsteps5bto5jforapoly-
omino. Figure 3 illustrates how the complexity value K
5. For the contactgraph(in the case of no redundant reflects symmetry and modularity present in the struc-
edges)oreachsubgraph(ifredundantedgesexist): ture.
4
category 1 labelling new labels category 2 labelling new labels large size small size
A) E)
1 5 6 5 1
1 2 3 4 4 3 2 1 4 3
1 5 6 5 1 5 size: 17 size: 9
y complexity: 228.7 bits complexity: 79.5 bits
B) 6 arit A) C)
1 1 ul 8 13 7 2 3 2 1
1 1 1 d
1 1 F) 1 7 6 5 1 5 6 5 mo 1 3 1 5
C) 1 1 1 1 21 3 54 6 74 3 21 1 54 3 4 etry/ 9 4 1 6 4 1
1 21 21 1 12 1 5 6 5 4 67 mm 10 1 2
y 1 5 11 12 6 1
s
D) 1 1 G) w
1 21 3 3 21 1 3 2 1 12 3 74 6 54 3 21 1 54 3 6 45 lo large size and low symmetry small size but low symmetry
1 5 6 7 1 7 results in high complexity results in medium complexity
4 7
(labels unchanged) 7 6 5 6
size: 17 size: 9
FIG.2: Anillustration of thecrucial steps5b to5j oftheal- arity B)complexity: 74.9 bits D)complexity: 19.1 bits
gorithmforminimizingtheassemblykitsize,inthiscasefora ul
d 2 1 2 1 1
polyomino. Ineveryiterationofcategory1labellings(LEFT), o
m
allunlabellednodeswithexactlyoneunlabelledneighbourare y/ 3 3 2
given labels which distinguish them according to their topo- etr 4 5 6 5 4 1 2 3 2 1
logically distinct neighbourhoods of unlabelled and labelled m
m 3 3 2
tiles. This procedureis repeated untilnomore blockscan be
y
labelledinthisway. Theremainingblocksaregivencategory h s 1 2 1 2 1
2labellings (RIGHT)whichare appliedsimultaneously, with g
hi
each label distinguishing the topological neighbourhoods of large size but high modularity small size and high symmetry
results in medium complexity results in low complexity
thetilesinthepreviousiteration. Notethatinthelast itera-
tion the labellings have stabilized, and only the interfaces of
thebuildingblocktypesareupdated. Forstructuresinwhich
FIG.3: Thecomplexityvaluesofthesefourpolyominoshapes
edges can be redundant,this operation can be performed for
illustratewhytheself-assemblyapproachisaneffectivewayof
all spanningsubgraphs of thestructure’s connectivity graph,
measuring symmetryand modularity without requiring prior
whichfurtherreducesthecomplexity. (Inpolyominoes,edges
assumptions. If two shapes are of equal size, the one with
canberedundant,buttherearenospanningsubgraphsinthe
moresymmetryandmodularityhasalowercomplexityvalue
above example.)
- compare A with B, and C with D. If on the other hand,
twoshapesareofsimilar complexity,butofdifferentsize,the
larger one will be more symmetric or modular (compare B
APPLICATIONS
and C).
The self-assembly approach can be used to calculate
complexity values for any physical structure. In order tiated. This also goes for atoms connected by different
to demonstratethe broadrangeofpotentialapplications bond types. For example, in glutamine (see Figure 4),
we determine the complexity of (a) molecules and (b) the oxygen atom connected with a double bond is a leaf
protein complexes. oftheself-assemblytreejustlikeanyofthe(implicit)hy-
The problem of molecular complexity has been stud- drogen atoms, but it requires a separate building block.
iedextensively overthe pastseventyyears,starting with The two molecules in our example of Figure 4 are the
workbyPo´lya[21]andRashevskyamongothers[22,23], amino acid glutamine and the explosive nitroglycerine,
andculminating in a seminalpaper by Bertz [24]. These which both consist of 20 atoms. Nitroglycerine however
approaches are based on Shannon entropy rather than exhibits a much higher degree of modularity, with its
algorithmic information theory and focus on symmetries three NO groups, and therefore has a much lower com-
3
rather than the more general concept of modularity. In plexityofK =55.3bitsthantheglutamine,forwhichthe
molecules, we take atoms to be the building blocks and valueisK =94.7bits. Note thatnitroglycerinedoesnot
chemical bonds to be their interfaces. Simple molecules, exhibit simple three-fold symmetry, but a more subtle,
suchasthoseinFigure4,forwhichweareonlyinterested hierarchical modularity. Such structural features would
in the bond connectivity, are an example of a structure be harderto discoverusingtraditionalapproachesto the
inwhichnoneoftheedgescanberegardedasredundant. measurement of molecular complexity [22–24], which do
This is because, unlike for polyominoes, we are not as- nottakeaself-assemblyperspectiveandrelyonShannon
suming anyinherentgeometryfor the building blocks. If entropyratherthanKolmogorovcomplexityasameasure
twoatomsplaythesameself-assemblyrolebutrepresent of complexity.
atoms of different atomic species, they must be differen- Many important biochemical structures are protein
5
a) NOitr-oglycerine O+- mOod-ule ON+- smubosdeutl eof a) 1oel (E. coli chaperonin GroEL) (double) symmetry
N+ O N O N+ O O
O O
O O
O
copy of
N+ O module
O O- +
N
O O- b) 1nlx (P. pratense allergen PHL P 6) module subset of
module
subset of
module
b) Glutamine
module 1
O
O
H2N OH mcoodpuyl eo f2 H2N OH
NH2O NH2 Omcoodpuyl eo f1 FwIitGh.P5:DWBeidmenetaisfiuerrest1hoeelco(amcphleaxpietryoonfintw,tooppr)oatneidn1cnolmxp(alenxeasl-,
module 2 lergen,bottom),whichhave14proteinseach. Thesymmetry
of the chaperonin complex means that it has a much lower
complexity value of K = 31.5 bits, compared to K = 50.2
FIG. 4: Measuring the complexity of molecules – The explo-
bits for the allergen complex. Note that we are assuming
sive nitroglycerine (top) and the amino acid glutamine (bot-
non-redundant edges in this calculation, so that all building
tom) both consist of 20 atoms, but differgreatly in complex-
blocksofthechaperonincomplexarecategory2andallbuild-
ity. The highly modular structure of nitroglycerine with its
ing blocks of the allergen complex are category 1. Further-
threeNO3 groupsmeansthatitscomplexityvalueK,at52.2
more we do not consider neutral colours (ν = 0), and in the
bits, is little more than half that of glutamine (K = 91.0
caseofthechaperonincomplexwehavethreeself-interacting
bits). Note that nitroglycerine does not have simple three-
fold symmetry, but a more subtle modular structure, which colours (cs = 3). Note also that both complexes are homo-
mers, i.e. they only haveone typeof subunit.
theself-assemblyapproachfullyreveals. Notethatwedonot
consider neutral colours in this structure(ν =0).
K = 31.5 bits, versus K = 50.2 bits for the allergen
complexes, consisting of several individually formed and (which is still somewhat modular).
folded protein subunits bound together to produce func- More complex protein structures require more unique
tional cellular machinery. These subunits may include inter-subunit bonds types, compared to less complex
different types of protein and several copies of the same structures which can re-use bonds and be constructed
protein. The physical structure of protein complexes, as through simple repetition of subunits. As an increase
withproteinthemselves,isimportantindeterminingthe in bond types corresponds biologically to the presence
functionality of the complex. The manner in which the of more unique bonding sites on subunit proteins, more
subunits bond to form the final complex is known as the complex protein structures can be thought of as requir-
quaternary structure of the complex. The 3DComplex ing more evolutionary innovation to produce and would
database[25] contains a description of the quaternary therefore be expected to occur less frequently in biolog-
structuresofthousandsofproteincomplexes,intermsof ical organisms [26, 27]. This hypothesis is confirmed by
subunit type and inter-subunit bonding. If we have two Figure 6, which shows a histogram of complexity val-
proteins which play the same role in the self-assembling ues – normalized by the size of the protein complex, to
structure but are different proteins, we can choose to avoid size effects – for the 15733 protein complexes in
countthemastwodifferentbuildingblocks(analogousto the 3DComplex database [25]. The distribution closely
theaforementioneddistinctionbetweenatomicspeciesin (R2 =0.93) follows a power-law decay.
molecules). In the following analyses we are only inter- In both of these cases - molecules and protein com-
estedintheconnectivityofproteins(equivalenttotheQS plexes - we assume geometricallyunconstrainedfaces for
Topology level in the 3DComplex database), and there- the building blocks; in other words, we use F = cik .
i j j
fore do not distinguish between different proteins. The While the chemical bonds of atoms and the interfQaces of
two protein complexes in our example of Figure 5 are proteins are in fact usually constrained, this information
a chaperonin complex (E. coli chaperonin GroEL; PDB is not part of the structural formula of the molecule or
identifier: 1oel)andanallergencomplex(P. pratenseal- the contact graph of the protein complex. If this ad-
lergen PHL P 6; PDB identifier: 1nlx). Both consist of ditional level of resolution is required, a more realistic
14 proteins, but the former displays a much higher de- self-assembly model can be constructed, based on the
gree of symmetry and a much lower complexity value of exact three-dimensional characteristics of the atoms or
6
10000 11c g0/ e ssh u=b 0u.n2it0s 12 13 1712001141561911754328629 11 122k31y sboulobcuknsi trsequired MIncordeualsainrigty
1821
1000 1q2v 11i03 qsubunits ocks 10 Number of Proteins:
Frequency 100 1c 6/ ss u=b 0u.n2it6s c / s = 10.91ohh mber of Required Bl 1111000000 111111111111111111111111111111111111111111111111111111111111
16 subunits u 1b5s
10 c / s = 40.26 N 60 subunits
1 block required
1
b/z = 1
b/z = 0.5 b/z = 0.1
1
1 10 100
1 10
Complexity / Size Size
FIG. 7: (Colour online) The position of the 15733 protein
FIG. 6: (Colour online) Histogram of protein quaternary
complexes from [25] in the space of b (number of building
structure assembly complexity with frequency of occurrence
block types)and z (size of thecomplex). Many protein com-
in the 3DComplex database. Insets illustrate two pairs of
plexesarehighlymodular,andthisistrueacrossawiderange
equally sized structures with high and low complexity val-
of sizes. In this plot complexes of equal modularity m=z/b
ues. 1geh, 1i3q, 1q2v, and 1ohh are the PDB identifiers of
thecomplexes. The plot has an R2=0.93 correlation with a lie on a diagonal line with positive gradient. The lines are
shown for m=1, 2, and 10 (b/z =1, 0.5, and 0.1). The sizes
powerlawdecay. Notethatinthiscasewedonotdistinguish
ofthecirclesshowhowmanycomplexeslieatagivenposition
between different typesof subunit.
(z,b). The insets show two examples (with PDB identifiers
1kyoand 1b5s), with high and low modularities.
proteins, and using the F(c ,f ) term specified above.
i i
MODULARITY
Theself-assemblyperspectiveprovidesanintuitivedef-
inition of the modularity of a structure: If part of the
structure appears several times, it still only needs to be
encoded once. This is why modularity and symmetry complexes, we consider two of the outliers in the com-
(beingaspecialcaseofmodularity)leadtomoreefficient plexity and modularity histograms, the high-complexity
self-assembly kits and a lower value of the complexity 1ohh (Figure 6) and high-modularity 1b5s (Figure 7).
measureK. Formallywe candefine the modularitym of 1ohhconsistsoftwocopiesofbovineF1-ATPase(itself a
a structure of size z as the average number of times one protein complex) in complex with its regulatory protein
of the b different building block types in the minimum IF1[39]. The regulatory protein binds simultaneously to
assembly kit is used in the structure, which is simply: bothcopiesofthemaincomplex,butslightlyasymmetri-
cally, leading to asymmetric interactions being recorded
z
m= in the 3DComplex database. This asymmetry results in
b
extra information being required to describe the com-
We can furthermore define a module formally as a con- bined quaternary structure, and the observed high com-
nected set of building blocks which appears more than plexity value. 1b5s is a multienzyme complex consist-
once in a given structure. Note that modules can over- ing of multiple copies of dihydrolipoyl acetyletransferase
lap: A subset of a module could form another module, (E2p)[40]. The E2p protein has the potential to occupy
appearing a different number of times than the whole quasi-equivalentpositions,asseeninvirusstructures[41],
module. The molecule in Figure 4a illustrates such a andisalsoobservedtoformcubiccomplexes. Thehighly-
case. modular, dodecahedral structure exhibited in 1b5s is an
The majority of protein complexes in the 3DComplex efficient way of grouping many copies of an active pro-
database show high modularity values (Figure 7) with teininageometrythatfacilitatesenzymaticactivity: the
a common trend observable along the b/z = 0.5 line, large windows in the structure allow passage of the sub-
indicating many proteins consist of structures involving strateandproductbetweentheinnercavityandthesub-
two copies of all constituent subunits. strate. The structure of the protein subunits allows this
To further illustrate how the complexity K and the structuretoberealisedwithjustonebuildingblocktype,
modularitymmeasurethephysicalcomplexityofprotein resulting in high modularity.
7
JOINT, CONDITIONAL, AND MUTUAL
Polyominoes Amino acids
COMPLEXITY
O
A) C)
If we have two structures A and B with minimum 3 2 1
OH
assembly kits S˜A and S˜B, then the joint minimum as- 4
sembly kit S˜A,B is the minimum kit which can assem- 5 4 3
H N
ble both structures if an appropriate subset of building 2 2
blocks is chosen. The amount of information required to 1 NH O
2
describethiskitisthejoint complexityK(A,B)ofAand
B. Thisdefinitioncaneasilybegeneralizedtomorethan
two structures. B) D) O
Let us define S˜′ as the subset of S˜ which forms 3 2 1
A A,B
structure A, and S˜′ as the subset of S˜ which forms 4 H N
B A,B 2
structureB (notethate.g. S˜ isnotnecessarilyequalto 6 2 1 OH
A
S˜′ duetothecolourminimization),sothatS˜ =S˜′ ∪ 2
A A,B A
S˜′ . Furthermore,let us define the conditional minimum 1
B
assembly kit S˜ as the set of building blocks we need
A|B O
NH
in additionto S˜′ inorder to formstructure A. Then we 2
B
can write:
FIG. 8: POLYOMINOES (left): The two polyominoes share
S˜ =S˜ \S˜′
A|B A,B B manybuildingblocktypes,withtheonlytwouniqueonesbe-
ing blocks 5 and 6 (marked in grey). Hence, the joint set is
where \ denotes the set theoretic difference operation.
ThedefinitionofS˜ followsaccordingly. Hencewecan S˜A,B = {1,2,3,4,5,6}, the mutual set is S˜A:B = {1,2,3,4}
B|A and the conditional sets are: S˜ = {5} and S˜ = {6}.
alsodefineaconditional complexityK(A|B),whichisthe A|B B|A
Buildingblock5contributesK(A|B)=2log 9+2=8.4bits
2
amount of information needed to describe the building to the complexity K′(A) of the A shape, while block 6 con-
blocks in S˜A|B. Because the way we describe the assem- tributes K(B|A) = 4log29 = 12.7 bits to K′(B). It follows
bly kit is additive in the number of building blocks, we thereforethatthejointcomplexityisK(A,B)=67.4bitsand
can write the mutual complexity is K(A : B) = 46.4 bits, compared
to the standalone values of K(A) = K′(A) = 54.7 bits and
K(A|B)=K(A,B)−K′(B) K(B) = K′(B) = 59.1 bits (see Figure 3). AMINO ACIDS
(right): The two amino acid molecules asparagine (top, C)
since K′(B) is the information required to describe the
and glutamine (bottom, D) share the amino (NH2) and car-
building blocks in S˜B′ . The relationship between K(B) boxyl (CO2H) groups common to all amino acids, as well as
and K′(B) is given by the carboxamide group (CONH2). In a self-assembly frame-
workthesetwostructureshavecomplexitiesofK(Asn)=74.3
c
K′(B)=K(B)+ c log A,B bits and K(Gln) = 91 bits. While K′(Gln) = K(Gln),
Xi i 2 cB we have K′(Asn) = 78.0 bits. Because the two molecules
share three groups, their joint complexity is not much larger
where cA,B is the total number of colours in S˜A,B and than their individual complexities, at K(Asn,Gln) = 104.0
c is the total number of colours in S˜ . Because of the bits, and their mutual complexity is not much smaller, at
B B
K(Asn:Gln)=65bits,thanthecomplexitiesoftheindivid-
minimization of colours, c = max(c ,c ). Hence, if
A,B A B
ualmolecules. Theirconditionalcomplexitiesarecorrespond-
c ≥c , then K′(B)=K(B).
B A ingly low, at K(Asn|Gln) = 13 bits and K(Gln|Asn) = 26
Similarly, we can define a mutual minimum assembly
bits. The conditional complexities give the amount of infor-
kit S˜A:B, which corresponds to the intersection mationrequiredtodescribethebuildingblocks(atoms)which
are unique (in their self-assembly role) to the given amino
S˜ =S˜′ ∩S˜′ =S˜′ \S˜ =S˜′ \S˜
A:B A B A A|B B B|A acid. These atoms are marked with grey circles.
From this follows the mutual complexity
K(A:B) = K′(A)−K(A|B)=K′(B)−K(B|A)
and the relative mutual complexity
= K′(A)+K′(B)−K(A,B)
In order to account for the relative sizes of the struc- K(A:B)
Krel(A:B)=
tures we compare using these measures, we can define K(A,B)
relative versions of the above quantities. These are rela-
tive conditional complexity:
NotethatthelattermeasureresemblestheJaccardindex
K(A|B) [42]. For an illustration of joint, mutual and conditional
Krel(A|B)=
K′(B) complexity, see Figure 8.
8
DISCUSSION
1 2 1
2 2 2 2 1 2 1 2 1 2 1 2
Stericeffects–Forstructureswhichcontainloopstruc-
1 2 1
tures formed by repeating units, it is possible to exploit
stericeffectsinordertoreducethesizeoftheassemblykit A
below the minimum size found by our algorithm (which A 1 B 2 B A 1 A B 2 B
explicitly excluded such effects in its definition). An ex-
ample of a steric effect would be a polyomino which is
self-limitinginadeterministicway,purelybecauseofthe FIG. 9: A simple example of a steric effect. The two blocks
1 and 2 have colours A and B on their interfaces. These
geometric constraints of the building blocks. As long as
coloursattracteachother. Allotherfacesareneutral. Certain
eachdistincttypeofloopstructureisformedbybuilding
arrangementsofcolourswillleadtoself-delimitingstructures
blocksofadistinctspecies(orsetofspecies),theamount
purely because of the geometry of the building blocks. The
of information required to describe this structure can be complexity of such structures can be taken to be the same
taken to be the same as that required to describe an in- asthatofaninfinitechainconsistingofthesamesequenceof
finite chain consisting of the same elements. A simple blocks,butonlyifeachloopstructureinsideabiggerstructure
example is given in Figure 9. The crucial assumption has a distinct (set of) species of buildingblocks.
which has to hold for this simplification to work is that
the geometryofthe loopis specified by the species (and,
by extension, the geometry) of the building block. For
proteinsasbuildingblocksofproteincomplexes,thisisa
* *
very reasonable assumption. In the case of molecules 4 3 2 1 5 6
1
it would furthermore be possible to simplify the self-
2 3
assembly kit by introducing building blocks representing
common small loop structures, such as carbon rings. 3 2
4 1
Multiple nuclei – In principle one could consider be-
12 3 4
ginning the self-assembly with multiple nuclei in place.
Multiple nuclei may, through steric hindrance or mod-
ular repetition, be used to achieve certain structures in
a more efficient way, using fewer building blocks than a 1 2 3 4 5 6
7 11
singlenucleus wouldrequire. Thisreductionincomplex-
8 12
ity may however be countered in practical applications
by the difficulty of achieving the required precise rela- 9 5
tive displacements of nucleus particles. It is because of 10 6
5 6
these reasons that we have concentrated on a single nu-
cleus model, as the positioning of multiple nuclei makes
it much more difficult to construct a general measure of
complexity.
Within the single nucleus category, we further distin-
FIG.10: Illustrationofnucleiplacement. (Top:) Ifwespecifyei-
guish between structure with a specified nucleus block
therofthetwostarredblocksasnuclei,deterministicbondingwill
and those with general nucleus blocks. The former case result. However, if any other block is used as the nucleus, bond-
encompassesthoseassemblykitswhichareguaranteedto ingwillbenon-deterministic,asboththe{1,0,0,4}and{1,0,5,0}
produceagivenoutputstructureifandonlyifaspecified blockscanjointheopen2edgesthatwillform. Thisself-assembly
kithasacomplexityofK=42.4bits. (Bottom:) Ageneralnucleus
block is used as the nucleus (in other words, this block
systemtoproducethesamestructure,illustratingtherequiredin-
is placed on the substrate before other blocks are intro- creaseincomplexity(K=98.1bits).
duced to the system). General-nucleus assembly kits by
contrastwillformthesameoutputstructureregardlessof
which block is placed first. See Figure 10 for an illustra-
tion how specifying a nucleus can reduce the complexity trolled environment where a nucleus can be placed to
of a assembly kit. initiate assembly, the single-nucleus model is applicable.
Which of these classes to employ in a study depends The two cases correspond to different ‘languages’ being
on the motivating context of the self-assembling system used to measure complexity, and so care must be taken
underconsideration. Ifmodellingassemblyinadiffusion- in comparative studies to only compare numerical com-
dominated environment,for example,the orderin which plexity values from within one class.
interacting particles meet cannot be specified, so the Kolmogorov complexity – Our approach to measuring
general-nucleus model is more appropriate. In a con- physical complexity is motivated by the concept of Kol-
9
mogorovcomplexity. Itishoweverimportanttonotethat [13] C. Mao, T. H. LaBean, J. H. Reif, and N. C. Seeman,
while Kolmogorovcomplexityitselfis uncomputable due Nature 407, 493 (2000).
to the Halting problem [3], our minimum is not. This is [14] A. Chworos, I. Severcan, A. Y. Koyfman, P. Weinkam,
E.Oroudyev,H.G.Hansma,andL.Jaeger,Science306,
because the runtime of a finite computer program with
2068 (2004).
finite output can be infinite, while the assembly time of
[15] R. P. Goodman et al., Science 310, 1661 (2005).
a finite shape is always finite [4]. It is possible to de-
[16] P. W. K. Rothemund,Nature440, 297 (2006).
fine the actual Kolmogorov complexity of a shape [5],
[17] K. Fujibayashi, R. Hariadi, S. H. Park, E. Winfree, and
but this is uncomputable. Our computable complexity S. Murata, NanoLett. 8, 1791 (2008).
measureK(A)formsaboundonthisunattainablequan- [18] E. W. Weisstein, “Polyomino.” from
tity, and is dependent on the way in which we encode MathWorld - A Wolfram Web Resource.
the description of the assembly kit. It therefore is useful http://mathworld.wolfram.com/Polyomino.html
[19] E. W. Weisstein, “Necklace.” from Math-
fortheanalysis,classificationandcomparisonofphysical
World - A Wolfram Web Resource.
structures, as long as we use a consistent encoding.
http://mathworld.wolfram.com/Necklace.html
[20] H. Wang, Bell SystemsTech. J. 40, 1 (1961).
[21] G. Po´lya, Acta Math-Djursholm 68, 145 (1937).
CONCLUSION [22] N. Rashevsky,Bull. Math. Biophys. 17, 229 (1955).
[23] E. Trucco, Bull. Math. Biophys. 18, 129 (1956).
Wepresentageneralapproachformeasuringthephys- [24] S. H. Bertz, J. Am. Chem. Soc. 103, 3599 (1981).
[25] E. D. Levy, J. B. Pereira-Leal, C. Chotia, and S. A. Te-
icalcomplexityofanyconnectedstructure,usingthelan-
ichmann, PLoS Comp. Biol. 2, e155 (2006).
guage of self-assembly. This approach is capable of de-
[26] Gabriel Villar, et al., Phys. Rev. Lett. 102, 118106
tecting symmetry and modularity in a given structure,
(2009).
because these features significantly decrease the size of [27] E.D.Levy,E.B.Erba,C.V.Robinson,S.A.Teichmann,
the required self-assembly instruction set. It therefore Nature 453, 1262 (2008).
providesapowerfultoolforautomatedclassificationand [28] B.Goodwin,Howtheleopardchangeditsspots: Theevo-
categorization of physical structures. In addition, the lution of complexity. (Princeton University Press, 2001)
[29] J. Bronowski, Synthese21, 228 (1970).
connection between self-assembly and complexity is an
[30] D. W. McShea, Biology and Philosophy 6, 303 (1991).
argumentfortheubiquityofmodularandsymmetricfea-
[31] C. Bennett, Complexity, entropy, and the physics of in-
turesinbiologicalsystems: Sincemanysuchsystemsself-
formation,pp.137-148(WestviewPress,1990,ed.W.H.
assemble, evolving sets of self-assembly instructions are Zurek).
likely to yield symmetric and modular structures, as the [32] J. S.Wicken, J Theor. Biol. 77, 349 (1979).
instructions for these are more efficient to evolve. [33] O. Toussaint and E. D. Schneider, Comparative Bio-
chemistry and Physiology, Part A 120, 3 (1998).
[34] C. Adami, C. Ofria and T. C. Collier, Proc. Natl. Acad.
Sci. U.S.A.97, 4463 (2000).
[35] M.Mitchell,Anintroductiontogeneticalgorithms(Brad-
ford Books, 1996)
[1] A. N. Kolmogorov, Prob. Inform. Transmission 1, 4
[36] M. Mitchell and S.Forrest, Artificial Life 1, 267 (1994).
(1965).
[37] J. H.Holland, ScientificAmerican 267, 66 (1992).
[2] G. J. Chaitin, J. Assoc. Comput. Mach. 13, 547 (1966).
[38] S. Forrest, Science 261 872 (1993).
[3] T.M.CoverandJ.A.Thomas,ElementsofInformation
[39] E.Cabez´on,M.G.Montgomery,A.G.W.Leslie, andJ.
Theory (Wiley-Interscience,1991).
E. Walker, Nature Structural & Molecular Biology, 10,
[4] P. W. K. Rothemund and E. Winfree, STOC ’00: Pro-
744 (2003).
ceedingsofthethirty-secondannualACMsymposiumon
[40] T. Izard, A. Ævarsson, M. D. Allen, A. H. Westphal, R.
Theory of computing, pp.459-468 (2000).
N. Perham, A. de Kok, and W. G. J. Hol, Proc. Natl.
[5] D.SoloveichikandE.Winfree,SIAMJ.Comp.36,1544
Acad. Sci. U.S.A. 96, 1240 (1999).
(2006).
[41] D. Caspar and A. Klug, Cold Spring Harbor Symp.
[6] C. Adami, Phys. Life Rev.1, 3 (2004).
Quant. Biol, 27(1) (1962).
[7] G. M. Whitesides and M. Boncheva, Proc. Natl. Acad.
[42] P. Jaccard, Bulletin de la Societe Vaudoise des Sciences
Sci. U.S.A.99, 4769 (2002).
Naturelles, 37 241, (1901).
[8] G.KrauschandR.Magerle,Adv.Mater.14,1579(2002).
[43] For free necklaces, which represent building blocks with
[9] J. Israelachvili, Langmuir 10, 3774 (1994).
[10] H.Fraenkel-ConratandR.C.Williams,Proc.Natl.Acad. no fixedchirality there areMc =(c4+2c3+3c2+2c)/8
necklaces [19]. In general we will assume fixed chirality.
Sci. U.S.A,41 690 (1955).
[44] Heterogeneous interfaces are double-counted as, unlike
[11] A.Zlotnick, J. Mol. Biol. 241, 59 (1994).
homogeneous interfaces, theyrequire two colours.
[12] E. Winfree, F. Liu, L. A. Wenzler, and N. C. Seeman,
Nature394, 539 (1998).