Table Of ContentSurprise maximization reveals the
community structure of complex networks
SUBJECTAREAS:
RodrigoAldecoa&IgnacioMar´ın
BIOINFORMATICS
DATAINTEGRATION
InstitutodeBiomedicinadeValencia.ConsejoSuperiordeInvestigacionesCient´ıficas(IBV-CSIC).CalleJaimeRoig11,46010.
APPLIEDMATHEMATICS
Valencia.Spain.
STATISTICS
Howtodeterminethecommunitystructureofcomplexnetworksisanopenquestion.Itiscriticalto
Received establishthebeststrategiesforcommunitydetectioninnetworksofunknownstructure.Here,using
25September2012 standardsyntheticbenchmarks,weshowthatnoneofthealgorithmshithertodevelopedforcommunity
structurecharacterizationperformoptimally.Significantly,evaluatingtheresultsaccordingtotheir
Accepted modularity,themostpopularmeasureofthequalityofapartition,systematicallyprovidesmistaken
28December2012 solutions.However,anovelqualityfunction,calledSurprise,canbeusedtoelucidatewhichistheoptimal
divisionintocommunities.Consequently,weshowthatthebeststrategytofindthecommunitystructureof
Published
allthenetworksexaminedinvolveschoosingamongthesolutionsprovidedbymultiplealgorithmstheone
14January2013 withthehighestSurprisevalue.WeconcludethatSurprisemaximizationpreciselyrevealsthecommunity
structureofcomplexnetworks.
Correspondenceand
requestsformaterials Theanalysis of networkshas profound implicationsinvery differentfields, from sociologytobiology1–5.
Oneofthemostinterestingfeaturesofanetworkisitscommunitystructure6,7.Communitiesaregroups
shouldbeaddressedto
ofnodesthataremorestronglyorfrequentlyconnectedamongthemselvesthanwiththeothernodesof
I.M.([email protected].
the network. The best way to establish the communities present in a network is an open problem. Two
es)
related questions are still unsolved. First, which is the best algorithm to characterize networks of known
community structure. Second, how to evaluate algorithm performance when the community structure is
unknown. The first question requirestesting the algorithms in benchmarkscomposed of complex networks
wherethecommunitystructureisestablishedapriori.Inthesebenchmarks,ithasbeenfoundthatalgorithm
performance depends on how different is the density of intracommunity links from the average density of
links in the network. In addition, it has been determined that most algorithms perform well when the
networks are small and the communities have similar sizes, but many perform quite poorly in benchmarks
composed oflargenetworkswithmany communities ofheterogeneous sizes8–18.Thus,benchmarkswiththe
latter features have become crucial to rank algorithm performances. Among them, the Lancichinetti-
Fortunato-Radicchi (LFR) benchmarks11–18 and the Relaxed Caveman (RC) benchmarks14,18–20 have shown
tobeparticularlyuseful.Bothbenchmarksposeasterntestforalgorithmsthatdealpoorlywiththepresence
ofmanycommunities,ofsmallcommunities orofamixtureofcommunitiesofdifferentsizes(see e.g.refs.
11, 13 and 14).
Thesecondquestion,howtodeterminethebestperformancewhenthecommunitystructureisunknown,
involves devising an independent measure ofthequality ofapartition into communities that canbe reliably
appliedtoanytypeofnetwork.Thefirstandstilltodaymostpopularsuchmeasurewascalledmodularity21(often
abbreviatedasQ).Modularitycomparesthenumberoflinkswithineachcommunitywiththeexpectednumber
oflinksinarandomgraphofthesamesizeandsamedistributionofnodedegreesandthenaddsthedifferences
betweenexpectedandobservedvaluesforallthecommunities.Itwasproposedthattheoptimalpartitionofa
networkcouldbefoundbymaximizingQ21.However,itwaslaterdeterminedthatmodularity-basedevaluations
areoftenincorrectwhensmallcommunitiesarepresentinthenetwork,i.e.Qhasaresolutionlimit22.Several
other works have found additional, subtle problems caused by using modularity maximization to determine
networkcommunitystructure17,23–26.AlltheseresultssuggestthatusingQprovidesincorrectanswersinmany
cases.
We recently suggested an alternative global measure of performance, which we called Surprise14. Surprise
assumes as a null model that links between nodes emerge randomly. It then evaluates the departure of the
observedpartitionfromtheexpecteddistributionofnodesandlinksintocommunitiesgiventhatnullmodel.
Todoso,itusesthefollowingcumulativehypergeometricdistribution:
SCIENTIFICREPORTS |3:1060|DOI:10.1038/srep01060 1
www.nature.com/scientificreports
(cid:2)M(cid:3) (cid:2)F{M(cid:3) justafewalgorithms.Giventhatotheralgorithmscouldprovideeven
MiXnðM,nÞ j n{j higherSvalues,itwasunclearhowoptimaltheseresultswere.
S~{log (cid:2)F(cid:3) ð1Þ Here,wetestthebeststrategiescurrentlyavailabletocharacterize
j~p thestructureofcomplexnetworksandwecomparethemwiththe
n results provided by Surprise maximization in both LFR and RC
benchmarks. We first show that none among a large number of
WhereFisthemaximumpossiblenumberoflinksinanetwork([k22
state-of-the-artalgorithmsworkconsistentlywellinallthesecom-
k]/2,beingkthenumberofunits),nistheobservednumberoflinks,
plex benchmarks. Particularly, all modularity-based heuristics
M is the maximum possible number of intracommunity links for a
behave poorly. Also, we demonstrate that evaluating the perform-
given partition, and p is the total number of intracommunity links
anceofanalgorithmusingmodularityisincorrect.Wethenshow
observedinthatpartition14.Usingacumulativehypergeometricdis-
that a simple meta-algorithm, which consists in choosing in each
tributionallowstocalculatetheexactprobabilityofthedistributionof
network the algorithm that maximizes Surprise, very efficiently
linksandnodesinthecommunitiesdefinedforthenetworkbyagiven
determines the community structure of all the networks tested.
partition.Thus,Smeasureshowunlikely(or‘‘surprising’’,hencethe
This method clearly performs better than any of the algorithms
name of the parameter) is that distribution. In previous studies, we
devisedsofar.WeconcludethatSurprisemaximizationisthestrat-
showed that Surprise improved on modularity in standard bench-
egyofchoiceforcommunitycharacterizationincomplexnetworks.
marksandthatchoosingalgorithmswithhighSvaluesleadstoaccur-
ate community structurecharacterization14,18. Although theseresults
were encouraging,whether S maximization could be used to obtain Results
optimalpartitionswasnotrigorouslytested.Thiswasduetothefact In order to determine the performance of different algorithms for
that Surprise values wereestimated from the partitions provided by community structure characterization, we explored two standard
Figure1|Globalperformanceofthealgorithms. (a)BehaviorofthealgorithmsintheLFRbenchmark.Toobtainthisfigure,thealgorithmswerefirst
orderedaccordingtotheVIresultsobtainedforeachmcondition.Then,weplottedtheresultsforthealgorithmwiththebestVIvalue(blackline,
indicatedwith‘‘1’’),theaverageofthetopfivealgorithms(redline),averageofthetoptenones(blueline)oraverageforallthe18algorithms(greenline).
Thegreyregioncorrespondstothevaluesofm(0.1–0.7)chosentoperformthemaincomparativeanalyses(seetext).Beyondthatregion,eventhebest
algorithmsobtainVIvaluesconsiderablyhigherthanzero,meaningthattheoriginalstructureofthenetworkhasbeensignificantlymodifiedbythe
increaseinintercommunitylinks.(b)AnexampleshowingthefivelargestcommunitiesinaLFRnetwork(5000units)whenm50.1.Nodesare
distributedintotwodimensionswithaspring-embeddedalgorithm46anddrawnusingCytoscape47.Communitiesarewell-isolatedgroups.(c)Thefive
largestcommunitieswhenm50.7.Theyarebarelydistinguishableinthisrepresentationbecausethemixingoflinkswasquiteextreme.However,several
algorithmswerestilloftenabletodetectthesefuzzycommunities.(d)–(f):ThesameresultsfortheRCbenchmark(512units).Paneledepictsthefive
largestcommunitieswhenR510%andPanelftothesamecommunitieswhenR550%.Again,noticeinpaneld)thesharpincreaseinVIvalueswhenR
.50%.AnextremedegreeofsuperimpositionamongcommunitiesisobservedalreadywhenR550%(f).IntheLFRbenchmark,therapidincreaseinVI
valueswhentheintercommunitylinksgoesfromm50.7to0.8(Panela)isexplainedbyallcommunitiesbeingofsimilarsizes.Therefore,theyare
destroyedataboutthesametime.Onthecontrary,themoreprogressiveincreaseinVIwhenRgrows,whichweobservedinPaneld,isduetothe
heterogeneoussizesofthecommunitiespresentinthatbenchmark,whichbreakdownatdifferenttimes.
SCIENTIFICREPORTS |3:1060|DOI:10.1038/srep01060 2
www.nature.com/scientificreports
benchmarks,anLFRbenchmarkwith5000unitsandanRCbench- networkswith10%#R#50%(again,100realizationsperRvalue,
markwith512units(seeMethods).VariationofInformation(VI) foratotalof500differentnetworks).Theseconditionsgeneratesome
wasusedtodeterminethedegreeofcongruencebetweentheparti- communitystructuresthatareverydifficulttodetect(Figure1).
tionsintocommunitiessuggestedby18differentalgorithmsandthe Figure2summarizestheindividualperformanceofthealgorithms
realcommunitystructurepresentinthenetworks.Aperfectcongru- accordingtothreeglobalmeasuresofpartitionquality.Thefirstone
encecorrespondstoavalueVI50.Figures1aand1ddisplaythe isVI,thegoldstandardforalgorithmperformanceinthesebench-
generalresultsobtainedinthetwobenchmarks.AsharpVIincrease marks.Theothertwo,alreadymentionedabove,areSurprise(S)and
wasfoundwhenthecommunitystructurewasweakenedbyhighly modularity(Q).Theperformancevaluesmeasuredaccordingtothe
increasingthenumberofintercommunitylinks,asoccurswhenthe VIscoresshowninFigure2indicatetwoveryimportantfacts.On
mixingparameter(m)oftheLFRbenchmarkhasvaluesabove0.7or onehand,noneofthealgorithmswasthebestinallLFRorinallRC
therewiringparameterRoftheRCbenchmarksishigherthan50% networks.Ontheotherhand,thebestalgorithmsinLFRnetworks
(see also Methods for the precise definitions of m and R). These oftenperformedpoorlyinRCnetworks,andviceversa(seee.g.the
resultsmeanthat,abovem50.7orR550%,thecommunitystruc- results of RB, LPA or RNSC in Figure 2). This can be rigorously
tureoriginallypresentinthenetworkswassubstantiallyaltered.In shownbyorderingwithineachbenchmarkthealgorithmsaccording
suchcases,wecouldnotdeterminewhetherthepartitionssuggested totheirperformance,assigningarank,frombesttoworst,andcom-
bythealgorithmswerecorrectornot:therewouldnotbeaknown paringtheranksinbothbenchmarks.WefoundthatKendall’snon-
structurewith which to compare. Thus, wedecided to restrict our parametriccorrelationcoefficientfortheserankswasveryweak,just
subsequentanalysestotheLFRnetworkswith0.1#m#0.7(100 t50.31(p50.04,one-tailedcomparison).Weconcludethatusing
realizationspermvalue,givingatotalof700networks)andtheRC single algorithms for community characterization is inadvisable,
Figure2|PerformanceofthealgorithmsaccordingtoVariationofInformation(VI),Surprise(S)andModularity(Q)inLFRandRCbenchmarks.
Averageperformanceandstandarderrorsofthemeanareshown.Performancevalueswereobtainedbythefollowingmethod:1)theVI,SorQvaluesof
thepartitionsprovidedbythe18algorithmsineachofthenetworks(i.e.700valuesforLFRbenchmarks,500valuesinRCbenchmarks)wereestablished;
2)Foreachnetwork,thealgorithmswereassignedarankaccordingtotheirperformance(15optimal,185worse);identicalranksweregiventotied
algorithms(i.e.theranksthatwouldcorrespondtoeachofthemweresummedupandthendividedbythenumberoftiedalgorithms);and,3)
Performancewascalculatedas18–averagerank,meaningthat17isthemaximumpossiblevaluethatwouldobtainanalgorithmthatoutperformstherest
inallnetworks,and0equalstobeingtheworstinallnetworks.
SCIENTIFICREPORTS |3:1060|DOI:10.1038/srep01060 3
www.nature.com/scientificreports
giventhattheirperformanceisstronglydependentontheparticular maximizeQ(Blondel;EO,MLGC,MSG1VMandCNM27–31)and
structureofthenetwork. thealgorithmsthatuseQtoevaluatethequalityoftheirpartitions
If we focus now on the Surprise (S) and modularity (Q) results (Walktrap,DM32,33)werepoorperformers(Figure2).
showninFigure2,anothertwostrikingfactsbecomeapparent.First, IfindeedmaximizationofSurpriseisanoptimalstrategyforcom-
therewasaverystrongcorrelationbetweentheperformanceofthe munitycharacterization,asitsstrongcorrelationwithVIsuggests,
algorithmsaccordingtoVIandaccordingtoS.Kendall’scorrelation then it should be possible to improve on the results of any single
coefficient for the ranks of the performances of the algorithms algorithmbysimplypickingupamongmanyalgorithmstheonethat
ordered according to VI and to S values is t 5 0.91 in the LFR generatesthehighestSvalue(S )ineachparticularnetwork.Also,
max
benchmarks (p 5 4.9 10211, one-tailed comparison) and t 5 0.83 thisS-maximizationstrategyshouldprovideVIvaluesverycloseto
intheRCbenchmark(p51.41028,one-tailedtest).Theseresults zero in our benchmarks. These two expectations are fulfilled, as
demonstratethatSisanexcellentmeasureoftheglobalqualityofa shown in Figure 3. The top panel (Figure 3a) demonstrates that
division into communities, confirming and extending the conclu- choosing in each particular case the algorithm with the highest S
sions of one of our previous works14. Second, the performance of valueisbetterthanselectinganyofthestate-of-the-artalgorithms
thealgorithmsevaluatedusingQonlyweaklycorrelatedwiththeir tested.ItisremarkablethattheS valuesinFigure3aderivedfrom
max
performanceaccordingtoVIintheLFRbenchmarks(Kendall’st thecombinedresultsofasmanyas7algorithms(CPM,Infomap,RB,
LFR
50.29,p50.048,one-tailedtest)andthesetwomeasuresdidnot RN,RNSC,SClusterandUVCluster16,20,34–38).Inaddition,Figure3b
significantlycorrelateintheRCbenchmarks(t 50.27,p50.066, indicatesthatthesumoftheaverageVIvaluesobtainedusingS in
RC max
againone-tailedtest).Theseresultsindicatethatevaluatingthequal- the1200networksanalyzed(withm50.1–0.7andR510–50%)
ityofapartitionaccordingtoitsmodularityisinappropriate.Itwas werejustslightlyabovezero,i.e.almostoptimal.Theaveragevalues
therefore logical to find out that both the algorithms devised to were0.00260.000intheLFRbenchmarksand0.10060.007inthe
Figure3|Asimplemeta-algorithmbasedonSurprisemaximizationimprovesoverallknowncommunitydetectionalgorithms. (a)Performances
(calculatedasinFigure2)forallthealgorithmsarecomparedinboththeLFRandtheRCbenchmarkswiththeperformanceofastrategythatconsistsin
pickingupthealgorithmthatprovidesthehighestSvalue(S ).(b)FortheS strategy,theaverageVIvaluesforthe1200networksanalyzedarevery
max max
closetozero,i.e.analmostoptimalperformance.
SCIENTIFICREPORTS |3:1060|DOI:10.1038/srep01060 4
www.nature.com/scientificreports
RC benchmarks. We may ask why these VI values are not exactly
zero,giventhatVI50wouldbeexpectedforaperfectglobalmea-
sure.Wedetectedtworeasonsforthisminordiscrepancy.Thefirst
reasonwasthat,insomecases(mainlyintheRCbenchmarkwithR
550%),theavailablealgorithmsfailedtoobtainthehighestpossible
S values. We found that the S values expected assuming that the
originalcommunitystructureofthenetworkwasintact(S )were
orig
often higher than S (Table 1). This obviously means that these
max
algorithmsdidnotfoundthecommunitystructurethatmaximizesS.
Thatstructurecouldstillbetheoriginalone--whichindeedhasthe
highestSvalueobservedsofarinouranalyses--orsomealternate
structure,butclearlynotanyofthosefoundbythealgorithms,which
hadlowerSvalues.Thesecondreasonobservedwasthepresenceof
minorchangesincommunitystructurethatoccurredinsomenet-
workswhenintercommunitylinksincreased.Thus,theexactoriginal
structureofthenetworkwasnotpresentanymore.Thiswasdeduced
fromthefactthatS valuesweresometimesslightlyhigherthan
max
S both in the LFR benchmarks with m 5 0.6 – 0.7 and the RC
orig
benchmarkswithR510–40%(Table1).Theseresultssuggested
thatthealgorithmsobtainedoptimalpartitions,buttheywereabit
differentfromtheoriginalones.Toestablishthatfact,weexamined
the23caseswhereS .S intheRCbenchmarkswithR510%.
max orig
WefoundthatthepartitionswithS valuesgenerallydifferedfrom
max
theoriginalstructuresinoneofthesmallestcommunitieshavinglost
single units (Supplementary Table 1; see example in Figure 4).
Significantly,inthose23networkswealwaysfoundjustonepartition
withS .S andseveralalgorithmsoftenrecoveredexactlythat
max orig
same partition (Supplementary Table 1). All these results indicate
thatreal, smallchanges in community structureoccurred in those
networks,suggestingthatthepartitionswithS .S valueswere Figure4|WhenVIandSurprisemaximumvaluesdonotcoincide,the
max orig
indeedoptimal.FromthedatainTable1,wealsoobtainanindirect differenceisoftenduetominimalchangesinthecommunitystructureof
validationofourdecisionofusingtheLFRbenchmarkswithm#0.7 thenetwork. ThisisanexamplefromtheRCbenchmarkwhereSmax.
andtheRCbenchmarkswithR#50%toevaluatealgorithmper- Sorig(seetext).a):originalstructure.b):afterR510%hasbeenapplied.
formance.AsshowninthatTable,uptothoselimits,theSmaxand Smaxisobtainedwhenasingleunit(square)isclassifiedasbeingisolated
S valuesarenotsignificantlydifferent,while,beyondthoselimits, fromitsoriginal4-nodescommunity(highlighted).Asshowninpanelb),
orig
verysignificantdifferencesarefound.Thismeansthattheoriginal thecriticalunithasbecomealmostfullyseparatedfromtherestofthe
structures, or structures almost identical to them, were indeed nodesinitsoriginalcommunity,onlyonelinkremains,whileithasbeen
connectedtomanynodesinothercommunities.
Table1|AverageS andS valuesintheLFRandRCbenchmarks.Statisticalsignificance(p)wasestimatedusingatwo-tailedStudentt
orig max
test. ns: non-significant differences. In italics, the benchmarks containing quasi-random networks, discarded for the main analyses
(summarizedinFigures2and3),butincludedintheanalysesshowninFigure5,below
LFRbenchmark
m S S p
orig max
0.1 99065.696111.50 99065.696111.50 ns
0.2 82631.18693.92 82631.18693.92 ns
0.3 67847.35690.78 67847.35690.78 ns
0.4 54354.47676.71 54354.47676.71 ns
0.5 41991.16648.70 41991.16648.70 ns
0.6 30807.18640.09 30807.38640.09 ns
0.7 20563.37626.92 20570.70626.78 ns
0.8 11598.83617.91 10168.11628.15 ,0.0001
0.9 4204.5067.62 8368.9464.21 ,0.0001
RCbenchmark
R S S p
orig max
10 19012.72667.33 19012.94667.32 ns
20 13505.84634.14 13506.72634.11 ns
30 9298.98611.88 9301.12611.88 ns
40 6013.6963.92 6017.5864.09 ns
50 3487.65611.54 3479.92612.99 ns
60 1647.42613.82 1540.76616.79 ,0.0001
70 475.35610.42 899.9667.98 ,0.0001
80 11.8461.52 963.7369.42 ,0.0001
90 0.0060.00 1003.2169.95 ,0.0001
SCIENTIFICREPORTS |3:1060|DOI:10.1038/srep01060 5
www.nature.com/scientificreports
Figure5|Theperformanceofthealgorithmsinthelimitcases(m50.7,R550%)andbeyondthoselimits(m50.8–0.9,R560–90%)arecorrelated.
Astatisticallysignificantcorrelationwasfound,despitethefactthatsomealgorithms,suchasInfomaporLPA,totallycollapsed.Thesealgorithms
establishedpartitionsconsistinginasinglecommunity,whichledtoVI50whencomparedwiththeoriginaldistribution.
presentinthenetworksexaminedtogeneratetheresultssummarized
inFigures2and3,whichpreciselywastheonlyconditionrequired
forareliablemeasureofalgorithmperformance.
TheimportantresultsdescribedinFigures2and3indicatethatS
maximizationshouldallowdeterminingwithaveryhighprecision
thecommunitystructureofanynetwork.Wehaveexploredwhether
this may be the case even when the community structure is very
poorly defined by analyzing the results of our 600 additional net-
works,correspondingtotheLFRbenchmarkwithmixingparameter
m50.8andm50.9(i.e.200networks)andtheRCbenchmarkwithR
5 60% to R 5 90% (400 networks). As indicated above, in these
networks,theVI-basedoptimalitycriterion(i.e.VI50meansfind-
ing the original community structure) cannot be confidently used
(Figure1;Table1).However,alternative,unknownstructuresmay
bepresentthatthealgorithmsshouldbeabletodetect.Ifthisisthe
case,areasonablepredictionisthatthealgorithmsthatareproviding
the maximum S values in the conditions that are closest to those
extremeones(i.e.whenm50.7intheLFRbenchmarks andR5
50%intheRCbenchmarks)shouldalsoprovidethebestSvaluesin
the most extreme networks. Figure 5 shows that there is indeed a
goodcorrelationbetweentheresultsobtainedinthelimitcasesand
in the most extreme cases. Kendall’s non-parametric correlation
coefficients for the ranks of the algorithms in the limit networks
and in the most extreme networks are significant in both theLFR
andRCbenchmarks(t 50.42;p50.007andt 50.49,p5
LFR RC
0.020, one-tailed tests). This occurs despite some algorithms, as
InfomaporLPA34,39,totallyfailinginthesequasi-randomnetworks
(Figure5).UVCluster,RB,CPMandSCluster16,20,35,38emergeasthe
bestalgorithmstocharacterizethestructureinnetworkswithpoorly
definedcommunities,ingoodagreementwithpreviousresults14,16.
Wedecidedtoperformsomefinalteststodeterminewhetherthe
limitationsthataffectQwhencommunitiesareverysmallmayalso
affectS.Forthispurpose,weusedtwoextremenetworksofknown
structure suggested before17,22,40. The first one includes just three Figure6|Twoextremenetworksdesignedtotestthebehaviorof
communities,oneofthemverylarge(400nodeswithaveragedegree Surprisewhensmallcommunitiesarepresent. (a)Anetworkwiththree
5100)andtheothertwomuchsmaller(cliquesof13nodes).These communities(sizes400,13,13).Thenodesofthelargestcommunityhave
three communities are connected by single links (Figure 6a). We anaveragedegreeof100,whilethenodesinthetwosmallestcommunities
foundinthisnetworkthatthemaximumvalueofQ(EOalgorithm, formcliques.Thethreecommunitiesareinterconnectedbysinglelinks,as
Q50.0836)didnotcorrespondtoapartitionintothethreenatural shown.(b)Cliques,eachonewithfivenodes,whichareconnectedalsoby
communities.Onthecontrary,andasalreadynotedbyotherauthors singlelinksinawaythatcanbedepictedasaring.Thefigureshowsan
insimilarcases17,22,Qindicatesamistakensolution,inthiscasewith examplewitheightcliques,butthatnumberwasprogressivelyincreasedto
fivecommunities.Ontheotherhand,thethreecommunitieswere determinewhetherthepartitionwithhighestSstillcorrespondedtothe
correctlyfoundbymultiplealgorithms(CPM,Infomap,LPA,RNSC, oneinwhicheachcliqueswasanindependentcommunity.
SCIENTIFICREPORTS |3:1060|DOI:10.1038/srep01060 6
www.nature.com/scientificreports
SClusterandUVCluster),andthispartitionindeedcorrespondedto maximization strategy was notdetermined. Here, weextend those
the maximum value of S (1230.73). The second extreme type of results,toshowthatusingSmaximizationleadstoanalmostperfect
network was precisely the ring of cliques in which the resolution performance. We were very close to solve the correct community
limitofQwasfirstdescribed22,whichisschematizedinFigure6b. structureofallthenetworksofthesetwobenchmarks,asisstrikingly
Here,avariablenumberofcliques,eachonecomposedoffiveunits, demonstratedbytheS resultsshowninFigure3b.Itissignificant
max
wereconnectedtoeachotherbysinglelinkstoformaring.Wewere thattheywereobtainedbycombiningresultsofthe7algorithmswith
interestedindeterminingwhether,evenifweincreasethenumberof thebestaverageperformances,asdetailedinFigure3a:RN,SCluster,
cliques,asolutioninwhichallcliquesareseparatelydetectedalways Infomap,CPM,RNSC,UVClusterandRB.Anotherimportantresult
has a better S value than one in which pairs of cliques are put issummarizedinFigure5,whichindicatesthatSurprisecanalsobe
together.Wetestednetworksofsizesupto1millionnodes,finding usedincasesinwhichthecommunitystructureissoblurredasto
thatthebestpartitionwasalwaystheoneinwhichthecliquesare become almost random. Given these results, we conclude that
consideredindependentcommunities.Onthecontrary,whenQis Surpriseistheparameter ofchoicetocharacterize thecommunity
used,thecliquesareconsideredindependentunitsonlyifthenet- structure of complex networks. Future works should use Surprise
worksizeissmallerthan150units. maximization,insteadofmodularitymaximizationorothermeth-
ods,toestablishthatstructure.
Discussion Itissignificantthatonlytwoalgorithms(SClusterandUVCluster)
Ourresultsleadtotwomainconclusions.Thefirstoneisthatnoneof usethemaximizationofSurprisetochooseamongpartitionsgener-
thealgorithmscurrentlyavailablegeneratesoptimalsolutionsinall atedbyconsensushierarchicalclustering20,38.Thismayexplaintheir
networks(Figures2,3).Infact,thereisjustaweakcorrelationofthe good average results (Figures 2, 3, 5). However, no available algo-
algorithmperformancesinthetwostandardbenchmarksusedinthis rithmperformssearchestodirectlydeterminethemaximalSurprise
study.Moreprecisely,wecansaythattherearesomealgorithmsthat values. That type of algorithms could overcome the limitations
clearlyfailinbothbenchmarksandtheresttendtoperformmuch detectedinallthecurrentlyavailableones,potentiallyallowingthe
betterinoneofthebenchmarksthanintheother(seeFigure2).Most characterizationofoptimalpartitionseveninthemostdifficultnet-
ofthebestoverallperformerswerealreadyfoundtobeoutstanding works.
inotherstudies12–14,16,18,36.TheexceptionisRNSC37,whichhadnot WemayaskwhySurpriseisabletoevaluatewithsuchefficiency
been tested in depth before. Among the ones that always perform the quality of a given partition, while modularity cannot. In our
poorlyareallthealgorithmsthatusemodularityaseitheraglobal opinion,thedifferencerestsonthefactthatmodularityisbasedon
parametertomaximizeorasawaytoevaluatepartitions.Thisfact, an inappropriate definition of community. Newman and Girvan21
togetherwiththedemonstrationthatQdoesnotcorrelatewithVIin verbally defined a community as a region of a network in which
networksofknownstructure(Figure2),andalsothegoodperform- the density of links is higher than expected by chance. However,
ance of S, including its ability to cope with extreme networks in theprecisemathematicalmodelusedtodeducethemodularityfor-
which Q traditionally fails due to its resolution limit (Figure 6), mula implies a definition of community that does not take into
shoulddefinitelydeterresearchersfromusingmodularity.Astrong accountthenumberofnodesrequiredtoachievesuchahighden-
corollary is that it is advisable a reevaluation of the hundreds of sity21.Bynotevaluatingthenumberofnodes,modularityfallspreyof
papers--infieldsasvariedassociology,ecology,molecularbiology aresolutionlimit:smallcommunitiescannotbedetected17,22.Onthe
ormedicine--whicharebasedonmodularityanalyses. otherhand,Surpriseanalysesoftenchooseasbestasolutionwhere
Thesecond,andmostimportant,conclusionisthatthecommun- somecommunitiesarejustisolatedunits(seeexamplesinFigure4
itystructureofanetworkcanbedeterminedbymaximizingS,for and Ref. 14). This happens because the Surprise formula precisely
examplebysimplytakingtheresultsofasmanyalgorithmsaspos- evaluatesnotonlythenumberoflinks,butalsothenumberofnodes
sibleandchoosingtheonethatprovidespartitionswiththehighest withineachcommunity.Forinstance,incorporatingasinglepoorly
Surprisevalue.Inapreviouspaper,weshowedthatSurprisecanbe connectedunitintoacommunityisoftenforbiddenbythefactthat
usedtoefficientlyevaluatethequalityofapartition,behavingmuch suchincorporationsharplyincreasesthenumberofpotentialintra-
better than modularity14, but the precise performance of the S- community links (all those that might connect the units already
Table2|Detailsofthealgorithmsusedinthisstudy.Asummaryofthestrategiesimplementedbythealgorithmsandthecorresponding
referencesareindicated
NameoftheAlgorithm Strategyusedbythealgorithm References
Blondel Multilevelmodularitymaximization [27]
CNM Greedymodularitymaximization [31]
CPM MultiresolutionPottsmodel [16]
DM Spectralanalysis1modularitymaximization [33]
EO Modularitymaximization [28]
HAC MaximumLikelihood [43]
Infomap Informationcompression [34]
LPA Labelpropagation [39]
MCL Simulatedflow [44]
MLGC Multilevelmodularitymaximization [29]
MSG1VM Greedymodularitymaximization1refinement [30]
RB MultiresolutionPottsmodel [35]
RN MultiresolutionPottsmodel [36]
RNSC Neighborhoodtabusearch [37]
SAVI Optimalpredictionforrandomwalks [45]
SCluster ConsensusHierarchicalClustering1Surprisemaximization [20]
UVCluster ConsensusHierarchicalClustering1Surprisemaximization [20,38]
Walktrap Randomwalks1modularitymaximization [32]
SCIENTIFICREPORTS |3:1060|DOI:10.1038/srep01060 7
www.nature.com/scientificreports
presentinthecommunitywiththenewunit)whilebarelyincreasing 7. Labatut,V.&Balasque,J.-M.Detectionandinterpretationofcommunitiesin
thenumberofrealintracommunitylinks.ThisleadstoanSvalue complexnetworks:practicalmethodsandapplication.In:Abraham,A.&
Hassanien,A.-E.(eds.)ComputationalSocialNetworks:Tools,Perspectivesand
muchsmallerthaniftheunitiskeptseparated.Itisalsosignificant
Applicationspp79–110.(Springer,2012).
thatageneralproblemofmodularitymaximizationandotherrelated 8. Girvan,M.&Newman,M.E.J.Communitystructureinsocialandbiological
algorithms--asthosebasedonPottsmodels withmultiresolution networks.Proc.Natl.Acad.Sci.USA99,7821–7826(2002).
parameters--isthattheycannotfindaperfectequilibriumbetween 9. Danon,L.,Duch,J.,Diaz-Guilera,A.&Arenas,A.Comparingcommunity
structureidentification.J.Stat.Mech.P09008(2005).
merging and splitting communities17,25,26. In these methods, each
10.Danon,L.,Diaz-Guilera,A.&Arenas,A.Theeffectofsizeheterogeneityon
community is evaluated independently, one at a time. The global communityidentificationincomplexnetworks.J.Stat.Mech.P11010(2006).
valuetobemaximizedisthesumofthequalitiesoftheindividual 11.Lancichinetti,A.,Fortunato,S.&Radicchi,F.Benchmarkgraphsfortesting
communities.However,incomplexnetworkswithcommunitiesof communitydetectionalgorithms.Phys.Rev.E78,046110(2008).
12.Lancichinetti,A.&Fortunato,S.Communitydetectionalgorithms:acomparative
verydifferentsizes,itmaybeoftenimpossibletofindasinglerule
analysis.Phys.Rev.E80,056117(2009).
(evenusingatunableparameter,asinthesemultiresolutionmeth- 13.Orman,G.K.,Labatut,V.&Cherifi,H.Qualitativecomparisonofcommunity
ods) to split some communities while keeping intact the rest17. detectionalgorithms.In:Cherifi,H.,Zain,J.M.&El-Qawasmeh,E.(eds.)
Surprise analyses are not affected by this problem, because com- DICTAP(2),vol.167ofCommunicationsinComputerandInformationScience
265–279(Springer,2011).
munitiesarenotdefinedindependently,onebyone,butemergeas
14.Aldecoa,R.&Mar´ın,I.DecipheringnetworkcommunitystructurebySurprise.
regionsofnodesstatisticallyenrichedinlinks,accordingtothegen- PLoSONE6,e24195(2011).
eralfeatures(i.e.thetotalnumberofnodesandlinks)ofthewhole 15.Ronhovde,P.&Nussinov,Z.Multiresolutioncommunitydetectionformegascale
network. networksbyinformation-basedreplicacorrelations.Phys.Rev.E80,016109
(2009).
16.Traag,V.A.,VanDooren,P.&Nesterov,Y.Narrowscopeforresolution-free
Methods communitydetection.Phys.Rev.E84,016114(2011).
17.Lancichinetti,A.&Fortunato,S.Limitsofmodularitymaximizationin
Wesearchedtheliteraturetoselectthebestcommunitydetectionalgorithmsavail-
communitydetection.Phys.Rev.E84,066122(2011).
abletoanalyzenetworkswithunweighted,undirectedlinks.Ourfinalresultsare
18.Aldecoa,R.&Mar´ın,I.Closedbenchmarksfornetworkcommunitystructure
basedon18ofthem(summarizedinTable2).Algorithmsknowntobehavepoorlyin
characterization.Phys.Rev.E85,026109(2012).
similarbenchmarksorspecificallydesignedtocharacterizecommunitieswithover-
19.Watts,D.J.SmallWorlds:TheDynamicsOfNetworksBetweenOrderAnd
lappingnodeswerediscarded.Someotheralgorithmsthatseemedinterestingbutwe
Randomness.(PrincetonUniversityPress,1999).
wereunabletotestfordiversereasons(e.g.theywerenotprovidedbytheauthors,did
20.Aldecoa,R.&Mar´ın,I.Jerarca:Efficientanalysisofcomplexnetworksusing
notcompletethebenchmarks,etc.)aredetailedinSupplementaryTable2.Weper-
hierarchicalclustering.PLoSONE5,e11585(2010).
formedextensivetestswiththeseselectedalgorithms,usingtheirdefaultparameters,
21.Newman,M.E.J.&Girvan,M.Findingandevaluatingcommunitystructurein
intwoverydifferentbenchmarks.Theywerechosenbothdifficultandverydissim-
networks.Phys.Rev.E69,026113(2004).
ilar,withtheideathattheresultscouldbegeneralenoughastobeextrapolatedto
22.Fortunato,S.&Barthe´lemy,M.Resolutionlimitincommunitydetection.Proc.
networksofunknownstructure.ThefirstwasastandardLFRbenchmarkalready
Natl.Acad.Sci.USA104,36–41(2007).
usedinotherstudiesthatcomparedalgorithms12–14,17.Itiscomposedofnetworkswith
23.Good,B.H.,deMontjoye,Y.-A.&Clauset,A.Performanceofmodularity
5000nodes,structuredinsmallcommunitieswith10–50nodes.Thedistributionof
maximizationinpracticalcontexts.Phys.Rev.E81,046106(2010).
nodedegreesandcommunitysizesweregeneratedaccordingtopowerlawswith
24.Bagrow,J.P.Communitiesandbottlenecks:Treesandtreelikenetworkshavehigh
exponents22and21,respectively.Thesizesofthecommunitiesinthenetworksof
modularity.Phys.Rev.E85,066118(2012).
thisbenchmarkhaveaveragePielou’sindexes41withavalueof0.98.Thisindexis
25.Xiang,J.&Hu,K.Limitationofmulti-resolutionmethodsincommunity
equalto1whenallcommunitiesareofthesamesize.Thechiefdifficultyofthis
detection.PhysicaA391,4995–5003(2012).
benchmarkthusliesonthepresenceofmanysmallcommunities.Thesecond
26.Xiang,J.etal.Multi-resolutionmodularitymethodsandtheirlimitationsin
benchmarkwasoneoftheRelaxedCaveman(RC)type,verysimilartotheonesused
communitydetection.Eur.Phys.J.B85,352(2012).
inourpreviousworks14,18.ThenetworksinthisRCbenchmarkhave512unitsand16
27.Blondel,V.D.,Guillaume,J.-L.,Lambiotte,R.&Lefebvre,E.Fastunfoldingof
communities,withsizesdefinedaccordingtoabroken-stickmodeltoobtainan
communitiesinlargenetworks.J.Stat.Mech.P10008(2008).
averagePielou’sindex50.75.Thismakesthisbenchmarkverydifficult,giventhatit
28.Duch,J.&Arenas,A.Communitydetectionincomplexnetworksusingextremal
consistsofnetworkswithcommunitiesofverydifferentsizes,someofthemvery
optimization.Phys.Rev.E72,027104(2005).
small(seee.g.Figure4).ItwasnotconvenienttoourpurposestouselargerRC
29.Noack,A.&Rotta,R.Multi-levelalgorithmsformodularityclustering.Lect.Notes
benchmarksgiventhatthetotalnumberoflinksinthesenetworksquicklygrows
Comp.Sci.5526,257–268(2009).
whenthenumberofnodesisincreasedandmanyalgorithmsbecometooslow.
30.Schuetz,P.&Caflisch,A.Efficientmodularityoptimizationbymultistepgreedy
Thesetwobenchmarksare‘‘open’’,meaningthattheyhaveatunableparameter
algorithmandvertexmoverrefinement.Phys.Rev.E77,046112(2008).
that,whenincreased,makesthenetworkcommunitystructuretobecomelessandless
31.Clauset,A.,Newman,M.E.J.&Moore,C.Findingcommunitystructureinvery
obviousuntilitshiftstowardsatotallyunknownstructure,potentiallyverydifferent
largenetworks.Phys.Rev.E70,066111(2004).
fromtheoriginaloneandclosetorandom11,13,14,18.Thisparameterincreasesintra-
32.Pons,P.&Latapy,M.Computingcommunitiesinlargenetworksusingrandom
communitylinksandlowersthenumberofintercommunitylinks.Inthecaseofthe
walks.J.GraphAlgorithmsAppl.10,191–218(2006).
LFRbenchmarks,the‘‘mixingparameter’’,m,indicatesthefractionoflinkscon-
33.Donetti,L.&Mun˜oz,M.A.DetectingNetworkCommunities:anewsystematic
nectingeachnodeofacommunitywithnodesoutsideofthecommunity11.FortheRC andefficientalgorithm.J.Stat.Mech.P10012(2004).
benchmarks,wedefinedRewiring(R)asthepercentageoflinksthatisrandomly 34.Rosvall,M.&Bergstrom,C.T.Mapsofrandomwalksoncomplexnetworksreveal
shuffledamongunits.Thus,R510%meansthat10percentofthelinkswerefirst communitystructure.Proc.Natl.Acad.Sci.USA105,1118–1123(2008).
randomlyremovedandthenaddedagain,tolinkrandomlychosennodes. 35.Reichardt,J.&Bornholdt,S.Statisticalmechanicsofcommunitydetection.Phys.
Variationofinformation(VI)42wasusedtomeasuretheagreementbetweenthe Rev.E74,016110(2006).
originalcommunitystructurepresentinthenetworkandthestructurededucedby 36.Ronhovde,P.&Nussinov,Z.Localresolution-limit-freePottsmodelfor
eachalgorithm.TheadvantagesofusingVIhavebeendiscussedinourprevious communitydetection.Phys.Rev.E81,046114(2010).
works14,18.AperfectagreementwithaknownstructurewillprovideavalueofVI50. 37.King,A.D.,Przulj,N.&Jurisica,I.Proteincomplexpredictionviacost-based
Inaddition,twoglobalqualityfunctions,NewmanandGirvan’smodularity(Q)8and clustering.Bioinformatics20,3013–3020(2004).
Surprise(S)14,18,(seeFormula[1]),werealsousedtoevaluatetheresults.Thevaluesof 38.Arnau,V.,Mars,S.&Mar´ın,I.Iterativeclusteranalysisofproteininteractiondata.
SandQforthepartitionsproposedbyeachalgorithmwerecalculatedandthenallthe Bioinformatics21,364–378(2005).
valueswereusedtodeterminethecorrelationsofSandQwithVIandtoestablish 39.Raghavan,U.N.,Albert,R.&Kumara,S.Nearlineartimealgorithmtodetect
thesemaximumvaluesofS. communitystructuresinlarge-scalenetworks.Phys.Rev.E76,036106(2007).
40.Granell,C.,Go´mez,S.&Arenas,A.Hierarchicalmultiresolutionmethodto
overcometheresolutionlimitincomplexnetworks.Int.J.Bifurcat.Chaos22,
1. Wasserman,S.&Faust,K.SocialNetworkAnalysis:MethodsandApplications. 1250171(2012).
(CambridgeUniversityPress,1994). 41.Pielou,E.C.Themeasurementofdiversityindifferenttypesofbiological
2. Strogatz,S.H.Exploringcomplexnetworks.Nature410,268–276(2001). collections.J.Theor.Biol.13,131–144(1966).
3. Baraba´si,A.-L.&Oltvai,Z.N.Networkbiology:understandingthecell’s 42.Meila,M.Comparingclusterings–aninformationbaseddistance.J.Multivar.
functionalorganization.Nat.Rev.Genet.5,101–113(2004). Anal.98,873–895(2007).
4. Costa,L.D.F.,Rodrigues,F.A.,Travieso,G.&Boas,P.R.V.Characterizationof 43.Park,Y.&Bader,J.S.Resolvingthestructureofinteractomeswithhierarchical
complexnetworks:Asurveyofmeasurements.Adv.Phys.56,167(2007). agglomerativeclustering.BMCBioinformatics12,S44(2011).
5. Newman,M.E.J.Networks:AnIntroduction(OxfordUniversityPress,2010). 44.Enright,A.J.,VanDongen,S.&Ouzounis,C.A.Anefficientalgorithmforlarge-
6. Fortunato,S.Communitydetectioningraphs.Phys.Rep.486,75–174(2010). scaledetectionofproteinfamilies.Nucl.AcidsRes.30,1575–1584(2002).
SCIENTIFICREPORTS |3:1060|DOI:10.1038/srep01060 8
www.nature.com/scientificreports
45.E,W.,Li,T.&Vanden-Eijnden,E.Optimalpartitionandeffectivedynamicsof Additionalinformation
complexnetworks.Proc.Natl.Acad.Sci.USA105,7907–7912(2008).
Supplementaryinformationaccompaniesthispaperathttp://www.nature.com/
46.Kamada,T.&Kawai,S.Analgorithmfordrawinggeneralundirectedgraphs.Inf.
scientificreports
Proc.Lett.31,7–15(1989).
47.Smoot,M.E.,Ono,K.,Ruscheinski,J.,Wang,P.-L.&Ideker,T.Cytoscape2.8:new Competingfinancialinterests:Theauthorsdeclarenocompetingfinancialinterests.
featuresfordataintegrationandnetworkvisualization.Bioinformatics27, License:ThisworkislicensedunderaCreativeCommons
431–432(2011). Attribution-NonCommercial-NoDerivs3.0UnportedLicense.Toviewacopyofthis
license,visithttp://creativecommons.org/licenses/by-nc-nd/3.0/
Howtocitethisarticle:Aldecoa,R.&Mar´ın,I.Surprisemaximizationrevealsthe
Acknowledgements communitystructureofcomplexnetworks.Sci.Rep.3,1060;DOI:10.1038/srep01060
(2013).
ThisstudywassupportedbygrantBFU2011-30063(Spanishgovernment).
Authorcontributions
RAandIMconceivedanddesignedtheexperiments.RAperformedtheexperimentsand
analyzedthedata.IMwrotethemainmanuscripttext.Allauthorsreviewedthemanuscript.
SCIENTIFICREPORTS |3:1060|DOI:10.1038/srep01060 9