Table Of ContentOptimistic Parallelism Requires Abstractions∗
MilindKulkarni†, BruceWalter,GaneshRamanarayanan,
KeshavPingali KavitaBala‡,L.PaulChew
DepartmentofComputerScience, DepartmentofComputerScience,
UniversityofTexas,Austin. CornellUniversity,Ithaca,NewYork.
{milind,pingali}@cs.utexas.edu [email protected],
{graman,kb,chew}@cs.cornell.edu
Abstract Categories and Subject Descriptors D.1.3 [Programming Tech-
niques]:ConcurrentProgramming—ParallelProgramming
Irregularapplications,whichmanipulatelarge,pointer-baseddata
structures like graphs, are difficult to parallelize manually. Auto- GeneralTerms Languages
matictoolsandtechniquessuchasrestructuringcompilersandrun-
timespeculativeexecutionhavefailedtouncovermuchparallelism Keywords Optimistic Parallelism, Abstractions, Irregular Pro-
intheseapplications,inspiteofalotofeffortbytheresearchcom- grams
munity.Thesedifficultieshaveevenledsomeresearcherstowon-
derifthereisanycoarse-grainparallelismworthexploitinginir- 1. Introduction
regularapplications.
Apessimistseesthedifficultyineveryopportunity;
Inthispaper,wedescribetworeal-worldirregularapplications:
anoptimistseestheopportunityineverydifficulty.
aDelaunaymeshrefinementapplicationandagraphicsapplication
thatperformsagglomerativeclustering.Bystudyingthealgorithms —SirWinstonChurchill
anddatastructuresusedintheseapplications,weshowthatthere
The advent of multicore processors has shifted the burden of
is substantial coarse-grain, data parallelism in these applications,
improving program execution speed from chip manufacturers to
but that this parallelism is very dependent on the input data and
software developers. A particularly challenging problem in this
thereforecannotbeuncoveredbycompileranalysis.Inprinciple,
context is the parallelization of irregular applications that deal
optimistictechniquessuchasthread-levelspeculationcanbeused
withcomplex,pointer-baseddatastructuressuchastrees,queues
to uncover this parallelism, but we argue that current implemen-
and graphs. In this paper, we describe two such applications: a
tationscannotaccomplishthisbecausetheydonotusetheproper
Delaunaymeshrefinementcode[8]andagraphicsapplication[39]
abstractionsforthedatastructuresintheseprograms.
thatperformsagglomerativeclustering[26].
These insights have informed our design of the Galois sys-
Inprinciple,itispossibletouseathreadlibrary(e.g.,pthreads)
tem, an object-based optimistic parallelization system for irregu-
oracombinationofcompilerdirectivesandlibraries(e.g.,OpenMP
larapplications.TherearethreemainaspectstoGalois:(1)asmall
[25])towritethreadedcodeformulticoreprocessors,butitiswell
numberofsyntacticconstructsforpackagingoptimisticparallelism
knownthatwritingthreadedcodecanbeverytrickybecauseofthe
as iteration over ordered and unordered sets, (2) assertions about
complexitiesofsynchronization,dataraces,memoryconsistency,
methodsinclasslibraries,and(3)aruntimeschemefordetecting
etc.TimSweeney,whodesignedthemulti-threadedUnreal3game
andrecoveringfrompotentiallyunsafeaccessestosharedmemory
engine,estimatesthatwritingmulti-threadingcodetripledsoftware
madebyanoptimisticcomputation.
costsatEpicGames(quotedin[9]).
We show that Delaunay mesh generation and agglomerative
Anotherpossibilityistousecompileranalysessuchaspoints-
clusteringcanbeparallelizedinastraight-forwardwayusingthe
toandshapeanalysis[5,31]toparallelizesequentialirregularpro-
Galois approach, and we present experimental measurements to
grams.Unfortunately,staticanalysesfailtouncovertheparallelism
show that this approach is practical. These results suggest that
in such applications because the parallel schedule is very data-
Galois is a practical approach to exploiting data parallelism in
dependent and cannot be computed at compile-time, as we argue
irregularprograms.
inSection3.
Optimisticparallelization[17]isapromisingidea,butcurrent
∗This work is supported in part by NSF grants 0615240, 0541193, implementations of optimistic parallelization such as thread-level
0509307,0509324,0426787and0406380,aswellasgrantsfromtheIBM speculation(TLS)[36,43]cannotexploittheparallelisminthese
andIntelCorportations. applications,aswediscussinSection3.
†MilindKulkarniissupportedbytheDOEHPCSFellowship. Inthispaper,wedescribetheGaloisapproachtoparallelizing
‡KavitaBalaissupportedinpartbyNSFCareerGrant0644175 irregularapplications.Thisapproachisinformedbythefollowing
beliefs.
• Optimisticparallelizationistheonlyplausibleapproachtopar-
allelizingmany,ifnotmost,irregularapplications.
• For effective optimistic parallelization, it is crucial to exploit
theabstractionsprovidedbyobject-orientedlanguages(inpar-
Permissiontomakedigitalorhardcopiesofallorpartofthisworkforpersonalor
ticular,thedistinctionbetweenanabstractdatatypeanditsim-
classroomuseisgrantedwithoutfeeprovidedthatcopiesarenotmadeordistributed
forprofitorcommercialadvantageandthatcopiesbearthisnoticeandthefullcitation plementation).
onthefirstpage.Tocopyotherwise,torepublish,topostonserversortoredistribute • Concurrency should be packaged, when possible, within syn-
tolists,requirespriorspecificpermissionand/orafee.
tactic constructs that make it easy for the programmer to ex-
PLDI’07 June11–13,2007,SanDiego,California,USA.
Copyright(cid:13)c 2007ACM978-1-59593-633-2/07/0006...$5.00. presswhatmightbedoneinparallel,andforthecompilerand
runtimesystemtodeterminewhatshouldbedoneinparallel.
Figure1. ADelaunaymesh.Notethatthecircumcircleforeachof Figure2. Fixingabadelement.
thetrianglesdoesnotcontainotherpointsinthemesh.
1: Mesh m = /* read in initial mesh */
2: WorkList wl;
ThesyntacticconstructsusedinGaloisareverynaturalandcan 3: wl.add(mesh.badTriangles());
beaddedeasilytoanyobject-orientedprogramminglanguage 4: while (wl.size() != 0) {
5: Element e = wl.get(); //get bad triangle
likeJava.TheyarerelatedtosetiteratorsinSETL[19].
6: if (e no longer in mesh) continue;
• Concurrentaccesstomutablesharedobjectsbymultiplethreads 7: Cavity c = new Cavity(e);
isfundamental,andcannotbeaddedtothesystemasanafter- 8: c.expand();
9: c.retriangulate();
thought as is done in current approaches to optimistic paral-
10: mesh.update(c);
lelization.However,disciplineneedstobeimposedonconcur- 11: wl.add(c.badTriangles());
rentaccessestosharedobjectstoensurecorrectexecution. 12: }
Figure3. Pseudocodeofthemeshrefinementalgorithm
We have implemented the Galois approach in C++ on two
shared-memoryplatforms,andwehaveusedthisimplementation
In practice, the Delaunay property alone is not sufficient, and
to write a number of complex applications including Delaunay
it is necessary to impose quality constraints governing the shape
meshrefinement,agglomerativeclustering,animagesegmentation
and size of the triangles. For a given Delaunay mesh, this is ac-
code that uses graph cuts [39], and an approximate SAT solver
complishedbyiterativemeshrefinement,whichsuccessivelyfixes
calledWalkSAT[33].
“bad”triangles(trianglesthatdonotsatisfythequalityconstraints)
ThispaperdescribestheGaloisapproachanditsimplementa-
by adding new points to the mesh and re-triangulating. Figure 2
tion, and presents performance results for some of these applica-
illustrates this process; the shaded triangle in Figure 2(a) is as-
tions.Itisorganizedasfollows.InSection2,wepresentDelaunay
sumedtobebad.Tofixthisbadtriangle,anewpointisaddedat
meshrefinementandagglomerativeclustering,anddescribeoppor-
thecircumcenterofthistriangle.Addingthispointmayinvalidate
tunitiesforexploitingparallelisminthesecodes.InSection3,we
theemptycirclepropertyforsomeneighboringtriangles,soallaf-
give an overview of existing parallelization techniques and argue
fectedtrianglesaredetermined(thisregioniscalledthecavityof
that they cannot exploit the parallelism in these applications. In
thebadtriangle),andthecavityisre-triangulated,asshowninFig-
Section4,wediscusstheGaloisprogrammingmodelandrun-time
ure2(c)(inthisfigure,alltriangleslieinthecavityoftheshaded
system.InSection5,weevaluatetheperformanceofoursystemon
badtriangle).Re-triangulatingacavitymaygeneratenewbadtri-
thetwoapplications.Finally,inSection6,wediscussconclusions
anglesbutitcanbeshownthatthisiterativerefinementprocesswill
andongoingwork.
ultimatelyterminateandproduceaguaranteed-qualitymesh.Dif-
ferentordersofprocessingbadelementsleadtodifferentmeshes,
2. TwoIrregularApplications althoughallsuchmeshessatisfythequalityconstraints[8].
Figure3showsthepseudocodeformeshrefinement.Theinput
Tounderstandthenatureoftheparallelisminirregularprograms,
tothisprogramisaDelaunaymeshinwhichsometrianglesmaybe
itisuselesstostudytheexecutiontracesofirregularprograms,as
bad,andtheoutputisarefinedmeshinwhichalltrianglessatisfy
moststudiesinthisareado;insteaditisnecessarytorecallNiklaus
Wirth’saphorismprogram=algorithm+datastructure[41],and the quality constraints. There are two key data structures used in
thisalgorithm.Oneisaworklistcontainingthebadtrianglesinthe
examine the relevant algorithms and data structures. In this sec-
mesh. The other is a graph representing the mesh structure; each
tion, we describe two irregular applications: Delaunay mesh re-
triangle in the mesh is represented as one node, and edges in the
finement [8], and agglomerative clustering [26] as used within a
graphrepresenttriangleadjacenciesinthemesh.
graphics application [39]. These applications perform refinement
andcoarseningrespectively,whicharearguablythetwomostcom-
Opportunities for Exploiting Parallelism. The natural unit of
mon operations for bulk-modification of irregular data structures.
workforparallelexecutionistheprocessingofabadtriangle.Our
Foreachapplication,wedescribethealgorithmandkeydatastruc-
measurements show that on the average, each unit of work takes
tures,anddescribeopportunitiesforexploitingparallelism.
about a million instructions of which about 10,000 are floating-
point operations. Because a cavity is typically a small neighbor-
2.1 DelaunayMeshRefinement
hoodofabadtriangle,twobadtrianglesthatarefarapartonthe
Meshgenerationisanimportantproblemwithapplicationsinmany meshmayhavecavitiesthatdonotoverlap.Furthermore,theentire
areassuchasthenumericalsolutionofpartialdifferentialequations refinementprocess(expansion,retriangulationandgraphupdating)
andgraphics.Thegoalofmeshgenerationistorepresentasurface forthetwotrianglesiscompletelyindependent;thus,thetwotrian-
oravolumeasatessellationcomposedofsimpleshapesliketrian- glescanbeprocessedinparallel.Thisapproachobviouslyextends
gles,tetrahedra,etc. tomorethantwotriangles.Ifhoweverthecavitiesoftwotriangles
Althoughmanytypesofmeshesareusedinpractice,Delaunay overlap,thetrianglescanbeprocessedineitherorderbutonlyone
meshes are particularly important since they have a number of ofthemcanbeprocessedatatime.Whetherornottwobadtrian-
desirablemathematicalproperties[8].TheDelaunaytriangulation gleshaveoverlappingcavitiesdependsentirelyonthestructureof
forasetofpointsintheplaneisthetriangulationsuchthatnopoint themesh,whichchangesthroughouttheexecutionofthealgorithm.
isinsidethecircumcircleofanytriangle(thispropertyiscalledthe How much parallelism is there in Delaunay mesh generation?
empty circle property). An example of such a mesh is shown in The answer obviously depends on the mesh and on the order in
Figure1. which bad triangles are processed, and may be different at dif-
1: kdTree := new KDTree(points)
e e 2: pq := new PriorityQueue()
3: foreach p in points {pq.add(<p,kdTree.nearest(p)>)}
4: while(pq.size() != 0) do {
5: Pair <p,n> := pq.get();//return closest pair
a d a d
6: if (p.isAlreadyClustered()) continue;
b b
c c a b c d e 7: if (n.isAlreadyClustered()) {
8: pq.add(<p, kdTree.nearest(p)>);
9: continue;
(a) Data points (b) Hierarchical clusters (c) Dendrogram
10: }
Figure4. Agglomerativeclustering 11: Cluster c := new Cluster(p,n);
12: dendrogram.add(c);
ferentpointsduringtheexecutionofthealgorithm.Onestudyby 13: kdTree.remove(p);
14: kdTree.remove(n);
Antonopoulosetal.[2]onameshofonemilliontrianglesfound
15: kdTree.add(c);
thatthereweremorethan256cavitiesthatcouldbeexpandedin 16: Point m := kdTree.nearest(c);
paralleluntilalmosttheendofexecution. 17: if (m != ptAtInfinity) pq.add(<c,m>);
18: }
2.2 AgglomerativeClustering Figure5. Pseudocodeforagglomerativeclustering
The second problem is agglomerative clustering, a well-known
data-mining algorithm [26]. This algorithm is used in graphics dendrogramisalongandskinnytree,theremaybefewindepen-
applicationsforhandlinglargenumbersoflightsources[39]. dentiterations,whereasifthedendrogramisabushytree,thereis
Theinputtotheclusteringalgorithmis(1)adata-set,and(2)a parallelismthatcanbeexploitedsincethetreecanbeconstructed
measureofthe“distance”betweenitemsinthedata-set.Intuitively, bottom-upinparallel.AsinthecaseofDelaunaymeshrefinement,
thismeasureisanestimateofsimilarity—thelargerthedistance theparallelismisverydata-dependent.Inexperimentsongraphics
betweentwodataitems,thelesssimilartheyarebelievedtobe.The sceneswith20,000lights,wehavefoundthatonaverageabout100
goalofclusteringistoconstructabinarytreecalledadendrogram clusterscanbeconstructedconcurrently;thus,thereissubstantial
whosehierarchicalstructureexposesthesimilaritybetweenitems parallelismthatcanbeexploited.Forthisapplication,eachiteration
in the data-set. Figure 4(a) shows a data-set containing points in ofthewhile-loopinFigure5performsabout100,000instructions
theplane,forwhichthemeasureofdistancebetweendatapoints ofwhichroughly4000arefloating-pointoperations.
istheusualEuclideandistance.Thedendrogramforthisdatasetis
showninFigures4(b,c). 3. LimitationsofCurrentApproaches
Agglomerative clustering can be performed by an iterative al-
Current approaches for parallelizing irregular applications can be
gorithm:ateachstep,thetwoclosestpointsinthedata-setareclus-
dividedintostatic,semi-static,anddynamicapproaches.
teredtogetherandreplacedinthedata-setbyasinglenewpointthat
representsthenewcluster.Thelocationofthisnewpointmaybe Static Approaches. One approach to parallelization is to use a
determinedheuristically[26].Thealgorithmterminateswhenthere compilertoanalyzeandtransformsequentialprogramsintoparallel
isonlyonepointleftinthedata-set. ones,usingtechniqueslikepoints-toanalysis[5]andshapeanaly-
PseudocodeforthealgorithmisshowninFigure5.Thecentral sis[31].Theweaknessofthisapproachisthattheparallelschedule
datastructureisapriorityqueuewhoseentriesareorderedpairsof produced by the compiler must be valid for all inputs to the pro-
points<x,y>,suchthatyisthenearestneighborofx(wecallthis gram. As we have seen, parallelism in irregular applications can
nearest(x)).Ineachiterationofthewhileloop,thealgorithm beverydata-dependent,socompile-timeparallelizationtechniques
dequeues the top element of the priority queue to find a pair of will serialize the entire execution. This conclusion holds even if
points<p,n>thatareclosertoeachotherthananyotherpairof dependence analysis is replaced with more sophisticated analysis
points,andclustersthem.Thesetwopointsarethenreplacedbya techniqueslikecommutativityanalysis[10].
newpointthatrepresentsthiscluster.Thenearestneighborofthis
new point is determined, and the pair is entered into the priority A Semi-static Approach. In the inspector-executor approach of
queue. If there is only one point left, its nearest neighbor is the Saltzetal[27],thecomputationissplitintotwophases,aninspec-
pointatinfinity. torphasethatdeterminesdependenciesbetweenunitsofwork,and
Tofindthenearestneighborofapoint,wecanscantheentire an executor phase that uses the schedule to perform the compu-
data-setateachstep,butthisistooinefficient.Abetterapproach tationinparallel.Thisapproachisnotusefulforourapplications
is to sort the points by location, and search within this sorted set since the data-sets, and therefore the dependences, change as the
tofindnearestneighbors.Ifthepointswereallinaline,wecould codesexecute.
useabinarysearchtree.Sincethepointsareinhigherdimensions,
DynamicApproaches. Indynamicapproaches,parallelizationis
a multi-dimensional analog called a kd-tree is used [3]. The kd-
performed at runtime, and is known as speculative or optimistic
tree is built at the start of the algorithm, and it is updated by
parallelization.Theprogramisexecutedinparallelassumingthat
removing the points that are clustered, and then adding the new
dependencesarenotviolated,butthesystemsoftwareorhardware
pointrepresentingthecluster,asshowninFigure5.
detectsdependenceviolationsandtakesappropriatecorrectiveac-
Opportunities for Exploiting Parallelism. Since each iteration tionsuchaskillingofftheoffendingportionsoftheprogramand
clustersthetwoclosestpointsinthecurrentdata-set,itmayseem re-executingthemsequentially.Ifnodependenceviolationsarede-
that the algorithm is inherently sequential. In particular, an item tected by the end of the speculative computation, the results of
<x,nearest(x)> inserted into the priority queue by iteration the speculative computation are committed and become available
i at line 17 may be the same item that is dequeued by iteration toothercomputations.
(i+1)inline5;thiswillhappenifthepointsinthenewpairare Fine-grainspeculativeparallelizationforexploitinginstruction-
closertogetherthananyotherpairofpointsinthecurrentdata-set. levelparallelismwasintroducedaround1970;forexample,Toma-
On the other hand, if we consider the data-set in Figure 4(a), we sulo’s IBM 360/91 fetched instructions speculatively from both
seethatpointsaandb,andpointscanddcanbeclusteredcon- sidesofabranchbeforethebranchtargetwasresolved[37].Spec-
currently since neither cluster affects the other. Intuitively, if the ulative execution of instructions past branches was studied in the
abstract by Foster and Riseman in 1972 [7], and was made prac- Client Code
ticalbyJoshFisherwhenheintroducedtheideaofusingbranch Galois Objects
probabilitiestoguidespeculation[11].Branchspeculationcanex-
poseinstruction-level(fine-grain)parallelisminprogramsbutnot
thedata-dependentcoarse-grainparallelisminapplicationslikeDe-
launaymeshrefinement.
Oneoftheearliestimplementationsofcoarse-grainoptimistic
parallel execution was in Jefferson’s 1985 Time Warp system for
distributed discrete-event simulation [17]. In 1999, Rauchwerger
and Padua described the LRPD test for supporting speculative
Figure6. High-levelviewofGaloisexecutionmodel
executionofFORTRANDO-loopsinwhicharraysubscriptswere
toocomplextobedisambiguatedbydependenceanalysis[30].This memorytrackreadsandwritestomemorylocations,sotheysuf-
approachcanbeextendedtowhile-loopsifanupperboundonthe ferfromthesameproblemsascurrentTLSimplementations.Open
numberofloopiterationscanbedeterminedbeforetheloopbegins nestedtransactions[22]havebeenproposedrecentlyasasolution
execution[29].Morerecentworkhasprovidedhardwaresupport tothisproblem,andtheyarediscussedinmoredetailinSection4.
forthiskindofcoarse-grainloop-levelspeculation,nowknownas
thread-levelspeculation(TLS)[36,43]. 4. TheGaloisApproach
However,therearefundamentalreasonswhycurrentTLSim-
Perhapsthemostimportantlessonfromthepasttwenty-fiveyears
plementations cannot exploit the parallelism in our applications.
ofparallelprogrammingisthatthecomplexityofparallelprogram-
One problem is that many of these applications, such as Delau-
mingshouldbehiddenfromprogrammersasfaraspossible.For
naymeshrefinement,haveunboundedwhile-loops,whicharenot
example,itislikelythatmoreSQLprogramsareexecutedinparal-
supportedbymostcurrentTLSimplementationssincetheytarget
lelthanprogramsinanyotherlanguage.However,mostSQLpro-
FORTRAN-styleDO-loopswithfixedloopbounds.Amorefunda-
grammersdonotwriteexplicitlyparallelcode;insteadtheyobtain
mentalproblemarisesfromthefactthatcurrentTLSimplementa-
parallelism by invoking parallel library implementations of joins
tionstrackdependencesbymonitoringthereadsandwritesmade
andotherrelationaloperations.A“layered”approachofthissortis
by loop iterations to memory locations. For example, if iteration
alsousedindenselinearalgebra,anotherdomainthathassuccess-
i+1writestoalocationbeforeitisreadbyiterationi,adependence
fullymasteredparallelism.
violationisreported,anditerationi+1mustberolledback.
In this spirit, programs in the Galois approach consist of (i) a
For irregular applications that manipulate pointer-based data
setoflibraryclassesand(ii)thetop-levelclientcodethatcreates
structures,thisistoostrictandtheprogramwillperformpoorlybe-
andmanipulatesobjectsoftheseclasses.Forexample,inDelaunay
causeoffrequentroll-backs.Tounderstandthis,considerthework-
mesh refinement, the relevant objects are the mesh and worklist,
listinDelaunaymeshgeneration.Regardlessofhowtheworklistis
and the client code implements the Delaunay mesh refinement
implemented,theremustbeamemorylocation(callthislocation
algorithm discussed in Section 2. This client code is executed
head) that points to a cell containing the next bad triangle to be
concurrentlybysomenumberofthreads,butaswewillsee,itis
handedout.Thefirstiterationofthewhileloopremovesabadtri-
notexplicitlyparallelandmakesnomentionofthreads.Figure6is
anglefromtheworklist,soitreadsandwritestohead,buttheresult
apictorialviewofthisexecutionmodel.
of this write is not committed until that iteration terminates suc-
There are three main aspects to the Galois approach: (1) two
cessfully.Athreadthatattemptstostarttheseconditerationcon-
syntacticconstructscalledoptimisticiteratorsforpackagingopti-
currentlywiththeexecutionofthefirstiterationwillalsoattempt
misticparallelismasiterationoversets(Section4.1),(2)assertions
toreadandwritehead,andsincethishappensbeforetheupdates
about methods in class libraries (Section 4.2), and (3) a runtime
fromthefirstiterationhavebeencommitted,adependenceconflict
scheme for detecting and recovering from potentially unsafe ac-
willbereported(theprecisepointatwhichadependenceconflict
cessestosharedobjectsmadebyanoptimisticcomputation(Sec-
willbereporteddependsontheTLSimplementation).Whilethis
tion4.3).
particular problem might be circumvented by inventing some ad
hocmechanism,it isunlikely thatthere isany suchwork-around
4.1 Optimisticiterators
forthefarmorecomplexpriorityqueuemanipulationsinagglom-
erative clustering. The manipulations of the graph and kd-tree in Asmentionedabove,theclientcodeisnotexplicitlyparallel;in-
theseapplicationsmayalsocreatesuchconflicts. steadparallelismispackagedintotwoconstructsthatwecallop-
Thisisafundamentalproblem:formanyirregularapplications, timisticiterators.Inthecompilerliterature,itisstandardtodistin-
tracking dependences by monitoring reads and writes to memory guishbetweendo-allloopsanddo-acrossloops[20].Theiterations
locationsiscorrectbutwillresultinpoorperformance. ofado-allloopcanbeexecutedinanyorderbecausethecompiler
Finally, Herlihy and Moss have proposed to simplify shared- ortheprogrammerassertsthattherearenodependencesbetween
memoryprogrammingbyeliminatinglock-basedsynchronization iterations.Incontrast,ado-acrossloopisoneinwhichtheremaybe
constructsinfavoroftransactions[15].Thereisgrowinginterestin dependencesbetweeniterations,sopropersequencingofiterations
supportingtransactionsefficientlywithsoftwareandhardwareim- isessential.Weintroducetwoanalogousconstructsforpackaging
plementationsoftransactionalmemory[1,12,13,21,34].Mostof optimisticparallelism.
thisworkisconcernedwithoptimisticsynchronizationandnotop-
• Setiterator:for each e in Set S do B(e)
timisticparallelization;thatis,theirstartingpointisaprogramthat
TheloopbodyB(e)isexecutedforeachelementeofsetS.
has already been parallelized (for example, the SPLASH bench-
Sincesetelementsarenotordered,thisconstructassertsthatin
marks[12]ortheLinuxkernel[28]),andthegoalisfindaneffi-
aserialexecutionoftheloop,theiterationscanbeexecutedin
cientwaytosynchronizeparallelthreads.Incontrast,ourgoalisto
anyorder.Theremaybedependencesbetweentheiterations,as
findtherightabstractionsforexpressingcoarse-grainparallelismin
inthecaseofDelaunaymeshgeneration,butanyserialorderof
irregularapplications,andtosupporttheseabstractionsefficiently;
executingiterationsispermitted.Whenaniterationexecutes,it
synchronizationisonepartofabiggerproblemweareaddressing
mayaddelementstoS.
inthispaper.Furthermore,mostimplementationsoftransactional
• Ordered-setiterator:for each e in Poset S do B(e)
1: Mesh m = /* read in initial mesh */
2: Set wl;
3: wl.add(mesh.badTriangles());
4: for each e in wl do {
5: if (e no longer in mesh) continue; Set S S.add(x) ws.get() ws.get() Workset ws
6: Cavity c = new Cavity(e);
S.contains?(x)
7: c.expand();
8: c.retriangulate(); S.remove(x) ws.add(x) ws.add(y)
9: m.update(c);
10: wl.add(c.badTriangles());
11: }
Figure7. Delaunaymeshrefinementusingsetiterator (a) (b)
Figure8. Interleavingmethodinvocationsfromdifferentiterations
Thisconstructisaniteratoroverapartially-orderedset(Poset) add(x), remove(x), get() and contains?(x) that have
S.Itassertsthatinaserialexecutionoftheloop,theiterations theusualsemantics1.
must be performed in the order dictated by the ordering of Thefirstproblemistheusualoneofconcurrencycontrol(also
elements in the Poset S. There may be dependences between known in the database literature as ensuring consistency). If a
iterations,andasinthecaseofthesetiterator,elementsmaybe method invocation from one iteration is performed concurrently
addedtoSduringexecution. withaninvocationfromanotheriteration,wemustensurethatthe
twoinvocationsdonotsteponeachother.Onesolutionistousea
Thesetiteratorisaspecialcaseoftheordered-setiteratorbutit lockonobjectS;ifthisinhibitsconcurrency,wecanusefine-grain
canbeimplementedmoreefficiently,asweseelaterinthissection. lockswithinobjectS.Theselocksareacquiredbeforethemethod
Figure7showstheclientcodeforDelaunaymeshgeneration. isinvokedandreleasedwhenthemethodcompletes.
Instead of a work list, this code uses a set and a set iterator. The However,thisisnotenoughtoensurethatthesequentialseman-
Galoisversionisnotonlysimplerbutalsomakesevidentthefact ticsoftheiteratorsarerespected.ConsiderFigure8(a).IfSdoesnot
that the bad triangles can be processed in any order; this fact containxbeforetheiterationsstart,noticethatinanysequentialex-
is absent from the more conventional code of Figure 3 since it ecutionoftheiterations,themethodinvocationcontains?(x)
implementsaparticularprocessingorder.Forlackofspace,wedo willreturnfalse.However,foronepossibleinterleavingofopera-
notshowtheGaloisversionofagglomerativeclustering,butituses tions—add(x),contains?(x),remove(x)—theinvoca-
theordered-setiteratorintheobviousway. tioncontains?(x)willreturntrue,whichisincorrect.Thisis
theproblemofensuringisolationoftheiterations.
4.1.1 ConcurrentExecutionModel Onesolutionforbothproblemsisforaniterationtoreleaseits
locks only at the end of the iteration: the well-known two-phase
AlthoughthesemanticsofGaloisiteratorscanbespecifiedwithout
lockingalgorithmusedindatabasesisanoptimizedversionofthis
appealing to a parallel execution model, these iterators provide
simple idea. Transactional memory implementations accomplish
hints from the programmer to the Galois runtime system that it
thesamegoalbytrackingthereadandwritesetsofeachiteration
may be profitable to execute the iterations in parallel; of course
insteadoflockingthem.
anyparallelexecutionmustbefaithfultothesequentialsemantics.
WhilethissolvestheprobleminFigure8(a),itisnotadequate
TheGaloisconcurrentexecutionmodelisthefollowing.Amas-
for our applications. The program in Figure 8(b) is motivated by
terthreadbeginstheexecutionoftheprogram;italsoexecutesthe
Delaunay mesh generation: each iteration gets a bad triangle at
codeoutsideiterators.Whenthismasterthreadencountersaniter-
the beginning of the iteration, and may add some bad triangles
ator,itenliststheassistanceofsomenumberofworkerthreadsto
to the work-set at the end. Regardless of how the set object is
executeiterationsconcurrentlywithitself.Theassignmentofiter-
implemented,theremustbealocation(callithead)thatpointstoa
ationstothreadsisunderthecontrolofaschedulingpolicyimple-
cellcontainingthenexttriangletobehandedout.Thefirstiteration
mentedbytheruntimesystem;fornow,weassumethatthisassign-
to get work will read and write location head, and it will lock
mentisdonedynamicallytoensureload-balancing.Allthreadsare
itforthedurationoftheiteration,preventinganyotheriterations
synchronizedusingbarriersynchronizationattheendoftheitera-
fromgettingwork.Mostcurrentimplementationsoftransactional
tor.
memorywillsufferfromthesameproblemsincetheheadlocation
Inourapplications,wehavenotfounditnecessarytousenested
willbeinthereadandwritesetsofthefirstiterationfortheduration
iterators. There is no fundamental problem in supporting nested
ofthatiteration.
parallelism,butourcurrentimplementationdoesnotsupportit;if
Thecruxoftheproblemisthattheabstractsetoperationshave
a thread encounters an inner iterator, it executes the entire inner
useful semantics that are not available to an implementation that
iteratorsequentially.
worksdirectlyontherepresentationofthesetandtracksreadsand
Given this execution model, the main technical problem is to
writestoindividualmemorylocations.Theproblemthereforeisto
ensurethattheparallelexecutionrespectsthesequentialsemantics
understandthesemanticsofsetoperationsthatmustbeexploited
of the iterators. This is a non-trivial problem because each itera-
to permit parallel execution in our irregular applications, and to
tionmayreadandwritetotheobjectsinsharedmemory,andwe
specifythesesemanticsinsomeconciseway.
mustensurethatthesereadsandwritesareproperlycoordinated.
Section4.2describestheinformationthatmustbespecifiedbythe
4.2.1 SemanticCommutativity
Galois class writer to enable this. Section 4.3 describes how the
Galoisruntimesystemusesthisinformationtoensurethatthese- The solution we have adopted exploits the commutativity of
quentialsemanticsofiteratorsarerespected. method invocations. Intuitively, it is obvious that the method in-
vocationstoagivenobjectfromtwoiterationscanbeinterleaved
without losing isolation provided that these method invocations
4.2 WritingGaloisClasses
commute,sincethisensuresthatthefinalresultisconsistentwith
Toensurethatthesequentialsemanticsofiteratorsarerespected,
therearetwoproblemsthatmustbesolved,whichweexplainwith 1Themethodremove(x)removesaspecificelementfromthesetwhile
referencetoFigure8.Thisfigureshowssetobjectswithmethods get()returnsanarbitraryelementfromtheset,removingitfromtheset.
some serial order of iteration execution. In Figure 8(a), the invo- class Set {
// interface methods
cationcontains?(x)doesnotcommutewiththeoperationsfrom
void add(Element x);
the other thread, so the invocations from the two iterations must [calls] _add(x) : void
notbeinterleaved.InFigure8(b),(1)getoperationscommutewith [commutes]
eachother,and(2)agetoperationcommuteswithanaddoperation - add(y) {y != x}
- remove(y) {y != x}
providedthattheoperandofaddisnottheelementreturnedbyget.
- contains(y) {y != x}
Thisallowsmultiplethreadstopullworkfromthework-setwhile - get() : y {y != x} //get call that returns y
ensuringthatsequentialsemanticsofiteratorsarerespected. [inverse] _remove(x)
Itisimportanttonotethatwhatisrelevantforourpurposeis void remove(Element x);
[calls] _remove(x) : void
commutativityinthesemanticsense.Theinternalstateoftheobject
[commutes]
mayactuallybedifferentfordifferentordersofmethodinvocations - add(y) {y != x}
eveniftheseinvocationscommuteinthesemanticsense.Forexam- - remove(y) {y != x}
ple,ifthesetisimplementedusingalinkedlistandtwoelements - contains(y) {y != x}
- get() : y {y != x}
are added to this set, the concrete state of the linked list will de-
[inverse] _add(x)
pendingeneralontheorderinwhichtheseelementswereadded bool contains(Element x);
tothelist.However,whatisrelevantforparallelizationisthatthe [calls] _contains(x) : bool b
stateofthesetabstractdatatype,whichisbeingimplementedby [commutes]
- add(y) {y != x}
thelinkedlist,isthesameforbothorders.Inotherwords,weare
- remove(y) {y != x}
not concerned with concrete commutativity (that is, commutativ- - get() : y {y != x}
itywithrespecttotheimplementationtypeoftheclass),butwith - contains(*) //any call to contains
semanticcommutativity(thatis,commutativitywithrespecttothe Element get();
[calls] _get() : Element x
abstractdatatypeoftheclass).Wealsonotethatcommutativityof
[commutes]
method invocations may depend on the arguments of those invo- - add(y) {y != x}
cations.Forexample,anaddandaremovecommuteonlyiftheir - remove(y) {y != x}
argumentsaredifferent. - contains(y) {y != x}
- get() : y {y != x}
[inverse] _add(x)
4.2.2 InverseMethods
//internal methods
Becauseiterationsareexecutedinparallel,itispossibleforcom-
void _add(Element x);
mutativityconflictstopreventaniterationfromcompleting.Once void _remove(Element x);
aconflictisdetected,somerecoverymechanismmustbeinvoked bool _contains(Element x);
toallowexecutionoftheprogramtocontinuedespitetheconflict. Element _get();
}
Becauseourexecutionmodelusestheparadigmofoptimisticpar-
allelism, our recovery mechanism rolls back the execution of the Figure9. ExampleGaloisclassforaSet
conflictingiteration.Toavoidlivelock,thelowerpriorityiteration
isrolledbackinthecaseoftheordered-setiterator. ditions. For example, remove(x) commutes with add(y),
Topermitthis,everymethodofasharedobjectthatmaymod- aslongastheyelementsaredifferent.
ifythestateofthatobjectmusthaveanassociatedinversemethod • inverse:Thissectionspecifiestheinverseofthecurrentmethod.
thatundoestheside-effectsofthatmethodinvocation.Forexam-
ple,foraset,theinverseofadd(x)isremove(x),andtheinverseof ThedescriptionoftheGaloissysteminthissectionimplicitly
remove(x)isadd(x).Asinthecaseofcommutativity,whatisrele- assumedthatallcallstoparallelobjectsaremadefromclientcode.
vantforourpurposeisaninverseinthesemanticsense;invokinga However,tofacilitatecomposition,wealsoallowparallelobjectsto
methodanditsinverseinsuccessionmaynotrestoretheconcrete invokemethodsonotherobjects.Thisishandledthroughasimple
datastructuretowhatitwas. flattening approach. The iteration object is passed to the “child”
Notethatwhenaniterationrollsback,allofthemethodswhich invocation and hence all operations done in the child invocation
itinvokesduringroll-backmustsucceed.Thus,wemustneveren- areappendedtotheundologoftheiteration.Similarly,thechild
counterconflictswheninvokinginversemethods.WhentheGalois invocationfunctionsasanextensionoftheoriginalmethodwhen
system checks commutativity, it also checks commutativity with detectingcommutativityconflicts.Nochangesneedtobemadeto
theassociatedinversemethod. theGaloisrun-timetosupportthisformofcomposition.
The class implementor must also ensure that each internal
4.2.3 PuttingitAllTogether method invocation is atomic to ensure consistency. This can be
doneusinganytechniquedesired,includinglocksortransactional
Sinceweareinterestedinsemanticcommutativityandundo,itis
memory. Recall that whatever locks are acquired during method
necessary for the class designer to specify this information. Fig-
invocation (or memory locations placed in read/write sets during
ure 9 illustrates how this information is specified in Galois for a
transactional execution) are released as soon as the method com-
classthatimplementssets.Theinterfacespecifiestwoversionsof
pletes, rather than being held throughout the execution of the it-
eachmethod:theinternalmethodsontheobject,andtheinterface
eration,sincewerelyoncommutativityinformationtoguarantee
methods,calledfromwithiniterators,thatperformthecommuta-
isolation.Inourcurrentimplementation,theinternalmethodsare
tivitychecks,maintaintheundoinformationandtriggerrollbacks
madeatomicthroughtheuseoflocks.
whencommutativityconflictsaredetected.
Thespecificationforaninterfacemethodconsistsofthreemain
4.2.4 Asmallexample
sections(withpseudo-coderepresentingtheseinthefigure):
Consideraprogramwrittenusingasinglesharedobject,aninteger
• calls: This section ties the interface method to the internal accumulator.Theobjectsupportstwooperations:accumulateand
method(s)itinvokes. read,withtheobvioussemantics.Itisclearthataccumulatescom-
• commutes: This section specifies which other interface meth- mutewithotheraccumulates,andreadscommutewithotherreads,
odsthecurrentmethodcommuteswith,andunderwhichcon- butthataccumulatedoesnotcommutewithread.Themethodsare
IterationA IterationB IterationC IterationRecord {
Status status;
{ { { Priority p;
... ... ... UndoLog ul;
a.accumulate(5) a.accumulate(7) a.read() list<LocalConflictLog> local_log;
... ... ... Lock l;
} } } }
Figure11. Iterationrecordmaintainedbyruntimesystem
Figure10. Exampleaccumulatorcode
4.3.1 ConflictLogs
madeatomicwithasinglelockwhichisacquiredatthebeginning
Theconflictlogisthemechanismfordetectingcommutativitycon-
ofthemethodandreleasedattheend.
flicts.Thereisoneconflictlogassociatedwitheachsharedobject.
There are three iterations executing concurrently, as seen in
Asimpleimplementationfortheconflictlogofanobjectisalist
Figure10.Theprogressoftheexecutionisasfollows:
containingthemethodsignatures(includingthevaluesoftheinput
• Iteration A calls accumulate, acquiring the lock, updating the and output parameters) of all invocations on that object made by
accumulatorandthenreleasingthelockandcontinuing. currently executing iterations (called “outstanding invocations”).
When iteration i attempts to call a method m on an object, the
• IterationBcallsaccumulate.Becauseaccumulatescommute,B 1
method signature is compared against all the outstanding invoca-
cansuccessfullymakethecall,acquiringthelock,updatingthe
tions in the conflict log. If one of the entries in the log does not
accumulatorandreleasingit.NotethatAhasalreadyreleased
commutewithm ,thenacommutativityconflictisdetected,and
thelockontheaccumulator,thusallowingBtomakeforward 1
anarbitrationprocessisbeguntodeterminewhichiterationsshould
progresswithoutblockingontheaccumulator’slock.
beaborted,asdescribedbelow.Ifm commuteswithalltheentries
• WheniterationCattemptstoexecuteread,itseesthatitcannot, 1
inthelog,thesignatureofm isappendedtothelog.Whenieither
asreaddoesnotcommutewiththealreadyexecutedaccumu- 1
abortsorcommits,alltheentriesintheconflictloginsertedbyiare
lates. Thus, C must roll back and try again. Note that this is
removedfromtheconflictlog.
notenforcedbythelockontheaccumulator,butinsteadbythe
Thismodelforconflictlogs,whilesimple,isnotefficientsince
commutativityconditionsontheaccumulator.
itrequiresafullscanoftheconflictlogwheneveraniterationcalls
• WheniterationsAandBcommit,Ccanthensuccessfullycall a method on the associated object. In our actual implementation,
readandcontinueexecution. conflict logs consist of separate conflict sets for each method in
theclass.Nowwhenicallsm ,onlytheconflictsetsformethods
In[38],vonPraunetaldiscusstheuseoforderedtransactions 1
whichm mayconflictwitharechecked;therestareignored.
inparallelizingFORTRAN-styleDO-loops,andtheygivespecial 1
Therearetwooptimizationsthatwehaveimplementedforcon-
treatment to reductions in such loops to avoid spurious conflicts.
flictlogs.
ReductionsdonotrequireanyspecialtreatmentintheGaloisap-
First,eachiterationcachesitsownportionoftheconflictlogs
proachsincetheprogrammercouldjustuseanobjectliketheac-
in a private log called its local log. This local log stores a
cumulatortoimplementreduction.
record of all the methods the iteration has successfully invoked
on the object. When an iteration makes a call, it first checks its
4.3 RuntimeSystem locallog.Ifthislocallogindicatesthattheinvocationwillsucceed
TheGaloisruntimesystemhastwocomponents:(1)aglobalstruc- (eitherbecausethatsamemethodhasbeencalledbeforeorother
turecalledthecommitpoolthatisresponsibleforcreating,abort- methods,whosecommutativityimpliesthatthecurrentmethodalso
ing,andcommittingiterations,and(2)per-objectstructurescalled commutes,havebeencalledbefore2),theiterationdoesnotneedto
conflictlogswhichdetectwhencommutativityconditionsarevio- checktheobject’sconflictlog.
lated. Asecondoptimizationisthatnotallobjectshaveconflictlogs
Atahighlevel,theruntimesystemsworksasfollows.Thecom- associatedwiththem.Forexample,thetrianglescontainedinthe
mitpoolmaintainsaniterationrecord,showninFigure11,foreach meshdonot;theirinformationismanagedbytheconflictloginthe
ongoing iteration in the system. The status of an iteration can be mesh.Ifthisoptimizationisused,caremustbetakenthatmodifi-
RUNNING, RTC (ready-to-commit) or ABORTED. Threads go to cationstothetriangleareonlymadethroughthemeshinterface.In
thecommitpooltoobtainaniteration.Thecommitpoolcreatesa general,programanalysisisrequiredtoensurethatthisoptimiza-
newiterationrecord,obtainsthenextelementfromtheiterator,as- tionissafe.
signsaprioritytotheiterationrecordbasedonthepriorityofthe
element(forasetiterator,allelementshavethesamepriority),and 4.3.2 CommitPool
setsthestatusfieldoftheiterationrecordtoRUNNING.Whenan Whenaniterationattemptstocommit,thecommitpoolcheckstwo
iteration invokes a method of a shared object, (i) the conflict log things:(i)thattheiterationisattheheadofthecommitqueue,and
ofthatobjectandthelocal logoftheiterationrecordareup- (ii)thatthepriorityoftheiterationishigherthanalltheelements
dated,asdescribedinmoredetailbelow,and(ii)acallbacktothe leftintheset/poSetbeingiteratedover3.Ifbothconditionsaremet,
associated inverse method is pushed onto the undo log of the it- theiterationcansuccessfullycommit.Iftheconditionsarenotmet,
erationrecord.Ifacommutativityconflictisdetected,thecommit theiterationmustwaituntilithasthehighestpriorityinthesystem;
poolarbitratesbetweentheconflictingiterations,andabortsitera- itsstatusissettoRTC,andthethreadisallowedtobeginanother
tionstopermitthehighestpriorityiterationtocontinueexecution. iteration.
Callbacks in the undo logs of aborted iterations are executed to
undotheireffectsonsharedobjects.Onceathreadhascompleted 2Forexample, ifaniteration hasalreadysuccessfullyinvoked add(x),
aniteration,thestatusfieldofthatiterationischangedtoRTC,and thencontains(x)willclearlycommutewithmethodinvocationsmade
thethreadisallowedtobeginanewiteration.Whenthecompleted byotherongoingiterations.
iterationhasthehighestpriorityinthesystem,itisallowedtocom- 3Thisistoguardagainstasituationwhereanearliercommittediteration
mit.Itcanbeseenthattheroleofthecommitpoolissimilartothat addsanewelementwithhighprioritytothecollectionwhichhasnotyet
ofareorderbufferinout-of-orderprocessors[14]. beenconsumedbytheiterator
When an iteration successfully commits, the thread that was 4.4 Discussion
runningthatiterationalsochecksthecommitqueuetoseeifmore
Setiterators:AlthoughtheGaloissetiteratorsintroducedinSec-
iterationsintheRTCstatecanbecommitted.Ifso,itcommitsthose
tion4.1weremotivatedinthispaperbythetwoapplicationsdis-
iterationsbeforebeginningtheexecutionofanewiteration.When
cussedinSection2,theyareverygeneral,andwehavefoundthem
an iteration has to be aborted, the status of its record is changed
tobeusefulforwritingotherirregularapplicationssuchasadvanc-
to ABORTED, but the commit pool takes no further action. Such
ingfrontmeshgenerators[23],andWalkSATsolvers[33].Many
iterationobjectsarelazilyremovedfromthecommitqueuewhen
of these applications use “work-list”-style algorithms, for which
theyreachthehead.
Galoisiteratorsarenatural,andtheGaloisapproachallowsusto
exploitdata-parallelismintheseirregularapplications.
Conflictarbitration Theotherresponsibilityofthecommitpool SETLwasprobablythefirstlanguagetointroduceanunordered
istoarbitrateconflictsbetweeniterations.Wheniteratingoveran setiterator[19],butthisconstructdiffersfromitsGaloiscounter-
unordered set, the choice of which iteration to roll back in the partinimportantways.InSETL,thesetbeingiteratedovercanbe
eventofaconflictisirrelevant.Forsimplicity,wealwayschoose modified during the execution of the iterator, but these modifica-
theiterationwhichdetectedtheconflict.However,wheniterating tionsdonottakeeffectuntiltheexecutionoftheentireiteratoris
overanorderedset,thelowerpriorityiterationmustberolledback complete.Inourexperience,thisistoolimitingbecausework-list
whilethehigherpriorityiterationmustcontinue.Withoutdoingso, algorithmsusuallyinvolvedata-structuretraversalsofsomekindin
thereexiststhepossibilityofdeadlock. whichnewworkisdiscoveredduringthetraversal.Thetupleiter-
Thus,wheniterationi callsamethodonasharedobjectand ator in SETL is similar to the Galois ordered-set iterator, but the
1
aconflictisdetectedwithiterationi ,thecommitpoolarbitrates tuplecannotbemodifiedduringtheexecutionoftheiterator,which
2
basedontheprioritiesofthetwoiterations.Ifi haslowerpriority, limitsitsusefulnessinirregularapplications.Finally,SETLwasa
1
it simply performs the standard rollback operations. The thread sequentialprogramminglanguage.DO-loopsinFORTRANarea
whichwasexecutingi thenbeginsanewiteration. specialcaseoftheGaloisordered-setiteratorinwhichiterationis
1
Thissituationiscomplicatedwheni istheiterationthatmust performedoverintegersinsomeinterval.
2
be rolled back. Because the Galois run time systems functions Amorecompletedesignthanourswouldincludeiteratorsover
purelyattheuserlevel,thereisnosimplewaytoabortaniteration multisetsandmaps,whichareeasytoaddtoGalois.MATLABor
runningonanotherthread.Toaddressthisproblem,eachiteration FORTRAN-90-stylenotationlike[low:step:high]forspec-
recordhasaniterationlockasshowninFigure11.Wheninvoking ifying ordered and unordered integers within intervals would be
methods on shared objects, each iteration must own the iteration useful.Webelieveitisalsoadvisabletodistinguishsyntactically
lockinitsrecord.Thus,thethreadrunningi doesthefollowing: between DO-ALL loops and unordered-set iterators over integer
1
ranges, since in the former case, the programmer can assert that
run-time dependence checks are unnecessary, enabling more effi-
1. Itattemptstoobtaini2’siterationlock.Bydoingso,itensures cient execution. For example, in the standard i-j-k loop nest for
thati2isnotmodifyinganysharedstate. matrix-multiplication, the i and j loops are not only Galois-style
2. Itabortsi byexecutingi ’sundologandclearingthevarious unordered-setiteratorsoverintegerintervalsbuttheyareevenDO-
2 2
conflict logs of i ’s invocations. Note that the control flow of ALLloops;thekloopisanordered-setinteratoriftheaccumula-
2
thethreadexecutingi doesnotchange;thatthreadcontinues tionstoelementsoftheCmatrixmustbedoneinorder.
2
asifnorollbackisoccurring. Semanticcommutativity:Withoutcommutativityinformation,
3. Itsetsthestatusofi toABORTED. an object can be accessed by at most one iteration at a time, and
2
thatiterationshutsoutotheriterationsuntilitcommits.Inthiscase,
4. Itthenresumesitsexecutionofi ,whichcannowproceedas
1 inversemethodscanbeimplementedautomaticallybydatacopying
theconflicthasbeenresolved.
asisdoneinsoftwaretransactionalmemories.
Intheapplicationswehavelookedat,mostsharedobjectsare
Ontheothersideofthisarbitrationprocess,thethreadexecuting instancesofcollections,whicharevariationsofsets,sospecifying
i willrealizethati hasbeenabortedwhenitattemptstoinvoke commutativity information and writing inverse methods has been
2 2
anothermethodonasharedobject(orattemptstocommit).Atthis straightforward.Forexample,thekd-treeisjustasetwithanad-
point,thethreadwillseethati ’sstatusisABORTEDandwillcease ditionalmethodforfindingthenearestneighborofanelementin
2
executionofi andbeginanewiteration. theset.NotethatthedesignofGaloismakesiteasytoreplacesim-
2
When an iteration has to be aborted, the callbacks in its undo ple data structures with clever, hand-tuned concurrent data struc-
logareexecutedinLIFOorder.Becausetheundologmustpersist tures[32]ifnecessary,withoutchangingtherestoftheprogram.
untilaniterationcommits,wemustensurethatallthearguments Theuseofcommutativityinparallelprogramexecutionwasex-
usedbythecallbacksremainvaliduntiltheiterationcommits.Ifthe ploredbyBernsteinasfarbackas1966[4].Inthecontextofcon-
argumentsarepass-by-value,thereisnoproblem;theyarecopied currentdatabasesystems,Weihldescribedatheoreticalframework
when the callback is created. A more complex situation is when for using commutativity conditions for concurrency control [40].
arguments are pass-by-reference or pointers. The first problem is HerlihyandWeihlextendedthisworkbyleveragingorderingcon-
that the underlying data which the reference or pointer points to straints to increase concurrency but at the cost of more complex
maybechangedduringthecourseofexecution.Thus,thecallback rollbackschemes[16].
may be called with inappropriate arguments. However, as long Inthecontextofparallelprogramming,Steeledescribedasys-
as all changes to the underlying data also occur through Galois temforexploitingcommutingoperationsonmemorylocationsin
interfaces,theLIFOnatureoftheundologensuresthattheywillbe optimisticparallelexecution[18].However,commutativityisstill
rolledbackasnecessarybeforethecallbackusesthem.Thesecond tiedtoconcretememorylocationsanddoesnotexploitproperties
problemoccurswhenaniterationattemptstofreeapointer,asthere ofabstractdatatypeslikeGaloisdoes.DinizandRinardperformed
is no simple way to undo a call to free. The Galois run-time staticanalysistodetermineconcretecommutativityofmethodsfor
avoidsthisproblembydelayingallcallstofreeuntilaniteration useincompile-timeparallelization[10].Semanticcommutativity,
commits. This does not affect the semantics of the iteration, and asusedinGalois,ismoregeneralbutitmustbespecifiedbythe
avoidstheproblemofrollingbackmemorydeallocation. classdesigner.WuandPaduahaveproposedtousehighlevelse-
manticsofcontainerclasses[42].Theyproposemakingacompiler
awareofpropertiesofabstractdatatypessuchasstacksandsetsto
permitmoreaccuratedependenceanalysis.
Recently, Ni et al [24] have proposed to extend conventional
transactionalmemorywithanotionof“abstractlocking”tointro-
ducethenotionofsemanticconflicts.Carlstrometalhavetakena
similarapproachtotheJavacollectionsclasses[6].Semanticcom- 8
mexupteartiievnitcyepirsonveidedesedanboetfhoerrewthaeyroeflastpiveeciafdyvinagntoapgeensonfesthtiengtw.Mooapre- Time (s)6
pisroaancehaessiebrencoomtioencfleoarrp,rbougtrwamembeelriesvteotuhnadtesresmtaanndt.iccommutativity Execution 24 rFFmmeGGeefeLLssrhh e((ggdrn)ee)cnne ((dr))
0
1 2 3 4
5. Evaluation # of processors
(a)Executiontimes
We have implemented the Galois system in C++ on two Linux
platforms:(i)a4processor,1.5GHzItanium2,with16KBofL1, reference
3 FGL (d)
256KBofL2and3MBofL3cacheperprocessor,and(ii)adual FGL (r)
pcororeceasnsdor4dMuBal-ocfoLre23c.a0chGeHpzerXperooncessyssotre.mT,hweitthhre3a2dKinBgloifbrLa1rypoenr Speedup2.25 mmeesshhggeenn ((dr))
1.5
bothplatformswaspthreads.
1
1 2 3 4
5.1 DelaunayMeshRefinement # of processors
(b)Self-relativeSpeed-ups
We first wrote a sequential Delaunay mesh refinement program
withoutlocks,threadsetc.toserveasareferenceimplementation. Committed Aborted
#ofproc. Max Min Avg Max Min Avg
WethenimplementedaGaloisversion(whichwecallmeshgen),
1 21918 21918 21918 n/a n/a n/a
andafine-grainlockingversion(FGL)thatuseslocksonindivid- 4(meshgen(d)) 22128 21458 21736 28929 27711 28290
ualtriangles.TheGaloisversionusesthesetiterator,andtherun- 4(meshgen(r)) 22101 21738 21909 265 151 188
timesystemdescribedinSection4.3.Inallthreeimplementations, (c)Committedandabortediterationsformeshgen
themeshwasrepresentedbyagraphthatwasimplementedasaset InstructionType reference meshgen(r)
oftriangles,whereeachtrianglemaintainedasetofitsneighbors. Branch 38047 70741
This is essentially the same as the standard adjacency list repre- FP 9946 10865
LwD/iSthT randomiz9e0d0 6q4ueue 165746
sentationofgraphs.Formeshgen,codeforcommutativitychecks
Int 304449 532884
was addedby hand tothis graph class; ultimately,we would like Total 442506 780236
togeneratethiscodeautomaticallyfromhighlevelcommutativity (d)Instructionsperiterationonasingleprocessor
specificationslikethoseinFigure9.WeusedanSTLqueuetoim-
plementtheworkset.Werefertothesedefaultimplementationsof 30.0 30.0
25.6734 25.7625
meshgenandFGLasmeshgen(d)andFGL(d).
wwtuehrniTTciemohhdeputahlnienredampenwuredstnootdtramekand-tdsaeetltwsethemeowteweamnfsaftoesiormcefgtpetovhlnefeeemrsrcsaceiuthonerenrtdeedsnd,uatulFibstnGoyegmtL.ap(arodt)ilacitcaaaynlldsyotrnumuscpeitnseuhrgrfgeoJerontmhn(aara)ttnh,craieenn-, Instructions (billions)12752...505 16.8889 17.4675 Cycle (billions)12752...505 13.8951 18.8501
mesghen
Shewchuk’s Triangle program [35]. It had 10,156 triangles and 0 0
1 proc 4 proc (d) 4 proc (r) 1 proc 4 proc (d) 4 proc (r)
boundarysegments,ofwhich4,837triangleswerebad. (e)BrCeoamkmditownoAfbionrtstructSicohnedsualenrdcycCloemsminutamtiveitsyhgen
Executiontimesandspeed-ups. Executiontimesandself-relative Commit
10%
speed-upsforthefiveimplementationsontheItaniummachineare
showninFigure12(a,b).Thereferenceversionisthefastestona A1b0o%rt
singleprocessor.On4processors,FGL(d)andFGL(r)differonly
Scheduler
slightly in performance. Meshgen(r) performed almost as well as 3%
FGL,althoughsurprisingly,meshgen(d)wastwiceasslowasFGL.
Statistics on committed and aborted iterations. To understand
theseissuesbetter,wedeterminedthetotalnumberofcommitted Commutativity
77%
andabortediterationsfordifferentversionsofmeshgen,asshown
(f)BreakdownofGaloisoverhead
in Figure 12(c). On 1 processor, meshgen executed and commit-
ted21,918iterations.Becauseoftheinherentnon-determinismof #ofprocs Client Object Runtime Total
1 1.177 0.6208 0.6884 2.487
the set iterator, the number of iterations executed by meshgen in
4 2.769 3.600 4.282 10.651
parallel varies from run to run (the same effect will be seen on
(g)L3misses(inmillions)formeshgen(r)
oneprocessoriftheschedulingpolicyisvaried).Therefore,weran
the codes a large number of times, and determined a distribution Figure12. Meshrefinementresults:4-processorItanium
forthenumbersofcommittedandabortediterations.Figure12(c)
shows that on 4 processors, meshgen(d) committed roughly the
samenumberofiterationsasitdidon1processor,butalsoaborted
almostasmanyiterationsduetocavityconflicts.Theabortratiofor
meshgen(r) is much lower because the scheduling policy reduces
the likelihood of conflicts between processors. This accounts for
the performance difference between meshgen(d) and meshgen(r).
Because the FGL code is carefully tuned by hand, the cost of an 8
ambeosrhtegdenit,esroatFioGnLis(rs)upbesrtfaonrtmiaslloynlleyssatlhitatnlethbeetcteorrrtehsapnoFndGiLng(dc)o.stin me (s)6
ccoavuIilttdysbeceeomnbflseinccetosfiucsnhitaoelwr,inbedtuuttihtaiavdteethtehepaeptrraoinbrvlaeenmsdtoigcmoatuiizloedndbisenctahotetrtdihubeluitnseogduprtocoeloicuoyfr Execution ti24 Rtreeefebrueinldce
useofanSTLqueuetoimplementtheworkset.Whenabadtriangle 0 1 2 3 4
isrefinedbythealgorithm,aclusterofsmallerbadtrianglesmaybe # of processors
(a)Executiontimes
createdwithinthecavity.Inthequeuedatastructure,thesenewbad
trianglesareadjacenttoeachother,soitislikelythattheywillbe Reference
treebuild
scheduledtogetherforrefinementondifferentprocessors,leading 2.5
tocOavnietyccoonncflluiscitos.nfromtheseexperimentsisthatdomainknowl- Speedup1.52
edgeisinvaluableforimplementingagoodschedulingpolicy.
1
1 2 3 4
# of processors
(b)Self-relativespeed-ups
Instructions and cycles breakdown. Figure 12(d) shows the
30000
breakdown of different types of instructions executed by the ref-
erenceandmeshgenversionsofDelaunaymeshrefinementwhen 25000
ttghiiovenye;aainrpesreoruqfinuleeonontfiaoalne“extyeppcrioucctaieols”ns,oittreh.reaTrtehioeanrneiunnmothbaeebrtsowrstohs,ocswoodnethsae.rseEeapnceuhrmiittbeeerraars-- Frequency112050000000000
tionofmeshgenperformsroughly10,000floating-pointoperations
5000
and executes almost a million instructions. These are relatively
0
long-runningcomputations. 0 5 10 15 20
RTC occupancy in commit pool
Meshgenexecutesalmost80%moreinstructionsthantheref- (c)CommitpooloccupancybyRTCiterations
erenceversion.Tounderstandwheretheseextracycleswerebeing
Committed Aborted
spent,weinstrumentedthecodeusingPAPI.Figure12(e)showsa #ofproc. Max Min Avg Max Min Avg
breakdownofthetotalnumberofinstructionsandcyclesbetween 1 57846 57846 57846 n/a n/a n/a
theclientcode(thecodeinFigure7),thesharedobjects(graphand 4 57870 57849 57861 3128 1887 2528
workset),andtheGaloisruntimesystem.The4processornumbers (d)Committedandabortediterationsintreebuild
aresumsacrossallfourprocessors.Thereferenceversionperforms InstructionType reference treebuild
almost9.8billioninstructions,andthisisroughlythesameasthe Branch 7162 18187
numberofinstructionsexecutedintheclientcodeandsharedob- FP 3601 3640
LD/ST 22519 48025
jects in the 1 processor version of meshgen and the 4 processor
Int 70829 146716
versionofmeshgen(r).Becausemeshgen(d)hasalotofaborts,it Total 104111 216568
spendssubstantiallymoretimeintheclientcodedoingworkthat (e)Instructionsperiterationonasingleprocessor.
getsabortedandintheruntimelayertorecoverfromaborts.
WefurtherbrokedowntheGaloisoverheadintofourcategories: 20 20 18.8916
commit and abort overheads, which are the time spent commit-
twortihevnhseeguircGlhihttseae,ailrnoadacits,silowusonedhveseienscarhnhtiindemisaaFdebtihgosgeuprotretieeinnmstg1iea2ntrh(sbfepp)imee,trrnsa,fthtorioeprnsmwegprifectnohcogrtanmitvcfleiroinolcmyguts;mgc;shoucalnhntyafledttdiihvccurtoiletemcyerhmfcoeohcvuukeetrcasrthkt.hiesvTs.aihotdIyeft, Instructions (billions) 11055 12.5272 13.2660 Cycle (billions) 11505 14.1666
treebuild
isclearthatreducingthisoverheadiskeytoreducingtheoverall 0 0
1 proc 4 proc 1 proc 4 proc
overheadoftheGaloisrun-time. (f)BCormemaiktdowAnboortfinstSrcuhecdtuiloernsanCdomcmyuctlaetivsity
The1processorversionofmeshgenexecutesroughlythesame
Commit
number of instructions as the 4 processor version. We do not get 8% Abort
1%
perfect self-relative speedup because some of these instructions
takelongertoexecuteinthe4processorversionthaninthe1pro-
cessorversion.Therearetworeasonsforthis:contentionforlocks
Commutativity
in shared objects and the runtime system, and cache misses due 52% Scheduler
toinvalidations.Contentionisdifficulttomeasuredirectly,sowe 39%
lookedatcachemissesinstead.Onthe4processorItanium,there
isnosharedcache,sowemeasuredL3cachemisses.Figure12(g)
showsL3misses;the4processornumbersaresumsacrossallpro- (g)BreakdownofGaloisoverhead
cessorsformeshgen(r).Mostoftheincreaseincachemissesarises #ofprocs User Object Runtime Total
fromcodeinthesharedobjectclassesandintheGaloisruntime. 1 0.5583 3.102 0.883 4.544
AnL3misscostsroughly300cyclesontheItanium,soitcanbe 4 2.563 12.8052 5.177 20.545
(h)NumberofL3misses(inmillions)ondifferentnumbersofprocessors.
seenthatoverhalfoftheextracyclesexecutedbythe4processor
version,whencomparedtothe1processorversion,arelostinL3 Figure13. Agglomerativeclusteringresults:4-processorItanium
misses.Therestoftheextracyclesarelostincontention.
Description:Keshav Pingali. Department of Computer Science,. University of Texas, Austin be added easily to any object-oriented programming language like Java. They are related to set iterators in most studies in this area do; instead it is necessary to recall Niklaus. Wirth's aphorism program = algorithm +