Table Of Content

Optimistic Parallelism Requires Abstractions∗ MilindKulkarni†, BruceWalter,GaneshRamanarayanan, KeshavPingali KavitaBala‡,L.PaulChew DepartmentofComputerScience, DepartmentofComputerScience, UniversityofTexas,Austin. CornellUniversity,Ithaca,NewYork. {milind,pingali}@cs.utexas.edu [email protected], {graman,kb,chew}@cs.cornell.edu Abstract Categories and Subject Descriptors D.1.3 [Programming Tech- niques]:ConcurrentProgramming—ParallelProgramming Irregularapplications,whichmanipulatelarge,pointer-baseddata structures like graphs, are difficult to parallelize manually. Auto- GeneralTerms Languages matictoolsandtechniquessuchasrestructuringcompilersandrun- timespeculativeexecutionhavefailedtouncovermuchparallelism Keywords Optimistic Parallelism, Abstractions, Irregular Pro- intheseapplications,inspiteofalotofeffortbytheresearchcom- grams munity.Thesedifficultieshaveevenledsomeresearcherstowon- derifthereisanycoarse-grainparallelismworthexploitinginir- 1. Introduction regularapplications. Apessimistseesthedifficultyineveryopportunity; Inthispaper,wedescribetworeal-worldirregularapplications: anoptimistseestheopportunityineverydifficulty. aDelaunaymeshrefinementapplicationandagraphicsapplication thatperformsagglomerativeclustering.Bystudyingthealgorithms —SirWinstonChurchill anddatastructuresusedintheseapplications,weshowthatthere The advent of multicore processors has shifted the burden of is substantial coarse-grain, data parallelism in these applications, improving program execution speed from chip manufacturers to but that this parallelism is very dependent on the input data and software developers. A particularly challenging problem in this thereforecannotbeuncoveredbycompileranalysis.Inprinciple, context is the parallelization of irregular applications that deal optimistictechniquessuchasthread-levelspeculationcanbeused withcomplex,pointer-baseddatastructuressuchastrees,queues to uncover this parallelism, but we argue that current implemen- and graphs. In this paper, we describe two such applications: a tationscannotaccomplishthisbecausetheydonotusetheproper Delaunaymeshrefinementcode[8]andagraphicsapplication[39] abstractionsforthedatastructuresintheseprograms. thatperformsagglomerativeclustering[26]. These insights have informed our design of the Galois sys- Inprinciple,itispossibletouseathreadlibrary(e.g.,pthreads) tem, an object-based optimistic parallelization system for irregu- oracombinationofcompilerdirectivesandlibraries(e.g.,OpenMP larapplications.TherearethreemainaspectstoGalois:(1)asmall [25])towritethreadedcodeformulticoreprocessors,butitiswell numberofsyntacticconstructsforpackagingoptimisticparallelism knownthatwritingthreadedcodecanbeverytrickybecauseofthe as iteration over ordered and unordered sets, (2) assertions about complexitiesofsynchronization,dataraces,memoryconsistency, methodsinclasslibraries,and(3)aruntimeschemefordetecting etc.TimSweeney,whodesignedthemulti-threadedUnreal3game andrecoveringfrompotentiallyunsafeaccessestosharedmemory engine,estimatesthatwritingmulti-threadingcodetripledsoftware madebyanoptimisticcomputation. costsatEpicGames(quotedin[9]). We show that Delaunay mesh generation and agglomerative Anotherpossibilityistousecompileranalysessuchaspoints- clusteringcanbeparallelizedinastraight-forwardwayusingthe toandshapeanalysis[5,31]toparallelizesequentialirregularpro- Galois approach, and we present experimental measurements to grams.Unfortunately,staticanalysesfailtouncovertheparallelism show that this approach is practical. These results suggest that in such applications because the parallel schedule is very data- Galois is a practical approach to exploiting data parallelism in dependent and cannot be computed at compile-time, as we argue irregularprograms. inSection3. Optimisticparallelization[17]isapromisingidea,butcurrent ∗This work is supported in part by NSF grants 0615240, 0541193, implementations of optimistic parallelization such as thread-level 0509307,0509324,0426787and0406380,aswellasgrantsfromtheIBM speculation(TLS)[36,43]cannotexploittheparallelisminthese andIntelCorportations. applications,aswediscussinSection3. †MilindKulkarniissupportedbytheDOEHPCSFellowship. Inthispaper,wedescribetheGaloisapproachtoparallelizing ‡KavitaBalaissupportedinpartbyNSFCareerGrant0644175 irregularapplications.Thisapproachisinformedbythefollowing beliefs. • Optimisticparallelizationistheonlyplausibleapproachtopar- allelizingmany,ifnotmost,irregularapplications. • For effective optimistic parallelization, it is crucial to exploit theabstractionsprovidedbyobject-orientedlanguages(inpar- Permissiontomakedigitalorhardcopiesofallorpartofthisworkforpersonalor ticular,thedistinctionbetweenanabstractdatatypeanditsim- classroomuseisgrantedwithoutfeeprovidedthatcopiesarenotmadeordistributed forprofitorcommercialadvantageandthatcopiesbearthisnoticeandthefullcitation plementation). onthefirstpage.Tocopyotherwise,torepublish,topostonserversortoredistribute • Concurrency should be packaged, when possible, within syn- tolists,requirespriorspecificpermissionand/orafee. tactic constructs that make it easy for the programmer to ex- PLDI’07 June11–13,2007,SanDiego,California,USA. Copyright(cid:13)c 2007ACM978-1-59593-633-2/07/0006...$5.00. presswhatmightbedoneinparallel,andforthecompilerand runtimesystemtodeterminewhatshouldbedoneinparallel. Figure1. ADelaunaymesh.Notethatthecircumcircleforeachof Figure2. Fixingabadelement. thetrianglesdoesnotcontainotherpointsinthemesh. 1: Mesh m = /* read in initial mesh */ 2: WorkList wl; ThesyntacticconstructsusedinGaloisareverynaturalandcan 3: wl.add(mesh.badTriangles()); beaddedeasilytoanyobject-orientedprogramminglanguage 4: while (wl.size() != 0) { 5: Element e = wl.get(); //get bad triangle likeJava.TheyarerelatedtosetiteratorsinSETL[19]. 6: if (e no longer in mesh) continue; • Concurrentaccesstomutablesharedobjectsbymultiplethreads 7: Cavity c = new Cavity(e); isfundamental,andcannotbeaddedtothesystemasanafter- 8: c.expand(); 9: c.retriangulate(); thought as is done in current approaches to optimistic paral- 10: mesh.update(c); lelization.However,disciplineneedstobeimposedonconcur- 11: wl.add(c.badTriangles()); rentaccessestosharedobjectstoensurecorrectexecution. 12: } Figure3. Pseudocodeofthemeshrefinementalgorithm We have implemented the Galois approach in C++ on two shared-memoryplatforms,andwehaveusedthisimplementation In practice, the Delaunay property alone is not sufficient, and to write a number of complex applications including Delaunay it is necessary to impose quality constraints governing the shape meshrefinement,agglomerativeclustering,animagesegmentation and size of the triangles. For a given Delaunay mesh, this is ac- code that uses graph cuts [39], and an approximate SAT solver complishedbyiterativemeshrefinement,whichsuccessivelyfixes calledWalkSAT[33]. “bad”triangles(trianglesthatdonotsatisfythequalityconstraints) ThispaperdescribestheGaloisapproachanditsimplementa- by adding new points to the mesh and re-triangulating. Figure 2 tion, and presents performance results for some of these applica- illustrates this process; the shaded triangle in Figure 2(a) is as- tions.Itisorganizedasfollows.InSection2,wepresentDelaunay sumedtobebad.Tofixthisbadtriangle,anewpointisaddedat meshrefinementandagglomerativeclustering,anddescribeoppor- thecircumcenterofthistriangle.Addingthispointmayinvalidate tunitiesforexploitingparallelisminthesecodes.InSection3,we theemptycirclepropertyforsomeneighboringtriangles,soallaf- give an overview of existing parallelization techniques and argue fectedtrianglesaredetermined(thisregioniscalledthecavityof that they cannot exploit the parallelism in these applications. In thebadtriangle),andthecavityisre-triangulated,asshowninFig- Section4,wediscusstheGaloisprogrammingmodelandrun-time ure2(c)(inthisfigure,alltriangleslieinthecavityoftheshaded system.InSection5,weevaluatetheperformanceofoursystemon badtriangle).Re-triangulatingacavitymaygeneratenewbadtri- thetwoapplications.Finally,inSection6,wediscussconclusions anglesbutitcanbeshownthatthisiterativerefinementprocesswill andongoingwork. ultimatelyterminateandproduceaguaranteed-qualitymesh.Dif- ferentordersofprocessingbadelementsleadtodifferentmeshes, 2. TwoIrregularApplications althoughallsuchmeshessatisfythequalityconstraints[8]. Figure3showsthepseudocodeformeshrefinement.Theinput Tounderstandthenatureoftheparallelisminirregularprograms, tothisprogramisaDelaunaymeshinwhichsometrianglesmaybe itisuselesstostudytheexecutiontracesofirregularprograms,as bad,andtheoutputisarefinedmeshinwhichalltrianglessatisfy moststudiesinthisareado;insteaditisnecessarytorecallNiklaus Wirth’saphorismprogram=algorithm+datastructure[41],and the quality constraints. There are two key data structures used in thisalgorithm.Oneisaworklistcontainingthebadtrianglesinthe examine the relevant algorithms and data structures. In this sec- mesh. The other is a graph representing the mesh structure; each tion, we describe two irregular applications: Delaunay mesh re- triangle in the mesh is represented as one node, and edges in the finement [8], and agglomerative clustering [26] as used within a graphrepresenttriangleadjacenciesinthemesh. graphics application [39]. These applications perform refinement andcoarseningrespectively,whicharearguablythetwomostcom- Opportunities for Exploiting Parallelism. The natural unit of mon operations for bulk-modification of irregular data structures. workforparallelexecutionistheprocessingofabadtriangle.Our Foreachapplication,wedescribethealgorithmandkeydatastruc- measurements show that on the average, each unit of work takes tures,anddescribeopportunitiesforexploitingparallelism. about a million instructions of which about 10,000 are floating- point operations. Because a cavity is typically a small neighbor- 2.1 DelaunayMeshRefinement hoodofabadtriangle,twobadtrianglesthatarefarapartonthe Meshgenerationisanimportantproblemwithapplicationsinmany meshmayhavecavitiesthatdonotoverlap.Furthermore,theentire areassuchasthenumericalsolutionofpartialdifferentialequations refinementprocess(expansion,retriangulationandgraphupdating) andgraphics.Thegoalofmeshgenerationistorepresentasurface forthetwotrianglesiscompletelyindependent;thus,thetwotrian- oravolumeasatessellationcomposedofsimpleshapesliketrian- glescanbeprocessedinparallel.Thisapproachobviouslyextends gles,tetrahedra,etc. tomorethantwotriangles.Ifhoweverthecavitiesoftwotriangles Althoughmanytypesofmeshesareusedinpractice,Delaunay overlap,thetrianglescanbeprocessedineitherorderbutonlyone meshes are particularly important since they have a number of ofthemcanbeprocessedatatime.Whetherornottwobadtrian- desirablemathematicalproperties[8].TheDelaunaytriangulation gleshaveoverlappingcavitiesdependsentirelyonthestructureof forasetofpointsintheplaneisthetriangulationsuchthatnopoint themesh,whichchangesthroughouttheexecutionofthealgorithm. isinsidethecircumcircleofanytriangle(thispropertyiscalledthe How much parallelism is there in Delaunay mesh generation? empty circle property). An example of such a mesh is shown in The answer obviously depends on the mesh and on the order in Figure1. which bad triangles are processed, and may be different at dif- 1: kdTree := new KDTree(points) e e 2: pq := new PriorityQueue() 3: foreach p in points {pq.add(<p,kdTree.nearest(p)>)} 4: while(pq.size() != 0) do { 5: Pair <p,n> := pq.get();//return closest pair a d a d 6: if (p.isAlreadyClustered()) continue; b b c c a b c d e 7: if (n.isAlreadyClustered()) { 8: pq.add(<p, kdTree.nearest(p)>); 9: continue; (a) Data points (b) Hierarchical clusters (c) Dendrogram 10: } Figure4. Agglomerativeclustering 11: Cluster c := new Cluster(p,n); 12: dendrogram.add(c); ferentpointsduringtheexecutionofthealgorithm.Onestudyby 13: kdTree.remove(p); 14: kdTree.remove(n); Antonopoulosetal.[2]onameshofonemilliontrianglesfound 15: kdTree.add(c); thatthereweremorethan256cavitiesthatcouldbeexpandedin 16: Point m := kdTree.nearest(c); paralleluntilalmosttheendofexecution. 17: if (m != ptAtInfinity) pq.add(<c,m>); 18: } 2.2 AgglomerativeClustering Figure5. Pseudocodeforagglomerativeclustering The second problem is agglomerative clustering, a well-known data-mining algorithm [26]. This algorithm is used in graphics dendrogramisalongandskinnytree,theremaybefewindepen- applicationsforhandlinglargenumbersoflightsources[39]. dentiterations,whereasifthedendrogramisabushytree,thereis Theinputtotheclusteringalgorithmis(1)adata-set,and(2)a parallelismthatcanbeexploitedsincethetreecanbeconstructed measureofthe“distance”betweenitemsinthedata-set.Intuitively, bottom-upinparallel.AsinthecaseofDelaunaymeshrefinement, thismeasureisanestimateofsimilarity—thelargerthedistance theparallelismisverydata-dependent.Inexperimentsongraphics betweentwodataitems,thelesssimilartheyarebelievedtobe.The sceneswith20,000lights,wehavefoundthatonaverageabout100 goalofclusteringistoconstructabinarytreecalledadendrogram clusterscanbeconstructedconcurrently;thus,thereissubstantial whosehierarchicalstructureexposesthesimilaritybetweenitems parallelismthatcanbeexploited.Forthisapplication,eachiteration in the data-set. Figure 4(a) shows a data-set containing points in ofthewhile-loopinFigure5performsabout100,000instructions theplane,forwhichthemeasureofdistancebetweendatapoints ofwhichroughly4000arefloating-pointoperations. istheusualEuclideandistance.Thedendrogramforthisdatasetis showninFigures4(b,c). 3. LimitationsofCurrentApproaches Agglomerative clustering can be performed by an iterative al- Current approaches for parallelizing irregular applications can be gorithm:ateachstep,thetwoclosestpointsinthedata-setareclus- dividedintostatic,semi-static,anddynamicapproaches. teredtogetherandreplacedinthedata-setbyasinglenewpointthat representsthenewcluster.Thelocationofthisnewpointmaybe Static Approaches. One approach to parallelization is to use a determinedheuristically[26].Thealgorithmterminateswhenthere compilertoanalyzeandtransformsequentialprogramsintoparallel isonlyonepointleftinthedata-set. ones,usingtechniqueslikepoints-toanalysis[5]andshapeanaly- PseudocodeforthealgorithmisshowninFigure5.Thecentral sis[31].Theweaknessofthisapproachisthattheparallelschedule datastructureisapriorityqueuewhoseentriesareorderedpairsof produced by the compiler must be valid for all inputs to the pro- points<x,y>,suchthatyisthenearestneighborofx(wecallthis gram. As we have seen, parallelism in irregular applications can nearest(x)).Ineachiterationofthewhileloop,thealgorithm beverydata-dependent,socompile-timeparallelizationtechniques dequeues the top element of the priority queue to find a pair of will serialize the entire execution. This conclusion holds even if points<p,n>thatareclosertoeachotherthananyotherpairof dependence analysis is replaced with more sophisticated analysis points,andclustersthem.Thesetwopointsarethenreplacedbya techniqueslikecommutativityanalysis[10]. newpointthatrepresentsthiscluster.Thenearestneighborofthis new point is determined, and the pair is entered into the priority A Semi-static Approach. In the inspector-executor approach of queue. If there is only one point left, its nearest neighbor is the Saltzetal[27],thecomputationissplitintotwophases,aninspec- pointatinfinity. torphasethatdeterminesdependenciesbetweenunitsofwork,and Tofindthenearestneighborofapoint,wecanscantheentire an executor phase that uses the schedule to perform the compu- data-setateachstep,butthisistooinefficient.Abetterapproach tationinparallel.Thisapproachisnotusefulforourapplications is to sort the points by location, and search within this sorted set since the data-sets, and therefore the dependences, change as the tofindnearestneighbors.Ifthepointswereallinaline,wecould codesexecute. useabinarysearchtree.Sincethepointsareinhigherdimensions, DynamicApproaches. Indynamicapproaches,parallelizationis a multi-dimensional analog called a kd-tree is used [3]. The kd- performed at runtime, and is known as speculative or optimistic tree is built at the start of the algorithm, and it is updated by parallelization.Theprogramisexecutedinparallelassumingthat removing the points that are clustered, and then adding the new dependencesarenotviolated,butthesystemsoftwareorhardware pointrepresentingthecluster,asshowninFigure5. detectsdependenceviolationsandtakesappropriatecorrectiveac- Opportunities for Exploiting Parallelism. Since each iteration tionsuchaskillingofftheoffendingportionsoftheprogramand clustersthetwoclosestpointsinthecurrentdata-set,itmayseem re-executingthemsequentially.Ifnodependenceviolationsarede- that the algorithm is inherently sequential. In particular, an item tected by the end of the speculative computation, the results of <x,nearest(x)> inserted into the priority queue by iteration the speculative computation are committed and become available i at line 17 may be the same item that is dequeued by iteration toothercomputations. (i+1)inline5;thiswillhappenifthepointsinthenewpairare Fine-grainspeculativeparallelizationforexploitinginstruction- closertogetherthananyotherpairofpointsinthecurrentdata-set. levelparallelismwasintroducedaround1970;forexample,Toma- On the other hand, if we consider the data-set in Figure 4(a), we sulo’s IBM 360/91 fetched instructions speculatively from both seethatpointsaandb,andpointscanddcanbeclusteredcon- sidesofabranchbeforethebranchtargetwasresolved[37].Spec- currently since neither cluster affects the other. Intuitively, if the ulative execution of instructions past branches was studied in the abstract by Foster and Riseman in 1972 [7], and was made prac- Client Code ticalbyJoshFisherwhenheintroducedtheideaofusingbranch Galois Objects probabilitiestoguidespeculation[11].Branchspeculationcanex- poseinstruction-level(fine-grain)parallelisminprogramsbutnot thedata-dependentcoarse-grainparallelisminapplicationslikeDe- launaymeshrefinement. Oneoftheearliestimplementationsofcoarse-grainoptimistic parallel execution was in Jefferson’s 1985 Time Warp system for distributed discrete-event simulation [17]. In 1999, Rauchwerger and Padua described the LRPD test for supporting speculative Figure6. High-levelviewofGaloisexecutionmodel executionofFORTRANDO-loopsinwhicharraysubscriptswere toocomplextobedisambiguatedbydependenceanalysis[30].This memorytrackreadsandwritestomemorylocations,sotheysuf- approachcanbeextendedtowhile-loopsifanupperboundonthe ferfromthesameproblemsascurrentTLSimplementations.Open numberofloopiterationscanbedeterminedbeforetheloopbegins nestedtransactions[22]havebeenproposedrecentlyasasolution execution[29].Morerecentworkhasprovidedhardwaresupport tothisproblem,andtheyarediscussedinmoredetailinSection4. forthiskindofcoarse-grainloop-levelspeculation,nowknownas thread-levelspeculation(TLS)[36,43]. 4. TheGaloisApproach However,therearefundamentalreasonswhycurrentTLSim- Perhapsthemostimportantlessonfromthepasttwenty-fiveyears plementations cannot exploit the parallelism in our applications. ofparallelprogrammingisthatthecomplexityofparallelprogram- One problem is that many of these applications, such as Delau- mingshouldbehiddenfromprogrammersasfaraspossible.For naymeshrefinement,haveunboundedwhile-loops,whicharenot example,itislikelythatmoreSQLprogramsareexecutedinparal- supportedbymostcurrentTLSimplementationssincetheytarget lelthanprogramsinanyotherlanguage.However,mostSQLpro- FORTRAN-styleDO-loopswithfixedloopbounds.Amorefunda- grammersdonotwriteexplicitlyparallelcode;insteadtheyobtain mentalproblemarisesfromthefactthatcurrentTLSimplementa- parallelism by invoking parallel library implementations of joins tionstrackdependencesbymonitoringthereadsandwritesmade andotherrelationaloperations.A“layered”approachofthissortis by loop iterations to memory locations. For example, if iteration alsousedindenselinearalgebra,anotherdomainthathassuccess- i+1writestoalocationbeforeitisreadbyiterationi,adependence fullymasteredparallelism. violationisreported,anditerationi+1mustberolledback. In this spirit, programs in the Galois approach consist of (i) a For irregular applications that manipulate pointer-based data setoflibraryclassesand(ii)thetop-levelclientcodethatcreates structures,thisistoostrictandtheprogramwillperformpoorlybe- andmanipulatesobjectsoftheseclasses.Forexample,inDelaunay causeoffrequentroll-backs.Tounderstandthis,considerthework- mesh refinement, the relevant objects are the mesh and worklist, listinDelaunaymeshgeneration.Regardlessofhowtheworklistis and the client code implements the Delaunay mesh refinement implemented,theremustbeamemorylocation(callthislocation algorithm discussed in Section 2. This client code is executed head) that points to a cell containing the next bad triangle to be concurrentlybysomenumberofthreads,butaswewillsee,itis handedout.Thefirstiterationofthewhileloopremovesabadtri- notexplicitlyparallelandmakesnomentionofthreads.Figure6is anglefromtheworklist,soitreadsandwritestohead,buttheresult apictorialviewofthisexecutionmodel. of this write is not committed until that iteration terminates suc- There are three main aspects to the Galois approach: (1) two cessfully.Athreadthatattemptstostarttheseconditerationcon- syntacticconstructscalledoptimisticiteratorsforpackagingopti- currentlywiththeexecutionofthefirstiterationwillalsoattempt misticparallelismasiterationoversets(Section4.1),(2)assertions toreadandwritehead,andsincethishappensbeforetheupdates about methods in class libraries (Section 4.2), and (3) a runtime fromthefirstiterationhavebeencommitted,adependenceconflict scheme for detecting and recovering from potentially unsafe ac- willbereported(theprecisepointatwhichadependenceconflict cessestosharedobjectsmadebyanoptimisticcomputation(Sec- willbereporteddependsontheTLSimplementation).Whilethis tion4.3). particular problem might be circumvented by inventing some ad hocmechanism,it isunlikely thatthere isany suchwork-around 4.1 Optimisticiterators forthefarmorecomplexpriorityqueuemanipulationsinagglom- erative clustering. The manipulations of the graph and kd-tree in Asmentionedabove,theclientcodeisnotexplicitlyparallel;in- theseapplicationsmayalsocreatesuchconflicts. steadparallelismispackagedintotwoconstructsthatwecallop- Thisisafundamentalproblem:formanyirregularapplications, timisticiterators.Inthecompilerliterature,itisstandardtodistin- tracking dependences by monitoring reads and writes to memory guishbetweendo-allloopsanddo-acrossloops[20].Theiterations locationsiscorrectbutwillresultinpoorperformance. ofado-allloopcanbeexecutedinanyorderbecausethecompiler Finally, Herlihy and Moss have proposed to simplify shared- ortheprogrammerassertsthattherearenodependencesbetween memoryprogrammingbyeliminatinglock-basedsynchronization iterations.Incontrast,ado-acrossloopisoneinwhichtheremaybe constructsinfavoroftransactions[15].Thereisgrowinginterestin dependencesbetweeniterations,sopropersequencingofiterations supportingtransactionsefficientlywithsoftwareandhardwareim- isessential.Weintroducetwoanalogousconstructsforpackaging plementationsoftransactionalmemory[1,12,13,21,34].Mostof optimisticparallelism. thisworkisconcernedwithoptimisticsynchronizationandnotop- • Setiterator:for each e in Set S do B(e) timisticparallelization;thatis,theirstartingpointisaprogramthat TheloopbodyB(e)isexecutedforeachelementeofsetS. has already been parallelized (for example, the SPLASH bench- Sincesetelementsarenotordered,thisconstructassertsthatin marks[12]ortheLinuxkernel[28]),andthegoalisfindaneffi- aserialexecutionoftheloop,theiterationscanbeexecutedin cientwaytosynchronizeparallelthreads.Incontrast,ourgoalisto anyorder.Theremaybedependencesbetweentheiterations,as findtherightabstractionsforexpressingcoarse-grainparallelismin inthecaseofDelaunaymeshgeneration,butanyserialorderof irregularapplications,andtosupporttheseabstractionsefficiently; executingiterationsispermitted.Whenaniterationexecutes,it synchronizationisonepartofabiggerproblemweareaddressing mayaddelementstoS. inthispaper.Furthermore,mostimplementationsoftransactional • Ordered-setiterator:for each e in Poset S do B(e) 1: Mesh m = /* read in initial mesh */ 2: Set wl; 3: wl.add(mesh.badTriangles()); 4: for each e in wl do { 5: if (e no longer in mesh) continue; Set S S.add(x) ws.get() ws.get() Workset ws 6: Cavity c = new Cavity(e); S.contains?(x) 7: c.expand(); 8: c.retriangulate(); S.remove(x) ws.add(x) ws.add(y) 9: m.update(c); 10: wl.add(c.badTriangles()); 11: } Figure7. Delaunaymeshrefinementusingsetiterator (a) (b) Figure8. Interleavingmethodinvocationsfromdifferentiterations Thisconstructisaniteratoroverapartially-orderedset(Poset) add(x), remove(x), get() and contains?(x) that have S.Itassertsthatinaserialexecutionoftheloop,theiterations theusualsemantics1. must be performed in the order dictated by the ordering of Thefirstproblemistheusualoneofconcurrencycontrol(also elements in the Poset S. There may be dependences between known in the database literature as ensuring consistency). If a iterations,andasinthecaseofthesetiterator,elementsmaybe method invocation from one iteration is performed concurrently addedtoSduringexecution. withaninvocationfromanotheriteration,wemustensurethatthe twoinvocationsdonotsteponeachother.Onesolutionistousea Thesetiteratorisaspecialcaseoftheordered-setiteratorbutit lockonobjectS;ifthisinhibitsconcurrency,wecanusefine-grain canbeimplementedmoreefficiently,asweseelaterinthissection. lockswithinobjectS.Theselocksareacquiredbeforethemethod Figure7showstheclientcodeforDelaunaymeshgeneration. isinvokedandreleasedwhenthemethodcompletes. Instead of a work list, this code uses a set and a set iterator. The However,thisisnotenoughtoensurethatthesequentialseman- Galoisversionisnotonlysimplerbutalsomakesevidentthefact ticsoftheiteratorsarerespected.ConsiderFigure8(a).IfSdoesnot that the bad triangles can be processed in any order; this fact containxbeforetheiterationsstart,noticethatinanysequentialex- is absent from the more conventional code of Figure 3 since it ecutionoftheiterations,themethodinvocationcontains?(x) implementsaparticularprocessingorder.Forlackofspace,wedo willreturnfalse.However,foronepossibleinterleavingofopera- notshowtheGaloisversionofagglomerativeclustering,butituses tions—add(x),contains?(x),remove(x)—theinvoca- theordered-setiteratorintheobviousway. tioncontains?(x)willreturntrue,whichisincorrect.Thisis theproblemofensuringisolationoftheiterations. 4.1.1 ConcurrentExecutionModel Onesolutionforbothproblemsisforaniterationtoreleaseits locks only at the end of the iteration: the well-known two-phase AlthoughthesemanticsofGaloisiteratorscanbespecifiedwithout lockingalgorithmusedindatabasesisanoptimizedversionofthis appealing to a parallel execution model, these iterators provide simple idea. Transactional memory implementations accomplish hints from the programmer to the Galois runtime system that it thesamegoalbytrackingthereadandwritesetsofeachiteration may be profitable to execute the iterations in parallel; of course insteadoflockingthem. anyparallelexecutionmustbefaithfultothesequentialsemantics. WhilethissolvestheprobleminFigure8(a),itisnotadequate TheGaloisconcurrentexecutionmodelisthefollowing.Amas- for our applications. The program in Figure 8(b) is motivated by terthreadbeginstheexecutionoftheprogram;italsoexecutesthe Delaunay mesh generation: each iteration gets a bad triangle at codeoutsideiterators.Whenthismasterthreadencountersaniter- the beginning of the iteration, and may add some bad triangles ator,itenliststheassistanceofsomenumberofworkerthreadsto to the work-set at the end. Regardless of how the set object is executeiterationsconcurrentlywithitself.Theassignmentofiter- implemented,theremustbealocation(callithead)thatpointstoa ationstothreadsisunderthecontrolofaschedulingpolicyimple- cellcontainingthenexttriangletobehandedout.Thefirstiteration mentedbytheruntimesystem;fornow,weassumethatthisassign- to get work will read and write location head, and it will lock mentisdonedynamicallytoensureload-balancing.Allthreadsare itforthedurationoftheiteration,preventinganyotheriterations synchronizedusingbarriersynchronizationattheendoftheitera- fromgettingwork.Mostcurrentimplementationsoftransactional tor. memorywillsufferfromthesameproblemsincetheheadlocation Inourapplications,wehavenotfounditnecessarytousenested willbeinthereadandwritesetsofthefirstiterationfortheduration iterators. There is no fundamental problem in supporting nested ofthatiteration. parallelism,butourcurrentimplementationdoesnotsupportit;if Thecruxoftheproblemisthattheabstractsetoperationshave a thread encounters an inner iterator, it executes the entire inner useful semantics that are not available to an implementation that iteratorsequentially. worksdirectlyontherepresentationofthesetandtracksreadsand Given this execution model, the main technical problem is to writestoindividualmemorylocations.Theproblemthereforeisto ensurethattheparallelexecutionrespectsthesequentialsemantics understandthesemanticsofsetoperationsthatmustbeexploited of the iterators. This is a non-trivial problem because each itera- to permit parallel execution in our irregular applications, and to tionmayreadandwritetotheobjectsinsharedmemory,andwe specifythesesemanticsinsomeconciseway. mustensurethatthesereadsandwritesareproperlycoordinated. Section4.2describestheinformationthatmustbespecifiedbythe 4.2.1 SemanticCommutativity Galois class writer to enable this. Section 4.3 describes how the Galoisruntimesystemusesthisinformationtoensurethatthese- The solution we have adopted exploits the commutativity of quentialsemanticsofiteratorsarerespected. method invocations. Intuitively, it is obvious that the method in- vocationstoagivenobjectfromtwoiterationscanbeinterleaved without losing isolation provided that these method invocations 4.2 WritingGaloisClasses commute,sincethisensuresthatthefinalresultisconsistentwith Toensurethatthesequentialsemanticsofiteratorsarerespected, therearetwoproblemsthatmustbesolved,whichweexplainwith 1Themethodremove(x)removesaspecificelementfromthesetwhile referencetoFigure8.Thisfigureshowssetobjectswithmethods get()returnsanarbitraryelementfromtheset,removingitfromtheset. some serial order of iteration execution. In Figure 8(a), the invo- class Set { // interface methods cationcontains?(x)doesnotcommutewiththeoperationsfrom void add(Element x); the other thread, so the invocations from the two iterations must [calls] _add(x) : void notbeinterleaved.InFigure8(b),(1)getoperationscommutewith [commutes] eachother,and(2)agetoperationcommuteswithanaddoperation - add(y) {y != x} - remove(y) {y != x} providedthattheoperandofaddisnottheelementreturnedbyget. - contains(y) {y != x} Thisallowsmultiplethreadstopullworkfromthework-setwhile - get() : y {y != x} //get call that returns y ensuringthatsequentialsemanticsofiteratorsarerespected. [inverse] _remove(x) Itisimportanttonotethatwhatisrelevantforourpurposeis void remove(Element x); [calls] _remove(x) : void commutativityinthesemanticsense.Theinternalstateoftheobject [commutes] mayactuallybedifferentfordifferentordersofmethodinvocations - add(y) {y != x} eveniftheseinvocationscommuteinthesemanticsense.Forexam- - remove(y) {y != x} ple,ifthesetisimplementedusingalinkedlistandtwoelements - contains(y) {y != x} - get() : y {y != x} are added to this set, the concrete state of the linked list will de- [inverse] _add(x) pendingeneralontheorderinwhichtheseelementswereadded bool contains(Element x); tothelist.However,whatisrelevantforparallelizationisthatthe [calls] _contains(x) : bool b stateofthesetabstractdatatype,whichisbeingimplementedby [commutes] - add(y) {y != x} thelinkedlist,isthesameforbothorders.Inotherwords,weare - remove(y) {y != x} not concerned with concrete commutativity (that is, commutativ- - get() : y {y != x} itywithrespecttotheimplementationtypeoftheclass),butwith - contains(*) //any call to contains semanticcommutativity(thatis,commutativitywithrespecttothe Element get(); [calls] _get() : Element x abstractdatatypeoftheclass).Wealsonotethatcommutativityof [commutes] method invocations may depend on the arguments of those invo- - add(y) {y != x} cations.Forexample,anaddandaremovecommuteonlyiftheir - remove(y) {y != x} argumentsaredifferent. - contains(y) {y != x} - get() : y {y != x} [inverse] _add(x) 4.2.2 InverseMethods //internal methods Becauseiterationsareexecutedinparallel,itispossibleforcom- void _add(Element x); mutativityconflictstopreventaniterationfromcompleting.Once void _remove(Element x); aconflictisdetected,somerecoverymechanismmustbeinvoked bool _contains(Element x); toallowexecutionoftheprogramtocontinuedespitetheconflict. Element _get(); } Becauseourexecutionmodelusestheparadigmofoptimisticpar- allelism, our recovery mechanism rolls back the execution of the Figure9. ExampleGaloisclassforaSet conflictingiteration.Toavoidlivelock,thelowerpriorityiteration isrolledbackinthecaseoftheordered-setiterator. ditions. For example, remove(x) commutes with add(y), Topermitthis,everymethodofasharedobjectthatmaymod- aslongastheyelementsaredifferent. ifythestateofthatobjectmusthaveanassociatedinversemethod • inverse:Thissectionspecifiestheinverseofthecurrentmethod. thatundoestheside-effectsofthatmethodinvocation.Forexam- ple,foraset,theinverseofadd(x)isremove(x),andtheinverseof ThedescriptionoftheGaloissysteminthissectionimplicitly remove(x)isadd(x).Asinthecaseofcommutativity,whatisrele- assumedthatallcallstoparallelobjectsaremadefromclientcode. vantforourpurposeisaninverseinthesemanticsense;invokinga However,tofacilitatecomposition,wealsoallowparallelobjectsto methodanditsinverseinsuccessionmaynotrestoretheconcrete invokemethodsonotherobjects.Thisishandledthroughasimple datastructuretowhatitwas. flattening approach. The iteration object is passed to the “child” Notethatwhenaniterationrollsback,allofthemethodswhich invocation and hence all operations done in the child invocation itinvokesduringroll-backmustsucceed.Thus,wemustneveren- areappendedtotheundologoftheiteration.Similarly,thechild counterconflictswheninvokinginversemethods.WhentheGalois invocationfunctionsasanextensionoftheoriginalmethodwhen system checks commutativity, it also checks commutativity with detectingcommutativityconflicts.Nochangesneedtobemadeto theassociatedinversemethod. theGaloisrun-timetosupportthisformofcomposition. The class implementor must also ensure that each internal 4.2.3 PuttingitAllTogether method invocation is atomic to ensure consistency. This can be doneusinganytechniquedesired,includinglocksortransactional Sinceweareinterestedinsemanticcommutativityandundo,itis memory. Recall that whatever locks are acquired during method necessary for the class designer to specify this information. Fig- invocation (or memory locations placed in read/write sets during ure 9 illustrates how this information is specified in Galois for a transactional execution) are released as soon as the method com- classthatimplementssets.Theinterfacespecifiestwoversionsof pletes, rather than being held throughout the execution of the it- eachmethod:theinternalmethodsontheobject,andtheinterface eration,sincewerelyoncommutativityinformationtoguarantee methods,calledfromwithiniterators,thatperformthecommuta- isolation.Inourcurrentimplementation,theinternalmethodsare tivitychecks,maintaintheundoinformationandtriggerrollbacks madeatomicthroughtheuseoflocks. whencommutativityconflictsaredetected. Thespecificationforaninterfacemethodconsistsofthreemain 4.2.4 Asmallexample sections(withpseudo-coderepresentingtheseinthefigure): Consideraprogramwrittenusingasinglesharedobject,aninteger • calls: This section ties the interface method to the internal accumulator.Theobjectsupportstwooperations:accumulateand method(s)itinvokes. read,withtheobvioussemantics.Itisclearthataccumulatescom- • commutes: This section specifies which other interface meth- mutewithotheraccumulates,andreadscommutewithotherreads, odsthecurrentmethodcommuteswith,andunderwhichcon- butthataccumulatedoesnotcommutewithread.Themethodsare IterationA IterationB IterationC IterationRecord { Status status; { { { Priority p; ... ... ... UndoLog ul; a.accumulate(5) a.accumulate(7) a.read() list<LocalConflictLog> local_log; ... ... ... Lock l; } } } } Figure11. Iterationrecordmaintainedbyruntimesystem Figure10. Exampleaccumulatorcode 4.3.1 ConflictLogs madeatomicwithasinglelockwhichisacquiredatthebeginning Theconflictlogisthemechanismfordetectingcommutativitycon- ofthemethodandreleasedattheend. flicts.Thereisoneconflictlogassociatedwitheachsharedobject. There are three iterations executing concurrently, as seen in Asimpleimplementationfortheconflictlogofanobjectisalist Figure10.Theprogressoftheexecutionisasfollows: containingthemethodsignatures(includingthevaluesoftheinput • Iteration A calls accumulate, acquiring the lock, updating the and output parameters) of all invocations on that object made by accumulatorandthenreleasingthelockandcontinuing. currently executing iterations (called “outstanding invocations”). When iteration i attempts to call a method m on an object, the • IterationBcallsaccumulate.Becauseaccumulatescommute,B 1 method signature is compared against all the outstanding invoca- cansuccessfullymakethecall,acquiringthelock,updatingthe tions in the conflict log. If one of the entries in the log does not accumulatorandreleasingit.NotethatAhasalreadyreleased commutewithm ,thenacommutativityconflictisdetected,and thelockontheaccumulator,thusallowingBtomakeforward 1 anarbitrationprocessisbeguntodeterminewhichiterationsshould progresswithoutblockingontheaccumulator’slock. beaborted,asdescribedbelow.Ifm commuteswithalltheentries • WheniterationCattemptstoexecuteread,itseesthatitcannot, 1 inthelog,thesignatureofm isappendedtothelog.Whenieither asreaddoesnotcommutewiththealreadyexecutedaccumu- 1 abortsorcommits,alltheentriesintheconflictloginsertedbyiare lates. Thus, C must roll back and try again. Note that this is removedfromtheconflictlog. notenforcedbythelockontheaccumulator,butinsteadbythe Thismodelforconflictlogs,whilesimple,isnotefficientsince commutativityconditionsontheaccumulator. itrequiresafullscanoftheconflictlogwheneveraniterationcalls • WheniterationsAandBcommit,Ccanthensuccessfullycall a method on the associated object. In our actual implementation, readandcontinueexecution. conflict logs consist of separate conflict sets for each method in theclass.Nowwhenicallsm ,onlytheconflictsetsformethods In[38],vonPraunetaldiscusstheuseoforderedtransactions 1 whichm mayconflictwitharechecked;therestareignored. inparallelizingFORTRAN-styleDO-loops,andtheygivespecial 1 Therearetwooptimizationsthatwehaveimplementedforcon- treatment to reductions in such loops to avoid spurious conflicts. flictlogs. ReductionsdonotrequireanyspecialtreatmentintheGaloisap- First,eachiterationcachesitsownportionoftheconflictlogs proachsincetheprogrammercouldjustuseanobjectliketheac- in a private log called its local log. This local log stores a cumulatortoimplementreduction. record of all the methods the iteration has successfully invoked on the object. When an iteration makes a call, it first checks its 4.3 RuntimeSystem locallog.Ifthislocallogindicatesthattheinvocationwillsucceed TheGaloisruntimesystemhastwocomponents:(1)aglobalstruc- (eitherbecausethatsamemethodhasbeencalledbeforeorother turecalledthecommitpoolthatisresponsibleforcreating,abort- methods,whosecommutativityimpliesthatthecurrentmethodalso ing,andcommittingiterations,and(2)per-objectstructurescalled commutes,havebeencalledbefore2),theiterationdoesnotneedto conflictlogswhichdetectwhencommutativityconditionsarevio- checktheobject’sconflictlog. lated. Asecondoptimizationisthatnotallobjectshaveconflictlogs Atahighlevel,theruntimesystemsworksasfollows.Thecom- associatedwiththem.Forexample,thetrianglescontainedinthe mitpoolmaintainsaniterationrecord,showninFigure11,foreach meshdonot;theirinformationismanagedbytheconflictloginthe ongoing iteration in the system. The status of an iteration can be mesh.Ifthisoptimizationisused,caremustbetakenthatmodifi- RUNNING, RTC (ready-to-commit) or ABORTED. Threads go to cationstothetriangleareonlymadethroughthemeshinterface.In thecommitpooltoobtainaniteration.Thecommitpoolcreatesa general,programanalysisisrequiredtoensurethatthisoptimiza- newiterationrecord,obtainsthenextelementfromtheiterator,as- tionissafe. signsaprioritytotheiterationrecordbasedonthepriorityofthe element(forasetiterator,allelementshavethesamepriority),and 4.3.2 CommitPool setsthestatusfieldoftheiterationrecordtoRUNNING.Whenan Whenaniterationattemptstocommit,thecommitpoolcheckstwo iteration invokes a method of a shared object, (i) the conflict log things:(i)thattheiterationisattheheadofthecommitqueue,and ofthatobjectandthelocal logoftheiterationrecordareup- (ii)thatthepriorityoftheiterationishigherthanalltheelements dated,asdescribedinmoredetailbelow,and(ii)acallbacktothe leftintheset/poSetbeingiteratedover3.Ifbothconditionsaremet, associated inverse method is pushed onto the undo log of the it- theiterationcansuccessfullycommit.Iftheconditionsarenotmet, erationrecord.Ifacommutativityconflictisdetected,thecommit theiterationmustwaituntilithasthehighestpriorityinthesystem; poolarbitratesbetweentheconflictingiterations,andabortsitera- itsstatusissettoRTC,andthethreadisallowedtobeginanother tionstopermitthehighestpriorityiterationtocontinueexecution. iteration. Callbacks in the undo logs of aborted iterations are executed to undotheireffectsonsharedobjects.Onceathreadhascompleted 2Forexample, ifaniteration hasalreadysuccessfullyinvoked add(x), aniteration,thestatusfieldofthatiterationischangedtoRTC,and thencontains(x)willclearlycommutewithmethodinvocationsmade thethreadisallowedtobeginanewiteration.Whenthecompleted byotherongoingiterations. iterationhasthehighestpriorityinthesystem,itisallowedtocom- 3Thisistoguardagainstasituationwhereanearliercommittediteration mit.Itcanbeseenthattheroleofthecommitpoolissimilartothat addsanewelementwithhighprioritytothecollectionwhichhasnotyet ofareorderbufferinout-of-orderprocessors[14]. beenconsumedbytheiterator When an iteration successfully commits, the thread that was 4.4 Discussion runningthatiterationalsochecksthecommitqueuetoseeifmore Setiterators:AlthoughtheGaloissetiteratorsintroducedinSec- iterationsintheRTCstatecanbecommitted.Ifso,itcommitsthose tion4.1weremotivatedinthispaperbythetwoapplicationsdis- iterationsbeforebeginningtheexecutionofanewiteration.When cussedinSection2,theyareverygeneral,andwehavefoundthem an iteration has to be aborted, the status of its record is changed tobeusefulforwritingotherirregularapplicationssuchasadvanc- to ABORTED, but the commit pool takes no further action. Such ingfrontmeshgenerators[23],andWalkSATsolvers[33].Many iterationobjectsarelazilyremovedfromthecommitqueuewhen of these applications use “work-list”-style algorithms, for which theyreachthehead. Galoisiteratorsarenatural,andtheGaloisapproachallowsusto exploitdata-parallelismintheseirregularapplications. Conflictarbitration Theotherresponsibilityofthecommitpool SETLwasprobablythefirstlanguagetointroduceanunordered istoarbitrateconflictsbetweeniterations.Wheniteratingoveran setiterator[19],butthisconstructdiffersfromitsGaloiscounter- unordered set, the choice of which iteration to roll back in the partinimportantways.InSETL,thesetbeingiteratedovercanbe eventofaconflictisirrelevant.Forsimplicity,wealwayschoose modified during the execution of the iterator, but these modifica- theiterationwhichdetectedtheconflict.However,wheniterating tionsdonottakeeffectuntiltheexecutionoftheentireiteratoris overanorderedset,thelowerpriorityiterationmustberolledback complete.Inourexperience,thisistoolimitingbecausework-list whilethehigherpriorityiterationmustcontinue.Withoutdoingso, algorithmsusuallyinvolvedata-structuretraversalsofsomekindin thereexiststhepossibilityofdeadlock. whichnewworkisdiscoveredduringthetraversal.Thetupleiter- Thus,wheniterationi callsamethodonasharedobjectand ator in SETL is similar to the Galois ordered-set iterator, but the 1 aconflictisdetectedwithiterationi ,thecommitpoolarbitrates tuplecannotbemodifiedduringtheexecutionoftheiterator,which 2 basedontheprioritiesofthetwoiterations.Ifi haslowerpriority, limitsitsusefulnessinirregularapplications.Finally,SETLwasa 1 it simply performs the standard rollback operations. The thread sequentialprogramminglanguage.DO-loopsinFORTRANarea whichwasexecutingi thenbeginsanewiteration. specialcaseoftheGaloisordered-setiteratorinwhichiterationis 1 Thissituationiscomplicatedwheni istheiterationthatmust performedoverintegersinsomeinterval. 2 be rolled back. Because the Galois run time systems functions Amorecompletedesignthanourswouldincludeiteratorsover purelyattheuserlevel,thereisnosimplewaytoabortaniteration multisetsandmaps,whichareeasytoaddtoGalois.MATLABor runningonanotherthread.Toaddressthisproblem,eachiteration FORTRAN-90-stylenotationlike[low:step:high]forspec- recordhasaniterationlockasshowninFigure11.Wheninvoking ifying ordered and unordered integers within intervals would be methods on shared objects, each iteration must own the iteration useful.Webelieveitisalsoadvisabletodistinguishsyntactically lockinitsrecord.Thus,thethreadrunningi doesthefollowing: between DO-ALL loops and unordered-set iterators over integer 1 ranges, since in the former case, the programmer can assert that run-time dependence checks are unnecessary, enabling more effi- 1. Itattemptstoobtaini2’siterationlock.Bydoingso,itensures cient execution. For example, in the standard i-j-k loop nest for thati2isnotmodifyinganysharedstate. matrix-multiplication, the i and j loops are not only Galois-style 2. Itabortsi byexecutingi ’sundologandclearingthevarious unordered-setiteratorsoverintegerintervalsbuttheyareevenDO- 2 2 conflict logs of i ’s invocations. Note that the control flow of ALLloops;thekloopisanordered-setinteratoriftheaccumula- 2 thethreadexecutingi doesnotchange;thatthreadcontinues tionstoelementsoftheCmatrixmustbedoneinorder. 2 asifnorollbackisoccurring. Semanticcommutativity:Withoutcommutativityinformation, 3. Itsetsthestatusofi toABORTED. an object can be accessed by at most one iteration at a time, and 2 thatiterationshutsoutotheriterationsuntilitcommits.Inthiscase, 4. Itthenresumesitsexecutionofi ,whichcannowproceedas 1 inversemethodscanbeimplementedautomaticallybydatacopying theconflicthasbeenresolved. asisdoneinsoftwaretransactionalmemories. Intheapplicationswehavelookedat,mostsharedobjectsare Ontheothersideofthisarbitrationprocess,thethreadexecuting instancesofcollections,whicharevariationsofsets,sospecifying i willrealizethati hasbeenabortedwhenitattemptstoinvoke commutativity information and writing inverse methods has been 2 2 anothermethodonasharedobject(orattemptstocommit).Atthis straightforward.Forexample,thekd-treeisjustasetwithanad- point,thethreadwillseethati ’sstatusisABORTEDandwillcease ditionalmethodforfindingthenearestneighborofanelementin 2 executionofi andbeginanewiteration. theset.NotethatthedesignofGaloismakesiteasytoreplacesim- 2 When an iteration has to be aborted, the callbacks in its undo ple data structures with clever, hand-tuned concurrent data struc- logareexecutedinLIFOorder.Becausetheundologmustpersist tures[32]ifnecessary,withoutchangingtherestoftheprogram. untilaniterationcommits,wemustensurethatallthearguments Theuseofcommutativityinparallelprogramexecutionwasex- usedbythecallbacksremainvaliduntiltheiterationcommits.Ifthe ploredbyBernsteinasfarbackas1966[4].Inthecontextofcon- argumentsarepass-by-value,thereisnoproblem;theyarecopied currentdatabasesystems,Weihldescribedatheoreticalframework when the callback is created. A more complex situation is when for using commutativity conditions for concurrency control [40]. arguments are pass-by-reference or pointers. The first problem is HerlihyandWeihlextendedthisworkbyleveragingorderingcon- that the underlying data which the reference or pointer points to straints to increase concurrency but at the cost of more complex maybechangedduringthecourseofexecution.Thus,thecallback rollbackschemes[16]. may be called with inappropriate arguments. However, as long Inthecontextofparallelprogramming,Steeledescribedasys- as all changes to the underlying data also occur through Galois temforexploitingcommutingoperationsonmemorylocationsin interfaces,theLIFOnatureoftheundologensuresthattheywillbe optimisticparallelexecution[18].However,commutativityisstill rolledbackasnecessarybeforethecallbackusesthem.Thesecond tiedtoconcretememorylocationsanddoesnotexploitproperties problemoccurswhenaniterationattemptstofreeapointer,asthere ofabstractdatatypeslikeGaloisdoes.DinizandRinardperformed is no simple way to undo a call to free. The Galois run-time staticanalysistodetermineconcretecommutativityofmethodsfor avoidsthisproblembydelayingallcallstofreeuntilaniteration useincompile-timeparallelization[10].Semanticcommutativity, commits. This does not affect the semantics of the iteration, and asusedinGalois,ismoregeneralbutitmustbespecifiedbythe avoidstheproblemofrollingbackmemorydeallocation. classdesigner.WuandPaduahaveproposedtousehighlevelse- manticsofcontainerclasses[42].Theyproposemakingacompiler awareofpropertiesofabstractdatatypessuchasstacksandsetsto permitmoreaccuratedependenceanalysis. Recently, Ni et al [24] have proposed to extend conventional transactionalmemorywithanotionof“abstractlocking”tointro- ducethenotionofsemanticconflicts.Carlstrometalhavetakena similarapproachtotheJavacollectionsclasses[6].Semanticcom- 8 mexupteartiievnitcyepirsonveidedesedanboetfhoerrewthaeyroeflastpiveeciafdyvinagntoapgeensonfesthtiengtw.Mooapre- Time (s)6 pisroaancehaessiebrencoomtioencfleoarrp,rbougtrwamembeelriesvteotuhnadtesresmtaanndt.iccommutativity Execution 24 rFFmmeGGeefeLLssrhh e((ggdrn)ee)cnne ((dr)) 0 1 2 3 4 5. Evaluation # of processors (a)Executiontimes We have implemented the Galois system in C++ on two Linux platforms:(i)a4processor,1.5GHzItanium2,with16KBofL1, reference 3 FGL (d) 256KBofL2and3MBofL3cacheperprocessor,and(ii)adual FGL (r) pcororeceasnsdor4dMuBal-ocfoLre23c.a0chGeHpzerXperooncessyssotre.mT,hweitthhre3a2dKinBgloifbrLa1rypoenr Speedup2.25 mmeesshhggeenn ((dr)) 1.5 bothplatformswaspthreads. 1 1 2 3 4 5.1 DelaunayMeshRefinement # of processors (b)Self-relativeSpeed-ups We first wrote a sequential Delaunay mesh refinement program withoutlocks,threadsetc.toserveasareferenceimplementation. Committed Aborted #ofproc. Max Min Avg Max Min Avg WethenimplementedaGaloisversion(whichwecallmeshgen), 1 21918 21918 21918 n/a n/a n/a andafine-grainlockingversion(FGL)thatuseslocksonindivid- 4(meshgen(d)) 22128 21458 21736 28929 27711 28290 ualtriangles.TheGaloisversionusesthesetiterator,andtherun- 4(meshgen(r)) 22101 21738 21909 265 151 188 timesystemdescribedinSection4.3.Inallthreeimplementations, (c)Committedandabortediterationsformeshgen themeshwasrepresentedbyagraphthatwasimplementedasaset InstructionType reference meshgen(r) oftriangles,whereeachtrianglemaintainedasetofitsneighbors. Branch 38047 70741 This is essentially the same as the standard adjacency list repre- FP 9946 10865 LwD/iSthT randomiz9e0d0 6q4ueue 165746 sentationofgraphs.Formeshgen,codeforcommutativitychecks Int 304449 532884 was addedby hand tothis graph class; ultimately,we would like Total 442506 780236 togeneratethiscodeautomaticallyfromhighlevelcommutativity (d)Instructionsperiterationonasingleprocessor specificationslikethoseinFigure9.WeusedanSTLqueuetoim- plementtheworkset.Werefertothesedefaultimplementationsof 30.0 30.0 25.6734 25.7625 meshgenandFGLasmeshgen(d)andFGL(d). wwtuehrniTTciemohhdeputahlnienredampenwuredstnootdtramekand-tdsaeetltwsethemeowteweamnfsaftoesiormcefgtpetovhlnefeeemrsrcsaceiuthonerenrtdeedsnd,uatulFibstnGoyegmtL.ap(arodt)ilacitcaaaynlldsyotrnumuscpeitnseuhrgrfgeoJerontmhn(aara)ttnh,craieenn-, Instructions (billions)12752...505 16.8889 17.4675 Cycle (billions)12752...505 13.8951 18.8501 mesghen Shewchuk’s Triangle program [35]. It had 10,156 triangles and 0 0 1 proc 4 proc (d) 4 proc (r) 1 proc 4 proc (d) 4 proc (r) boundarysegments,ofwhich4,837triangleswerebad. (e)BrCeoamkmditownoAfbionrtstructSicohnedsualenrdcycCloemsminutamtiveitsyhgen Executiontimesandspeed-ups. Executiontimesandself-relative Commit 10% speed-upsforthefiveimplementationsontheItaniummachineare showninFigure12(a,b).Thereferenceversionisthefastestona A1b0o%rt singleprocessor.On4processors,FGL(d)andFGL(r)differonly Scheduler slightly in performance. Meshgen(r) performed almost as well as 3% FGL,althoughsurprisingly,meshgen(d)wastwiceasslowasFGL. Statistics on committed and aborted iterations. To understand theseissuesbetter,wedeterminedthetotalnumberofcommitted Commutativity 77% andabortediterationsfordifferentversionsofmeshgen,asshown (f)BreakdownofGaloisoverhead in Figure 12(c). On 1 processor, meshgen executed and commit- ted21,918iterations.Becauseoftheinherentnon-determinismof #ofprocs Client Object Runtime Total 1 1.177 0.6208 0.6884 2.487 the set iterator, the number of iterations executed by meshgen in 4 2.769 3.600 4.282 10.651 parallel varies from run to run (the same effect will be seen on (g)L3misses(inmillions)formeshgen(r) oneprocessoriftheschedulingpolicyisvaried).Therefore,weran the codes a large number of times, and determined a distribution Figure12. Meshrefinementresults:4-processorItanium forthenumbersofcommittedandabortediterations.Figure12(c) shows that on 4 processors, meshgen(d) committed roughly the samenumberofiterationsasitdidon1processor,butalsoaborted almostasmanyiterationsduetocavityconflicts.Theabortratiofor meshgen(r) is much lower because the scheduling policy reduces the likelihood of conflicts between processors. This accounts for the performance difference between meshgen(d) and meshgen(r). Because the FGL code is carefully tuned by hand, the cost of an 8 ambeosrhtegdenit,esroatFioGnLis(rs)upbesrtfaonrtmiaslloynlleyssatlhitatnlethbeetcteorrrtehsapnoFndGiLng(dc)o.stin me (s)6 ccoavuIilttdysbeceeomnbflseinccetosfiucsnhitaoelwr,inbedtuuttihtaiavdteethtehepaeptrraoinbrvlaeenmsdtoigcmoatuiizloedndbisenctahotetrtdihubeluitnseogduprtocoeloicuoyfr Execution ti24 Rtreeefebrueinldce useofanSTLqueuetoimplementtheworkset.Whenabadtriangle 0 1 2 3 4 isrefinedbythealgorithm,aclusterofsmallerbadtrianglesmaybe # of processors (a)Executiontimes createdwithinthecavity.Inthequeuedatastructure,thesenewbad trianglesareadjacenttoeachother,soitislikelythattheywillbe Reference treebuild scheduledtogetherforrefinementondifferentprocessors,leading 2.5 tocOavnietyccoonncflluiscitos.nfromtheseexperimentsisthatdomainknowl- Speedup1.52 edgeisinvaluableforimplementingagoodschedulingpolicy. 1 1 2 3 4 # of processors (b)Self-relativespeed-ups Instructions and cycles breakdown. Figure 12(d) shows the 30000 breakdown of different types of instructions executed by the ref- erenceandmeshgenversionsofDelaunaymeshrefinementwhen 25000 ttghiiovenye;aainrpesreoruqfinuleeonontfiaoalne“extyeppcrioucctaieols”ns,oittreh.reaTrtehioeanrneiunnmothbaeebrtsowrstohs,ocswoodnethsae.rseEeapnceuhrmiittbeeerraars-- Frequency112050000000000 tionofmeshgenperformsroughly10,000floating-pointoperations 5000 and executes almost a million instructions. These are relatively 0 long-runningcomputations. 0 5 10 15 20 RTC occupancy in commit pool Meshgenexecutesalmost80%moreinstructionsthantheref- (c)CommitpooloccupancybyRTCiterations erenceversion.Tounderstandwheretheseextracycleswerebeing Committed Aborted spent,weinstrumentedthecodeusingPAPI.Figure12(e)showsa #ofproc. Max Min Avg Max Min Avg breakdownofthetotalnumberofinstructionsandcyclesbetween 1 57846 57846 57846 n/a n/a n/a theclientcode(thecodeinFigure7),thesharedobjects(graphand 4 57870 57849 57861 3128 1887 2528 workset),andtheGaloisruntimesystem.The4processornumbers (d)Committedandabortediterationsintreebuild aresumsacrossallfourprocessors.Thereferenceversionperforms InstructionType reference treebuild almost9.8billioninstructions,andthisisroughlythesameasthe Branch 7162 18187 numberofinstructionsexecutedintheclientcodeandsharedob- FP 3601 3640 LD/ST 22519 48025 jects in the 1 processor version of meshgen and the 4 processor Int 70829 146716 versionofmeshgen(r).Becausemeshgen(d)hasalotofaborts,it Total 104111 216568 spendssubstantiallymoretimeintheclientcodedoingworkthat (e)Instructionsperiterationonasingleprocessor. getsabortedandintheruntimelayertorecoverfromaborts. WefurtherbrokedowntheGaloisoverheadintofourcategories: 20 20 18.8916 commit and abort overheads, which are the time spent commit- twortihevnhseeguircGlhihttseae,ailrnoadacits,silowusonedhveseienscarhnhtiindemisaaFdebtihgosgeuprotretieeinnmstg1iea2ntrh(sbfepp)imee,trrnsa,fthtorioeprnsmwegprifectnohcogrtanmitvcfleiroinolcmyguts;mgc;shoucalnhntyafledttdiihvccurtoiletemcyerhmfcoeohcvuukeetrcasrthkt.hiesvTs.aihotdIyeft, Instructions (billions) 11055 12.5272 13.2660 Cycle (billions) 11505 14.1666 treebuild isclearthatreducingthisoverheadiskeytoreducingtheoverall 0 0 1 proc 4 proc 1 proc 4 proc overheadoftheGaloisrun-time. (f)BCormemaiktdowAnboortfinstSrcuhecdtuiloernsanCdomcmyuctlaetivsity The1processorversionofmeshgenexecutesroughlythesame Commit number of instructions as the 4 processor version. We do not get 8% Abort 1% perfect self-relative speedup because some of these instructions takelongertoexecuteinthe4processorversionthaninthe1pro- cessorversion.Therearetworeasonsforthis:contentionforlocks Commutativity in shared objects and the runtime system, and cache misses due 52% Scheduler toinvalidations.Contentionisdifficulttomeasuredirectly,sowe 39% lookedatcachemissesinstead.Onthe4processorItanium,there isnosharedcache,sowemeasuredL3cachemisses.Figure12(g) showsL3misses;the4processornumbersaresumsacrossallpro- (g)BreakdownofGaloisoverhead cessorsformeshgen(r).Mostoftheincreaseincachemissesarises #ofprocs User Object Runtime Total fromcodeinthesharedobjectclassesandintheGaloisruntime. 1 0.5583 3.102 0.883 4.544 AnL3misscostsroughly300cyclesontheItanium,soitcanbe 4 2.563 12.8052 5.177 20.545 (h)NumberofL3misses(inmillions)ondifferentnumbersofprocessors. seenthatoverhalfoftheextracyclesexecutedbythe4processor version,whencomparedtothe1processorversion,arelostinL3 Figure13. Agglomerativeclusteringresults:4-processorItanium misses.Therestoftheextracyclesarelostincontention.

Description:

Keshav Pingali. Department of Computer Science,. University of Texas, Austin be added easily to any object-oriented programming language like Java. They are related to set iterators in most studies in this area do; instead it is necessary to recall Niklaus. Wirth's aphorism program = algorithm +

Optimistic Parallelism Requires Abstractions PDF

12 Pages·2007·0.46 MB·English

Checking for file health...

Save to my drive

Quick download

Download

Download Optimistic Parallelism Requires Abstractions PDF Free - Full Version

by Unknow| 2007| 12 pages| 0.46| English

Download Optimistic Parallelism Requires Abstractions by in PDF format completely FREE. No registration required, no payment needed. Get instant access to this valuable resource on PDFdrive.to!

Free Download PDF

About Optimistic Parallelism Requires Abstractions

Detailed Information

Author:	Unknown
Publication Year:	2007
Pages:	12
Language:	English
File Size:	0.46
Format:	PDF
Price:	FREE

Download Free PDF

Safe & Secure Download - No registration required

Why Choose PDFdrive for Your Free Optimistic Parallelism Requires Abstractions Download?

100% Free: No hidden fees or subscriptions required for one book every day.
No Registration: Immediate access is available without creating accounts for one book every day.
Safe and Secure: Clean downloads without malware or viruses
Multiple Formats: PDF, MOBI, Mpub,... optimized for all devices
Educational Resource: Supporting knowledge sharing and learning

Frequently Asked Questions

Is it really free to download Optimistic Parallelism Requires Abstractions PDF?

Yes, on https://PDFdrive.to you can download Optimistic Parallelism Requires Abstractions by completely free. We don't require any payment, subscription, or registration to access this PDF file. For 3 books every day.

How can I read Optimistic Parallelism Requires Abstractions on my mobile device?

After downloading Optimistic Parallelism Requires Abstractions PDF, you can open it with any PDF reader app on your phone or tablet. We recommend using Adobe Acrobat Reader, Apple Books, or Google Play Books for the best reading experience.

Is this the full version of Optimistic Parallelism Requires Abstractions?

Yes, this is the complete PDF version of Optimistic Parallelism Requires Abstractions by Unknow. You will be able to read the entire content as in the printed version without missing any pages.

Is it legal to download Optimistic Parallelism Requires Abstractions PDF for free?

https://PDFdrive.to provides links to free educational resources available online. We do not store any files on our servers. Please be aware of copyright laws in your country before downloading.

The materials shared are intended for research, educational, and personal use in accordance with fair use principles.