Table Of ContentBeyond the Socket: NUMA-Aware GPUs
UgljesaMilic†,∓ OresteVilla‡ EvgenyBolotin‡ AkhilArunkumar∗ EimanEbrahimi‡
AamerJaleel‡ AlexRamirez⋆ DavidNellans‡
†BarcelonaSupercomputingCenter(BSC),∓UniversitatPolitècnicadeCatalunya(UPC)
‡NVIDIA,∗ArizonaStateUniversity,⋆Google
[email protected],{ovilla,ebolotin,eebrahimi,ajaleel,dnellans}@nvidia.com,[email protected],
[email protected]
ABSTRACT ACMReferenceformat:
GPUs achieve high throughput and power efficiency by employ- Ugljesa Milic, Oreste Villa, Evgeny Bolotin, Akhil Arunkumar, Eiman
Ebrahimi,AamerJaleel,AlexRamirez,DavidNellans.2017.Beyondthe
ingmanysmallsingleinstructionmultiplethread(SIMT)cores.To
Socket:NUMA-AwareGPUs.InProceedingsofMICRO-50,Cambridge,
minimizeschedulinglogicandperformancevariancetheyutilize
MA,USA,October14–18,2017,13pages.
auniformmemorysystemandleveragestrongdataparallelismex-
https://doi.org/10.1145/3123939.3124534
posedviatheprogrammingmodel.WithMoore’slawslowing,for
GPUstocontinuescalingperformance(whichlargelydependson
1 INTRODUCTION
SIMTcorecount)theyarelikelytoembracemulti-socketdesigns
wheretransistorsaremorereadilyavailable.Howeverwhenmoving InthelastdecadeGPUscomputinghastransformedthehighper-
tosuchdesigns,maintainingtheillusionofauniformmemorysys- formancecomputing,machinelearning,anddataanalyticsfields
temisincreasinglydifficult.Inthisworkweinvestigatemulti-socket thatwerepreviouslydominatedbyCPU-basedinstallations[27,34,
non-uniformmemoryaccess(NUMA)GPUdesignsandshowthat 53, 61]. Many systems now rely on a combination of GPUs and
significantchangesareneededtoboththeGPUinterconnectand CPUstoleveragehighthroughputdataparallelGPUswithlatency
cachearchitecturestoachieveperformancescalability.Weshowthat criticalexecutionoccurringontheCPUs.Inpart,GPU-accelerated
applicationphaseeffectscanbeexploitedallowingGPUsocketsto computinghasbeensuccessfulinthesedomainsbecauseofnative
dynamicallyoptimizetheirindividualinterconnectandcachepoli- supportfordataparallelprogramminglanguages[24,40]thatre-
cies,minimizingtheimpactofNUMAeffects.OurNUMA-aware duceprogrammerburdenwhentryingtoscaleprogramsacrossever
GPU outperforms a single GPU by 1.5×, 2.3×, and 3.2× while growingdatasets.
achieving89%,84%,and76%oftheoreticalapplicationscalability Nevertheless,withGPUsnearingthereticlelimitsformaximum
in2,4,and8socketsdesignsrespectively.Implementabletoday, diesize andthe transistor densitygrowthrateslowingdown[5],
NUMA-awaremulti-socketGPUsmaybeapromisingcandidatefor developerslookingtoscaletheperformanceoftheirsingleGPU
scalingGPUperformancebeyondasinglesocket. programs are in a precarious position. Multi-GPU programming
modelssupportexplicitprogrammingoftwoormoreGPUs,but
itischallengingtoleveragemechanismssuchasPeer-2-Peerac-
CCSCONCEPTS
cess [36] or a combination of MPI and CUDA [42] to manage
• Computing methodologies → Graphics processors; • Com- multipleGPUs.Theseprogrammingextensionsenableprogrammers
putersystemsorganization→Singleinstruction,multipledata; toemploymorethanoneGPUforhighthroughputcomputation,but
requirere-writingoftraditionalsingleGPUapplications,slowing
KEYWORDS theiradoption.
GraphicsProcessingUnits,Multi-socketGPUs,NUMASystems High port-count PCIe switches are now readily available and
thePCI-SIGroadmapisprojectingPCIe5.0bandwidthtoreach
128GBsin2019[6].Atthesametime,GPUsarestartingtoexpand
beyond the traditional PCIe peripheral interface to enable more
Permissiontomakedigitalorhardcopiesofallorpartofthisworkforpersonalor
efficientinterconnectionprotocolsbetweenbothGPUsandCPUs,
classroomuseisgrantedwithoutfeeprovidedthatcopiesarenotmadeordistributed
forprofitorcommercialadvantageandthatcopiesbearthisnoticeandthefullcitation suchasAMD’sInfinityFabricorNVIDIA’sScalableLinkInterface
onthefirstpage.CopyrightsforcomponentsofthisworkownedbyothersthanACM andNVLink[2,29,35,39,44].FuturehighbandwidthGPUtoGPU
mustbehonored.Abstractingwithcreditispermitted.Tocopyotherwise,orrepublish,
topostonserversortoredistributetolists,requirespriorspecificpermissionand/ora interconnects,possiblyusingimprovedcommunicationprotocols,
[email protected]. mayleadtosystemdesignswithcloselycoupledgroupsofGPUs
MICRO-50,October14–18,2017,Cambridge,MA,USA
thatcanefficientlysharememoryatfinegranularity.
©2017AssociationforComputingMachinery.
ACMISBN978-1-4503-4952-9/17/10...$15.00 Theonsetofsuchmulti-socketGPUswouldprovideapivotpoint
https://doi.org/10.1145/3123939.3124534 forGPUandsystemvendors.Ononehand,vendorscancontinueto
MICRO-50,October14–18,2017,Cambridge,MA,USA U.Milicetal.
PCGGIePPUU HighGGGG BPPPPPWCUUUUI GePU-GPGGGGUP PPPPCInIUUUUteerface GGGGPPPPUUUU of Workloadsure Large GPUs1068000 41/41 37/41 35/41 33/41 31/41
GGPPUU ge Fut 40
PCICCePPUU PCIeCCPPUUPCIe GGGGPPPPUUUU CCPPUU PercentaAble to Fill 200Current GPU 2x GPU 4x GPU 8x GPU 16x GPU
[56 SMs] [112 SMs] [224 SMs] [448 SMs] [896 SMs]
GGGGPPPPUUUU GGGGPPPPUUUU
GGPPUU GGPPUU
Figure 2: Percentage of workloads that are able to fill future
High BW GPU-GPU Interface
largerGPUs(averagenumberofconcurrentthreadblocksex-
Logical Programmer Exposed GPU(s)
ceedsnumberofSMsinthesystem).
Figure1:TheevolutionofGPUsfromtraditionaldiscretePCIe
devicestosinglelogical,multi-socketedacceleratorsutilizinga acrossmultiplesocketsisagooddesignchoice,despitethe
switchedinterconnect. overheads.
• Weshowthatmulti-socketNUMA-awareGPUscanallow
traditional GPU programs to scale efficiently to as many
exposetheseGPUsasindividualGPUsandforcedeveloperstouse
as 8 GPU sockets, providing significant headroom before
multipleprogrammingparadigmstoleveragethesemultipleGPUs.
developersmustre-architectapplicationstoobtainadditional
Ontheother,vendorscouldexposemulti-socketdesignsasasingle
performance.
non-uniformmemoryaccess(NUMA)GPUresourceasshownin
Figure1.ByextendingthesingleGPUprogrammingmodeltomulti- 2 MOTIVATIONANDBACKGROUND
socketGPUs,applicationscanscalebeyondtheboundsofMoore’s
Over the last decade single GPU performance has scaled thanks
law,whilesimultaneouslyretainingtheprogramminginterfaceto
to a significant growth in per-GPU transistor count and DRAM
whichGPUdevelopershavebecomeaccustomed.
bandwidth.Forexample,in2010NVIDIA’sFermiGPUintegrated
Severalgroupshavepreviouslyexaminedaggregatingmultiple
1.95Btransistorsona529mm2die,with180GBsofDRAMband-
GPUstogetherunderasingleprogrammingmodel[7,30];however
width[13].In2016NVIDIA’sPascalGPUcontained12Btransistors
thisworkwasdoneinanerawhereGPUshadlimitedmemoryad-
withina610mm2die,whilerelyingon720GBsofmemoryband-
dressabilityandreliedonhighlatency,lowbandwidthCPU-based
width[41].Unfortunately,transistordensityisslowingsignificantly
PCIeinterconnects.Asaresult,priorworkfocusedprimarilyonim-
andintegratedcircuitmanufacturersarenotprovidingroadmapsbe-
provingthemulti-GPUprogrammingexperienceratherthanachiev-
yond7nm.Moreover,GPUdiesizes,whichhavebeenalsoslowly
ingscalableperformance.Buildinguponthiswork,weproposea
butsteadilygrowinggenerationally,areexpectedtoslowdowndue
multi-socketNUMA-awareGPUarchitectureandruntimethatag-
tolimitationsinlithographyandmanufacturingcost.
gregatesmultipleGPUsintoasingleprogrammertransparentlogical
Withouteitherlargerordenserdies,GPUarchitectsmustturn
GPU.Weshowthatinthetheeraofunifiedvirtualaddressing[37],
toalternativesolutionstosignificantlyincreaseGPUperformance.
cachelineaddressablehighbandwidthinterconnects[39],anddedi-
Recently 3D die-stacking has seen significant interest due to its
catedGPUandCPUsocketPCBdesigns[29],scalablemulti-GPU
successeswithhighbandwidthDRAM[23].Unfortunately3Ddie-
performancemaybeachievableusingexistingsingleGPUprogram-
stacking still has a significant engineering challenges related to
mingmodels.Thisworkmakesthefollowingcontributions:
powerdelivery,energydensity,andcooling[60]whenemployed
• We show that traditional NUMA memory placement and inpowerhungry,maximaldie-sizedchipssuchasGPUs.Thuswe
schedulingpoliciesarenotsufficientformulti-socketGPUs proposeGPUmanufacturersarelikelytore-examineatriedand
toachieveperformancescalability.Wethendemonstratethat truedsolutionfromCPUworld,multi-socketGPUs,toscalingGPU
inter-socketbandwidthwillbetheprimaryperformancelim- performancewhilemaintainingthecurrentratiooffloatingpoint
iterinfutureNUMAGPUs. operationspersecond(FLOPS)andDRAMbandwidth.
• Byexploitingprogramphasebehaviorweshowthatinter- Multi-socketGPUsareenabledbytheevolutionofGPUsfrom
socketlinks(andthusbandwidth)shouldbedynamicallyand externalperipheralstocentralcomputingcomponents,considered
adaptivelyreconfiguredatruntimetomaximizelinkutiliza- atsystemdesigntime.GPUoptimizedsystemsnowemploycus-
tion.Moreover,weshowthatlinkpolicymustbedetermined tomPCBdesignsthataccommodatehighpincountsocketedGPU
onaper-GPUbasis,asglobalpoliciesfailtocaptureper-GPU modules[35]withinter-GPUinterconnectsresemblingQPIorHy-
phasebehavior. perTransport[17,21,39].Despitetherapidimprovementinhard-
• WeshowthatboththeGPUL1andL2cachesshouldbemade warecapabilities,thesesystemshavecontinuedtoexposetheGPUs
NUMA-awareanddynamicallyadapttheircachingpolicy providedasindividuallyaddressableunits.Thesemulti-GPUsys-
tominimizeNUMAeffects.WedemonstratethatinNUMA temscanprovidehighaggregatethroughputwhenrunningmultiple
GPUs,extendingexistingGPUcachecoherenceprotocols concurrentGPUkernels,buttoaccelerateasingleGPUworkload
BeyondtheSocket:NUMA-AwareGPUs MICRO-50,October14–18,2017,Cambridge,MA,USA
theyrequirelayeringadditionalsoftwareruntimesontopofnative memorypropertieswithintheGPU.Finegrainedinterleavingof
GPUprogramminginterfacessuchasCUDAorOpenCL[24,38]. memoryaddressesacrossmemorychannelsontheGPUprovides
Unfortunately,byrequiringapplicationre-designmanyworkloads implicitloadbalancingacrossmemorybutdestroysmemorylocality.
areneverportedtotakeadvantageofmultipleGPUs. Asaresult,threadblockschedulingpoliciesneednotbesophisti-
ExtendingsingleGPUworkloadperformancesignificantlyisa catedtocapturelocality,whichhasbeendestroyedbythememory
laudable goal, but we must first understand if these applications systemlayout.ForfutureNUMAGPUstoworkwell,bothsystem
will be able to leverage larger GPUs. Assuming that the biggest softwareandhardwaremustbechangedtoachievebothfunction-
GPUinthemarkettodayamasses≈50SMs(i.e.NVIDIA’sPascal alityandperformance.Beforefocusingonarchitecturalchangesto
GPUcontains56),Figure2showsthatacrossabenchmarksetof41 buildaNUMA-awareGPUwedescribetheGPUruntimesystem
applicationsthatarelaterdescribedinSection3.2,mostsingleGPU weemploytoenablemulti-socketGPUexecution.
optimizedworkloadsalreadycontainsufficientdataparallelismtofill Priorworkhasdemonstratedfeasibilityofaruntimesystemthat
GPUsthatare2–8×largerthantoday’sbiggestGPUs.Forthosethat transparentlydecomposesGPUkernelsintosub-kernelsandexe-
donot,wefindthattheabsolutenumberofthreadblocks(CTAs)is cutesthemonmultiplePCIeattachedGPUsinparallel[7,25,30].
intentionallylimitedbytheprogrammer,orthattheproblemdataset Forexample,onNVIDIAGPUsthiscanbeimplementedbyinter-
cannotbeeasilyscaledupduetomemorylimitations.Whilemostof ceptingandremappingeachkernelcall,GPUmemoryallocation,
theseapplicationsthatdoscaleareunlikelytoscaletothousandsof memorycopy,andGPU-widesynchronizationissuedbytheCUDA
GPUsacrossanentiredatacenterwithoutadditionaldevelopereffort, driver.Per-GPUmemoryfencesmustbepromotedtosystemlevel
moderateprogrammertransparentperformancescalabilitywillbe and seen by all GPUs, and sub-kernels CTA identifiers must be
attractiveforapplicationsthatalreadycontainsufficientalgorithmic properlymanagedtoreflectthoseoftheoriginalkernel.Cabezaset
anddataparallelism. al.solvethesetwoproblemsbyintroducingcodeannotationsandan
Inthiswork,weexaminetheperformanceofafuture4-module additionalsource-to-sourcecompilerwhichwasalsoresponsiblefor
NUMAGPUtounderstandtheeffectsthatNUMAwillhavewhen staticallypartitioningdataplacementandcomputation[7].
executingapplicationsdesignedforUMAGPUs.Wewillshow(not In our work, we follow a similar strategy but without using a
surprisingly),thatwhenexecutingUMA-optimizedGPUprograms source-to-source translation. Unlike prior work, we are able to
onamulti-socketNUMAGPU,performancedoesnotscale.We rely on NVIDIA’s Unified Virtual Addressing [37] to allow dy-
drawonpriorworkandshowthatbeforeoptimizingGPUmicroarchi- namicplacementofpagesintomemoryatruntime.Similarly,tech-
tectureforNUMA-awareness,severalsoftwareoptimizationsmust nologieswithcachelinegranularityinterconnectslikeNVIDIA’s
beinplacetopreservedatalocality.AlonetheseSWimprovements NVLink[39]allowtransparentaccesstoremotememorywithout
arenotsufficienttoachievescalableperformancehoweverandin- theneedtomodifyapplicationsourcecodetoaccesslocalorremote
terconnectbandwidthisasignificanthindranceonperformance.To memory addresses. Due to these advancements, we assume that
overcomethisbottleneckweproposetwoclassesofimprovements throughdynamiccompilationofPTXtoSASSatexecutiontime,the
toreducetheobservedNUMApenalty. GPUruntimewillbeabletostaticallyidentifyandpromotesystem
First,inSection4weexaminetheabilityofswitchconnected widememoryfencesaswellasmanagesub-kernelCTAidentifiers.
GPUstodynamicallyrepartitiontheiringressandegresslinkstopro- Current GPUs perform fine-grained memory interleaving at a
videasymmetricbandwidthprovisioningwhenrequired.Byusing sub-pagegranularityacrossmemorychannels.InaNUMAGPU
existinginterconnectsmoreefficiently,theeffectiveNUMAband- thispolicywoulddestroylocalityandresultin75%ofallaccesses
widthratioofremotememorytolocalmemorydecreases,improving tobetoremotememoryina4GPUsystem,anundesirableeffect
performance.Second,tominimizetrafficonoversubscribedintercon- inNUMAsystems.Similarly,around-robinpagelevelinterleaving
nectlinksweproposeGPUcachesneedtobecomeNUMA-aware couldbeutilized,similartotheLinuxinterleavepageallocation
inSection5.Traditionalon-chipcachesareoptimizedtomaximize strategy,butdespitetheinherentmemoryloadbalancing,thisstill
overallhitrate,thusminimizingoff-chipbandwidth.However,in resultsin75%ofmemoryaccessesoccurringoverlowbandwidth
NUMAsystems,notallcachemisseshavethesamerelativecost NUMAlinks.InsteadweleverageUVMpagemigrationfunctional-
andthusshouldnotbetreatedequally.DuetotheNUMApenalty itytomigratepageson-demandfromsystemmemorytolocalGPU
of accessing remote memory, we show that performance can be memoryassoonasthefirstaccess(alsocalledfirst-touchallocation)
maximizedbypreferencingcachecapacity(andthusimprovinghit isperformedasdescribedbyArunkumaret.al[3].
rate)towardsdatathatresidesinslowerremoteNUMAzones,at OnasingleGPU,fine-graineddynamicassignmentofCTAsto
theexpenseofdatathatresidesinfasterlocalNUMAzones.Tothis SMsisperformedtoachievegoodloadbalancing.Extendingthis
end,weproposeanewNUMA-awarecachearchitecturethatdynam- policy to a multi-socket GPU system is not possible due to the
icallybalancescachecapacitybasedonmemorysystemutilization. relativelyhighlatencyofpassingsub-kernellaunchesfromsoftware
Beforedivingintomicroarchitecturaldetailsandresults,wefirst tohardware.Toovercomethispenalty,theGPUruntimemustlaunch
describethelocality-optimizedGPUsoftwareruntimethatenables ablockofCTAstoeachGPU-socketatacoursegranularity.To
ourproposedNUMA-awarearchitecture. encourageloadbalancing,eachsub-kernelcouldbecomprisedof
aninterleavingofCTAsusingmoduloarithmetic.Alternativelya
singlekernelcanbedecomposedintoNsub-kernels,whereNisthe
3 ANUMA-AWAREGPURUNTIME
totalnumberofGPUsocketsinthesystem,assigningequalamount
CurrentGPUsoftwareandhardwareisco-designedtogethertoopti- ofcontiguousCTAstoeachGPU.Thisdesignchoicepotentially
mizethroughputofprocessorsbasedontheassumptionofuniform exposesworkloadunbalanceacrosssub-kernels,butitalsopreserves
MICRO-50,October14–18,2017,Cambridge,MA,USA U.Milicetal.
Single GPU 4-Socket NUMA GPU :: Traditional GPU Scheduling and Memory Interleaving
Hypothetical 4x Larger Single GPU 4-Socket NUMA GPU :: Locality-Optimized GPU Scheduling and Memory Placement (Baseline)
5
4
up3
d
e
e
p2
S
1
0
Rodinia-Euler3D HPC-AMG HPC-RSBench HPC-CoMD-Ta PC-Lulesh-Unstruct-Mesh2 HPC-Lulesh HPC-Nekbone Lonestar-MST-Mesh PC-Lulesh-Unstruct-Mesh1 Lonestar-SP HPC-CoMD-Wa HPC-HPGMG-UVM Rodinia-BFS HPC-CoMD Lonestar-SSSP-Wlc HPC-MCB ML-AlexNet-cudnn-Lev2 Rodinia-Gaussian Other-Optix-Raytracing HPC-MiniContact-Mesh2 Lonestar-MST-Graph HPC-Namd2.9 Lonestar-SSSP-Wln HPC-MiniContact-Mesh1 ML-GoogLeNet-cudnn-Lev2 Rodinia-Hotspot HPC-SNAP ML-AlexNet-cudnn-Lev4 Rodnia-Pathfinder Lonestar-SSSP HPC-MiniAMR HPC-HPGMG Lonestar-DMR Rodinia-Srad Rodinia-Backprop Other-Stream-Triad Other-Bitcoin-Crypto ML-AlexNet-ConvNet2 HPC-RabbitCT ML-OverFeat-cudnn-Lev3 Rodinia-Kmeans Arithmetic Mean Geometric Mean
H H
Figure3:Performanceofa4-socketNUMAGPUrelativetoasingleGPUandahypothetical4×larger(allresourcesscaled)single
GPU.Applicationsshowningreyachievegreaterthan99%ofperformancescalingwithSW-onlylocalityoptimization.
datalocalitypresentinapplicationswherecontiguousCTAsalso theoreticalperformance.Inparticularforthetwofar-mostbench-
accesscontiguousmemoryregions[3,7]. marksontheright,thelocalityoptimizedsolutionscanoutperform
thehypothetical4×largerGPUduehighercachehitratesbecause
3.1 PerformanceThroughLocality contiguousblockschedulingismorecachefriendlythantraditional
GPUscheduling.
Figure3showstherelativeperformanceofa4-socketNUMAGPU
However,theapplicationsontheleftsideshowalargegapbe-
comparedtoasingleGPUusingthetwopossibleCTAscheduling
tweentheLocality-OptimizedNUMAdesignandtheoreticalperfor-
andmemoryplacementstrategiesexplainedabove.Thegreenbars
mance.Theseareworkloadsinwhicheitherlocalitydoesnotexist
showtherelativeperformanceoftraditionalsingle-GPUschedul-
ortheLocality-OptimizedGPUruntimeisnoteffective,resulting
ingandmemoryinterleavingpolicieswhenadaptedtoaNUMA
inlargenumberofremotedataaccesses.Becauseourgoalistopro-
GPU.Thelightbluebarsshowtherelativeperformanceofusing
videscalableperformanceforsingle-GPUoptimizedapplications,
locality-optimizedGPUschedulingandmemoryplacement,con-
therestofthepaperdescribeshowtoclosethisperformancegap
sistingofcontiguousblockCTAschedulingandfirsttouchpage
throughmicroarchitecturalinnovation.Tosimplifylaterdiscussion,
migration;afterwhichpagesarenotdynamicallymovedbetween
wechoosetoexcludebenchmarksthatachieve≥99%ofthetheo-
GPUs.TheLocality-Optimizedsolutionalmostalwaysoutperforms
reticalperformancewithsoftware-onlylocalityoptimizations.Still,
thetraditionalGPUschedulingandmemoryinterleaving.Without
weincludeallbenchmarksinourfinalresultstoshowtheoverall
theseruntimelocalityoptimizations,a4-socketNUMAGPUisnot
scalabilityachievablewithNUMA-awaremulti-socketGPUs.
abletomatchtheperformanceofasingleGPUdespitethelarge
increaseinhardwareresources.Thus,usingvariantsofpriorpro-
3.2 SimulationMethodology
posals[3,7],wenowonlyconsiderthislocalityoptimizedGPU
runtimefortheremainderofthepaper. ToevaluatetheperformanceoffutureNUMA-awaremulti-socket
Despitetheperformanceimprovementsthatcancomevialocality- GPUsweuseaproprietary,cycle-level,trace-drivensimulatorfor
optimizedsoftwareruntimes,manyapplicationsdonotscalewellon singleandmulti-GPUsystems.OurbaselineGPUinbothsingle
ourproposedNUMAGPUsystem.Toillustratethiseffect,Figure3 GPUandmulti-socketGPUconfigurations,approximatesthelatest
showsthespeedupachievablebyahypothetical(unbuildable)4× NVIDIAPascalarchitecture[43].Eachstreamingmultiprocessors
largerGPUwithareddash.Thisreddashrepresentsanapproxima- (SM)ismodeledasanin-orderprocessorwithmultiplelevelsof
tionofthemaximumtheoreticalperformanceweexpectedfroma cachehierarchycontainingprivate,per-SM,L1cachesandamulti-
perfectlyarchitected(bothHWandSW)NUMAGPUsystem.Fig- banked,shared,L2cache.EachGPUisbackedbylocalon-package
ure3sortstheapplicationsbythegapbetweenrelativeperformance highbandwidthmemory[23].Ourmulti-socketGPUsystemscon-
oftheLocality-OptimizedNUMAGPUandhypothetical4×larger tains two to eight of these GPUs interconnected through a high
GPU.Weobservethatontherightsideofthegraphsomework- bandwidthswitchasshowninFigure1.Table1providesamore
loads(showninthegreybox)canachieveorsurpassthemaximum detailedoverviewofthesimulationparameters.
BeyondtheSocket:NUMA-AwareGPUs MICRO-50,October14–18,2017,Cambridge,MA,USA
Parameter Value(s) Benchmark Time-weighted Memory
AverageCTAs Footprint(MB)
NumofGPUsockets 4
ML-GoogLeNet-cudnn-Lev2 6272 1205
TotalnumberofSMs 64perGPUsocket
ML-AlexNet-cudnn-Lev2 1250 832
GPUFrequency 1GHz ML-OverFeat-cudann-Lev3 1800 388
ML-AlexNet-cudnn-Lev4 1014 32
MaxnumberofWarps 64perSM
ML-AlexNet-ConvNet2 6075 97
WarpScheduler GreedythenRoundRobin Rodinia-Backprop 4096 160
L1Cache Private,128KBperSM,128Blines,4-way, Rodinia-Euler3D 1008 25
Write-Through,GPU-sideSW-basedcoherent Rodinia-BFS 1954 38
Rodinia-Gaussian 2599 78
L2Cache Shared,Banked,4MBpersocket,128Blines, Rodinia-Hotspot 7396 64
16-way,Write-Back,Mem-sidenon-coherent Rodinia-Kmeans 3249 221
GPU–GPUInterconnect 128GBspersocket(64GBseachdirection) Rodnia-Pathfinder 4630 1570
8lanes8Bwideeachperdirection Rodinia-Srad 16384 98
128-cyclelatency HPC-SNAP 200 744
HPC-Nekbone-Large 5583 294
DRAMBandwidth 768GBsperGPUsocket
HPC-MiniAMR 76033 2752
DRAMLatency 100ns HPC-MiniContact-Mesh1 250 21
HPC-MiniContact-Mesh2 15423 257
Table 1: Simulation parameters for evaluation of single and
HPC-Lulesh-Unstruct-Mesh1 435 19
multi-socketGPUsystems.
HPC-Lulesh-Unstruct-Mesh2 4940 208
HPC-AMG 241549 3744
GPUcoherenceprotocolsarenotone-sizefitsall[49,54,63].
HPC-RSBench 7813 19
This work examines clusters of large discrete GPUs but smaller HPC-MCB 5001 162
moretightlyintegratedGPU–CPUdesignsexisttodayassystemon HPC-NAMD2.9 3888 88
chips(SoC)[12,33].InthesedesignsGPUsandCPUscanshare HPC-RabbitCT 131072 524
asinglememoryspaceandlast-levelcache,necessitatingacom- HPC-Lulesh 12202 578
patibleGPU–CPUcoherenceprotocol.Howevercloselycoupled HPC-CoMD 3588 319
CPU-GPUsolutionsarenotlikelytobeidealcandidatesforGPU- HPC-CoMD-Wa 13691 393
HPC-CoMD-Ta 5724 394
centricHPCworkloads.DiscreteGPUseachdedicatetensofbillions
HPC-HPGMG-UVM 10436 1975
oftransistorstothroughputcomputing,whileintegratedsolutions
HPC-HPGMG 10506 1571
dedicateonlyafractionofthechiparea.WhilediscreteGPUsare
Lonestar-SP 75 8
alsostartingtointegratemorecloselywithsomeCPUcoherence
Lonestar-MST-Graph 770 86
protocols[1,63],PCIeattacheddiscreteGPUs(whereintegratedco- Lonestar-MST-Mesh 895 75
herenceisnotpossible)arelikelytocontinuedominatingthemarket, Lonestar-SSSP-Wln 60 21
thankstobroadcompatibilitybetweenCPUandGPUvendors. Lonestar-DMR 82 248
Thisworkexaminesthescalabilityofonesuchcachecoherence Lonestar-SSSP-Wlc 163 21
protocolusedbyPCIeattacheddiscreteGPUs.Theprotocolisop- Lonestar-SSSP 1046 38
timized for simplicity and without need for hardware coherence Other-Stream-Triad 699051 3146
Other-Optix-Raytracing 3072 87
support at any level of the cache hierarchy. SM-side L1 private
Other-Bitcoin-Crypto 60 5898
cachesachievecoherencethroughcompilerinsertedcachecontrol
(flush) operations and memory-side L2 caches, which do not re- Table2:Time-weightedaveragenumberofthreadblocksand
quirecoherencesupport.Whilesoftware-basedcoherencemayseem applicationfootprint.
heavyhandedcomparedtofinegrainedMOESI-stylehardwareco-
herence,manyGPUprogrammingmodels(inadditiontoC++2011) ofinterestdeemedrepresentativeforeachworkload,whichmaybe
aremovingtowardsscopedsynchronizationwhereexplicitsoftware assmallasasinglekernelcontainingatightinnerlooporseveral
acquireandreleaseoperationsmustbeusedtoenforcecoherence. thousandkernelinvocations.Weruneachbenchmarktocompletion
Withouttheuseoftheseoperations,coherenceisnotgloballyguar- forthedeterminedregionofinterest.Table2providesboththemem-
anteedandthusmaintainingfinegrainCPU-styleMOESIcoherence oryfootprintperapplicationaswellastheaveragenumberofactive
(viaeitherdirectoriesorbroadcast)maybeanunnecessaryburden. CTAsintheworkload(weightedbythetimespentoneachkernel)to
We study the scalability of multi-socket NUMA GPUs using providearepresentationofhowmanyparallelthreadblocks(CTAs)
41 workloads taken from a range of production codes based on aregenerallyavailableduringworkloadexecution.
theHPCCORALbenchmarks[28],graphapplicationsfromLon-
4 ASYMMETRICINTERCONNECTS
estar[45],HPCapplicationsfromRodinia[9],andseveralother
in-houseCUDAbenchmarks.Thissetofworkloadscoversawide Figure4(a)showsaswitchconnectedGPUwithsymmetricandstatic
spectrumofGPUapplicationsusedinmachinelearning,fluiddy- linkbandwidthassignment.Eachlinkconsistsofequalnumbersof
namic,imagemanipulation,graphtraversal,andscientificcomput- uni-directionalhigh-speedlanesinbothdirections,collectivelycom-
ing.Eachofourbenchmarksishandselectedtoidentifytheregion prisingasymmetricbi-directionallink.Traditionalstatic-designtime
MICRO-50,October14–18,2017,Cambridge,MA,USA U.Milicetal.
Saturated lane Unsaturated lane Dynamically allocated lane GPU Egress GPU Ingress
h 1.0
High BW 32 of 6+4 GB/s widtU 000..68
GPU Switch 64 of 64 GB/s ndGP0.4
96 of 128 GB/s Ba 00..02
[75% BW Utilization]
h 1.0
(a) Symmetric Bandwidth Assignment dt10.8
wiU 0.6
ndGP0.4
32 of 32 GB/s a 0.2
High BW + B 0.0
GPU 96 of 96 GB/s
Switch h 1.0
128 of 128 GB/s dt20.8
(b) Asymmetric Bandwidth Assignment [100% BW Utilization] ndwiGPU 00..46
a 0.2
B 0.0
Figure 4: Example of dynamic link assignment to improve in- h 1.0
terconnectefficiency. widtU 300..68
ndGP0.4
a 0.2
B 0.0
Time
linkcapacityassignmentisverycommonandhasseveraladvan-
tages.Forexample,onlyonetypeofI/Ocircuitry(egressdrivers
Figure5:NormalizedlinkbandwidthprofileforHPC-HPGMG-UVM
oringressreceivers)alongwithonlyonetypeofcontrollogicare
showingasymmetriclinkutilizationbetweenGPUsandwithin
implementedateachon-chiplinkinterface.Moreover,themulti-
aGPU.Verticalblackdottedlinesindicatekernellaunchevents.
socketswitchesresultinsimplerdesignsthatcaneasilysupporta
staticallyprovisionedbandwidthrequirements.Ontheotherhand,
multi-socketlinkbandwidthutilizationcanhavealargeinfluenceon
overallsystemperformance.Staticpartitioningofbandwidth,when onaper-GPUbasis,resultingindynamicasymmetriclinkcapacity
applicationneedsaredynamic,canleaveperformanceonthetable. assignmentsasshowninFigure4(b).
BecauseI/Obandwidthisalimitedandexpensivesystemresource, To evaluate this proposal, we model point-to-point links con-
NUMA-awareinterconnectdesignsmustlookforinnovationsthat taining multiple lanes, similar to PCIe [47] or NVLink [43]. In
cankeepwireandI/Outilizationhigh. these links, 8 lanes with 8GBs capacity per lane yield an aggre-
Inmulti-socketNUMAGPUsystems,weobservethatmanyap- gatebandwidthof64GBsineachdirection.Weproposereplacing
plicationshavedifferentutilizationofegressandingresschannelson uni-directionallaneswithbi-directionallanestowhichweapply
bothaperGPU-socketbasisandduringdifferentphasesofexecution. anadaptivelinkbandwidthallocationmechanismthatworksasfol-
Forexample,Figure5showsalinkutilizationsnapshotovertime lowing.Foreachlinkinthesystem,atkernellaunchthelinksare
forHPC-HPGMG-UVMbenchmarkrunningonaSWlocality-optimized alwaysreconfiguredtocontainsymmetriclinkbandwidthwith8
4-socketNUMAGPU.Verticaldottedblacklinesrepresentkernel lanesperdirection.Duringkernelexecutionthelinkloadbalancer
invocationsthataresplitacrossthe4GPUsockets.Severalsmall periodicallysamplesthesaturationstatusofeachlink.Ifthelanes
kernelshavenegligibleinterconnectutilization.However,forthe inonedirectionarenotsaturated,whilethelanesintheopposite
laterlargerkernels,GPU0andGPU2fullysaturatetheiringress directionare99%saturated,thelinkloadbalancerreconfiguresand
links,whileGPU1andGPU3fullysaturatetheiregresslinks.Atthe reversesthedirectionofoneoftheunsaturatedlanesafterquiescing
sametimeGPU0andGPU2,andGPU1andGPU3areunderutilizing allpacketsonthatlane.
theiregressandingresslinksrespectively. Thissampleandreconfigureprocessstopsonlywhendirectional
InmanyworkloadsacommonscenariohasCTAswritingtothe utilizationisnotoversubscribedorallbutonelaneisconfiguredina
samememoryrangeattheendofakernel(i.e.parallelreductions, singledirection.Ifbothingressandegresslinksaresaturatedand
datagathering). ForCTAsrunningononeofthesockets,GPU0 inanasymmetricconfiguration,linksarethenreconfiguredback
forexample,thesememoryreferencesarelocalanddonotproduce towardasymmetricconfigurationtoencourageglobalbandwidth
anytrafficontheinter-socketinterconnections.HoweverCTAsdis- equalization.Whilethisprocessmaysoundcomplex,thecircuitry
patchedtootherGPUsmustissueremotememorywrites,saturating fordynamicallyturninghighspeedsingleendedlinksaroundinjust
theiregresslinkswhileingresslinksremainunderutilized,butcaus- tensofcyclesorlessisalreadyinusebymodernhighbandwidth
ingingresstrafficonGPU0.Suchcommunicationpatternstypically memoryinterfaces,suchasGDDR,wherethesamesetofwiresis
utilizeonly50%ofavailableinterconnectbandwidth.Inthesecases, usedforbothmemoryreadsandwrites[16].Inhighspeedsignaling
dynamicallyincreasingthenumberofingresslanesforGPU0(by implementations,necessaryphase–delaylockloopresynchroniza-
reversingthedirectionofegresslanes)andswitchingthedirectionof tioncanoccurwhiledataisinflight;eliminatingtheneedtoidlethe
ingresslanesforGPUs1–3,cansubstantiallyimprovetheachievable linkduringthislonglatency(microseconds)operationifupcoming
interconnectbandwidth.Motivatedbythesefindings,wepropose linkturnoperationscanbesufficientlyprojectedaheadoftime,such
to dynamically control multi-socket link bandwidth assignments asonafixedinterval.
BeyondtheSocket:NUMA-AwareGPUs MICRO-50,October14–18,2017,Cambridge,MA,USA
Static 128GB/s Inter-GPU Interconnect Dynamic 128GB/s Inter-GPU Interconnect :: 10K Cycle Sample
Dynamic 128GB/s Inter-GPU Interconnect :: 1K Cycle Sample Dynamic 128GB/s Inter-GPU Interconnect :: 50K Cycle Sample
Dynamic 128GB/s Inter-GPU Interconnect :: 5K Cycle Sample Static 256GB/s Inter-GPU Interconnect
2.00
1.75
1.50
up1.25
d
e1.00
e
p
S0.75
0.50
0.25
0.00
Rodinia-Euler3D HPC-AMG HPC-RSBench HPC-CoMD-Ta PC-Lulesh-Unstruct-Mesh2 HPC-Lulesh HPC-Nekbone Lonestar-MST-Mesh PC-Lulesh-Unstruct-Mesh1 Lonestar-SP HPC-CoMD-Wa HPC-HPGMG-UVM Rodinia-BFS HPC-CoMD Lonestar-SSSP-Wlc HPC-MCB ML-AlexNet-cudnn-Lev2 Rodinia-Gaussian Other-Optix-Raytracing HPC-MiniContact-Mesh2 Lonestar-MST-Graph HPC-Namd2.9 Lonestar-SSSP-Wln HPC-MiniContact-Mesh1 ML-GoogLeNet-cudnn-Lev2 Rodinia-Hotspot HPC-SNAP ML-AlexNet-cudnn-Lev4 Rodnia-Pathfinder Lonestar-SSSP HPC-MiniAMR HPC-HPGMG Arithmetic Mean Geometric Mean
H H
Figure6:Relativespeedupofthedynamiclinkadaptivitycomparedtothebaselinearchitecturebyvaryingsampletimeandassuming
switchtimeof100cycles.Inred,wealsoprovidespeedupachievablebydoublinglinkbandwidth.
4.1 ResultsandDiscussion Atthesametime,usingafasterlaneswitch(10cycles)doesnot
Twoimportantparametersaffecttheperformanceofourproposed significantlyimprovetheperformanceovera100cyclelinkturn
mechanism:(i)SampleTime–thefrequencyatwhichthescheme time. The link turnaround times of modern high-speed on-board
samplesforapossiblereconfigurationand(ii)SwitchTime–thecost linkssuchasGDDR5[16]areabout8nswithbothlinkandinternal
ofturningthedirectionofanindividuallane.Figure6showstheper- DRAMturn-aroundlatency,whichislessthan10cyclesat1GHz.
formanceimprovement,comparedtoourlocality-optimizedGPUby Ourresultsdemonstratethatasymmetriclinkbandwidthalloca-
exploringdifferentvaluesoftheSampleTimeindicatedbygreenbars tioncanbeveryattractivewheninter-socketinterconnectbandwidth
andassumingaSwitchTimeof100cycles.TheredbarsinFigure6 isconstrainedbythenumberofon-PCBwires(andthustotallink
provideanupper-boundofperformancespeedupswhendoublingthe bandwidth).Theprimarydrawbackofthissolutionisthatbothtypes
availableinterconnectbandwidthto256GBs.Forworkloadsonthe ofinterfacecircuitry(TXandRX)andlogicmustbeimplemented
rightofthefigure,doublingthelinkbandwidthhaslittleeffect,indi- foreachlaneinboththeGPUandswitchinterfaces.Weconducted
catingthatadynamiclinkpolicywillalsoshowlittleimprovement ananalysisofthepotentialcostofdoublingtheamountofI/Ocir-
duetolowGPU–GPUinterconnectbandwidthdemand.Ontheleft cuitry and logic based on a proprietary state of the art GPU I/O
sideofthefigure,forsomeapplicationswhenimprovedinterconnect implementation.Ourresultsshowthatdoublingthisinterfacearea
bandwidthhasalargeeffect,dynamiclaneswitchingcanimprove increasestotalGPUareabylessthan1%whileyieldinga12%im-
applicationperformancebyasmuchas80%.Forsomebenchmarks provementinaverageinterconnectbandwidthanda14%application
likeRodinia-Euler-3D,HPC-AMG,andHPC-Lulesh,doublingthe performanceimprovement.Oneadditionalcaveatworthnotingis
linkbandwidthprovides2×speedup,whileourproposeddynamic thattheproposedasymmetriclinkmechanismoptimizeslinkband-
linkassignmentmechanismisnotabletosignificantlyimproveper- widthinagivendirectionforeachindividuallink,whilethetotal
formance.Theseworkloadssaturatebothlinkdirections,sothere switchbandwidthremainsconstant.
isnoopportunitytoprovideadditionalbandwidthbyturninglinks
around. 5 NUMA-AWARECACHEMANAGEMENT
Usingamoderate5Kcyclesampletime,thedynamiclinkpolicy
Section4showedthatinter-socketbandwidthisanimportantfactor
canimproveperformanceby14%onaverageoverstaticbandwidth
inachievingscalableNUMAGPUperformance.Unfortunately,be-
partitioning.Ifthelinkloadbalancersamplestooinfrequently,ap-
causeeithertheoutgoingorincominglinksmustbeunderutilizedfor
plicationdynamicscanbemissedandperformanceimprovementis
ustoreallocatethatbandwidthtothesaturatedlink,ifbothincoming
reduced.Howeverifthelinkisreconfiguredtoofrequently,band-
andoutgoinglinksaresaturated,dynamiclinkrebalancingyields
widthislostduetotheoverheadofturningthelink.Whilewehave
minimalgains.Toimproveperformanceinsituationswheredynamic
assumedapessimisticlinkturntimeof100cycles,weperformed
linkbalancingisineffective,systemdesignerscaneitherincrease
sensitivitystudiesthatshoweveniflinkturntimewereincreasedto
linkbandwidth,whichisveryexpensive,ordecreasetheamountof
500cycles,ourdynamicpolicyloseslessthan2%inperformance.
trafficthatcrossesthelowbandwidthcommunicationchannels.To
MICRO-50,October14–18,2017,Cambridge,MA,USA U.Milicetal.
Cache Partitioning Algorithm
SM … SM SM … SM SM … SM SM … SM 0) Allocate ½ ways for local and ½ for remote data
L1 … L1 desWab- tnereho L1 … L1 desWab- tnereho L1 … L1 desWab- tnereho L1 … L1 desWab- tnereho 12)) ElIofs citnaimtle Dar-tRGeA PiMnUc ooBmuWtig niosg i sniangtt uBerrWa-Gt ePdU a BnWd DaRnAdM m oBnWit onro t
SC R$ SC Coherent L2 SC Coherent L2 SC - RemoteWays++ and LocalWays--
3) If DRAM BW is saturated and inter-GPU BW not
To Switch To Switch To Switch To Switch
NoC NoC NoC NoC - RemoteWays-- and LocalWays++
DRLA2 M tnerehoC )yromem fo trap DRLA2 M tnerehoC )yromem fo trap DRAM DRAM 4 5 6 ))) ING f oob -- no bEDetaqho cou kanfa rottleiohtz h ees1iam n)at gal uli fosrt acsetaaret tSdeuad rma wtpealdey sT i(m++e acyncdl e-s-)
( (
(a) Mem-Side Local (c) Shared Coherent
(b) Static R$ (d) NUMA-Aware L1+L2
Only L2 L1+L2
Figure7:PotentialL2cacheorganizationstobalancecapacitybetweenremoteandlocalNUMAmemorysystems.
decreaseoff-chipmemorytraffic,architectstypicallyturntocaches someGPU-socketsprimarilyaccessingdatalocallywhileothers
tocapturelocality. areaccessingdataremotely,afixpartitioningofcachecapacityis
GPUcachehierarchiesdifferfromtraditionalCPUhierarchiesas guaranteed to be sub-optimal. Second, while we show that most
theytypicallydonotimplementstronghardwarecoherenceproto- applications will be able to completely fill large NUMA GPUs,
cols[55].TheyalsodifferfromCPUprotocolsinthatcachesmay thismaynotalwaysbethecase.GPUswithinthedatacenterare
bebothprocessorside(wheresomeformofcoherenceistypically beingvirtualizedandthereiscontinuousworktoimproveconcur-
necessary)ortheymaybememoryside(wherecoherenceisnot rent execution of multiple kernels and processes within a single
necessary).AsdescribedinTable1andFigure7(a),aGPUtoday GPU [15, 32, 46, 50]. If a large NUMA GPU is sub-partitioned,
istypicallycomposedofrelativelylargeSWmanagedcoherentL1 itisintuitivethatsystemsoftwareattempttopartitionitalongthe
cacheslocatedclosetotheSMs,whilearelativelysmall,distributed, NUMAboundaries(evenwithinasingleGPU-socket)toimprove
non-coherentmemorysideL2cacheresidesclosetothememory thelocalityofsmallGPUkernels.Toeffectivelycapturelocality
controllers.ThisorganizationworkswellforGPUsbecausetheir inthesesituation,NUMA-awareGPUsneedtobeabletodynami-
SIMTprocessordesignsoftenallowforsignificantcoalescingof callyre-purposecachecapacityatruntime,ratherthanbestatically
requeststothesamecacheline,sohavinglargeL1cachesreduces partitionedatdesigntime.
theneedforglobalcrossbarbandwidth.Thememory-sideL2caches Coherence:To-date,discreteGPUshavenotmovedtheirmem-
donotneedtoparticipateinthecoherenceprotocol,whichreduces sidecachestoprocessorsidebecausetheoverheadofcacheinvalida-
asystemcomplexity. tion(duetocoherence)isanunnecessaryperformancepenalty.For
asinglesocketGPUwithauniformmemorysystem,thereislittle
5.1 DesignConsiderations performance advantage to implementing L2 caches as processor
InNUMAdesigns,remotememoryreferencesoccurringacrosslow sidecaches.Still,inamulti-socketNUMAdesign,theperformance
bandwidthNUMAinterconnectionsresultsinpoorperformance,as taxofextendingcoherenceintoL2cachesisoffsetbythefactthat
showninFigure3.Similarly,inNUMAGPUsutilizingtraditional remotememoryaccessescannowbecachedlocallyandmaybe
memory-sideL2caches(thatdependonfinegrainedmemoryinter- justified.Figure7(c)showsaconfigurationwithacoherentL2cache
leavingforloadbalancing)isabaddecision.Becausememory-side whereremoteandlocaldatacontendforL2capacityasextensions
caches only cache accesses that originate in their local memory, oftheL1caches,implementingidenticalcoherencepolicy.
theycannotcachememoryfromotherNUMAzonesandthuscan DynamicPartitioning:BuildinguponcoherentGPUL2caches,
notreduceNUMAinterconnecttraffic.Previousworkhasproposed wepositthatwhileconceptuallysimple,allowingbothremoteand
thatGPUL2cachecapacityshouldbesplitbetweenmemory-side localmemoryaccessestocontendforcachecapacity(inboththe
cachesandanewprocessor-sideL1.5cachethatisanextensionof L1andL2caches)inaNUMAsystemisflawed.InUMAsystems
theGPUL1caches[3]toenablecachingofremotedata,shownin itiswellknowthatperformanceismaximizedbyoptimizingfor
Figure7(b).BybalancingL2capacitybetweenmemorysideand cachehitrate,thusminimizingoff-chipmemorysystembandwidth.
remotecaches(R$),thisdesignlimitstheneedforextendingexpen- However,inNUMAsystems,notallcachemisseshavethesame
sivecoherenceoperations(invalidations)intotheentireL2cache relativecostperformanceimpact.Acachemisstoalocalmemory
whilestillminimizingcrossbarorinterconnectbandwidth. addresshasasmallercost(inbothtermsoflatencyandbandwidth)
Flexibility:Designsthatstaticallyallocatecachecapacitytolocal than a cache miss to a remote memory address. Thus, it should
memoryandremotememory,inanybalance,mayachievereason- bebeneficialtodynamicallyskewcacheallocationtopreference
ableperformanceinspecificinstancesbuttheylackflexibility.Much cachingremotememoryoverlocaldatawhenitisdeterminedthe
likeapplicationphasingwasshowntoaffectNUMAbandwidthcon- systemisbottle-neckedonNUMAbandwidth.
sumptiontheabilitytodynamicallysharecachecapacitybetween Tominimizeinter-GPUbandwidthinmulti-socketGPUsystems
localandremotememoryhasthepotentialtoimproveperformance weproposeaNUMA-awarecachepartitioningalgorithm,withcache
underseveralsituations.First,whenapplicationphasingresultsin organizationandbriefsummaryshowninFigure7(d).Similarto
BeyondtheSocket:NUMA-AwareGPUs MICRO-50,October14–18,2017,Cambridge,MA,USA
Mem-Side Local Only L2 Static Partitioning Shared Coherent L1+L2 NUMA-aware L1+L2
4.0
3.5 5.01 15.16 5.83
4.89 15.16 5.69
3.0 7.42 5.27
p2.5
u
d
e2.0
e
Sp1.5
1.0
0.5
0.0
Rodinia-Euler3D HPC-AMG HPC-RSBench HPC-CoMD-Ta PC-Lulesh-Unstruct-Mesh2 HPC-Lulesh HPC-Nekbone Lonestar-MST-Mesh PC-Lulesh-Unstruct-Mesh1 Lonestar-SP HPC-CoMD-Wa HPC-HPGMG-UVM Rodinia-BFS HPC-CoMD Lonestar-SSSP-Wlc HPC-MCB ML-AlexNet-cudnn-Lev2 Rodinia-Gaussian Other-Optix-Raytracing HPC-MiniContact-Mesh2 Lonestar-MST-Graph HPC-Namd2.9 Lonestar-SSSP-Wln HPC-MiniContact-Mesh1 ML-GoogLeNet-cudnn-Lev2 Rodinia-Hotspot HPC-SNAP ML-AlexNet-cudnn-Lev4 Rodnia-Pathfinder Lonestar-SSSP HPC-MiniAMR HPC-HPGMG Arithmetic Mean Geometric Mean
H H
Figure8:Performanceof4-socketNUMA-awarecachepartitioning,comparedtomemory-sideL2andstaticpartitioning.
Single GPU tobebandwidthsaturated(Step 1 ).Incaseswheretheintercon-
4-Socket GPU :: NUMA-aware Coherent L1+L2 With L2 Invalidations nectbandwidthissaturatedbutlocalmemorybandwidthisnot,the
4-Socket GPU :: NUMA-aware Coherent L1+L2 Without L2 Invalidations
4.5 partitioningalgorithmattemptstoreduceremotememorytrafficby
4.0 re-assigningonewayfromthegroupoflocalwaystotheremote
3.5 ways grouping (Step 2 ). Similarly, if the local memory BW is
3.0
p saturatedandinter-GPUbandwidthisnot,thepolicyre-allocates
edu2.5 one way from the remote group, and allocates it to the group of
e2.0
p
S1.5 localways(Step 3 ).Tominimizetheimpactoncachedesign,all
1.0 waysareconsultedonlookup,allowinglazyevictionofdatawhen
0.5 thewaypartitioningchanges.Incasewhereboththeinterconnect
0.0
Rodinia-Euler3DHPC-AMGHPC-RSBenchHPC-CoMD-TaHPC-Lulesh-Unstruct-Mesh2HPC-LuleshHPC-NekboneLonestar-MST-MeshHPC-Lulesh-Unstruct-Mesh1Lonestar-SPHPC-CoMD-WaHPC-HPGMG-UVMRodinia-BFSHPC-CoMDLonestar-SSSP-WlcHPC-MCBML-AlexNet-cudnn-Lev2Rodinia-GaussianOther-Optix-RaytracingHPC-MiniContact-Mesh2Lonestar-MST-GraphHPC-Namd2.9Lonestar-SSSP-WlnHPC-MiniContact-Mesh1ML-GoogLeNet-cudnn-Lev2Rodinia-HotspotHPC-SNAPML-AlexNet-cudnn-Lev4Rodnia-PathfinderLonestar-SSSPHPC-MiniAMRHPC-HPGMGArithmetic MeanGeometric Mean aeltodarihelqnfrnwmeaudeemaaspoilytlto(aohiseSztlceiiertoacrceeslfaypqlltolmulhotcyai4ecerakeaimln)enl.auoscomFtmrrrnileyenerobameaaesbmsalerolatcyronooat,yitdfnni.oefwewdnmniwaaed(yeSiatstmshythueaeobiapsnrrsrsyeeoai5qgfl(sluntwa)hcee.tehaundTcirtlcofhiadnoheteprkrsocdsrrtaep,eoavumroibesenueonecrstptaueepcmlrlraoarofeenclocnimdhcartemtyloleyorsdagcystnaraatcaolrltauvedcetr)aueaia,ttanctihwelohclednyyeer,
Figure9:PerformanceoverheadofextendingcurrentGPUsoft-
warebasedcoherenceintotheGPUL2caches. 5.2 Results
Figure8comparestheperformanceof4differentcacheconfigura-
tionsinour4-socketNUMAGPU.OurbaselineisatraditionalGPU
ourinterconnectbalancingalgorithm,atinitialkernellaunch(after
with memory-side local-only write-back L2 caches. To compare
GPUcacheshavebeenflushedforcoherencepurposes)weallocate
againstpriorwork[3]weprovidea50–50staticpartitioningwhere
one half of the cache ways for local memory and the remaining
theL2cachebudgetissplitbetweentheGPU-sidecoherentremote
waysforremotedata(Step 0 ).Afterexecutingfora5Kcyclespe-
cachewhichcontainsonlyremotedata,andthememorysideL2
riod,wesampletheaveragebandwidthutilizationonlocalmemory
whichcontainsonlylocaldata.Inour4-socketNUMAGPUstatic
andestimatetheGPU-socket’sincomingreadrequestratebylook-
partitioningimprovesperformanceby54%onaverage,althoughfor
ingattheoutgoingrequestratemultipliedbytheresponsepacket
somebenchmarks,ithurtstheperformancebyasmuchas10%for
size.Byusingtheoutgoingrequestratetoestimatetheincoming
workloadsthathavenegligibleinter-socketmemorytraffic.Wealso
bandwidth,weavoidsituationswhereincomingwritesmaysaturate
showtheresultsforGPU-sidecoherentL1andL2cacheswhereboth
ourlinkbandwidthfalselyindicatingweshouldpreferenceremote
localandremotedatacontendcapacity.Onaverage,thissolution
data caching. Projected link utilization above 99% is considered
MICRO-50,October14–18,2017,Cambridge,MA,USA U.Milicetal.
Single GPU 4-Socket NUMA GPU :: Asymmetric Inter-GPU Interconnect
Hypothetical 4x Larger Single GPU 4-Socket NUMA GPU :: Asymmetric Inter-GPU Interconnect and NUMA-aware Cache Partitioning
5 4-Socket NUMA GPU :: Baseline
4
up3
d
e
e
p2
S
1
0
Rodinia-Euler3D HPC-AMG HPC-RSBench HPC-CoMD-Ta PC-Lulesh-Unstruct-Mesh2 HPC-Lulesh HPC-Nekbone Lonestar-MST-Mesh PC-Lulesh-Unstruct-Mesh1 Lonestar-SP HPC-CoMD-Wa HPC-HPGMG-UVM Rodinia-BFS HPC-CoMD Lonestar-SSSP-Wlc HPC-MCB ML-AlexNet-cudnn-Lev2 Rodinia-Gaussian Other-Optix-Raytracing HPC-MiniContact-Mesh2 Lonestar-MST-Graph HPC-Namd2.9 Lonestar-SSSP-Wln HPC-MiniContact-Mesh1 ML-GoogLeNet-cudnn-Lev2 Rodinia-Hotspot HPC-SNAP ML-AlexNet-cudnn-Lev4 Rodnia-Pathfinder Lonestar-SSSP HPC-MiniAMR HPC-HPGMG Arithmetic Mean Geometric Mean
H H
Figure10:FinalNUMA-awareGPUperformancecomparedtoasingleGPUand4×largersingleGPUwithscaledresources.
outperformsstaticcachepartitioningsignificantlydespiteincurring for some applications, on average SW-based cache invalidations
additionalflushingoverheadduetocachecoherence. overheadsareonly10%,evenwhenextendedacrossallGPU-socket
Finally, our proposed NUMA-aware cache partitioning policy L2caches.SowhilefinegrainHWcoherenceprotocolsmayim-
isshownindarkgrey.Duetoitsabilitytodynamicallyadaptthe proveperformance,themagnitudeoftheirimprovementmustbe
capacityofbothL2andL1tooptimizeperformancewhenbacked weightedagainsttheirhardwareimplementationcomplexity.While
byNUMAmemory,itisthehighestperformingcacheconfiguration. inthestudiesaboveweassumedawrite-backpolicyinL2caches,
Byexaminingsimulationresultswefindthatforworkloadsonthe asasensitivitystudywealsoevaluatedtheeffectofusingawrite-
leftsideofFigure8whichfullysaturatetheinter-GPUbandwidth, throughcachepolicytomirrorthewrite-throughL1cachepolicy.
NUMA-awaredynamicpolicyconfigurestheL1andL2cachesto Ourfindingsindicatethatwrite-backL2outperformswrite-through
beprimarilyusedasremotecaches.However,workloadsontheright L2by9%onaverageinourNUMA-GPUdesignduetothedecrease
sideofthefiguretendtohavegoodGPU-socketmemorylocality, intotalinter-GPUwritebandwidth.
andthuspreferL1andL2cachesstoreprimarilylocaldata.NUMA-
awarecachepartitioningisabletoflexiblyadapttovaryingmemory 6 DISCUSSION
accessprofilesandcanimproveaverageNUMAGPUperformance
CombinedImprovement:Sections4and5providetwotechniques
76% compared to traditional memory side L2 caches, and 22%
aimedatmoreefficientlyutilizingscarceNUMAbandwidthwithin
comparedtopreviouslyproposedstaticcachepartitioningdespite
futureNUMAGPUsystems.Theproposedmethodsfordynamic
incurringadditionalcoherenceoverhead.
interconnectbalancingandNUMA-awarecachingareorthogonal
When extending SW-controlled coherence into the GPU L2
andcanbeappliedinisolationorcombination.Dynamicintercon-
caches,L1coherenceoperationsmustbeextendedintotheGPUL2
nectbalancinghasanimplementationsimplicityadvantageinthat
caches.Usingbulksoftwareinvalidationtomaintaincoherenceis
thesystemlevelchangestoenablethisfeatureareisolatedfrom
simpletoimplementbutisaperformancepenaltywhenfalselyevict-
thelargerGPUdesign.Conversely,enablingNUMA-awareGPU
ingnotrequireddata.Theoverheadofthisinvalidationisdependent
cachingbasedoninterconnectutilizationrequireschangestoboth
onboththefrequencyoftheinvalidationsaswellasaggregatecache
thephysicalcachearchitectureandtheGPUcoherenceprotocol.
capacityinvalidated.ExtendingtheL1invalidationprotocolintothe
Becausethesetwofeaturestargetthesameproblem,whenem-
sharedL2,andthenacrossmultipleGPUs,increasesthecapacity
ployedtogethertheireffectsarenotstrictlyadditive.Figure10shows
affectedandfrequencyoftheinvalidationevents.
theoverallimprovementNUMA-awareGPUscanachievewhenap-
Tounderstandtheimpactoftheseinvalidations,weevaluatehy-
plyingbothtechniquesinparallel.ForbenchmarkssuchasCoMD,
potheticalL2cacheswhichcanignorethecacheinvalidationevents;
thesefeaturescontributenearlyequallytotheoverallimprovement,
thusrepresentingtheupperlimitonperformance(nocoherenceevic-
butforotherssuchasML-AlexNet-cudnn-Lev2orHPC-MST-Mesh1,
tionseveroccur)thatcouldbeachievedbyusingafinergranularity
interconnectimprovementsorcachingaretheprimarycontributor
HW-coherenceprotocol.Figure9showstheimpacttheseinvalida-
respectively.Onaverage,weobservethatwhencombinedwesee
tionoperationshaveonapplicationperformance.Whilesignificant
2.1×improvementoverasingleGPUand80%overthebaseline
Description:tem is increasingly difficult. In this work we investigate multi-socket non-uniform memory access (NUMA) GPU designs and show that significant changes are needed to both the GPU interconnect and cache architectures to achieve performance scalability. We show that application phase effects can be