Table Of ContentHeterogeneous Computing
with OpenCL 2.0
Heterogeneous Computing
with OpenCL 2.0
Third Edition
David Kaeli
Perhaad Mistry
Dana Schaa
Dong Ping Zhang
AMSTERDAM (cid:129) BOSTON (cid:129) HEIDELBERG (cid:129) LONDON
NEW YORK (cid:129) OXFORD (cid:129) PARIS (cid:129) SAN DIEGO
SAN FRANCISCO (cid:129) SINGAPORE (cid:129) SYDNEY (cid:129) TOKYO
Morgan Kaufmann is an imprint of Elsevier
AcquiringEditor:ToddGreen
EditorialProjectManager:CharlieKent
ProjectManager:PriyaKumaraguruparan
CoverDesigner:MatthewLimbert
MorganKaufmannisanimprintofElsevier
225WymanStreet,Waltham,MA02451,USA
Copyright©2015,2013,2012AdvancedMicroDevices,Inc.PublishedbyElsevierInc.
Allrightsreserved.
Nopartofthispublicationmaybereproducedortransmittedinanyformorbyanymeans,
electronicormechanical,includingphotocopying,recording,oranyinformationstorageand
retrievalsystem,withoutpermissioninwritingfromthepublisher.Detailsonhowtoseek
permission,furtherinformationaboutthePublisher’spermissionspoliciesandour
arrangementswithorganizationssuchastheCopyrightClearanceCenterandtheCopyright
LicensingAgency,canbefoundatourwebsite:www.elsevier.com/permissions.
Thisbookandtheindividualcontributionscontainedinitareprotectedundercopyrightby
thePublisher(otherthanasmaybenotedherein).
Notices
Knowledgeandbestpracticeinthisfieldareconstantlychanging.Asnewresearchand
experiencebroadenourunderstanding,changesinresearchmethods,professionalpractices,
ormedicaltreatmentmaybecomenecessary.
Practitionersandresearchersmustalwaysrelyontheirownexperienceandknowledgein
evaluatingandusinganyinformation,methods,compounds,orexperimentsdescribedherein.
Inusingsuchinformationormethodstheyshouldbemindfuloftheirownsafetyandthe
safetyofothers,includingpartiesforwhomtheyhaveaprofessionalresponsibility.
Tothefullestextentofthelaw,neitherthePublishernortheauthors,contributors,oreditors,
assumeanyliabilityforanyinjuryand/ordamagetopersonsorpropertyasamatterof
productsliability,negligenceorotherwise,orfromanyuseoroperationofanymethods,
products,instructions,orideascontainedinthematerialherein.
ISBN:978-0-12-801414-1
BritishLibraryCataloguinginPublicationData
AcataloguerecordforthisbookisavailablefromtheBritishLibrary
LibraryofCongressCataloging-in-PublicationData
AcatalogrecordforthisbookisavailablefromtheLibraryofCongress
ForinformationonallMKpublications
visitourwebsiteatwww.mkp.com
List of Figures
Fig.1.1 (a)Simplesorting:adivide-and-conquerimplementation,breakingthe
listintoshorterlists,sortingthem,andthenmergingtheshortersorted
lists.(b)Vector-scalarmultiply:scatteringthemultipliesandthen
gatheringtheresultstobesummedupinaseriesofsteps. 3
Fig.1.2 MultiplyingelementsinarraysAandB,andstoringtheresultinan
arrayC. 4
Fig.1.3 TaskparallelismpresentinfastFouriertransform(FFT)application.
Differentinputimagesareprocessedindependentlyinthethree
independenttasks. 5
Fig.1.4 Task-levelparallelism,wheremultiplewordscanbecompared
concurrently.Alsoshownisfiner-grainedcharacter-by-character
parallelismpresentwhencharacterswithinthewordsarecompared
withthesearchstring. 6
Fig.1.5 AfterallstringcomparisonsinFigure1.4havebeencompleted,we
cansumupthenumberofmatchesinacombiningnetwork. 6
Fig.1.6 Therelationshipbetweenparallelandconcurrentprograms.Parallel
andconcurrentprogramsaresubsetsofallprograms. 8
Fig.2.1 Out-of-orderexecutionofaninstructionstreamofsimpleassembly-like
instructions.Notethatinthissyntax,thedestinationregisterislisted
first.Forexample,add a,b,cisa = b+c. 18
Fig.2.2 VLIWexecutionbasedontheout-of-orderdiagraminFigure2.1. 20
Fig.2.3 SIMDexecutionwhereasingleinstructionisscheduledinorder,but
executesovermultipleALUsatthesametime. 21
Fig.2.4 Theout-of-orderscheduleseeninFigure2.1combinedwithasecond
threadandexecutedsimultaneously. 23
Fig.2.5 Twothreadsscheduledinatime-slicefashion. 24
Fig.2.6 Takingtemporalmultithreadingtoanextremeasisdoneinthroughput
computing:alargenumberofthreadsinterleaveexecutiontokeepthe
devicebusy,whereaseachindividualthreadtakeslongertoexecute
thanthetheoreticalminimum. 25
Fig.2.7 TheAMDPuma(left)andSteamroller(right)high-leveldesigns(not
showntoanysharedscale).Pumaisalow-powerdesignthatfollowsa
traditionalapproachtomappingfunctionalunitstocores.Steamroller
combinestwocoreswithinamodule,sharingitsfloating-point(FP)units. 26
Fig.2.8 TheAMDRadeonHD6970GPUarchitecture.Thedeviceisdivided
intotwohalves,whereinstructioncontrol(schedulinganddispatch)is
performedbythewaveschedulerforeachhalf.The2416-laneSIMD
coresexecutefour-wayVLIWinstructionsoneachSIMDlaneand
containprivatelevel1(L1)cachesandlocaldatashares(scratchpad
memory). 27
xi
xii List of Figures
Fig.2.9 TheNiagara2CPUfromSun/Oracle.Thedesignintendstomakea
highlevelofthreadingefficient.NoteitsrelativesimilaritytotheGPU
designseeninFigure2.8.Givenenoughthreads,wecancoverall
memoryaccesstimewithusefulcompute,withoutextracting
instruction-levelparallelism(ILP)throughcomplicatedhardware
techniques. 32
Fig.2.10 TheAMDRadeonR9290Xarchitecture.Thedevicehas44coresin
11clusters.Eachcoreconsistsofascalarexecutionunitthathandles
branchesandbasicintegeroperations,andfour16-laneSIMDALUs.
Theclustersshareinstructionandscalarcaches. 35
Fig.2.11 TheNVIDIAGeForceGTX780architecture.Thedevicehas12large
coresthatNVIDIAreferstoas“streamingmultiprocessors”(SMX).
EachSMXhas12SIMDunits(withspecializeddouble-precisionand
specialfunctionunits),asingleL1cache,andaread-onlydatacache. 36
Fig.2.12 TheA10-7850KAPUconsistsoftwoSteamroller-basedCPUcores
andeightRadeonR9GPUcores(3216-laneSIMDunitsintotal).The
APUincludesafastbusfromtheGPUtoDDR3memory,andashared
paththatisoptionallycoherentwithCPUcaches. 37
Fig.2.13 AnInteli7processorwithHDGraphics4000graphics.Althoughnot
termed“APU”byIntel,theconceptisthesameasforthedevicesin
thatcategoryfromAMD.IntelcombinesfourHaswellx86coreswithits
graphicsprocessors,connectedtoasharedlast-levelcache(LLC)via
aringbus. 38
Fig.3.1 AnOpenCLplatformwithmultiplecomputedevices.Eachcompute
devicecontainsoneormorecomputeunits.Acomputeunitis
composedofoneormoreprocessingelements(PEs).Asystemcould
havemultipleplatformspresentatthesametime.Forexample,a
systemcouldhaveanAMDplatformandanIntelplatformpresentat
thesametime. 43
Fig.3.2 SomeoftheOutputfromtheCLInfoprogramshowingthe
characteristicsofanOpenCLplatformanddevices.Weseethatthe
AMDplatformhastwodevices(aCPUandaGPU).Theoutputshown
herecanbequeriedusingfunctionsfromtheplatformAPI. 46
Fig.3.3 Vectoradditionalgorithmshowinghoweachelementcanbeadded
independently. 50
Fig.3.4 ThehierarchicalmodelusedforcreatinganNDRangeofwork-items,
groupedintowork-groups. 52
Fig.3.5 TheOpenCLruntimeshowndenotesanOpenCLcontextwithtwo
computedevices(aCPUdeviceandaGPUdevice).Eachcompute
devicehasitsowncommand-queues.Host-sideanddevice-side
command-queuesareshown.Thedevice-sidequeuesarevisibleonly
fromkernelsexecutingonthecomputedevice.Thememoryobjects
havebeendefinedwithinthememorymodel. 54
Fig.3.6 MemoryregionsandtheirscopeintheOpenCLmemorymodel. 61
Fig.3.7 MappingtheOpenCLmemorymodeltoanAMDRadeonHD7970GPU. 62
List of Figures xiii
Fig.4.1 Ahistogramgeneratedfroma256-bitimage.Eachbincorrespondsto
thefrequencyofthecorrespondingpixelvalue. 76
Fig.4.2 Animagerotatedby45◦.Pixelsthatcorrespondtoanout-of-bounds
locationintheinputimagearereturnedasblack. 83
Fig.4.3 Applyingaconvolutionfiltertoasourceimage. 91
Fig.4.4 Theeffectofdifferentconvolutionfiltersappliedtothesamesource
image:(a)theoriginalimage;(b)blurringfilter;and(c)embossingfilter. 92
Fig.4.5 Theproducerkernelwillgeneratefilteredpixelsandsendthemviaa
pipetotheconsumerkernel,whichwillthengeneratethehistogram:
(a)originalimage;(b)filteredimage;and(c)histogramoffilteredimage. 99
Fig.5.1 Multiplecommand-queuescreatedfordifferentdevicesdeclared
withinthesamecontext.Twodevicesareshown,whereone
command-queuehasbeencreatedforeachdevice. 118
Fig.5.2 Multipledevicesworkinginapipelinedmanneronthesamedata.The
CPUqueuewillwaituntiltheGPUkernelhasfinished. 119
Fig.5.3 Multipledevicesworkinginaparallelmanner.Inthisscenario,both
GPUsdonotusethesamebuffersandwillexecuteindependently.
TheCPUqueuewillwaituntilbothGPUdeviceshavefinished. 120
Fig.5.4 ExecutingthesimplekernelshowninListing5.5.Thedifferent
work-itemsintheNDRangeareshown. 121
Fig.5.5 Withinasinglekerneldispatch,synchronizationregardingexecution
orderissupportedonlywithinwork-groupsusingbarriers.Global
synchronizationismaintainedbycompletionofthekernel,andthe
guaranteethatonacompletioneventallworkiscompleteandmemory
contentisasexpected. 126
Fig.5.6 ExampleshowingOpenCLmemoryobjectsmappingtoargumentsfor
clEnqueueNativeKernel()inListing5.8. 131
Fig.5.7 Asingle-levelfork-joinexecutionparadigmcomparedwithnested
parallelismthreadexecution. 133
Fig.6.1 Anexampleshowingascenariowhereabufferiscreatedand
initializedonthehost,usedforcomputationonthedevice,and
transferredbacktothehost.Notethattheruntimecouldhavealso
createdandinitializedthebufferdirectlyonthedevice.(a)Creation
andinitializationofabufferinhostmemory.(b)Implicitdatatransfer
fromthehosttothedevicepriortokernelexecution.(c)Explicit
copyingofdatabackfromthedevicetothehostpointer. 150
Fig.6.2 Datamovementusingexplicitread-writecommands.(a)Creationofan
uninitializedbufferindevicememory.(b)Explicitdatatransferfrom
thehosttothedevicepriortoexecution.(c)Explicitdatatransferfrom
thedevicetothehostfollowingexecution. 151
Fig.6.3 Datamovementusingmap/unmap.(a)Creationofanuninitialized
bufferindevicememory.(b)Thebufferismappedintothehost’s
addressspace.(c)Thebufferisunmappedfromthehost’s
addressspace. 158
Fig.7.1 ThememoryspacesavailabletoanOpenCLdevice. 164
xiv List of Figures
Fig.7.2 Dataracewhenincrementingasharedvariable.Thevaluestored
dependsontheorderingofoperationsbetweenthethreads. 166
Fig.7.3 ApplyingZ-ordermappingtoatwo-dimensionalmemoryspace. 172
Fig.7.4 Thepatternofdataflowfortheexampleshowninthe
localAccesskernel. 177
Fig.8.1 High-leveldesignofAMD’sPiledriver-basedFX-8350CPU. 188
Fig.8.2 OpenCLmappedontoanFX-8350CPU.TheFX-8350CPUisboththe
OpenCLhostandthedeviceinthisscenario. 189
Fig.8.3 Implementationofwork-groupexecutiononanx86architecture. 190
Fig.8.4 Mappingthememoryspacesforawork-group(work-group0)ontoa
PiledriverCPUcache. 192
Fig.8.5 High-levelRadeonR9290XdiagramlabeledwithOpenCLexecution
andmemorymodelterms. 193
Fig.8.6 Memorybandwidthsinthediscretesystem. 195
Fig.8.7 RadeonR9290Xcomputeunitmicroarchitecture. 197
Fig.8.8 MappingOpenCL’smemorymodelontoaRadeonR9290XGPU. 201
Fig.8.9 Usingvectorreadsprovidesabetteropportunitytoreturndata
efficientlythroughthememorysystem.Whenwork-itemsaccess
consecutiveelements,GPUhardwarecanachievethesameresult
throughcoalescing. 203
Fig.8.10 Accessestononconsecutiveelementsreturnsmallerpiecesofdata
lessefficiently. 203
Fig.8.11 MappingtheRadeonR9290Xaddressspaceontomemorychannels
andDRAMbanks. 204
Fig.8.12 RadeonR9290Xmemorysubsystem. 205
Fig.8.13 TheaccumulationpassoftheprefixsumshowninListing8.2overa
16-elementarrayinlocalmemoryusing8work-items. 208
Fig.8.14 Step1inFigure8.13showingthebehaviorofanLDSwitheightbanks. 209
Fig.8.15 Step1inFigure8.14withpaddingaddedtotheoriginaldatasetto
removebankconflictsintheLDS. 210
Fig.9.1 Animageclassificationpipeline.AnalgorithmsuchasSURFisusedto
generatefeatures.Aclusteringalgorithmsuchask-meansthen
generatesasetofcentroidfeaturesthatcanserveasasetofvisual
wordsfortheimage.Thegeneratedfeaturesareassignedtoeach
centroidbythehistogrambuilder. 214
Fig.9.2 FeaturegenerationusingtheSURFalgorithm.TheSURFalgorithm
acceptsanimageasaninputandgeneratesanarrayoffeatures.Each
featureincludespositioninformationandasetof64valuesknownas
adescriptor. 214
Fig.9.3 Thedatatransformationkernelusedtoenablememorycoalescingis
thesameasamatrixtransposekernel. 219
Fig.9.4 Atransposeillustratedonaone-dimensionalarray. 220
Fig.10.1 ThesessionexplorerforCodeXLinprofilemode.Twoapplication
timelinesessionsandoneGPUperformancecountersessionareshown. 233
List of Figures xv
Fig.10.2 TheTimelineViewofCodeXLinprofilemodefortheNbody
application.Weseethetimespentindatatransferandkernelexecution. 234
Fig.10.3 TheAPITraceViewofCodeXLinprofilemodefortheNbodyapplication. 235
Fig.10.4 CodeXLProfilershowingthedifferentGPUkernelperformance
countersfortheNbodykernel. 237
Fig.10.5 AMDCodeXLexplorerinanalysismode.TheNBodyOpenCLkernel
hasbeencompiledandanalyzedforanumberofdifferentgraphics
architectures. 240
Fig.10.6 TheISAviewofKernelAnalyzer.TheNBodyOpenCLkernelhasbeen
compiledformultiplegraphicsarchitectures.Foreacharchitecture,
theAMDILandtheGPUISAcanbeevaluated. 241
Fig.10.7 TheStatisticsviewfortheNbodykernelshownbyKernelAnalyzer.We
seethatthenumberofconcurrentwavefrontsthatcanbescheduledis
limitedbythenumberofvectorregisters. 241
Fig.10.8 TheAnalysisviewoftheNbodykernelisshown.Theexecution
durationcalculatedbyemulationisshownfordifferentgraphics
architectures. 242
Fig.10.9 Ahigh-leveloverviewofhowCodeXLinteractswithanOpenCL
application. 243
Fig.10.10 CodeXLAPItraceshowingthehistoryoftheOpenCLfunctionscalled. 244
Fig.10.11 AkernelbreakpointsetontheNbodykernel. 246
Fig.10.12 TheMulti-Watchwindowshowingthevaluesofaglobalmemorybuffer
intheNbodyexample.Thevaluescanalsobevisualizedasanimage. 247
Fig.11.1 C++AMPcodeexample—vectoraddition. 250
Fig.11.2 Vectoraddition,conceptualview. 251
Fig.11.3 FunctorversionforC++AMPvectoraddition(conceptualcode). 256
Fig.11.4 FurtherexpandedversionforC++AMPvectoraddition(conceptual
code). 257
Fig.11.5 Hostcodeimplementationofparallel_for_each(conceptualcode). 259
Fig.11.6 C++AMPLambda—vectoraddition. 260
Fig.11.7 CompiledOpenCLSPIRcode—vectoradditionkernel. 261
Fig.12.1 WebCLobjects. 275
Fig.12.2 Usingmultiplecommand-queuesforoverlappeddatatransfer. 281
Fig.12.3 TypicalruntimeinvolvingWebCLandWebCL. 283
Fig.12.4 TwotrianglesinWebGLtodrawaWebCL-generatedimage. 284
List of Tables
Table 4.1 TheOpenCLFeaturesCoveredbyEachExample 76
Table 6.1 SummaryofOptionsforSVM 159
Table 9.1 TheTimeTakenfortheTransposeKernel 227
Table 9.2 KernelRunningTime(ms)forDifferentGPUImplementations 228
Table10.1 TheCommandStatesthatcanbeUsedtoObtainTimestampsfrom
OpenCLEvents 230
Table11.1 MappingKeyC++AMPConstructstoOpenCL 255
Table11.2 ConceptualMappingofDataMembersontheHostSideandonthe
DeviceSide 258
Table11.3 DataSharingBehaviorandImplicationsofOpenCL2.0SVMSupport 262
Table12.1 RelationshipsBetweenCTypesUsedinKernelsandsetArg()’s
webcl.type 277
xvii
Foreword
Inthelastfewyearscomputinghasenteredtheheterogeneouscomputingera,which
aims to bring together in a single device the best of both central processing units
(CPUs)andgraphicsprocessingunits(GPUs).Designersarecreatinganincreasingly
wide range of heterogeneous machines, and hardware vendors are making them
broadly available. This change in hardware offers great platforms for exciting new
applications. But, because the designs are different, classical programming models
donotworkverywell,anditisimportanttolearnaboutnewmodelssuchasthosein
OpenCL.
When the design of OpenCL started, the designers noticed that for a class of
algorithmsthatwerelatencyfocused(spreadsheets),developerswrotecodeinCor
C++andranitonaCPU,butforasecondclassofalgorithmsthatwherethroughput
focused(e.g.matrixmultiply),developersoftenwroteinCUDAandusedaGPU:two
relatedapproaches,buteachworkedononlyonekindofprocessor—C++didnotrun
on a GPU, CUDA did not run on a CPU. Developers had to specialize in one and
ignoretheother.Buttherealpowerofaheterogeneousdeviceisthatitcanefficiently
runapplicationsthatmixbothclassesofalgorithms.Thequestionwashowdoyou
programsuchmachines?
Onesolutionistoaddnewfeaturestotheexistingplatforms;bothC++andCUDA
areactivelyevolvingtomeetthechallengeofnewhardware.Anothersolutionwasto
createanewsetofprogrammingabstractionsspecificallytargetedatheterogeneous
computing. Apple came up with an initial proposal for such a new paradigm. This
proposalwasrefinedbytechnicalteamsfrommanycompanies,andbecameOpenCL.
Whenthedesignstarted,Iwasprivilegedtobepartofoneofthoseteams.Wehad
a lot of goals for the kernel language: (1) let developers write kernels in a single
source language; (2) allow those kernels to be functionally portable over CPUs,
GPUs, field-programmable gate arrays, and other sorts of devices; (3) be low level
so that developers could tease out all the performance of each device; (4) keep the
model abstract enough, so that the same code would work correctly on machines
beingbuiltbylotsofcompanies.And,ofcourse,aswithanycomputerproject,we
wantedtodothisfast.Tospeedupimplementations,wechosetobasethelanguage
on C99. In less than 6 months we produced the specification for OpenCL 1.0, and
within1yearthefirstimplementationsappeared.Andthen,timepassedandOpenCL
metrealdevelopers...
So what happened? First, C developers pointed out all the great C++ features
(a real memory model, atomics, etc.) that made them more productive, and CUDA
developers pointed out all the new features that NVIDIA added to CUDA (e.g.
nestedparallelism)thatmakeprogramsbothsimplerandfaster.Second,ashardware
architects explored heterogeneous computing, they figured out how to remove the
early restrictions requiring CPUs and GPUs to have separate memories. One great
hardware change was the development of integrated devices, which provide both a
xix