Table Of Content

Heterogeneous Computing with OpenCL 2.0 Heterogeneous Computing with OpenCL 2.0 Third Edition David Kaeli Perhaad Mistry Dana Schaa Dong Ping Zhang AMSTERDAM (cid:129) BOSTON (cid:129) HEIDELBERG (cid:129) LONDON NEW YORK (cid:129) OXFORD (cid:129) PARIS (cid:129) SAN DIEGO SAN FRANCISCO (cid:129) SINGAPORE (cid:129) SYDNEY (cid:129) TOKYO Morgan Kaufmann is an imprint of Elsevier AcquiringEditor:ToddGreen EditorialProjectManager:CharlieKent ProjectManager:PriyaKumaraguruparan CoverDesigner:MatthewLimbert MorganKaufmannisanimprintofElsevier 225WymanStreet,Waltham,MA02451,USA Copyright©2015,2013,2012AdvancedMicroDevices,Inc.PublishedbyElsevierInc. Allrightsreserved. Nopartofthispublicationmaybereproducedortransmittedinanyformorbyanymeans, electronicormechanical,includingphotocopying,recording,oranyinformationstorageand retrievalsystem,withoutpermissioninwritingfromthepublisher.Detailsonhowtoseek permission,furtherinformationaboutthePublisher’spermissionspoliciesandour arrangementswithorganizationssuchastheCopyrightClearanceCenterandtheCopyright LicensingAgency,canbefoundatourwebsite:www.elsevier.com/permissions. Thisbookandtheindividualcontributionscontainedinitareprotectedundercopyrightby thePublisher(otherthanasmaybenotedherein). Notices Knowledgeandbestpracticeinthisfieldareconstantlychanging.Asnewresearchand experiencebroadenourunderstanding,changesinresearchmethods,professionalpractices, ormedicaltreatmentmaybecomenecessary. Practitionersandresearchersmustalwaysrelyontheirownexperienceandknowledgein evaluatingandusinganyinformation,methods,compounds,orexperimentsdescribedherein. Inusingsuchinformationormethodstheyshouldbemindfuloftheirownsafetyandthe safetyofothers,includingpartiesforwhomtheyhaveaprofessionalresponsibility. Tothefullestextentofthelaw,neitherthePublishernortheauthors,contributors,oreditors, assumeanyliabilityforanyinjuryand/ordamagetopersonsorpropertyasamatterof productsliability,negligenceorotherwise,orfromanyuseoroperationofanymethods, products,instructions,orideascontainedinthematerialherein. ISBN:978-0-12-801414-1 BritishLibraryCataloguinginPublicationData AcataloguerecordforthisbookisavailablefromtheBritishLibrary LibraryofCongressCataloging-in-PublicationData AcatalogrecordforthisbookisavailablefromtheLibraryofCongress ForinformationonallMKpublications visitourwebsiteatwww.mkp.com List of Figures Fig.1.1 (a)Simplesorting:adivide-and-conquerimplementation,breakingthe listintoshorterlists,sortingthem,andthenmergingtheshortersorted lists.(b)Vector-scalarmultiply:scatteringthemultipliesandthen gatheringtheresultstobesummedupinaseriesofsteps. 3 Fig.1.2 MultiplyingelementsinarraysAandB,andstoringtheresultinan arrayC. 4 Fig.1.3 TaskparallelismpresentinfastFouriertransform(FFT)application. Differentinputimagesareprocessedindependentlyinthethree independenttasks. 5 Fig.1.4 Task-levelparallelism,wheremultiplewordscanbecompared concurrently.Alsoshownisfiner-grainedcharacter-by-character parallelismpresentwhencharacterswithinthewordsarecompared withthesearchstring. 6 Fig.1.5 AfterallstringcomparisonsinFigure1.4havebeencompleted,we cansumupthenumberofmatchesinacombiningnetwork. 6 Fig.1.6 Therelationshipbetweenparallelandconcurrentprograms.Parallel andconcurrentprogramsaresubsetsofallprograms. 8 Fig.2.1 Out-of-orderexecutionofaninstructionstreamofsimpleassembly-like instructions.Notethatinthissyntax,thedestinationregisterislisted first.Forexample,add a,b,cisa = b+c. 18 Fig.2.2 VLIWexecutionbasedontheout-of-orderdiagraminFigure2.1. 20 Fig.2.3 SIMDexecutionwhereasingleinstructionisscheduledinorder,but executesovermultipleALUsatthesametime. 21 Fig.2.4 Theout-of-orderscheduleseeninFigure2.1combinedwithasecond threadandexecutedsimultaneously. 23 Fig.2.5 Twothreadsscheduledinatime-slicefashion. 24 Fig.2.6 Takingtemporalmultithreadingtoanextremeasisdoneinthroughput computing:alargenumberofthreadsinterleaveexecutiontokeepthe devicebusy,whereaseachindividualthreadtakeslongertoexecute thanthetheoreticalminimum. 25 Fig.2.7 TheAMDPuma(left)andSteamroller(right)high-leveldesigns(not showntoanysharedscale).Pumaisalow-powerdesignthatfollowsa traditionalapproachtomappingfunctionalunitstocores.Steamroller combinestwocoreswithinamodule,sharingitsfloating-point(FP)units. 26 Fig.2.8 TheAMDRadeonHD6970GPUarchitecture.Thedeviceisdivided intotwohalves,whereinstructioncontrol(schedulinganddispatch)is performedbythewaveschedulerforeachhalf.The2416-laneSIMD coresexecutefour-wayVLIWinstructionsoneachSIMDlaneand containprivatelevel1(L1)cachesandlocaldatashares(scratchpad memory). 27 xi xii List of Figures Fig.2.9 TheNiagara2CPUfromSun/Oracle.Thedesignintendstomakea highlevelofthreadingefficient.NoteitsrelativesimilaritytotheGPU designseeninFigure2.8.Givenenoughthreads,wecancoverall memoryaccesstimewithusefulcompute,withoutextracting instruction-levelparallelism(ILP)throughcomplicatedhardware techniques. 32 Fig.2.10 TheAMDRadeonR9290Xarchitecture.Thedevicehas44coresin 11clusters.Eachcoreconsistsofascalarexecutionunitthathandles branchesandbasicintegeroperations,andfour16-laneSIMDALUs. Theclustersshareinstructionandscalarcaches. 35 Fig.2.11 TheNVIDIAGeForceGTX780architecture.Thedevicehas12large coresthatNVIDIAreferstoas“streamingmultiprocessors”(SMX). EachSMXhas12SIMDunits(withspecializeddouble-precisionand specialfunctionunits),asingleL1cache,andaread-onlydatacache. 36 Fig.2.12 TheA10-7850KAPUconsistsoftwoSteamroller-basedCPUcores andeightRadeonR9GPUcores(3216-laneSIMDunitsintotal).The APUincludesafastbusfromtheGPUtoDDR3memory,andashared paththatisoptionallycoherentwithCPUcaches. 37 Fig.2.13 AnInteli7processorwithHDGraphics4000graphics.Althoughnot termed“APU”byIntel,theconceptisthesameasforthedevicesin thatcategoryfromAMD.IntelcombinesfourHaswellx86coreswithits graphicsprocessors,connectedtoasharedlast-levelcache(LLC)via aringbus. 38 Fig.3.1 AnOpenCLplatformwithmultiplecomputedevices.Eachcompute devicecontainsoneormorecomputeunits.Acomputeunitis composedofoneormoreprocessingelements(PEs).Asystemcould havemultipleplatformspresentatthesametime.Forexample,a systemcouldhaveanAMDplatformandanIntelplatformpresentat thesametime. 43 Fig.3.2 SomeoftheOutputfromtheCLInfoprogramshowingthe characteristicsofanOpenCLplatformanddevices.Weseethatthe AMDplatformhastwodevices(aCPUandaGPU).Theoutputshown herecanbequeriedusingfunctionsfromtheplatformAPI. 46 Fig.3.3 Vectoradditionalgorithmshowinghoweachelementcanbeadded independently. 50 Fig.3.4 ThehierarchicalmodelusedforcreatinganNDRangeofwork-items, groupedintowork-groups. 52 Fig.3.5 TheOpenCLruntimeshowndenotesanOpenCLcontextwithtwo computedevices(aCPUdeviceandaGPUdevice).Eachcompute devicehasitsowncommand-queues.Host-sideanddevice-side command-queuesareshown.Thedevice-sidequeuesarevisibleonly fromkernelsexecutingonthecomputedevice.Thememoryobjects havebeendefinedwithinthememorymodel. 54 Fig.3.6 MemoryregionsandtheirscopeintheOpenCLmemorymodel. 61 Fig.3.7 MappingtheOpenCLmemorymodeltoanAMDRadeonHD7970GPU. 62 List of Figures xiii Fig.4.1 Ahistogramgeneratedfroma256-bitimage.Eachbincorrespondsto thefrequencyofthecorrespondingpixelvalue. 76 Fig.4.2 Animagerotatedby45◦.Pixelsthatcorrespondtoanout-of-bounds locationintheinputimagearereturnedasblack. 83 Fig.4.3 Applyingaconvolutionfiltertoasourceimage. 91 Fig.4.4 Theeffectofdifferentconvolutionfiltersappliedtothesamesource image:(a)theoriginalimage;(b)blurringfilter;and(c)embossingfilter. 92 Fig.4.5 Theproducerkernelwillgeneratefilteredpixelsandsendthemviaa pipetotheconsumerkernel,whichwillthengeneratethehistogram: (a)originalimage;(b)filteredimage;and(c)histogramoffilteredimage. 99 Fig.5.1 Multiplecommand-queuescreatedfordifferentdevicesdeclared withinthesamecontext.Twodevicesareshown,whereone command-queuehasbeencreatedforeachdevice. 118 Fig.5.2 Multipledevicesworkinginapipelinedmanneronthesamedata.The CPUqueuewillwaituntiltheGPUkernelhasfinished. 119 Fig.5.3 Multipledevicesworkinginaparallelmanner.Inthisscenario,both GPUsdonotusethesamebuffersandwillexecuteindependently. TheCPUqueuewillwaituntilbothGPUdeviceshavefinished. 120 Fig.5.4 ExecutingthesimplekernelshowninListing5.5.Thedifferent work-itemsintheNDRangeareshown. 121 Fig.5.5 Withinasinglekerneldispatch,synchronizationregardingexecution orderissupportedonlywithinwork-groupsusingbarriers.Global synchronizationismaintainedbycompletionofthekernel,andthe guaranteethatonacompletioneventallworkiscompleteandmemory contentisasexpected. 126 Fig.5.6 ExampleshowingOpenCLmemoryobjectsmappingtoargumentsfor clEnqueueNativeKernel()inListing5.8. 131 Fig.5.7 Asingle-levelfork-joinexecutionparadigmcomparedwithnested parallelismthreadexecution. 133 Fig.6.1 Anexampleshowingascenariowhereabufferiscreatedand initializedonthehost,usedforcomputationonthedevice,and transferredbacktothehost.Notethattheruntimecouldhavealso createdandinitializedthebufferdirectlyonthedevice.(a)Creation andinitializationofabufferinhostmemory.(b)Implicitdatatransfer fromthehosttothedevicepriortokernelexecution.(c)Explicit copyingofdatabackfromthedevicetothehostpointer. 150 Fig.6.2 Datamovementusingexplicitread-writecommands.(a)Creationofan uninitializedbufferindevicememory.(b)Explicitdatatransferfrom thehosttothedevicepriortoexecution.(c)Explicitdatatransferfrom thedevicetothehostfollowingexecution. 151 Fig.6.3 Datamovementusingmap/unmap.(a)Creationofanuninitialized bufferindevicememory.(b)Thebufferismappedintothehost’s addressspace.(c)Thebufferisunmappedfromthehost’s addressspace. 158 Fig.7.1 ThememoryspacesavailabletoanOpenCLdevice. 164 xiv List of Figures Fig.7.2 Dataracewhenincrementingasharedvariable.Thevaluestored dependsontheorderingofoperationsbetweenthethreads. 166 Fig.7.3 ApplyingZ-ordermappingtoatwo-dimensionalmemoryspace. 172 Fig.7.4 Thepatternofdataflowfortheexampleshowninthe localAccesskernel. 177 Fig.8.1 High-leveldesignofAMD’sPiledriver-basedFX-8350CPU. 188 Fig.8.2 OpenCLmappedontoanFX-8350CPU.TheFX-8350CPUisboththe OpenCLhostandthedeviceinthisscenario. 189 Fig.8.3 Implementationofwork-groupexecutiononanx86architecture. 190 Fig.8.4 Mappingthememoryspacesforawork-group(work-group0)ontoa PiledriverCPUcache. 192 Fig.8.5 High-levelRadeonR9290XdiagramlabeledwithOpenCLexecution andmemorymodelterms. 193 Fig.8.6 Memorybandwidthsinthediscretesystem. 195 Fig.8.7 RadeonR9290Xcomputeunitmicroarchitecture. 197 Fig.8.8 MappingOpenCL’smemorymodelontoaRadeonR9290XGPU. 201 Fig.8.9 Usingvectorreadsprovidesabetteropportunitytoreturndata efficientlythroughthememorysystem.Whenwork-itemsaccess consecutiveelements,GPUhardwarecanachievethesameresult throughcoalescing. 203 Fig.8.10 Accessestononconsecutiveelementsreturnsmallerpiecesofdata lessefficiently. 203 Fig.8.11 MappingtheRadeonR9290Xaddressspaceontomemorychannels andDRAMbanks. 204 Fig.8.12 RadeonR9290Xmemorysubsystem. 205 Fig.8.13 TheaccumulationpassoftheprefixsumshowninListing8.2overa 16-elementarrayinlocalmemoryusing8work-items. 208 Fig.8.14 Step1inFigure8.13showingthebehaviorofanLDSwitheightbanks. 209 Fig.8.15 Step1inFigure8.14withpaddingaddedtotheoriginaldatasetto removebankconflictsintheLDS. 210 Fig.9.1 Animageclassificationpipeline.AnalgorithmsuchasSURFisusedto generatefeatures.Aclusteringalgorithmsuchask-meansthen generatesasetofcentroidfeaturesthatcanserveasasetofvisual wordsfortheimage.Thegeneratedfeaturesareassignedtoeach centroidbythehistogrambuilder. 214 Fig.9.2 FeaturegenerationusingtheSURFalgorithm.TheSURFalgorithm acceptsanimageasaninputandgeneratesanarrayoffeatures.Each featureincludespositioninformationandasetof64valuesknownas adescriptor. 214 Fig.9.3 Thedatatransformationkernelusedtoenablememorycoalescingis thesameasamatrixtransposekernel. 219 Fig.9.4 Atransposeillustratedonaone-dimensionalarray. 220 Fig.10.1 ThesessionexplorerforCodeXLinprofilemode.Twoapplication timelinesessionsandoneGPUperformancecountersessionareshown. 233 List of Figures xv Fig.10.2 TheTimelineViewofCodeXLinprofilemodefortheNbody application.Weseethetimespentindatatransferandkernelexecution. 234 Fig.10.3 TheAPITraceViewofCodeXLinprofilemodefortheNbodyapplication. 235 Fig.10.4 CodeXLProfilershowingthedifferentGPUkernelperformance countersfortheNbodykernel. 237 Fig.10.5 AMDCodeXLexplorerinanalysismode.TheNBodyOpenCLkernel hasbeencompiledandanalyzedforanumberofdifferentgraphics architectures. 240 Fig.10.6 TheISAviewofKernelAnalyzer.TheNBodyOpenCLkernelhasbeen compiledformultiplegraphicsarchitectures.Foreacharchitecture, theAMDILandtheGPUISAcanbeevaluated. 241 Fig.10.7 TheStatisticsviewfortheNbodykernelshownbyKernelAnalyzer.We seethatthenumberofconcurrentwavefrontsthatcanbescheduledis limitedbythenumberofvectorregisters. 241 Fig.10.8 TheAnalysisviewoftheNbodykernelisshown.Theexecution durationcalculatedbyemulationisshownfordifferentgraphics architectures. 242 Fig.10.9 Ahigh-leveloverviewofhowCodeXLinteractswithanOpenCL application. 243 Fig.10.10 CodeXLAPItraceshowingthehistoryoftheOpenCLfunctionscalled. 244 Fig.10.11 AkernelbreakpointsetontheNbodykernel. 246 Fig.10.12 TheMulti-Watchwindowshowingthevaluesofaglobalmemorybuffer intheNbodyexample.Thevaluescanalsobevisualizedasanimage. 247 Fig.11.1 C++AMPcodeexample—vectoraddition. 250 Fig.11.2 Vectoraddition,conceptualview. 251 Fig.11.3 FunctorversionforC++AMPvectoraddition(conceptualcode). 256 Fig.11.4 FurtherexpandedversionforC++AMPvectoraddition(conceptual code). 257 Fig.11.5 Hostcodeimplementationofparallel_for_each(conceptualcode). 259 Fig.11.6 C++AMPLambda—vectoraddition. 260 Fig.11.7 CompiledOpenCLSPIRcode—vectoradditionkernel. 261 Fig.12.1 WebCLobjects. 275 Fig.12.2 Usingmultiplecommand-queuesforoverlappeddatatransfer. 281 Fig.12.3 TypicalruntimeinvolvingWebCLandWebCL. 283 Fig.12.4 TwotrianglesinWebGLtodrawaWebCL-generatedimage. 284 List of Tables Table 4.1 TheOpenCLFeaturesCoveredbyEachExample 76 Table 6.1 SummaryofOptionsforSVM 159 Table 9.1 TheTimeTakenfortheTransposeKernel 227 Table 9.2 KernelRunningTime(ms)forDifferentGPUImplementations 228 Table10.1 TheCommandStatesthatcanbeUsedtoObtainTimestampsfrom OpenCLEvents 230 Table11.1 MappingKeyC++AMPConstructstoOpenCL 255 Table11.2 ConceptualMappingofDataMembersontheHostSideandonthe DeviceSide 258 Table11.3 DataSharingBehaviorandImplicationsofOpenCL2.0SVMSupport 262 Table12.1 RelationshipsBetweenCTypesUsedinKernelsandsetArg()’s webcl.type 277 xvii Foreword Inthelastfewyearscomputinghasenteredtheheterogeneouscomputingera,which aims to bring together in a single device the best of both central processing units (CPUs)andgraphicsprocessingunits(GPUs).Designersarecreatinganincreasingly wide range of heterogeneous machines, and hardware vendors are making them broadly available. This change in hardware offers great platforms for exciting new applications. But, because the designs are different, classical programming models donotworkverywell,anditisimportanttolearnaboutnewmodelssuchasthosein OpenCL. When the design of OpenCL started, the designers noticed that for a class of algorithmsthatwerelatencyfocused(spreadsheets),developerswrotecodeinCor C++andranitonaCPU,butforasecondclassofalgorithmsthatwherethroughput focused(e.g.matrixmultiply),developersoftenwroteinCUDAandusedaGPU:two relatedapproaches,buteachworkedononlyonekindofprocessor—C++didnotrun on a GPU, CUDA did not run on a CPU. Developers had to specialize in one and ignoretheother.Buttherealpowerofaheterogeneousdeviceisthatitcanefficiently runapplicationsthatmixbothclassesofalgorithms.Thequestionwashowdoyou programsuchmachines? Onesolutionistoaddnewfeaturestotheexistingplatforms;bothC++andCUDA areactivelyevolvingtomeetthechallengeofnewhardware.Anothersolutionwasto createanewsetofprogrammingabstractionsspecificallytargetedatheterogeneous computing. Apple came up with an initial proposal for such a new paradigm. This proposalwasrefinedbytechnicalteamsfrommanycompanies,andbecameOpenCL. Whenthedesignstarted,Iwasprivilegedtobepartofoneofthoseteams.Wehad a lot of goals for the kernel language: (1) let developers write kernels in a single source language; (2) allow those kernels to be functionally portable over CPUs, GPUs, field-programmable gate arrays, and other sorts of devices; (3) be low level so that developers could tease out all the performance of each device; (4) keep the model abstract enough, so that the same code would work correctly on machines beingbuiltbylotsofcompanies.And,ofcourse,aswithanycomputerproject,we wantedtodothisfast.Tospeedupimplementations,wechosetobasethelanguage on C99. In less than 6 months we produced the specification for OpenCL 1.0, and within1yearthefirstimplementationsappeared.Andthen,timepassedandOpenCL metrealdevelopers... So what happened? First, C developers pointed out all the great C++ features (a real memory model, atomics, etc.) that made them more productive, and CUDA developers pointed out all the new features that NVIDIA added to CUDA (e.g. nestedparallelism)thatmakeprogramsbothsimplerandfaster.Second,ashardware architects explored heterogeneous computing, they figured out how to remove the early restrictions requiring CPUs and GPUs to have separate memories. One great hardware change was the development of integrated devices, which provide both a xix

Heterogeneous Computing with OpenCL 2.0 PDF

313 Pages·2015·13.602 MB·English

by David R. Kaeli, Perhaad Mistry, Dana Schaa, Dong Ping Zhang

Checking for file health...

Save to my drive

Quick download

Download

Download Heterogeneous Computing with OpenCL 2.0 PDF Free - Full Version

by David R. Kaeli, Perhaad Mistry, Dana Schaa, Dong Ping Zhang| 2015| 313 pages| 13.602| English

Download Heterogeneous Computing with OpenCL 2.0 by David R. Kaeli, Perhaad Mistry, Dana Schaa, Dong Ping Zhang in PDF format completely FREE. No registration required, no payment needed. Get instant access to this valuable resource on PDFdrive.to!

Free Download PDF

About Heterogeneous Computing with OpenCL 2.0

No description available for this book.

Detailed Information

Author:	David R. Kaeli, Perhaad Mistry, Dana Schaa, Dong Ping Zhang
Publication Year:	2015
ISBN:	9780128014141
Pages:	313
Language:	English
File Size:	13.602
Format:	PDF
Price:	FREE

Download Free PDF

Safe & Secure Download - No registration required

Why Choose PDFdrive for Your Free Heterogeneous Computing with OpenCL 2.0 Download?

100% Free: No hidden fees or subscriptions required for one book every day.
No Registration: Immediate access is available without creating accounts for one book every day.
Safe and Secure: Clean downloads without malware or viruses
Multiple Formats: PDF, MOBI, Mpub,... optimized for all devices
Educational Resource: Supporting knowledge sharing and learning

Frequently Asked Questions

Is it really free to download Heterogeneous Computing with OpenCL 2.0 PDF?

Yes, on https://PDFdrive.to you can download Heterogeneous Computing with OpenCL 2.0 by David R. Kaeli, Perhaad Mistry, Dana Schaa, Dong Ping Zhang completely free. We don't require any payment, subscription, or registration to access this PDF file. For 3 books every day.

How can I read Heterogeneous Computing with OpenCL 2.0 on my mobile device?

After downloading Heterogeneous Computing with OpenCL 2.0 PDF, you can open it with any PDF reader app on your phone or tablet. We recommend using Adobe Acrobat Reader, Apple Books, or Google Play Books for the best reading experience.

Is this the full version of Heterogeneous Computing with OpenCL 2.0?

Yes, this is the complete PDF version of Heterogeneous Computing with OpenCL 2.0 by David R. Kaeli, Perhaad Mistry, Dana Schaa, Dong Ping Zhang. You will be able to read the entire content as in the printed version without missing any pages.

Is it legal to download Heterogeneous Computing with OpenCL 2.0 PDF for free?

https://PDFdrive.to provides links to free educational resources available online. We do not store any files on our servers. Please be aware of copyright laws in your country before downloading.

The materials shared are intended for research, educational, and personal use in accordance with fair use principles.