Table Of Content

TOWARDS EFFICIENT HARDWARE ACCELERATION OF DEEP NEURAL NETWORKS ON FPGA by Sicheng Li M.S. in Electrical Engineering, New York University, 2013 B.S. in Electrical Engineering, Beijing University of Posts and Communications, 2011 Submitted to the Graduate Faculty of the Swanson School of Engineering in partial fulfillment of the requirements for the degree of Doctor of Philosophy University of Pittsburgh 2017 UNIVERSITYOFPITTSBURGH SWANSONSCHOOLOFENGINEERING Thisdissertationwaspresented by SichengLi Itwasdefendedon September292017 andapprovedby Hai(Helen)Li,Ph.D.,AdjunctAssociateProfessor,DepartmentofElectricalandComputer Engineering YiranChen,Ph.D.,AdjunctAssociateProfessor,DepartmentofElectricalandComputer Engineering Zhi-HongMao,Ph.D.,AssociateProfessor,DepartmentofElectricalandComputerEngineering ErvinSejdic,Ph.D.,AssociateProfessor,DepartmentofElectricalandComputerEngineering YuWang,Ph.D.,AssociateProfessor,DepartmentofElectricalEngineering,TsinghuaUniversity DissertationDirector: Hai(Helen)Li,Ph.D.,AdjunctAssociateProfessor,Departmentof ElectricalandComputerEngineering ii Copyright(cid:13)c bySichengLi 2017 iii TOWARDS EFFICIENT HARDWARE ACCELERATION OF DEEP NEURAL NETWORKS ON FPGA SichengLi,PhD UniversityofPittsburgh,2017 Deep neural network (DNN) has achieved remarkable success in many applications because of its powerful capability for data processing. Their performance in computer vision have matched and insomeareasevensurpassedhumancapabilities. Deepneuralnetworkscancapturecomplexnon- linear features; however this ability comes at the cost of high computational and memory requirements. State-of-art networks require billions of arithmetic operations and millions of parameters. The brute-force computing model of DNN often requires extremely large hardware resources, in- troducing severe concerns on its scalability running on traditional von Neumann architecture. The well-known memory wall, and latency brought by the long-range connectivity and communica- tion of DNN severely constrain the computation efficiency of DNN. The acceleration techniques of DNN, either software or hardware, often suffer from poor hardware execution efficiency of the simplifiedmodel(software),orinevitableaccuracydegradationandlimitedsupportablealgorithms (hardware), respectively. In order to preserve the inference accuracy and make the hardware implementation in a more efficient form, a close investigation to the hardware/software co-design methodologiesforDNNsisneeded. The proposed work first presents an FPGA-based implementation framework for Recurrent Neural Network (RNN) acceleration. At architectural level, we improve the parallelism of RNN training scheme and reduce the computing resource requirement for computation efficiency en- hancement. Thehardware implementation primarily targets at reducing datacommunication load. Secondly,weproposeadatalocality-awaresparsematrixandvectormultiplication(SpMV)kernel. At software level, we reorganize a large sparse matrix into many modest-sized blocks by adopt- iv ing hypergraph-based partitioning and clustering. Available hardware constraints have been taken into consideration for the memory allocation and data access regularization. Thirdly, we present a holistic acceleration to sparse convolutional neural network (CNN). During network training, the data locality is regularized to ease the hardware mapping. The distributed architecture enables high computation parallelism and data reuse. The proposed research results in an hardware/software co-design methodology for fast and accurate DNN acceleration, through the innovations in algorithm optimization, hardware implementation, and the interactive design process across these twodomains. v TABLEOFCONTENTS 1.0 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 DeepNeuralNetworksonFPGAs . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 DissertationContributionandOutline . . . . . . . . . . . . . . . . . . . . . . . . 2 2.0 FPGAACCELERATIONTORNN-BASEDLANGUAGEMODEL . . . . . . . . . 6 2.1 Preliminary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.1.1 LanguageModels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.1.2 RNN&RNNbasedLanguageModel . . . . . . . . . . . . . . . . . . . . . 7 2.1.3 TheRNNTraining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2 AnalysisforDesignOptimization . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.3 ArchitectureOptimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.3.1 IncreaseParallelismbetweenHiddenandOutputLayers . . . . . . . . . . . 11 2.3.2 ComputationEfficiencyEnhancement . . . . . . . . . . . . . . . . . . . . . 13 2.4 HardwareImplementationDetails . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.4.1 SystemOverview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.4.2 DataAllocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.4.3 ThreadManagementinComputationEngine . . . . . . . . . . . . . . . . . 17 2.4.4 ProcessingElementDesign . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.4.5 DataReuse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.5 ExperimentalResults . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.5.1 ExperimentSetup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.5.2 TrainingAccuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.5.3 SystemPerformance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 vi 2.5.4 ComputationEngineEfficiency . . . . . . . . . . . . . . . . . . . . . . . . 25 2.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.0 THERECONFIGURABLESPMVKERNEL . . . . . . . . . . . . . . . . . . . . . 27 3.1 Preliminary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.1.1 SparseMatrixPreprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.1.2 TheExistingSpMVArchitectures . . . . . . . . . . . . . . . . . . . . . . . 28 3.2 DesignFramework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.3 SparseMatrixClustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.3.1 WorkloadBalance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.3.2 Hardware-awareClustering . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.3.3 StrongScalingvs. WeakScaling . . . . . . . . . . . . . . . . . . . . . . . . 35 3.3.4 HardwareConstraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.4 HardwareImplementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.4.1 GlobalControlUnit(GCU) . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.4.2 ProcessingElement(PE) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.4.3 ThreadManagementUnit(TMU) . . . . . . . . . . . . . . . . . . . . . . . 39 3.4.4 HardwareConfigurationOptimization . . . . . . . . . . . . . . . . . . . . . 41 3.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.5.1 ExperimentalSetup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.5.2 SystemPerformance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.5.3 TheImpactofDataPreprocessing . . . . . . . . . . . . . . . . . . . . . . . 46 3.5.4 ComparisontoPreviousDesigns . . . . . . . . . . . . . . . . . . . . . . . . 49 3.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.0 SPARSECONVOLUTIONALNEURALNETWORKSONFPGA . . . . . . . . . 51 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.2 CNNAccelerationandDifficulty . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.2.1 DenseCNNAcceleration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.2.2 InefficientAccelerationofSparseCNN . . . . . . . . . . . . . . . . . . . . 54 4.3 TheProposedDesignFramework . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 4.4 CNNModelSparsification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 vii 4.4.1 Locality-awareRegularization . . . . . . . . . . . . . . . . . . . . . . . . . 58 4.4.2 SparseNetworkRepresentation . . . . . . . . . . . . . . . . . . . . . . . . 60 4.4.3 KernelCompressionandDistribution . . . . . . . . . . . . . . . . . . . . . 62 4.5 HardwareImplementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 4.5.1 TheSystemArchitecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 4.5.2 ThePEOptimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 4.5.3 ZeroSkippingforComputationEfficiency . . . . . . . . . . . . . . . . . . . 68 4.5.4 DataReusetoImproveEffectiveBandwidth . . . . . . . . . . . . . . . . . . 69 4.6 HardwareSpecificOptimization . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 4.6.1 DesignTrade-offsonCrossLayerSparsity . . . . . . . . . . . . . . . . . . 70 4.6.2 HardwareConstraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 4.7 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 4.7.1 ExperimentalSetup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 4.7.2 Layer-by-LayerPerformance . . . . . . . . . . . . . . . . . . . . . . . . . . 73 4.7.3 End-to-EndSystemIntegration . . . . . . . . . . . . . . . . . . . . . . . . . 75 4.8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 5.0 RELATEDWORK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 6.0 CONCLUSIONANDFUTUREWORK . . . . . . . . . . . . . . . . . . . . . . . . 83 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 viii LISTOFTABLES 1 RNNComputationRuntimeBreakdowninGPU . . . . . . . . . . . . . . . . . . . 10 2 MemoryAccessRequirement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3 ResourceUtilization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 4 AccuracyEvaluationonMSRC . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 5 ConfigurationofDifferentPlatforms . . . . . . . . . . . . . . . . . . . . . . . . . 22 6 Runtime(inSeconds)ofRNNLMswithDifferentNetworkSize . . . . . . . . . . . 23 7 PowerConsumption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 8 ComputationEngineEfficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 9 ResourceUtilization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 10 CharacteristicsofBenchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 11 ConfigurationandSystemPerformanceofDifferentPlatforms . . . . . . . . . . . . 44 12 SystemPropertiesforPreviousImplementationsandThisWork . . . . . . . . . . . 48 13 The average weight sparsity and accuracy of three selected CNN models after regularization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 14 The average sparsity and replication rate of the input feature maps of Conv layers inAlexNet. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 15 Configurationofdifferentplatforms . . . . . . . . . . . . . . . . . . . . . . . . . . 72 16 ResourceutilizationonFPGA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 17 PerformanceevaluationonsparseConvandFCLayersofAlexNetonImageNet . . 74 18 ComparisontopreviousFPGAworks . . . . . . . . . . . . . . . . . . . . . . . . . 77 ix LISTOFFIGURES 1 (a)Feedforwardneuralnetwork;(b)Recurrentneuralnetwork. . . . . . . . . . . . 7 2 UnfoldRNNfortrainingthroughBPTTalgorithm. . . . . . . . . . . . . . . . . . . 9 3 The system speedup is saturated with the increment of CPU cores as the memory accessesbecomethenewbottleneck. . . . . . . . . . . . . . . . . . . . . . . . . . 11 4 The data flow of a two-stage pipelined RNN structure [1]. The computation com- plexity of white boxes in output layer is much bigger than that of gray boxes in hiddenlayer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 5 OurproposedparallelarchitectureforRNNLM. . . . . . . . . . . . . . . . . . . . 13 6 The impact of the reduced data precision of W in RNNLM on the reconstruction ho qualityofwordsequences. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 7 AnoverviewoftheRNNLMhardwareimplementation. . . . . . . . . . . . . . . . 15 8 Thethreadmanagementunitinacomputationengine. PEsareconnectedthrougha crossbartothesharedmemorystructure. . . . . . . . . . . . . . . . . . . . . . . . 17 9 Themulti-threadbasedprocessingelement. . . . . . . . . . . . . . . . . . . . . . . 18 10 Anexampleofcompressedrowstorage(CSR). . . . . . . . . . . . . . . . . . . . . 28 11 ThesparsityaffectsSpMVperformance. . . . . . . . . . . . . . . . . . . . . . . . 29 12 Ourproposeddatalocality-awaredesignframeworkforSpMVacceleration. . . . . 31 13 Applyingthepartitioningandclusteringonasparsematrix(lns 3937). . . . . . . 33 14 Theperformancepenaltybroughtbyinter-PEcommunications. . . . . . . . . . . . 36 15 ThearchitectureforSpMVacceleration. . . . . . . . . . . . . . . . . . . . . . . . 38 16 Configurationheaderfileisusedtoscheduleinter-PEcommunication. . . . . . . . . 38 17 Theinternalstructureofprocessingelementdesign. . . . . . . . . . . . . . . . . . 40 x

Description:

of DNN, either software or hardware, often suffer from poor hardware ware co-design methodology for fast and accurate DNN acceleration, Though FPGA in the Convey HC-2ex system (Xilinx Virtex6 LX760) has a large on-chip block [25] Convey Reference Manual, Convey computer, 2012,

Towards Efficient Hardware Acceleration of Deep Neural Networks on FPGA PDF

104 Pages·2017·9.72 MB·English

by Sicheng Li

Checking for file health...

Save to my drive

Quick download

Download

Download Towards Efficient Hardware Acceleration of Deep Neural Networks on FPGA PDF Free - Full Version

by Sicheng Li| 2017| 104 pages| 9.72| English

Download Towards Efficient Hardware Acceleration of Deep Neural Networks on FPGA by Sicheng Li in PDF format completely FREE. No registration required, no payment needed. Get instant access to this valuable resource on PDFdrive.to!

Free Download PDF

About Towards Efficient Hardware Acceleration of Deep Neural Networks on FPGA

Detailed Information

Author:	Sicheng Li
Publication Year:	2017
Pages:	104
Language:	English
File Size:	9.72
Format:	PDF
Price:	FREE

Download Free PDF

Safe & Secure Download - No registration required

Why Choose PDFdrive for Your Free Towards Efficient Hardware Acceleration of Deep Neural Networks on FPGA Download?

100% Free: No hidden fees or subscriptions required for one book every day.
No Registration: Immediate access is available without creating accounts for one book every day.
Safe and Secure: Clean downloads without malware or viruses
Multiple Formats: PDF, MOBI, Mpub,... optimized for all devices
Educational Resource: Supporting knowledge sharing and learning

Frequently Asked Questions

Is it really free to download Towards Efficient Hardware Acceleration of Deep Neural Networks on FPGA PDF?

Yes, on https://PDFdrive.to you can download Towards Efficient Hardware Acceleration of Deep Neural Networks on FPGA by Sicheng Li completely free. We don't require any payment, subscription, or registration to access this PDF file. For 3 books every day.

How can I read Towards Efficient Hardware Acceleration of Deep Neural Networks on FPGA on my mobile device?

After downloading Towards Efficient Hardware Acceleration of Deep Neural Networks on FPGA PDF, you can open it with any PDF reader app on your phone or tablet. We recommend using Adobe Acrobat Reader, Apple Books, or Google Play Books for the best reading experience.

Is this the full version of Towards Efficient Hardware Acceleration of Deep Neural Networks on FPGA?

Yes, this is the complete PDF version of Towards Efficient Hardware Acceleration of Deep Neural Networks on FPGA by Sicheng Li. You will be able to read the entire content as in the printed version without missing any pages.

Is it legal to download Towards Efficient Hardware Acceleration of Deep Neural Networks on FPGA PDF for free?

https://PDFdrive.to provides links to free educational resources available online. We do not store any files on our servers. Please be aware of copyright laws in your country before downloading.

The materials shared are intended for research, educational, and personal use in accordance with fair use principles.