Table Of Content

TECHNIQUESINSCALABLEANDEFFECTIVE PARALLELPERFORMANCEANALYSIS BY CHEEWAILEE DISSERTATION Submittedinpartialfulfillmentoftherequirements forthedegreeofDoctorofPhilosophyinComputerScience intheGraduateCollegeofthe UniversityofIllinoisatUrbana-Champaign,2009 Urbana,Illinois DoctoralCommittee: ProfessorLaxmikantKale,Chair ProfessorMarcSnir ProfessorMichaelHeath DoctorLuizDeRose,CrayInc. Abstract Performance analysis tools are essential to the maintenance of efficient parallel execution of scientific applications. As scientific applications are executed on larger and larger parallel su- percomputers,itisclearthatperformancetoolsmustemploymoreadvancedtechniquestokeep up with the increasing data volume and complexity of the performance information generated bytheseapplicationsasaresultofscaling. In this thesis, we investigate the useful techniques in four main thrusts to address various aspects of this problem. First, we study how some traditional performance analysis idioms can breakdowninthefaceofdatafromlargeprocessorcountsanddemonstratetechniquesandtools that restore scalability. Second, we investigate how the volume of performance data generated can be reduced while keeping the captured information relevant for analysis and performance problem detection. Third, we investigate the powerful new performance analysis idioms en- abled by live access to performance information streams from a running parallel application. Fourth, we demonstrate how repeated performance hypothesis testing can be conducted, via simulation techniques, scalably and with significantly reduced resource consumption. In addi- tion, we explore the benefits of performance tool integration to the propagation and synergy of scalableperformanceanalysistechniquesindifferenttools. ii Tomyfamilyandtowardsanageofreason. iii Acknowledgements First and foremost, I would like to express my deepest gratitude to Professor Kale´ for his great patience and guidance throughout the long and oftentimes difficult process of completing the thesis. His talent at managing so many students with so many diverse interests in our field of HighPerformanceComputingisawondertobeholdandafantasticsourceofinspiration. I would also like to thank the rest of my thesis committee: Professor Marc Snir, Professor MichaelHeathandDr. LuizDeRose,fortheirkindguidanceandconstructivefeedback. MyveryspecialthankstoCelsoMendes,IsaacDooley,EricBohm,AbhinavBhatele,Geng- binZhengandJoAnneGeignerfortheimmenseamountofhelpandsupporttowardhelpingme complete the thesis. Also, a big thank-you to the many many hang-out buddies at the Parallel Programming Laboratory over the years for making the lab simply a riot of fun: Eric Bohm, Ramprasad,EstebanMeneses,Filippo,Abhinav,Pritish,Jessie,RahulJain,Lukasz,Kumaresh, Phil, Dave, Isaac, Aaron, Esteban Pauli, Chao Mei, Eric Shook, Viraj, Ekaterina, Terry, Sayan- tan, Rahul Joshi, Chao Huang, Yan, Greg, Orion, Sameer, Hari, Amit, Nilesh, Tarun, Yogesh, Vikas, Sharon, Apurva, Ramkumar, Rashmi, Sindhura, Guna, Mani, Theckla, Jonathan, Josh, Arun,NeelamandPuneet. Specialshout-outsgoto(youknowwhoyouare): mychessbuddies, mydrinkingbuddies,mygymbuddiesandmypunchingbags(sorry!). Many thanks for the close friendships with so many people I have met at the University of Illinois at Urbana-Champaign. Your support and company have always been appreciated, even when it contributed to the dreaded “grad student procastination”. To the “net party” gang, we sure had many crazy times together: Fredrik Vraalsen, Jim Oly, Chris and Tanya Lattner, Mike iv Hunter, Rushabh Doshi, Bradley Jones, Bill Wendling, Howard and Melissa Sun, Nicholas Riley, Andy Reitz, Vinh Lam, Adam Slagell and Joel Stanley. To the “CS 105” gang, thanks for all the fun times working together to teach non-CS majors the joys of Computer Science: Dominique Kilman, Yang Yaling, Marsha and Roger Woodbury, Francis David, Tony Hursh, Reza Lesmana and Shamsi Iqbal. To the other “crazy” grad students of the Computer Science Department,thanksforallthefish!: LeeBaughandDmitryYershov. To the various faculty and staff of the Computer Science Department who have helped me with so many administrative burdens, special thanks go out to Professor Geneva Belford, Barb Cicone,MaryBethKelley,SheilaClark,MollyFlesner,ShirleyFinkeandElaineWilson. To Hamid at the Jerusalem restaurant along Wright Street, thank you for the fine food and company. I have enjoyed every moment working on this thesis at your establishment and par- takingintherefreshingTurkishcoffeeyoubrew. Lastbutmostcertainlynotleast,tomyparentsandmygrandfather,thankyouallforbearing my long absence with such grace, love and unwavering support. To my now-deceased grand- mother, thank you for holding out as long as you did in the hope of seeing me graduate. To mysister,thanksforkeepingmecompanyoverthelastfewyearsonFacebook. Tomyfianceé, thank you for your continued patience and love, and for sticking it out with me long-distance overthethirteenorsoyearswehavebeentogether. GrantsandotherAcknowledgements: NAMDwasdevelopedattheTheoreticalBiophysicsGroup(BeckmanInstitute,University of Illinois) and funded by the National Institutes of Health(NIH PHS 5 P41-RR05969-04). ProjectionsandCharm++aresupportedbyseveralprojectsfundedbytheDepartmentofEnergy (subcontractB523819)andtheNationalScienceFoundation(NSFDMR0121695) This work is supported by grant(s) from the National Institutes of Health P41-RR05969. The author gladly acknowledge supercomputer time provided by Pittsburgh Supercomputer Center and the National Center for Supercomputing Applications via Large Resources Allo- cationCommitteegrantMCA93S028. v Table of Contents ListofTables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix ListofFigures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x Chapter1 IntroductionAndMotivation . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 ParallelPerformanceTuningTools . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 ChallengesToPerformanceToolEffectivenessDuetoApplicationScaling . . . 4 1.3 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.4 ThesisObjectivesAndScope . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 Chapter2 SoftwareInfrastructure . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.1 Charm++ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.1.1 AdaptiveMPI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.1.2 AdaptiveOverlap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.1.3 AutomaticLoadBalancing . . . . . . . . . . . . . . . . . . . . . . . . 16 2.2 TheProjectionsPerformanceAnalysisAndVisualizationFramework . . . . . 17 2.2.1 Charm++PerformanceEvents . . . . . . . . . . . . . . . . . . . . . . 18 2.2.2 ProjectionsEventLogFormats . . . . . . . . . . . . . . . . . . . . . . 19 2.2.3 TheProjectionsPerformanceVisualizationTool . . . . . . . . . . . . . 20 2.2.4 ThePerformanceAnalysisProcessUsingProjections . . . . . . . . . . 21 Chapter3 ScalableAnalysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.1 ScalablyFindingObjectDecompositionWithPoorGrainsizeDistributions . . 23 3.2 ScalableIdiomsBasedOnTimeProfiles . . . . . . . . . . . . . . . . . . . . . 25 3.2.1 TimeProfileHeuristic: ProbableLoadImbalance . . . . . . . . . . . . 27 3.2.2 TimeProfileHeuristic: ComparativeSubstructureAnalysis . . . . . . . 29 3.2.3 HighResolutionTimeProfiles . . . . . . . . . . . . . . . . . . . . . . 31 3.3 FindingInterestingProcessors . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.3.1 CaseStudy: DetectingBadMulticastImplementation . . . . . . . . . . 34 3.3.2 CaseStudy: FindingCausesForLoadImbalance . . . . . . . . . . . . 35 3.3.3 ScalableToolSupportForFindingInterestingProcessors . . . . . . . . 37 3.3.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.4 RelatedWork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 vi Chapter4 DataVolumeReduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.1 BasicApproach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.2 RelatedWork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 4.3 QuantifyingTheProblem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.4 Applyingk-MeansClusteringToPerformanceData . . . . . . . . . . . . . . . 52 4.4.1 Thek-MeansClusteringAlgorithm . . . . . . . . . . . . . . . . . . . 52 4.4.2 ImportantChoicesFork-MeansClustering . . . . . . . . . . . . . . . 55 4.4.3 ChoosingRepresentativeProcessors . . . . . . . . . . . . . . . . . . . 57 4.5 Post-mortemVersusOnlineDataReduction . . . . . . . . . . . . . . . . . . . 59 4.6 CaseStudy: NAMDGrainsize . . . . . . . . . . . . . . . . . . . . . . . . . . 65 4.6.1 NAMD-NAnoscaleMolecularDynamics . . . . . . . . . . . . . . . . 65 4.6.2 ExperimentalMethodology . . . . . . . . . . . . . . . . . . . . . . . 67 4.6.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 4.7 FutureWork: ExtensionsToTheBasicApproach . . . . . . . . . . . . . . . . 74 4.7.1 ChoosingDataSubsetsByPhase . . . . . . . . . . . . . . . . . . . . 74 4.7.2 ConsiderationsForCriticalPaths . . . . . . . . . . . . . . . . . . . . 76 Chapter5 OnlineLiveAnalysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 5.1 RelatedWork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 5.2 TheConverseClient-ServerInterface . . . . . . . . . . . . . . . . . . . . . . . 80 5.3 ContinuousStreamingOfOnlinePerformanceData . . . . . . . . . . . . . . . 80 5.4 LiveParallelDataCollectionOfPerformanceProfiles . . . . . . . . . . . . . . 82 5.5 CaseStudy: Long-RunningNAMDSTMVSimulation . . . . . . . . . . . . . 88 5.6 FutureWork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 Chapter6 What-IfAnalysisThroughSimulation . . . . . . . . . . . . . . . . . . . 96 6.1 RelatedWork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 6.2 TheBigSimSimulationFramework . . . . . . . . . . . . . . . . . . . . . . . 99 6.2.1 BigSimEmulator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 6.2.2 BigSimSimulator . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 6.3 PerformanceAnalysisUsingFewerProcessors . . . . . . . . . . . . . . . . . 103 6.3.1 ReducingNumberOfProcesorsUsedAtEmulationTime . . . . . . . 104 6.3.2 NumberOfProcessorsUsedAtSimulationTime . . . . . . . . . . . . 105 6.4 GeneralMethodologyForHypothesisTesting . . . . . . . . . . . . . . . . . . 106 6.5 AnalyzingChangesToNetworkLatency . . . . . . . . . . . . . . . . . . . . . 106 6.5.1 CaseStudy: Seven-pointStencilComputation . . . . . . . . . . . . . . 107 6.5.2 ValidationWithSeven-pointStencilComputation . . . . . . . . . . . . 109 6.6 AnalyzingHypotheticalLoadBalancingChanges . . . . . . . . . . . . . . . . 120 6.6.1 CaseStudy: SimpleLoadImbalanceBenchmark . . . . . . . . . . . . 122 6.6.2 ValidationWithSimpleLoadImbalanceBenchmark . . . . . . . . . . 124 6.7 FutureWork: VariationsInObject-to-ProcessorMapping . . . . . . . . . . . . 128 vii Chapter7 ExtendingScalabilityTechniquesToThirdPartyTools . . . . . . . . . 130 7.1 PerformanceCall-back(Event)Interface . . . . . . . . . . . . . . . . . . . . . 130 7.2 IntegratingAnExternalPerformanceModule . . . . . . . . . . . . . . . . . . 133 7.2.1 TAUProfilingIntegration . . . . . . . . . . . . . . . . . . . . . . . . 133 7.2.2 InstrumentationOverheadAssessment . . . . . . . . . . . . . . . . . . 134 7.2.3 ExperimentalResultsOnCurrentMachines . . . . . . . . . . . . . . . 137 7.3 ScalabilityBenefitsOfIntegration . . . . . . . . . . . . . . . . . . . . . . . . 137 Chapter8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 viii List of Tables 4.1 Total volume of trace data summed across all files. ApoA1 is a NAMD simu- lationwith92katoms,F1-ATPasesimulates327katomswhilestmvsimulates 1Matoms. ThisdatawasgeneratedontheXT3atPSC. . . . . . . . . . . . . . 51 4.2 Total volume of trace data summed across all processors in stmv Projections logs with different Particle Mesh Ewald (PME) long-range electorstatics con- figurations. ThedatawasgeneratedonanXT5atNICS. . . . . . . . . . . . . 52 4.3 Numberofnon-emptyclustersfoundbyclusteringalgorithmwhenvaryingthe number of initial seeds uniformly distributed in the sample space The * indi- catesthatprocessor0wasaloneinitsowncluster. . . . . . . . . . . . . . . . . 71 4.4 Reduction in total volume of trace data for stmv. The number of processors selected in the subsets are 51, 102, 204 and 409 for 512, 1024, 2048 and 4096 originalprocessorsrespectively. . . . . . . . . . . . . . . . . . . . . . . . . . 71 4.5 Reduced dataset quality by proportionality based on total height of histogram bars. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 4.6 Measure of data reduction quality through number of least idle processors re- tainedinreduceddatasetsfor2,048processors. . . . . . . . . . . . . . . . . . 73 4.7 Measure of data reduction quality through number of least idle processors re- tainedinreduceddatasetsfor4,096processors. . . . . . . . . . . . . . . . . . 74 5.1 Overheadofcollectingutilizationprofileinstrumentation. . . . . . . . . . . . . 92 6.1 Experimental setup for baseline latency tolerance experiments executed on 48 processors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 6.2 Comparison of the SimpleImbalance benchmark average iteration times with a dummyloadbalancingstrategyon100processors. . . . . . . . . . . . . . . . 127 6.3 Comparison of the SimpleImbalance benchmark average iteration times with a greedyloadbalancingstrategyon100processors. . . . . . . . . . . . . . . . . 127 7.1 Overheadofperformancemodules(microsecondsperevent). . . . . . . . . . . 136 ix List of Figures 2.1 User view of the CHARM++ adaptive run time system with migratable objects ontheleftversus apossiblesystemmappingofthesamemigratableobjects on theright. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.2 PartialinterfacecodeshowinghowCharesencapsulateentrymethods. . . . . . 12 2.3 Abasicillustrationoftheruntimeschedulerimplementationoneachprocessor. 12 2.4 7 AMPI “processes” implemented as user-level threads bound to CHARM++ chareobjectswhicharethenboundto2actualprocessorsforexecution. . . . . 14 2.5 AdaptiveOverlapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.6 Overview: Asummaryofoverallparallelstructure. . . . . . . . . . . . . . . . 20 2.7 UsageProfile: Processorutilizationofvariousactivities. . . . . . . . . . . . . 20 2.8 Timelinevisualizationofprocessoractivity. . . . . . . . . . . . . . . . . . . . 21 3.1 Histogramofactivitygrainsizesinaparalleldiscreteeventsimulation. . . . . . 25 3.2 Low resolution time profile display of the structure of parallel event overlap in an8,192processorrunofOpenAtomat100msintervals. . . . . . . . . . . . . 26 3.3 Higher resolution time profile display of the same 8,192 processor run of Ope- nAtomat10msintervals. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.4 Time profile of a ChaNGa time step presenting its event-overlap substructure andexhibitingvisualcuesthatpointtopossibleloadimbalance. . . . . . . . . 28 3.5 Time profile of ChaNGa time step showing poorer performance after the application of a greedy load balancing strategy. The substructure hints at poorer communicationperformanceasaresultoftheattempttobalancetheload. . . . 29 3.6 TimeprofileofChaNGatimestepshowingtheresultofapplyingaloadbalanc- ingstrategythatalsotookcommunicationintoaccount. . . . . . . . . . . . . . 30 3.7 Veryhighresolutiontimeprofiledisplayof8NAMDtimestepswithhigh-detail substructureat100microsecondintervals. . . . . . . . . . . . . . . . . . . . . 31 3.8 CorrelationofOpenAtomtime-profileandcommunication-over-timegraphs. . 33 3.9 Zoomed-in view of communication-over-time graph showing the number of messagesreceivedforOpenAtomat10msintervals. . . . . . . . . . . . . . . 34 3.10 A small sample of processor timelines out of 1,024 showing (red) events with inefficientmulticasts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.11 Asmallsampleofprocessortimelinesoutof1,024showingmuchshorter(red) events with improved multicasts and a corresponding improvement in overall performance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 x

Description:

First, we study how some traditional performance analysis idioms can Ramprasad, Esteban Meneses, Filippo, Abhinav, Pritish, Jessie, Rahul Jain,

TECHNIQUES IN SCALABLE AND EFFECTIVE PARALLEL PERFORMANCE ANALYSIS BY ... PDF

162 Pages·2009·5.28 MB·English

Checking for file health...

Save to my drive

Quick download

Download

Download TECHNIQUES IN SCALABLE AND EFFECTIVE PARALLEL PERFORMANCE ANALYSIS BY ... PDF Free - Full Version

by Unknow| 2009| 162 pages| 5.28| English

Download TECHNIQUES IN SCALABLE AND EFFECTIVE PARALLEL PERFORMANCE ANALYSIS BY ... by in PDF format completely FREE. No registration required, no payment needed. Get instant access to this valuable resource on PDFdrive.to!

Free Download PDF

About TECHNIQUES IN SCALABLE AND EFFECTIVE PARALLEL PERFORMANCE ANALYSIS BY ...

First, we study how some traditional performance analysis idioms can Ramprasad, Esteban Meneses, Filippo, Abhinav, Pritish, Jessie, Rahul Jain,

Detailed Information

Author:	Unknown
Publication Year:	2009
Pages:	162
Language:	English
File Size:	5.28
Format:	PDF
Price:	FREE

Download Free PDF

Safe & Secure Download - No registration required

Why Choose PDFdrive for Your Free TECHNIQUES IN SCALABLE AND EFFECTIVE PARALLEL PERFORMANCE ANALYSIS BY ... Download?

100% Free: No hidden fees or subscriptions required for one book every day.
No Registration: Immediate access is available without creating accounts for one book every day.
Safe and Secure: Clean downloads without malware or viruses
Multiple Formats: PDF, MOBI, Mpub,... optimized for all devices
Educational Resource: Supporting knowledge sharing and learning

Frequently Asked Questions

Is it really free to download TECHNIQUES IN SCALABLE AND EFFECTIVE PARALLEL PERFORMANCE ANALYSIS BY ... PDF?

Yes, on https://PDFdrive.to you can download TECHNIQUES IN SCALABLE AND EFFECTIVE PARALLEL PERFORMANCE ANALYSIS BY ... by completely free. We don't require any payment, subscription, or registration to access this PDF file. For 3 books every day.

How can I read TECHNIQUES IN SCALABLE AND EFFECTIVE PARALLEL PERFORMANCE ANALYSIS BY ... on my mobile device?

After downloading TECHNIQUES IN SCALABLE AND EFFECTIVE PARALLEL PERFORMANCE ANALYSIS BY ... PDF, you can open it with any PDF reader app on your phone or tablet. We recommend using Adobe Acrobat Reader, Apple Books, or Google Play Books for the best reading experience.

Is this the full version of TECHNIQUES IN SCALABLE AND EFFECTIVE PARALLEL PERFORMANCE ANALYSIS BY ...?

Yes, this is the complete PDF version of TECHNIQUES IN SCALABLE AND EFFECTIVE PARALLEL PERFORMANCE ANALYSIS BY ... by Unknow. You will be able to read the entire content as in the printed version without missing any pages.

Is it legal to download TECHNIQUES IN SCALABLE AND EFFECTIVE PARALLEL PERFORMANCE ANALYSIS BY ... PDF for free?

https://PDFdrive.to provides links to free educational resources available online. We do not store any files on our servers. Please be aware of copyright laws in your country before downloading.

The materials shared are intended for research, educational, and personal use in accordance with fair use principles.