Table Of ContentTECHNIQUESINSCALABLEANDEFFECTIVE
PARALLELPERFORMANCEANALYSIS
BY
CHEEWAILEE
DISSERTATION
Submittedinpartialfulfillmentoftherequirements
forthedegreeofDoctorofPhilosophyinComputerScience
intheGraduateCollegeofthe
UniversityofIllinoisatUrbana-Champaign,2009
Urbana,Illinois
DoctoralCommittee:
ProfessorLaxmikantKale,Chair
ProfessorMarcSnir
ProfessorMichaelHeath
DoctorLuizDeRose,CrayInc.
Abstract
Performance analysis tools are essential to the maintenance of efficient parallel execution of
scientific applications. As scientific applications are executed on larger and larger parallel su-
percomputers,itisclearthatperformancetoolsmustemploymoreadvancedtechniquestokeep
up with the increasing data volume and complexity of the performance information generated
bytheseapplicationsasaresultofscaling.
In this thesis, we investigate the useful techniques in four main thrusts to address various
aspects of this problem. First, we study how some traditional performance analysis idioms can
breakdowninthefaceofdatafromlargeprocessorcountsanddemonstratetechniquesandtools
that restore scalability. Second, we investigate how the volume of performance data generated
can be reduced while keeping the captured information relevant for analysis and performance
problem detection. Third, we investigate the powerful new performance analysis idioms en-
abled by live access to performance information streams from a running parallel application.
Fourth, we demonstrate how repeated performance hypothesis testing can be conducted, via
simulation techniques, scalably and with significantly reduced resource consumption. In addi-
tion, we explore the benefits of performance tool integration to the propagation and synergy of
scalableperformanceanalysistechniquesindifferenttools.
ii
Tomyfamilyandtowardsanageofreason.
iii
Acknowledgements
First and foremost, I would like to express my deepest gratitude to Professor Kale´ for his great
patience and guidance throughout the long and oftentimes difficult process of completing the
thesis. His talent at managing so many students with so many diverse interests in our field of
HighPerformanceComputingisawondertobeholdandafantasticsourceofinspiration.
I would also like to thank the rest of my thesis committee: Professor Marc Snir, Professor
MichaelHeathandDr. LuizDeRose,fortheirkindguidanceandconstructivefeedback.
MyveryspecialthankstoCelsoMendes,IsaacDooley,EricBohm,AbhinavBhatele,Geng-
binZhengandJoAnneGeignerfortheimmenseamountofhelpandsupporttowardhelpingme
complete the thesis. Also, a big thank-you to the many many hang-out buddies at the Parallel
Programming Laboratory over the years for making the lab simply a riot of fun: Eric Bohm,
Ramprasad,EstebanMeneses,Filippo,Abhinav,Pritish,Jessie,RahulJain,Lukasz,Kumaresh,
Phil, Dave, Isaac, Aaron, Esteban Pauli, Chao Mei, Eric Shook, Viraj, Ekaterina, Terry, Sayan-
tan, Rahul Joshi, Chao Huang, Yan, Greg, Orion, Sameer, Hari, Amit, Nilesh, Tarun, Yogesh,
Vikas, Sharon, Apurva, Ramkumar, Rashmi, Sindhura, Guna, Mani, Theckla, Jonathan, Josh,
Arun,NeelamandPuneet. Specialshout-outsgoto(youknowwhoyouare): mychessbuddies,
mydrinkingbuddies,mygymbuddiesandmypunchingbags(sorry!).
Many thanks for the close friendships with so many people I have met at the University of
Illinois at Urbana-Champaign. Your support and company have always been appreciated, even
when it contributed to the dreaded “grad student procastination”. To the “net party” gang, we
sure had many crazy times together: Fredrik Vraalsen, Jim Oly, Chris and Tanya Lattner, Mike
iv
Hunter, Rushabh Doshi, Bradley Jones, Bill Wendling, Howard and Melissa Sun, Nicholas
Riley, Andy Reitz, Vinh Lam, Adam Slagell and Joel Stanley. To the “CS 105” gang, thanks
for all the fun times working together to teach non-CS majors the joys of Computer Science:
Dominique Kilman, Yang Yaling, Marsha and Roger Woodbury, Francis David, Tony Hursh,
Reza Lesmana and Shamsi Iqbal. To the other “crazy” grad students of the Computer Science
Department,thanksforallthefish!: LeeBaughandDmitryYershov.
To the various faculty and staff of the Computer Science Department who have helped me
with so many administrative burdens, special thanks go out to Professor Geneva Belford, Barb
Cicone,MaryBethKelley,SheilaClark,MollyFlesner,ShirleyFinkeandElaineWilson.
To Hamid at the Jerusalem restaurant along Wright Street, thank you for the fine food and
company. I have enjoyed every moment working on this thesis at your establishment and par-
takingintherefreshingTurkishcoffeeyoubrew.
Lastbutmostcertainlynotleast,tomyparentsandmygrandfather,thankyouallforbearing
my long absence with such grace, love and unwavering support. To my now-deceased grand-
mother, thank you for holding out as long as you did in the hope of seeing me graduate. To
mysister,thanksforkeepingmecompanyoverthelastfewyearsonFacebook. Tomyfiance´e,
thank you for your continued patience and love, and for sticking it out with me long-distance
overthethirteenorsoyearswehavebeentogether.
GrantsandotherAcknowledgements:
NAMDwasdevelopedattheTheoreticalBiophysicsGroup(BeckmanInstitute,University
of Illinois) and funded by the National Institutes of Health(NIH PHS 5 P41-RR05969-04).
ProjectionsandCharm++aresupportedbyseveralprojectsfundedbytheDepartmentofEnergy
(subcontractB523819)andtheNationalScienceFoundation(NSFDMR0121695)
This work is supported by grant(s) from the National Institutes of Health P41-RR05969.
The author gladly acknowledge supercomputer time provided by Pittsburgh Supercomputer
Center and the National Center for Supercomputing Applications via Large Resources Allo-
cationCommitteegrantMCA93S028.
v
Table of Contents
ListofTables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
ListofFigures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x
Chapter1 IntroductionAndMotivation . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 ParallelPerformanceTuningTools . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 ChallengesToPerformanceToolEffectivenessDuetoApplicationScaling . . . 4
1.3 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4 ThesisObjectivesAndScope . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Chapter2 SoftwareInfrastructure . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1 Charm++ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1.1 AdaptiveMPI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.1.2 AdaptiveOverlap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.1.3 AutomaticLoadBalancing . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2 TheProjectionsPerformanceAnalysisAndVisualizationFramework . . . . . 17
2.2.1 Charm++PerformanceEvents . . . . . . . . . . . . . . . . . . . . . . 18
2.2.2 ProjectionsEventLogFormats . . . . . . . . . . . . . . . . . . . . . . 19
2.2.3 TheProjectionsPerformanceVisualizationTool . . . . . . . . . . . . . 20
2.2.4 ThePerformanceAnalysisProcessUsingProjections . . . . . . . . . . 21
Chapter3 ScalableAnalysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.1 ScalablyFindingObjectDecompositionWithPoorGrainsizeDistributions . . 23
3.2 ScalableIdiomsBasedOnTimeProfiles . . . . . . . . . . . . . . . . . . . . . 25
3.2.1 TimeProfileHeuristic: ProbableLoadImbalance . . . . . . . . . . . . 27
3.2.2 TimeProfileHeuristic: ComparativeSubstructureAnalysis . . . . . . . 29
3.2.3 HighResolutionTimeProfiles . . . . . . . . . . . . . . . . . . . . . . 31
3.3 FindingInterestingProcessors . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.3.1 CaseStudy: DetectingBadMulticastImplementation . . . . . . . . . . 34
3.3.2 CaseStudy: FindingCausesForLoadImbalance . . . . . . . . . . . . 35
3.3.3 ScalableToolSupportForFindingInterestingProcessors . . . . . . . . 37
3.3.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.4 RelatedWork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
vi
Chapter4 DataVolumeReduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.1 BasicApproach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.2 RelatedWork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.3 QuantifyingTheProblem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.4 Applyingk-MeansClusteringToPerformanceData . . . . . . . . . . . . . . . 52
4.4.1 Thek-MeansClusteringAlgorithm . . . . . . . . . . . . . . . . . . . 52
4.4.2 ImportantChoicesFork-MeansClustering . . . . . . . . . . . . . . . 55
4.4.3 ChoosingRepresentativeProcessors . . . . . . . . . . . . . . . . . . . 57
4.5 Post-mortemVersusOnlineDataReduction . . . . . . . . . . . . . . . . . . . 59
4.6 CaseStudy: NAMDGrainsize . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.6.1 NAMD-NAnoscaleMolecularDynamics . . . . . . . . . . . . . . . . 65
4.6.2 ExperimentalMethodology . . . . . . . . . . . . . . . . . . . . . . . 67
4.6.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.7 FutureWork: ExtensionsToTheBasicApproach . . . . . . . . . . . . . . . . 74
4.7.1 ChoosingDataSubsetsByPhase . . . . . . . . . . . . . . . . . . . . 74
4.7.2 ConsiderationsForCriticalPaths . . . . . . . . . . . . . . . . . . . . 76
Chapter5 OnlineLiveAnalysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.1 RelatedWork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.2 TheConverseClient-ServerInterface . . . . . . . . . . . . . . . . . . . . . . . 80
5.3 ContinuousStreamingOfOnlinePerformanceData . . . . . . . . . . . . . . . 80
5.4 LiveParallelDataCollectionOfPerformanceProfiles . . . . . . . . . . . . . . 82
5.5 CaseStudy: Long-RunningNAMDSTMVSimulation . . . . . . . . . . . . . 88
5.6 FutureWork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
Chapter6 What-IfAnalysisThroughSimulation . . . . . . . . . . . . . . . . . . . 96
6.1 RelatedWork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
6.2 TheBigSimSimulationFramework . . . . . . . . . . . . . . . . . . . . . . . 99
6.2.1 BigSimEmulator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
6.2.2 BigSimSimulator . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
6.3 PerformanceAnalysisUsingFewerProcessors . . . . . . . . . . . . . . . . . 103
6.3.1 ReducingNumberOfProcesorsUsedAtEmulationTime . . . . . . . 104
6.3.2 NumberOfProcessorsUsedAtSimulationTime . . . . . . . . . . . . 105
6.4 GeneralMethodologyForHypothesisTesting . . . . . . . . . . . . . . . . . . 106
6.5 AnalyzingChangesToNetworkLatency . . . . . . . . . . . . . . . . . . . . . 106
6.5.1 CaseStudy: Seven-pointStencilComputation . . . . . . . . . . . . . . 107
6.5.2 ValidationWithSeven-pointStencilComputation . . . . . . . . . . . . 109
6.6 AnalyzingHypotheticalLoadBalancingChanges . . . . . . . . . . . . . . . . 120
6.6.1 CaseStudy: SimpleLoadImbalanceBenchmark . . . . . . . . . . . . 122
6.6.2 ValidationWithSimpleLoadImbalanceBenchmark . . . . . . . . . . 124
6.7 FutureWork: VariationsInObject-to-ProcessorMapping . . . . . . . . . . . . 128
vii
Chapter7 ExtendingScalabilityTechniquesToThirdPartyTools . . . . . . . . . 130
7.1 PerformanceCall-back(Event)Interface . . . . . . . . . . . . . . . . . . . . . 130
7.2 IntegratingAnExternalPerformanceModule . . . . . . . . . . . . . . . . . . 133
7.2.1 TAUProfilingIntegration . . . . . . . . . . . . . . . . . . . . . . . . 133
7.2.2 InstrumentationOverheadAssessment . . . . . . . . . . . . . . . . . . 134
7.2.3 ExperimentalResultsOnCurrentMachines . . . . . . . . . . . . . . . 137
7.3 ScalabilityBenefitsOfIntegration . . . . . . . . . . . . . . . . . . . . . . . . 137
Chapter8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
viii
List of Tables
4.1 Total volume of trace data summed across all files. ApoA1 is a NAMD simu-
lationwith92katoms,F1-ATPasesimulates327katomswhilestmvsimulates
1Matoms. ThisdatawasgeneratedontheXT3atPSC. . . . . . . . . . . . . . 51
4.2 Total volume of trace data summed across all processors in stmv Projections
logs with different Particle Mesh Ewald (PME) long-range electorstatics con-
figurations. ThedatawasgeneratedonanXT5atNICS. . . . . . . . . . . . . 52
4.3 Numberofnon-emptyclustersfoundbyclusteringalgorithmwhenvaryingthe
number of initial seeds uniformly distributed in the sample space The * indi-
catesthatprocessor0wasaloneinitsowncluster. . . . . . . . . . . . . . . . . 71
4.4 Reduction in total volume of trace data for stmv. The number of processors
selected in the subsets are 51, 102, 204 and 409 for 512, 1024, 2048 and 4096
originalprocessorsrespectively. . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.5 Reduced dataset quality by proportionality based on total height of histogram
bars. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.6 Measure of data reduction quality through number of least idle processors re-
tainedinreduceddatasetsfor2,048processors. . . . . . . . . . . . . . . . . . 73
4.7 Measure of data reduction quality through number of least idle processors re-
tainedinreduceddatasetsfor4,096processors. . . . . . . . . . . . . . . . . . 74
5.1 Overheadofcollectingutilizationprofileinstrumentation. . . . . . . . . . . . . 92
6.1 Experimental setup for baseline latency tolerance experiments executed on 48
processors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
6.2 Comparison of the SimpleImbalance benchmark average iteration times with a
dummyloadbalancingstrategyon100processors. . . . . . . . . . . . . . . . 127
6.3 Comparison of the SimpleImbalance benchmark average iteration times with a
greedyloadbalancingstrategyon100processors. . . . . . . . . . . . . . . . . 127
7.1 Overheadofperformancemodules(microsecondsperevent). . . . . . . . . . . 136
ix
List of Figures
2.1 User view of the CHARM++ adaptive run time system with migratable objects
ontheleftversus apossiblesystemmappingofthesamemigratableobjects on
theright. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2 PartialinterfacecodeshowinghowCharesencapsulateentrymethods. . . . . . 12
2.3 Abasicillustrationoftheruntimeschedulerimplementationoneachprocessor. 12
2.4 7 AMPI “processes” implemented as user-level threads bound to CHARM++
chareobjectswhicharethenboundto2actualprocessorsforexecution. . . . . 14
2.5 AdaptiveOverlapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.6 Overview: Asummaryofoverallparallelstructure. . . . . . . . . . . . . . . . 20
2.7 UsageProfile: Processorutilizationofvariousactivities. . . . . . . . . . . . . 20
2.8 Timelinevisualizationofprocessoractivity. . . . . . . . . . . . . . . . . . . . 21
3.1 Histogramofactivitygrainsizesinaparalleldiscreteeventsimulation. . . . . . 25
3.2 Low resolution time profile display of the structure of parallel event overlap in
an8,192processorrunofOpenAtomat100msintervals. . . . . . . . . . . . . 26
3.3 Higher resolution time profile display of the same 8,192 processor run of Ope-
nAtomat10msintervals. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.4 Time profile of a ChaNGa time step presenting its event-overlap substructure
andexhibitingvisualcuesthatpointtopossibleloadimbalance. . . . . . . . . 28
3.5 Time profile of ChaNGa time step showing poorer performance after the ap-
plication of a greedy load balancing strategy. The substructure hints at poorer
communicationperformanceasaresultoftheattempttobalancetheload. . . . 29
3.6 TimeprofileofChaNGatimestepshowingtheresultofapplyingaloadbalanc-
ingstrategythatalsotookcommunicationintoaccount. . . . . . . . . . . . . . 30
3.7 Veryhighresolutiontimeprofiledisplayof8NAMDtimestepswithhigh-detail
substructureat100microsecondintervals. . . . . . . . . . . . . . . . . . . . . 31
3.8 CorrelationofOpenAtomtime-profileandcommunication-over-timegraphs. . 33
3.9 Zoomed-in view of communication-over-time graph showing the number of
messagesreceivedforOpenAtomat10msintervals. . . . . . . . . . . . . . . 34
3.10 A small sample of processor timelines out of 1,024 showing (red) events with
inefficientmulticasts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.11 Asmallsampleofprocessortimelinesoutof1,024showingmuchshorter(red)
events with improved multicasts and a corresponding improvement in overall
performance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
x
Description:First, we study how some traditional performance analysis idioms can Ramprasad, Esteban Meneses, Filippo, Abhinav, Pritish, Jessie, Rahul Jain,