Table Of ContentIBM LoadLeveler
Version 5 Release 1
Using and Administering
(cid:1)(cid:2)(cid:3)
SC23-6792-04
IBM LoadLeveler
Version 5 Release 1
Using and Administering
(cid:1)(cid:2)(cid:3)
SC23-6792-04
Note
Beforeusingthisinformationandtheproductitsupports,readtheinformationin“Notices”onpage423.
Thiseditionappliestoversion5,release1,modification0ofIBMLoadLeveler(productnumbers5725-G01,
5641-LL1,5641-LL3,5765-L50,and5765-LLP)andtoallsubsequentreleasesandmodificationsuntilotherwise
indicatedinneweditions.
ThiseditionreplacesSC23-6792-03.
©Copyright1986,1987,1988,1989,1990,1991bytheCondorDesignTeam.
©CopyrightIBMCorporation1986,2012.
USGovernmentUsersRestrictedRights–Use,duplicationordisclosurerestrictedbyGSAADPScheduleContract
withIBMCorp.
Contents
Figures . . . . . . . . . . . . . . vii LoadLevelerforAIXandLoadLevelerforLinux
compatibility . . . . . . . . . . . . . . 35
Tables . . . . . . . . . . . . . . . ix RestrictionsforLoadLevelerforLinux . . . . 36
FeaturesnotsupportedinLoadLevelerforLinux 36
RestrictionsforLoadLevelerforAIXand
About this information . . . . . . . . xi
LoadLevelerforLinuxmixedclusters . . . . 36
Whoshouldusethisinformation . . . . . . . xi
Conventionsandterminologyusedinthis
Part 2. Configuring and managing
information . . . . . . . . . . . . . . xi
the LoadLeveler environment . . . 37
Prerequisiteandrelatedinformation . . . . . . xii
Howtosendyourcomments . . . . . . . . xiii
Chapter 4. Configuring the LoadLeveler
Summary of changes . . . . . . . . xv environment . . . . . . . . . . . . 39
Themasterconfigurationfile . . . . . . . . 40
Part 1. Overview of LoadLeveler SettingtheLoadLeveleruser . . . . . . . 40
concepts and operation. . . . . . . 1 Settingtheconfigurationsource . . . . . . 41
Overridingthesharedmemorykey . . . . . 41
File-basedconfiguration . . . . . . . . . . 42
Chapter 1. What is LoadLeveler? . . . . 3
Databaseconfigurationoption . . . . . . . . 43
LoadLevelerbasics . . . . . . . . . . . . 4 Understandingremotelyconfigurednodes . . . 43
LoadLeveler:Anetworkjobmanagementand Usingtheconfigurationeditor . . . . . . . . 44
schedulingsystem . . . . . . . . . . . . 4 Modifyingconfigurationdata . . . . . . . . 45
Jobdefinition . . . . . . . . . . . . . 5 DefiningLoadLeveleradministrators. . . . . 45
Machinedefinition . . . . . . . . . . . 5 DefiningaLoadLevelercluster . . . . . . . 45
HowLoadLevelerschedulesjobs . . . . . . . 7 DefiningLoadLevelermachinecharacteristics . . 59
HowLoadLevelerdaemonsprocessjobs . . . . . 8 Definingsecuritymechanisms . . . . . . . 60
Themasterdaemon . . . . . . . . . . . 9 Definingusagepoliciesforconsumableresources 65
TheSchedddaemon . . . . . . . . . . 10 Gatheringjobaccountingdata . . . . . . . 65
Thestartddaemon . . . . . . . . . . . 12 Managingjobstatusthroughcontrolexpressions 72
Theregionmanagerdaemon . . . . . . . 14 Trackingjobprocesses. . . . . . . . . . 73
Theresourcemanagerdaemon . . . . . . . 15 QueryingmultipleLoadLevelerclusters. . . . 74
Thekbdddaemon . . . . . . . . . . . 15 Handlingswitch-tableerrors. . . . . . . . 75
Thenegotiatordaemon . . . . . . . . . 15 Providingadditionaljob-processingcontrols
TheLoadLevelerjobcycle . . . . . . . . . 16 throughinstallationexits . . . . . . . . . 75
LoadLevelerjobstates. . . . . . . . . . 19
Consumableresources. . . . . . . . . . . 22 Chapter 5. Defining LoadLeveler
ConsumableresourcesandWorkloadManager 23
resources to administer . . . . . . . 89
Overviewofreservations. . . . . . . . . . 24
Definingmachines . . . . . . . . . . . . 89
Fairshareschedulingoverview. . . . . . . . 27
Planningconsiderationsfordefiningmachines . 90
Machine_groupstanzaformatandkeyword
Chapter 2. Getting a quick start using
summary . . . . . . . . . . . . . . 90
the default configuration . . . . . . . 29
Machinesubstanzaformatandkeyword
Whatyouneedtoknowbeforeyoubegin . . . . 29 summary . . . . . . . . . . . . . . 91
Usingthedefaultconfigurationfiles . . . . . . 29 Machinestanzaformatandkeywordsummary 91
LoadLevelerforLinuxquickstart . . . . . . . 30 Defaultvaluesformachine_groupandmachine
Quickinstallation . . . . . . . . . . . 30 stanzas. . . . . . . . . . . . . . . 92
Quickconfiguration . . . . . . . . . . 31 Examplesofmachine_groupandmachinestanzas 92
Quickverification . . . . . . . . . . . 31 Dynamicadapterdiscovery . . . . . . . . . 93
Post-installationconsiderations. . . . . . . . 32 LoadLeveleradapterandnodestatusmonitoring. . 94
StartingLoadLeveler . . . . . . . . . . 32 Definingclasses . . . . . . . . . . . . . 94
Directoryconsiderations . . . . . . . . . 33 Usinglimitkeywords . . . . . . . . . . 94
Allowinguserstouseaclass . . . . . . . 97
Chapter 3. What operating systems are Classstanzaformatandkeywordsummary . . 97
supported by LoadLeveler?. . . . . . 35 Examples:Classstanzas . . . . . . . . . 98
Definingusersubstanzasinclassstanzas . . . . 99
©CopyrightIBMCorp.1986,2012 iii
Examples:Substanzas . . . . . . . . . . 99 Usingtheckpt_dirandckpt_subdirkeywords 143
Definingusers . . . . . . . . . . . . . 102 Removingoldcheckpointfiles. . . . . . . 144
Userstanzaformatandkeywordsummary . . 102 Usingtheckpt_execute_dirkeyword . . . . 144
Examples:Userstanzas . . . . . . . . . 102 Initiatingacheckpointusingthell_ckpt()API 146
Defininggroups . . . . . . . . . . . . 103 LoadLevelerschedulingaffinitysupport . . . . 147
Groupstanzaformatandkeywordsummary 104 ConfiguringLoadLevelertousescheduling
Examples:Groupstanzas . . . . . . . . 104 affinity . . . . . . . . . . . . . . 148
Definingclusters . . . . . . . . . . . . 104 LoadLevelermulticlustersupport. . . . . . . 149
Clusterstanzaformatandkeywordsummary 104 ConfiguringaLoadLevelermulticluster . . . 150
Examples:Clusterstanzas . . . . . . . . 105 LoadLevelerBlueGenesupport . . . . . . . 153
Definingregions . . . . . . . . . . . . 106 ConfiguringLoadLevelerBlueGenesupport 155
Regionstanzaformatandkeywordsummary 106 BlueGenereservationsupport. . . . . . . 157
Examples:Regionstanzas . . . . . . . . 106 BlueGenefairshareschedulingsupport . . . 157
BlueGeneheterogeneousmemorysupport . . 157
Chapter 6. Performing additional BlueGenepreemptionsupport . . . . . . 157
administrator tasks. . . . . . . . . 109 Usingfairsharescheduling. . . . . . . . . 158
Fairshareschedulingkeywords . . . . . . 158
Settinguptheenvironmentforparalleljobs . . . 110
Reconfiguringfairshareschedulingkeywords 161
Schedulingconsiderationsforparalleljobs. . . 110
Example:threegroupsshareaLoadLeveler
Stepsforreducingjoblaunchoverheadfor
cluster. . . . . . . . . . . . . . . 161
paralleljobs . . . . . . . . . . . . . 111
Example:twothousandstudentssharea
Stepsforallowinguserstosubmitinteractive
LoadLevelercluster . . . . . . . . . . 162
POEjobs . . . . . . . . . . . . . . 112
Queryinginformationaboutfairshare
Settingupaclassforparalleljobs . . . . . 112
scheduling . . . . . . . . . . . . . 163
Stripingwhensomenetworksfail . . . . . 113
Resettingfairsharescheduling . . . . . . 163
Settingupaparallelmasternode. . . . . . 113
Savinghistoricdata . . . . . . . . . . 163
UsingtheBACKFILLscheduler . . . . . . . 114
Restoringsavedhistoricdata . . . . . . . 164
TipsforusingtheBACKFILLscheduler . . . 116
Procedureforrecoveringajobspool. . . . . . 164
Example:BACKFILLscheduling . . . . . . 117
Configuringandusingislandscheduling . . . . 165
Datastaging. . . . . . . . . . . . . . 117
Energyawarejobsupport . . . . . . . . . 166
ConfiguringLoadLevelertosupportdata
S3statesupport . . . . . . . . . . . . 166
staging . . . . . . . . . . . . . . 118
Usinganexternalscheduler . . . . . . . . 119
Part 3. Submitting and managing
ReplacingthedefaultLoadLevelerscheduling
algorithmwithanexternalscheduler . . . . 120 LoadLeveler jobs. . . . . . . . . 169
Customizingtheconfigurationfiletodefinean
externalscheduler. . . . . . . . . . . 121
Chapter 7. Building and submitting
Example:Retrievingspecificinformation . . . 122
jobs . . . . . . . . . . . . . . . 171
Example:Changingschedulertypes. . . . . . 122
Preemptingandresumingjobs . . . . . . . 122 Buildingajobcommandfile . . . . . . . . 171
Overviewofpreemption . . . . . . . . 123 Usingmultiplestepsinajobcommandfile . . 172
Planningtopreemptjobs . . . . . . . . 124 Examples:Jobcommandfiles . . . . . . . 173
Stepsforconfiguringaschedulertopreempt Editingjobcommandfiles . . . . . . . . . 176
jobs . . . . . . . . . . . . . . . 126 Definingresourcesforajobstep . . . . . . . 177
ConfiguringLoadLevelertosupportreservations 127 Submittingjobsrequestingdatastaging . . . . 177
Stepsforconfiguringreservationsina Workingwithcoscheduledjobsteps. . . . . . 178
LoadLevelercluster . . . . . . . . . . 128 Submittingcoscheduledjobsteps. . . . . . 178
StepsforintegratingLoadLevelerwiththe Determiningpriorityforcoscheduledjobsteps 178
WorkloadManager . . . . . . . . . . . 133 Supportingpreemptionofcoscheduledjobsteps 179
LoadLevelersupportforcheckpointingjobs . . . 135 CoscheduledjobstepsandcommandsandAPIs 179
Checkpointkeywordsummary . . . . . . 136 Terminationofcoscheduledsteps. . . . . . 179
Planningconsiderationsforcheckpointingjobs 136 Usingbulkdatatransfer. . . . . . . . . . 180
Additionalplanningconsiderationsfor Preparingajobforcheckpoint/restart . . . . . 180
checkpointingMetaClusterHPCjobsonAIX. . 138 Preparingajobforpreemption . . . . . . . 183
Checkpointandrestartlimitations . . . . . 138 Submittingajobcommandfile . . . . . . . 183
SubmittingaMetaClusterHPCcheckpointjobto Jobstatemonitoring . . . . . . . . . . 184
LoadLeveler. . . . . . . . . . . . . . 138 Submittingajobusingasubmit-onlymachine 184
job_1.cmd-Acheckpointablejobcommandfile 138 Workingwithparalleljobs . . . . . . . . . 184
Usingthellckptcommandtocheckpointajob StepforcontrollingwhetherLoadLevelercopies
step . . . . . . . . . . . . . . . 139 environmentvariablestoallexecutingnodes. . 185
Restartingajobstepfromacheckpoint. . . . 140
Makingperiodiccheckpoints . . . . . . . 142
iv LoadLeveler: UsingandAdministering
Ensuringthatparalleljobsinaclusterrunon 64-bitsupportforconfigurationfilekeywords
thecorrectlevelsofPEandLoadLeveler andexpressions . . . . . . . . . . . 232
software . . . . . . . . . . . . . . 185 Configurationkeyworddescriptions. . . . . . 233
Task-assignmentconsiderations . . . . . . 186 User-definedkeywords . . . . . . . . . . 284
Submittingjobsthatusestriping . . . . . . 188 LoadLevelervariables . . . . . . . . . . 286
RunninginteractivePOEjobs . . . . . . . 193 Variablestouseforsettingdates . . . . . . 291
DebugginginterfacesbetweenPOEand Variablestouseforsettingtimes. . . . . . 291
LoadLeveler. . . . . . . . . . . . . 194
RunningMPICH2. . . . . . . . . . . 194 Chapter 11. Administration keyword
RunningOpenMPI . . . . . . . . . . 195 reference . . . . . . . . . . . . . 293
RunningIntelMPIjobs . . . . . . . . . 196
Administrationfilestructureandsyntax . . . . 293
Runningembarassinglyparalleljobs. . . . . 196
Stanzacharacteristics. . . . . . . . . . 295
Examples:Buildingparalleljobcommandfiles 197
Syntaxforlimitkeywords . . . . . . . . 295
Obtainingstatusofparalleljobs . . . . . . 201
64-bitsupportforadministrationfilekeywords 297
Obtainingallocatedhostnames . . . . . . 201
Administrationkeyworddescriptions . . . . . 298
BuildingandsubmittingMPICH2andserial
interactivejobs. . . . . . . . . . . . . 202
Chapter 12. Job command file
Workingwithreservations . . . . . . . . . 203
reference . . . . . . . . . . . . . 333
Typesofreservations. . . . . . . . . . 203
Understandingtheflexiblejobstep . . . . . 203 Jobcommandfilesyntax . . . . . . . . . 333
Understandingthereservationlifecycle . . . 205 Serialjobcommandfile . . . . . . . . . 333
Creatingnewreservations . . . . . . . . 207 Paralleljobcommandfile . . . . . . . . 334
Submittingjobstorununderareservation . . 210 Syntaxforlimitkeywords . . . . . . . . 334
Removingboundjobsfromthereservation . . 212 64-bitsupportforjobcommandfilekeywords 334
Queryingexistingreservations . . . . . . 213 Jobcommandfilekeyworddescriptions . . . . 335
Modifyingexistingreservations . . . . . . 213 Jobcommandfilevariables. . . . . . . . 383
Cancelingexistingreservations . . . . . . 215 Run-timeenvironmentvariables . . . . . . 384
Reservationswithfloatingresources. . . . . 215 Jobcommandfileexamples . . . . . . . 386
Submittingjobsrequestingschedulingaffinity . . 217
SubmittingandmonitoringjobsinaLoadLeveler Part 5. Appendixes . . . . . . . . 389
multicluster . . . . . . . . . . . . . . 218
StepsforsubmittingjobsinaLoadLeveler
Appendix A. Troubleshooting
multiclusterenvironment . . . . . . . . 219
LoadLeveler . . . . . . . . . . . . 391
Workingwithenergyawarejobs . . . . . . . 220
SubmittingandmonitoringBlueGenejobs . . . 221 Frequentlyaskedquestions. . . . . . . . . 391
Whywon'tLoadLevelerstart?. . . . . . . 392
Chapter 8. Managing submitted jobs 223 Whywon'tmyjobrun?. . . . . . . . . 392
Whywon'tmyparalleljobrun? . . . . . . 395
Queryingthestatusofajob . . . . . . . . 223
Whywon'tmycheckpointedjobrestart? . . . 396
Workingwithmachines . . . . . . . . . . 223
Whywon'tmysubmit-onlyjobrun? . . . . 397
Displayingcurrentlyavailableresources . . . . 224
WhydoesajobstayinthePending(orStarting)
Settingandchangingthepriorityofajob . . . . 224
state? . . . . . . . . . . . . . . . 397
Example:Howdoesajob'spriorityaffect
Whathappenstorunningjobswhenamachine
dispatchingorder?. . . . . . . . . . . 225
goesdown? . . . . . . . . . . . . . 397
Placingandreleasingaholdonajob . . . . . 225
Whydoesllstatusindicatethatamachineis
Cancelingajob. . . . . . . . . . . . . 226
downwhenllqindicatesajobisrunningonthe
Checkpointingajob . . . . . . . . . . . 226
machine?. . . . . . . . . . . . . . 398
Whywon'tmyjobrunonaclusterwithboth
Chapter 9. Example: Using commands
AIXandLinuxmachines? . . . . . . . . 399
to build, submit, and manage jobs . . 227
Whywon'tmyjobsrunthatweredirectedtoan
idlepool? . . . . . . . . . . . . . 399
Part 4. LoadLeveler interfaces Whathappensifthecentralmanagerisn't
reference . . . . . . . . . . . . 229 operating? . . . . . . . . . . . . . 399
HowdoIrecoverresourcesallocatedbya
Scheddmachine? . . . . . . . . . . . 401
Chapter 10. Configuration keyword
Whycan'tIfindacorefileonLinux? . . . . 401
reference . . . . . . . . . . . . . 231 WhyamIseeinginconsistenciesinmyllfs
Configurationkeywordsyntax . . . . . . . 231 output? . . . . . . . . . . . . . . 402
Numericalandalphabeticalconstants . . . . 232 Whydon'tIseemyjobwhenIissuethellq
Mathematicaloperators . . . . . . . . . 232 command? . . . . . . . . . . . . . 402
Contents v
Whathappensiferrorsarefoundinmy WhydidmyBlueGenejobfailwhenthejob
configurationoradministrationfile?. . . . . 402 wassubmittedtoaremotecluster? . . . . . 410
Whyismyflexiblereservationnotactivated? 403 Whydoesllmkresorllchresreturn"Insufficient
Whywasmyenergyawarejobrejected? . . . 403 resourcestomeettherequest"foraBlueGene
Otherquestions . . . . . . . . . . . 403 reservationwhenresourcesappeartobe
Troubleshootinginamulticlusterenvironment . . 405 available?. . . . . . . . . . . . . . 410
HowdoIdetermineifIaminamulticluster Helpfulhints . . . . . . . . . . . . . 411
environment? . . . . . . . . . . . . 405 Scalingconsiderations . . . . . . . . . 411
HowdoIdeterminehowmymulticluster Hintsforrunningjobs . . . . . . . . . 412
environmentisdefinedandwhatarethe Hintsforusingmachines . . . . . . . . 414
inboundandoutboundhostsdefinedforeach HistoryfilesandSchedd . . . . . . . . 415
cluster? . . . . . . . . . . . . . . 405 GettinghelpfromIBM . . . . . . . . . . 416
Whyismymulticlusterenvironmentnot
enabled? . . . . . . . . . . . . . . 406 Appendix B. LoadLeveler port usage 417
HowdoIfindlogmessagesfrommy
multicluster-definedinstallationexits? . . . . 406 Accessibility features for LoadLeveler 421
Whywon'tmyremotejobbesubmittedor
Accessibilityfeatures. . . . . . . . . . . 421
moved? . . . . . . . . . . . . . . 407
Keyboardnavigation. . . . . . . . . . . 421
WhydidtheCLUSTER_REMOTE_JOB_FILTER
IBMandaccessibility. . . . . . . . . . . 421
notupdatethejobwithallofthestatementsI
defined? . . . . . . . . . . . . . . 408
Notices . . . . . . . . . . . . . . 423
HowdoIfindmyremotejob? . . . . . . 408
Whywon'tmyremotejobrun? . . . . . . 408 Trademarks . . . . . . . . . . . . . . 425
Whydoesllq-Xallshownojobsrunningwhen
therearejobsrunning? . . . . . . . . . 409 Glossary . . . . . . . . . . . . . 427
Troubleshootingadapteravailability. . . . . 409
TroubleshootinginaBlueGeneenvironment. . . 409 Index . . . . . . . . . . . . . . . 431
WhydoallofmyBlueGenejobsfaileven
thoughllstatusshowsthatBlueGeneispresent? 409
WhydoesllstatusshowthatBlueGeneis
absent? . . . . . . . . . . . . . . 409
vi LoadLeveler: UsingandAdministering
Figures
1. ExampleofaLoadLevelercluster . . . . . 3 13. Jobcommandfilewithmultiplestepsand
2. LoadLevelerjobsteps . . . . . . . . . 5 oneexecutable . . . . . . . . . . . 173
3. Multiplerolesofmachines. . . . . . . . 7 14. Jobcommandfilewithvaryinginput
4. High-leveljobflow . . . . . . . . . . 16 statements . . . . . . . . . . . . 173
5. JobissubmittedtoLoadLeveler. . . . . . 17 15. UsingLoadLevelervariablesinajob
6. LoadLevelerauthorizesthejob . . . . . . 17 commandfile . . . . . . . . . . . 175
7. LoadLevelerpreparestorunthejob . . . . 18 16. Jobcommandfileusedastheexecutable 176
8. LoadLevelerstartsthejob. . . . . . . . 18 17. Stripingovermultiplenetworks . . . . . 190
9. LoadLevelercompletesthejob . . . . . . 19 18. Stripingoverasinglenetwork. . . . . . 192
10. Howcontrolexpressionsaffectjobs . . . . 73 19. Whentheprimarycentralmanageris
11. MulticlusterExample . . . . . . . . . 105 unavailable . . . . . . . . . . . . 400
12. Jobcommandfilewithmultiplesteps 172 20. Multiplecentralmanagers . . . . . . . 400
©CopyrightIBMCorp.1986,2012 vii
viii LoadLeveler: UsingandAdministering
Description:Features not supported in LoadLeveler for Linux 36. Restrictions for .. installed,
for example, 5.1.0. 5 LoadLeveler for Linux: Installation Guide. – Support for