Table Of ContentArticle
pubs.acs.org/JCTC
Routine Microsecond Molecular Dynamics Simulations with AMBER
on GPUs. 1. Generalized Born
Andreas W. Götz,† Mark J. Williamson,†,∥ Dong Xu,†,⊥ Duncan Poole,‡ Scott Le Grand,‡
and Ross C. Walker*,†,§
†
San Diego Supercomputer Center, University of California San Diego, 9500 Gilman Drive MC0505, La Jolla, California 92093,
United States
‡
NVIDIA Corporation, 2701 San Tomas Expressway, Santa Clara, California 95050, United States
§
Department of Chemistry and Biochemistry, University of California San Diego, 9500 Gilman Drive MC0505, La Jolla,
California 92093, United States
*
S Supporting Information
ABSTRACT: We present an implementation of generalized Born implicit solvent all-atom classical molecular dynamics (MD)
within the AMBER program package that runs entirely on CUDA enabled NVIDIA graphics processing units (GPUs). We
discussthealgorithmsthatareusedtoexploittheprocessingpoweroftheGPUsandshowtheperformancethatcanbeachieved
in comparison to simulations on conventional CPU clusters. The implementation supports three different precision models in
which the contributions to the forces are calculated in single precision floating point arithmetic but accumulated in double
precision (SPDP), or everything is computed in single precision (SPSP) or double precision (DPDP). In addition to
performance, we have focused on understanding the implications of the different precision models on the outcome of implicit
solventMDsimulations.Weshowresultsforarangeoftestsincludingtheaccuracyofsinglepointforceevaluationsandenergy
conservationaswellasstructuralpropertiespertaininingtoproteindynamics.Thenumericalnoiseduetoroundingerrorswithin
theSPSPprecisionmodelissufficientlylargetoleadtoanaccumulationoferrorswhichcanresultinunphysicaltrajectoriesfor
longtimescalesimulations.Werecommendtheuseofthemixed-precisionSPDPmodelsincethenumericalresultsobtainedare
comparablewiththoseofthefulldoubleprecisionDPDPmodelandthereferencedoubleprecisionCPUimplementationbutat
significantlyreducedcomputationalcost.OurimplementationprovidesperformanceforGBsimulationsonasingledesktopthat
is on par with, and in some cases exceeds, that of traditional supercomputers.
1. INTRODUCTION innovation and ultimately their failure to mature into ubiquitous
community-maintained research tools.
Since the first simulation of an enzyme using molecular
dynamics (MD) was reported by McCammon et al.1 in 1977, Graphics processing units (GPUs), on the other hand, have
been an integral part of personal computers for decades, and a
MD simulations have evolved to become important tools in
strong demand from the consumer electronics industry has
rationalizingthebehaviorofbiomolecules.Thefieldhasgrown
resulted in significant sustained industrial investment in the
fromthatfirst10-ps-longsimulationofamere500atomstothe
stable,long-termdevelopmentofGPUtechnology. Inaddition
point where small enzymes can be simulated on the micro-
second time scale2−4 and simulations containing millions of tolowpricesforGPUs,thishasledtoacontinuousincreasein
atomscanbeconsideredroutine.5,6However,suchsimulations thecomputationalpowerandmemorybandwidthofGPUs,sig-
nificantly outstripping the improvements in CPUs. As a
are numerically very intensive, and using traditional CPU-
centric hardware requires access to large-scale supercomputers consequence, high-end GPUs can be considered standard equip-
or well-designed clusters with expensive interconnects that are ment in scientific workstations, which means that they either
beyond the reach of many research groups. already exist in many research laboratories or can be purchased
Numerous attempts have been made over the years to easily with new equipment. This makes them readily available to
accelerateclassicalMDsimulationsbyexploitingalternativehard- researchers and thus attractive targets for acceleration of many
ware technologies. Some notable examples include ATOMS by scientific applications including MD simulations.
AT&T Bell Laboratories,7 FASTRUN by Columbia University The nature of GPU hardware, however, has until recently
and Brookhaven National Laboratory,8 MDGRAPE by RIKEN,9 madetheiruseingeneralpurposecomputingchallengingtoall
and most recently Anton by DE Shaw Research LLC.10 All of but those with extensive three-dimensional (3D) graphics pro-
these approaches have, however, failed to make an impact on gramming experience. However, the development of application
mainstreamresearchbecauseoftheir excessivecost. Additionally, programminginterfaces(APIs)targetedatgeneralpurposescien-
these technologies have been based on custom hardware and do tificcomputinghasreducedthiscomplexitysubstantiallysuchthat
notformpartofwhatwouldbeconsideredastandardworkstation
specification. This has made it difficult to experiment with such Received: December 20, 2011
technologies, leading to a lack of sustained development or Published: March 26, 2012
©2012AmericanChemicalSociety 1542 dx.doi.org/10.1021/ct200909j|J.Chem.TheoryComput.2012,8,1542−1555
Journal of Chemical Theory and Computation Article
Figure 1. Peak floating-point operations per second (Flop/s; left) and memorybandwidth (right) for Intel CPUs26 and NVIDIAGPUs.27
GPUsarenowacceptedasserioustoolsfortheeconomically Unlike a regular CPU, which typically operates on one to four
efficient acceleration of an extensive range of scientific threads in parallel, GPUs typically process threads in blocks
problems.11,12 (termed warps within the CUDA programming language28)
Thecomputationalcomplexityandfinegrainedparallelismof containing between 16 and 64 threads. These thread blocks
MD simulations of macromolecules makes them an ideal logically map to the underlying hardware, which consists of
candidateforimplementationonGPUs.Indeed,asweillustrate streaming multiprocessors. At the time of writing, high-end
hereforimplicitsolventandinasubsequentpaper13forexplicit GPUs typically have between 16 and 32 multiprocessors. For
solvent, the careful implementation of modern MD algorithms example, an NVIDIA M2090 GPU consists of 16 multi-
on GPUs can provide capability, in terms of performance, that processors,eachcontaining32coresforatotalof512cores.All
exceeds that achieveable with any current CPU-based super- threads in a single block must execute the same instruction on
computer. Several previous studies have investigated the use thesameclockcycle.Thisnecessarilyimpliesthat,foroptimum
of GPUs to accelerate MD simulations.14−20 For a detailed performance, codes must be vectorized to match the size of a
reviewoftheuseofGPUsforaccelerationofcondensedphase thread block. Branching must therefore be used with extreme
biomolecularMDsimulations,wereferthereadertoourrecent care since if any two threads in the same warp have to follow
review.12 differentcodepathsofthebranch,thenthreadsinthewarpwill
In this manuscript, we present our high-performance GPU stall while each side of the branch is executed sequentially.
implementationofimplicitsolventgeneralizedBorn(GB)MD 2.2. Memory Model. The memory hierarchy of GPUs has
for the AMBER21 and CHARMM22 pairwise additive force its origins in their graphics lineage, and the high density of
fields on CUDA-enabled NVIDIA GPUs. We have implemented arithmetic units comes at the expense of cache memory and
this within the AMBER23,24 PMEMD dynamics engine in a controlunits.Allofthecoresmakingupamultiprocessorhave
manner that is designed to be as transparent to the user as pos- asmallnumberofregistersthattheycanaccess,afewkilobytes
sible, and we give an overview of what the code currently (64kBonanM2090)ofsharedmemory[thiscanbesplitinto
supports, as well as our plans for future developments. We directly accessible memory and L1 cache; in the case of an
discuss the specifics by which we exploit the processing power M2090, it can be split 48/16 kB or 16/48 kB; in the case of
ofGPUs,bothinserialandusingmultipleGPUs,andshowthe AMBER, the configuration is switched at runtime for optimal
performance that can be achieved in comparison to conven- performance of a given kernel] which is private to each multi-
tional CPU clusters. We also discuss our implementation and processor and a small amount (typically 48 kB) of high-speed
validation ofthree specificprecision models thatwedeveloped but read-only texture memory. The majority of the memory
and their impact on the numerical results of implicit solvent (6GBonanM2090),termedglobaldevicememory,isavailable
MD simulations. to all multiprocessors. While being fast compared to the main
memory accessible by CPUs, access to the device memory by
GPUs is still relatively slow compared to the local cache
2. GPU PROGRAMMING COMPLEXITIES
memory. The nature by which the multiprocessors are
AsillustratedbyFigure1,GPUsofferatremendousamountof connectedtothismemoryalsomeansthatthereisasignificant
computing power in acompact package. This, however, comes performancepenaltyfornonstride-1access.Finally,itshouldbe
at the cost of reduced flexibility and increased programming notedthatcurrentlytheCPUandGPUmemoriesareindifferent
complexityascomparedtoCPUs.Inordertodevelopsoftware addressspacesandthisrequirescarefulconsideration.Theunique
thatrunsefficientlyonGPUs,itisnecessarytohaveathorough nature of this memory model leads to several considerations for
understanding of the characteristics of the GPU hardware optimizing GPU performance, including optimizing device
architecture. A number of manuscripts have already discussed memory access for contiguous data, utilizing the multiprocessor
this in detail in the context of MD.11,12,15,17,25 For this reason, sharedmemorytostoreintermediateresultsortoreorganizedata
weprovidesimplyabriefoverviewofthecomplexitiesinvolved that would otherwise require nonstride-1 memory accesses, and
in programming GPUs as they relate to our implementa- using the texture memory to store read-only information, such
tion, focusing on NVIDIA hardware. For a more detailed as various force field parameters, in a fashion that allows very
description,thereaderisreferredtothepublicationscitedabove. rapid access.
2.1. Vectorization. A GPU is an example of a massively 2.3. GPU to CPU Communication. As mentioned above,
parallel stream-processing architecture which uses the single- the CPU and GPU memories are, at the time of writing, in
instruction multiple data (SIMD) vector processing model. differentaddressspaces.Thismeansitisuptotheprogrammer
1543 dx.doi.org/10.1021/ct200909j|J.Chem.TheoryComput.2012,8,1542−1555
Journal of Chemical Theory and Computation Article
to ensure that the memories are synchronized as necessary to all necessary operations to access and manipulate data on a
avoid race conditions. However, there is a big performance GPUdevice.RealizingthefullpotentialofGPUs,however,still
penalty for such synchronizations which have to occur via the requires considerable effort as indicated above and outlined
Peripheral Component Interconnect Express (PCIe) bus, and belowtotakeadvantageoftheparticularGPUarchitecture,and
thus they should be avoided unless absolutely necessary. not allalgorithms are suitable toachieve good performance on
2.4. GPU to GPU Communication. The traditional these massively parallel processors.
method for programming scientific algorithms in parallel
uses the message passing interface (MPI)29 in which each thread 3. OVERVIEW OF THE AMBER IMPLICIT SOLVENT
runsinaseparateaddressspace.WhenrunningGPUsinparallel GPU IMPLEMENTATION
underanMPIparadigm,additionalcomplexityisintroducedsince The nature of MD simulations requires what in computer
sending data between two GPUs involves copying the data from science is referred to as strong scaling, that is, reduction ofthe
the memory of the sending GPU to the CPU memory of the solution time with an increasing number of processors for a
corresponding MPI thread over the PCIe bus, an MPIsend by fixed total problem size. This enables access to simulations at
this CPU thread, and corresponding MPIreceive by the receiv- longer time scales, which is required for a proper convergence
ing CPU thread, which copies the data between the memories ofresults.Thisbecomesmoreimportantasonemovestolarger
oftheCPUs,andfinallycopyingthedatatothememoryofthe systemsizessincethenumberofdegreesoffreedomincreases.
receiving GPU. Clearly, this introduces additional consider- Weak scaling, that is, the solution time with the number of
ations for maximizing parallel performance as compared to processors for a fixed problem size per processor, is only of
traditional CPU programming. secondary importance, since this merely enables simulating
Atthetimeofwriting,thereareeffortstostreamlineGPUto larger molecules at currently attainable time scales. Our imple-
GPUcommunication,particularlywithinasinglenodebutalso mentation therefore has focused on accelerating problem sizes
forInfinibandconnectionsbetweennodes.Onesuchapproach thatcorrespondtothosetypicallystudiedbyAMBERusers.In
under development by NVIDIA and Mellanox is termed thecaseofGBsimulations,thisisintherangeof300to30000
GPUDirect,30 which ultimately seeks to unify address spaces atoms.
between multiple CPUs and GPUs. Currently, the degree to The initial driving force in accelerating AMBER implicit
whichthiscanbeutilizedisheavilydependentontheunderlying solventGBcalculationswithGPUswastoprovidethescientific
hardware design. Therefore, at present, the added complexity of community with a computational tool that would allow an indi-
using the advanced features of GPUDirect, beyond the pinned vidual researcher to obtain performance on a simple desktop
memory MPI optimizations offered by GPUDirect version 1, on workstation equivalent to that of a small CPU cluster. Such a
the large number of possible different hardware combinations is tool alleviates the costs, both capital and recurring, involved in
not worth the effort for a widely used production code. purchasing, maintaining, and using individual research compute
2.5. Mathematical Precision. Early versions of GPUs in clusters. To this end, our goal was that a single state-of-the-
NVIDIA’s lineup (prior to the GT200 model) only supported artGPUshouldprovideaperformanceequivalenttothatoffour
singleprecision(SP)floatingpointarithmetic.Thiswasdueto to six high-end CPU cluster nodes. Such an approach also re-
the fact that graphics rendering did not require double movestheneedtopurchaseandmaintainexpensiveinterconnects
precision(DP).Scientificalgorithms,however,typicallyrequire thatarerequiredtoachievescaling evenonamodestnumber of
DP arithmetic (for a discussion in the context of quantum nodes.
chemistry, see for example the work by Kniziaet al.31). The Beyond this initial serial development, which was first
generation of GPUs at the time of our initial implementation releasedasanupdatetoAMBER1034inAugust2009,wehave
(2008) supported DP in hardware, but only at 1/8 the alsodevelopedaparallelimplementationbasedontheMPI-229
performanceofSP.Inthelatestgenerationofcards,atthetime message passing protocol, released as an update to AMBER
ofwriting, termed the Fermi lineupby NVIDIA,the DPto SP 1123inOctober2010, thatallowsasingle jobtospanmultiple
performance ratio is 1/2 and thus equivalent to that in CPUs. GPUs. These can be within a single node or across multiple
This, however, only holds for the professional (termed Tesla) nodes.Asshownbelow,itispossiblewiththisimplementation
seriesofcards.Thesignificantlycheapergamingcards(termed to achieve a performance improvement that goes beyond simply
GeForce)stillonlysupportDPatafractionofthespeedofSP. making a desktop workstation faster, ultimately providing a per-
ItisthereforeimportanttooptimizetheuseofDPsuchthatit formancecapabilitythatsurpasseswhatisachievableonallcurrent
is only used when necessary to maintain numerical accuracy. conventionalsupercomputers.Achievingthislevelofperformance
2.6.ProgrammingModel.EarlyuseofGPUsforscientific required implementing the entire implicit solvent MD algorithm
computing was hampered by the lack of an application including energy and force evaluations, restraints, constraints,
programming interface (API) for general purpose calculations. thermostats,andtimestepintegrationontheGPU.Asdescribed
The problems to be solved had to be described in terms of a in section 3.2, CPU to GPU communication only occurs during
graphics pipeline employing either OpenGL or DirectX, which I/O or to some extent when data is sent between GPUs during
madethesoftwaredevelopmenttime-consumingandhardware- parallel runs.
specific. The barrier to utilizing GPU hardware for general WhilewehavedesignedourGPUimplementationtoachieve
purpose computation has since been reduced by the intro- substantialaccelerationofimplicitsolventMDsimulationsover
ductionofGPUprogrammingmodelssuchastheBrookstream that achievable with AMBER’s CPU implementation, our
programming language,32 OpenCL,33 and NVIDIA’s Compute overridinggoalhasalwaysbeentomaintaintheprecisionofthe
Unified Device Architecture (CUDA)28 and the availability calculations. To this end, we have focused on ensuring that
of corresponding software development toolkits (SDKs). The GPU simulations will match CPU simulations. All approxima-
AMBER implementation uses CUDA, which is a relatively tionsmadeinordertoachieveperformance onGPUhardware
simpleextensionofthestandardCprogramminglanguagethat have been rigorously tested as highlighted in the following
allowsonetocodeinaninherentlyparallelfashionandperform sections.
1544 dx.doi.org/10.1021/ct200909j|J.Chem.TheoryComput.2012,8,1542−1555
Journal of Chemical Theory and Computation Article
An additional design goal has been to attempt to preserve frequently unless the implementation is deterministic. The
forwardcompatibilityofourimplementation.UsingtheCUDA deterministic nature of the GPU code coupled with machine
programminglanguageprovidesthisbyabstractingtheprogram precision binary restart files (currently under development)
from the underlying hardware. The GPU accelerated version makesthismodeofsimulationpossible.Thisalsomakesdebugg-
of AMBER can be used on all NVIDIA cards that support ing and validation easier.
double precision in hardware, that is, those with hardware Transparency. Another key feature and a primary design
revision 1.3 or 2.0 or higher. Our choice of CUDA and goal of our GPU acccelerated implementation is that its use
NVIDIA graphics cards was largely guided by the fact that, at is completely transparent to the user. As far as the user is
thetimewebeganthiswork,OpenCLwasnotmatureenough concerned, our GPU implementation is indistinguishable
toofferthesameperformanceandstabilitybenefitsthatCUDA from the CPU implementation, and using the GPU version
did. A port to OpenCL is certainly possible, and this would ofthecodeissimplyacaseofswitchingtheexecutablename
support AMD hardware. However, with the public release of from pmemd to pmemd.cuda or from pmemd.MPI to
the CUDA API35 by NVIDIA and the release of CUDA pmemd.cuda.MPI for the MPI parallelized implementation.
compilers for x86 platforms by PGI,36 it is possible that an All other items such as input and output files and regression
AMBER implementation will soon be available on a variety of testswithinthecoderemainidentical.Theonlydifferenceto
accelerator hardware. be noticed by the user is an increase of performance. This
3.1.FeaturesoftheImplementation.Wehaveattempted guarantees effective uptake of our GPU implementation by
to make our GPU implementation all-inclusive of the features thescientificcommmunitybecausethereisnolearningcurve
available in the PMEMD program. At the time of writing, the for the use of the code, and all tools and scripts that have
majority of features applicable to implicit solvent simulations been developed for the CPU version of PMEMD can be
are available as described below. utilized without modifications.
SupportedMethods.SupportisprovidedforallGBmodels SystemSize.Themaximumsystemsizethatcanbetreated
currently implemented within AMBER37−41 as well as the withtheGPUimplementationisafunctionofboththeGPU
analytical linearized Poisson−Boltzmann (ALPB)42 model. In hardware and the MD simulation parameters. In particular,
additiontoconstantenergysimulations,thermostatshavebeen Langevintemperatureregulationandtheuseoflargercutoffs
implemented to perform constant temperature simulations. for the effective Born radii calculations increase the memory
ThisincludesallthreethermostatsavailableinPMEMD,thatis, requirements. The physical GPU hardware also affects
the Berendsen weak coupling algorithm,43 the Andersen tem- memory usage since the optimizations used are nonidentical
perature coupling scheme,44 and the Langevin dynamics ther- for different GPU types. Table 1 gives an overview of the
mostat.45 Constraints for hydrogen bond distances use a GPU
version of the standard SHAKE algorithm46,47 employed in Table1.ApproximateMaximum AtomCountsThatCanBe
PMEMD, and harmonic restraints to a reference structure are Treated with the GPU Implementation of GB Implicit
supported. Solvent Simulations in AMBER 11 Using the SPDP
Tothebestofourknowledge,noGB formalismcurrently Precision Modela
exists that corrects for the errors introduced by the use of
cutoffs for long-range nonbonded interactions. The use of GPUcard GPUmemory simulationtype maxatoms
cutoffs in GB simulations as implemented in PMEMD does GTX-295 895MB constantE 20500
not conserve energy, and their use involves an approx- constantT 19200
imation with an unknown effect on accuracy. For this TeslaC1060 4.0GB constantE 46350
reason, we chose not to implement van der Waals (vdW) constantT 45200
and electrostatic cutoffs in the GPU version of this code. TeslaC2050 3.0GB constantE 39250
[Cutoffs for the nonbonded interactions are implemented constantT 38100
for explicit solvent simulations with periodic boundary TeslaC2070 6.0GB constantE 54000
conditions using the particle mesh Ewald (PME) method, constantT 53050
as described in a later paper.] However, cutoffs in aTest systems are droplets of TIP3P water molecules. All
calculating the effective Born radii are supported. simulations use SHAKE (AMBER input ntf=2, ntc=2); a time
Reproducibility. A design feature of the GPU code that step of 2 fs; the Hawkins, Cramer, Truhlar GB model37 (AMBER
inputigb=1);thedefaultcutoffvalueof25ÅforGBradii(AMBER
goes beyond the CPU implementation is the deterministic
input rgbmax=25); and temperature control with the Langevin
nature of the implementation on a given hardware config-
thermostat (AMBER input ntt=3), if applicable. Error-correction
uration.SerialCPUcalculationsforagivensetofinputpara-
code (ECC) was switched off on the Tesla cards.
metersonidenticalhardwareareperfectlyreproducible.This
doesnotholdfortheparallelCPUimplementationsincethe
need to load balance aggressively to achieve good parallel approximate maximum atom counts that can be treated with
scaling means that the order of numerical operations is not the present version of the code. The dominant sources of
defined, andtherefore two simulations started fromidentical GPU memory usage are the output buffers used for the
conditions will always diverge due to rounding differences. nonbonded interactions as described in section 3.2. The
This poses a problem when transitioning to microsecond or memory used by those buffers is proportional to the square
greater simulation time scales since it can be of advantage of the number of atoms. Currently, the atom count limita-
to store trajectory information less frequently than what is tions imposed by GPU memory usage are roughly identical
optimal in order to conserve available storage space and in serial and parallel.
produce data files of manageable size. It is thus not possible 3.2. Technical Details of the Implementation. In
togobacktoagivenpointofthesimulationandanalyzethe classical MD, the majority of the computational effort is spent
trajectory in finer detail by restarting and sampling more evaluating the potential energy and gradients, which has to be
1545 dx.doi.org/10.1021/ct200909j|J.Chem.TheoryComput.2012,8,1542−1555
Journal of Chemical Theory and Computation Article
repeated each time step. In the case of the AMBER pairwise planarity and prevent undesired chiral inversions. Within the
additive force fields,21 the potential takes the form AMBERforcefield,improperdihedralsaretreatedinthesame
way as proper dihedrals. The third additional term is a cross
VAMBER = nb∑ondsbi(ri − ri,eq)2 + na∑nglesai(θi − θi,eq)2 tϕe,rψm bteertmweeedn tCwMoAsePq.4u9entAiadldpitriootneainllyb,actkhbeonCeHdiAheRdMraMl anganleds
i i AMBER force fields handle 1−4 nonbonded interactions in a
ndihedralsni,max different manner. The single prime on the electrostatic
+ ∑ ∑ (Vi,n/2)[1 + cos(nϕi − γi,n)] summation has the same meaning as described above for the
i n AMBERforcefieldwiththeexceptionthat1−4interactionsare
+ na∑toms′⎛⎜Aij − Bij⎞⎟ + na∑toms′ qiqj nthoetssacmaleede.xTclhuesiodnosubaslethperimsinegolenptrhimeevdbWutsthuemumsaetioofndiifmfeprleinest
i<j ⎝⎜ri1j2 ri6j ⎠⎟ i<j 4πε0rij (1) values Rimjin and εij for 1−4 interactions.
IntheGBimplicitsolventmodel,theeffectofasurrounding
where the bond and angle terms are represented by a simple solvent is described via a continuum electrostatics model that
harmonicexpressionwithforceconstantsb anda andequilibrium uses a pairwise descreening approximation and in general also
i i
bond distances/angles r and θ , respectively. Torsional includesaDebye−Hückeltermtoaccountforsalteffectsatlow
i,eq i,eq
potentialsforthedihedralanglesarerepresentedusingatruncated salt concentrations. The general form of the correction to the
Fourier expansion in which the individual terms have a potential energy of the solute is given as
itwVnhii,tentehrwvadicdtWthiiaotpnoienmrtiboeicerdatiwccptietiaoyernnanmraaeenttpoedrrmesps-ehcAnaestinjeetdesarhnbeidfdyt γaBpi,noij.LinTeanthnnedcahrladats-hrtJgeotenwseeoslqetcipetroramotnessdntaattiriaqeclj ΔGGB = −12 ∑i,j ⎛⎝⎜⎜⎜1 − e−κεfriGjB⎞⎠⎟⎟⎟4πεq0iqfjiGj B (3)
separated by the distance r. The prime on the summation of
the nonbonded interactionsijindicates that vdW and electrostatic where qi and qj are the atomic partial charges, εr is the relative
interactionsareonlycalculatedforatomsindifferentmoleculesor permittivity of the solvent, and κ is the Debye−Hückel
foratomsinthesamemoleculeseparatedbyatleastthreebonds. screening parameter.38 The function fiGjB interpolates between
Those nonbonded interactions separated by exactly three bonds aneffectiveBornradiusRiwhenthedistancerijbetweenatoms
(1−4interactions)arereducedbytheapplicationofindependent is short and rij at large distances according to
vdW and electrostatic scale factors, termed SCNB and SCEE in GB 2 2 1/2
f = [r + RR exp(−r /4RR)]
AMBER,whicharedependentonthespecificversionoftheforce ij ij i j ij i j (4)
field (2.0 and 1.2, respectively, for the ff99SB48 version of the
TheeffectiveBornradiusR reflectshowdeeplyburiedacharge
AMBER force field). i
The CHARMM force field22 takes a similar form but isinthelow-dielectricmedium(solute),anditdependsonthe
intrinsic radius ρ of an atom and the relative positions and
includes three additional bonded terms: i
intrinsic radii of all other atoms in the solute. The models
implemented in PMEMD differ in the intrinsic radii and the
V
CHARMM
function that is used to determine the effective Born radii.
nbonds nangles The remainder of this section discusses key aspects of the
= ∑ b(r − r )2 + ∑ a(θ − θ )2
i i i,eq i i i,eq GPU implementation of these equations. This includes design
i i features to maximize performance while preserving accuracy as
nUrey−Bradley well as addressing issues of reproducibility which are critical
+ ∑ u(r̃ − r̃ )2 when moving to longer time scale simulations.
i i i,eq
IntegrationwiththeExistingCodeBase.Oneoftheinitial
i
ndihedrals ni,max ideas we considered when implementing GPU acceleration for
+ ∑ ∑ (V /2)[1 + cos(nϕ − γ )] AMBER was to write an entirely new code from scratch in a
i,n i i,n combination of C++ and CUDA. However, this was quickly
i n
dismissed for a number of reasons. In particular, we wanted to
nimpropers
+ ∑ k(ω − ω )2 + ∑ V 1. keep the maintenance of the AMBER code base simple
i i i,eq CMAP 2. minimize the amount of coding required
i Φ,Ψ 3. simplifythewayinwhichfeaturesareportedtotheGPU
+ na∑toms″εij⎡⎢⎢⎛⎜⎜Rimj in⎞⎟⎟12 − ⎛⎜⎜Rimj in⎞⎟⎟6⎤⎥⎥ 4. manadinrteaginrebssaicoknwtaersdtsc.ompatibility with existing input files
i<j ⎣⎢⎝ rij ⎠ ⎝ rij ⎠ ⎦⎥ The approach weultimately chose wasto utilize the existing
FortrancodebaseinPMEMDandextenditwithcallstospecific
natoms qq CUDA kernels for the GPU acceleration with the CUDA code
+ ∑ ′ i j protected with #IFDEF CUDA preprocessor directives. Building
i<j 4πε0rij (2) andtestingisautomatedwithintheexistinginstallationprocedure.
AseriesofGPUsynchronizationroutinesprovideanabstract
The two-body Urey−Bradley terms account for angle bending way to copy relevant data to and from the GPU memory,
between atoms that are separated by two bonds using a for example, gpu_upload_vel() and gpu_download_vel() for
harmonic potential with force constant u and equilibrium atomicvelocities.Acompletelistofsynchronizationroutinesis
i
distance r̃ . The improper dihedrals with force constant k provided in the Supporting Information. For performance, we
i,eq i
describe out-of-plane bending and are used to maintain haveimplementedtheentireMDalgorithmontheGPU,which
1546 dx.doi.org/10.1021/ct200909j|J.Chem.TheoryComput.2012,8,1542−1555
Journal of Chemical Theory and Computation Article
means uploads (to GPU memory) are only needed at the adopters of GPU technology. For example, the OpenMM
beginningofarunanddownloads(toCPU memory)are only libraryofPandeetal.15usessingleprecision(SP)floatingpoint
needed when I/O is required, for example, downloading the numbers throughout the calculations with the exception of a
coordinates to write to the trajectory file. It remains single double-precision accumulator in the reduction phase of
nevertheless simple to add new features to the code in a way theforceaccumulation.UseofSPinallplaceswithinthecode,
thatwillworkforboththeCPUandGPU(seetheSupporting however, can cause substantial instabilities in the MD simula-
Informationforanexample).Whilethislimitstheperformance tions. For example, energy conservation in the NVE ensemble
forthenewfeatureduetorepeatedup-anddownloads,itdoes can become problematic. While this error can be hidden using
provideamechanismbywhichnewfeaturescanbeeasilyadded tightlycoupledthermostats,thetrueeffectsofsuchapproxima-
and tested with the GPU code before writing an additional tions have not been well characterized.
GPU kernel. We distinguish three different precision models in our GPU
Reproducibility. In order to achieve performance in parallel implementation in which the contributions to the nonbonded
on CPUs, it has always been necessary to include complex forces are calculated in single precision arithmetic, but bonded
dynamic load balancing algorithms and extensive use of asyn-
terms and force accumulation are in double precision (SPDP),
chronous communication within the implementation. This
or everything is computed and accumulated in single precision
arises from the fact that the CPUs are expected to perform a
(SPSP) or double precision (DPDP). The exception to this is
number of tasks above and beyond running the underlying
the SHAKE algorithm (see below), which is implemented in
simulation. For example, the operating systems run multiple
DPforallprecisionmodelssinceitinvolvescalculatingrelative
daemonscontrollingI/OandvariousotherOSrelatedtasks.In differences in distance on the order of 10−6 Å. Attempting to
addition, the latency between any two CPUs in a large parallel
useSPfortheSHAKEalgorithmasimplementedleadstonumeri-
run can span an order of magnitude or more. As a result, the
cally unstable simulations. The aim in developing our SPDP
various MPI threads become desynchronized. Load balancing
precisionmodelfortheGPUimplementationofPMEMDwasto
and asynchronous communication is used to minimize the
achieve numerical stability during MD simulations equivalent to
resulting effect. A downside of this is that the order of opera-
that of traditional DP implementations but with performance as
tions is not well-defined in the CPU implementation. This
close to the SPSP model as possible. As highlighted in the sub-
results in rounding differences between otherwise identical
sequentsections,theSPDPmodelachievestheseaimsandforthis
runs. Due to the chaotic nature of numerically integrating
Newton’s equations of motion, this means that two initially reasonisthedefaultprecisionmodelusedintheGPUimplementa-
tion of PMEMD, although the other precision models can be
identicalsimulationswillalwaysdecorrelate ontimescalesofa
chosen if desired.
few nanoseconds or less. There is nothing inherently wrong
Nonbonded Interactions. Ourapproachtothecalculationof
with this; the two simulations are just exploring two different
nonbonded interactions is similar to that described in Friedrichs
regions of phase space. As mentioned previously, this does,
etal.15havingbeendevelopedatthesametimeandwithoverlapp-
however, pose a problem when one starts to routinely run
ing authors but with several differences and additional optimiza-
simulations on the microsecond time scale since it can be
tions, including the way in which accumulations are handled.
necessarytoreturntoagivenpointinasimulationandrepeatit
Thealgorithmusedforthecalculationofthenonbondforces
exactly while saving the output more frequently. This is
somethingthatShawetal.attemptedtoaddresswithabit-wise is identical for all three precision models. The pairwise
reversible integrator50 based on scaled integers. However, this interactions between atoms i and j, which can schematically
did notsolve the issue when usingthe SHAKE algorithm. The berepresentedbyamatrixasinFigure2,aregroupedtogether
architectural designoftheGPUwithmanyfloatingpointunits
controlled by a single instruction unit, and the fact that the
GPUisnotrequiredtotimeslicewithOSrelatedtasks,means
that work can be divided between GPU threads and indeed
entire GPUs in a predetermined fashion without detrimentally
impacting performance. This can be exploited by careful pro-
grammingandbookkeepingtoensurethattheorderoffloating
point operations is always predefined at each given time step.
TheGPUimplementationofPMEMDhasbeendesignedsuch
that any two simulations with identical starting conditions run
onthesame GPUmodelwillalwaysbe bit-wise identical. This
is extremely useful for debugging purposes, for example, for
detecting shared memory race conditions, and has also been
usedtodeterminethattheuseofECConGPUshaslittletono
impact on the reliability of AMBER simulations.51
For Anderson and Langevin thermostatted simulations, it is
necessary for the random number generator (RNG) to also be
perfectly deterministic in order for any two initially identical
Figure2.Schematicrepresentationofthework-loaddistributionforthe
simulations to be reproducible. To this end, we use the
calculationofnonbondforceswithNatoms.Eachsquarerepresentsthe
parallelized RNG thatis implementedinthe CURAND library
interactions between two atoms i and j for which the resulting forces
and available with the CUDA Toolkits. needtobeevaluated.ThesearegroupedtogetherintilesofsizeW×W
Precision Model. In AMBER, all of the traditional CPU thatareeachassignedtoanindependentwarp.Duetosymmetry,only
codesarewrittenentirelyusingdoubleprecision(DP)floating the blue diagonal tiles and the green off-diagonal tiles need to be
point arithmetic. This is in contrast to developments by early consideredforthecalculation.Fordetails,seethetext.
1547 dx.doi.org/10.1021/ct200909j|J.Chem.TheoryComput.2012,8,1542−1555
Journal of Chemical Theory and Computation Article
intotilesofdimensionW×W.Theevaluationofthenonbond ij,theforceonatomjisjustthenegativeoftheforceonatomi,
forces for each tile is dynamically assigned to independent andthusonlytheupperorlowertriangleofthetilewouldhave
warps across all of the Streaming Multiprocessors (SMs) of a to be considered. However, this is far outweighed by the
given GPU, which are each running a sufficient number of advantage of avoiding race conditions that would result from
threadstoburyoperationallatency.Thereasonfordoingthisis several threads updating their force contribution to the same
that the GPU schedules work in terms of warps which atom.
effectively perform the same mathematical operation on W Inthe caseofoff-diagonal tiles,adifferentapproach istaken
valuesatonce.InthecaseofNVIDIA,allcurrentGPUsprocess such that the symmetry in the interactions is exploited while
warpsas32-threads-wide;inthiscase,optimumperformanceis avoidingraceconditions.Again,thecoordinatesandassociated
achievedbymakingW=32.Sincetheinteractionbetweentwo parameters of atoms labeled j to j + W − 1 are placed in the
atomsiandjissymmetric,onlythebluediagonaltilesandthe registers, while the coordinates and associated parameters of
green off-diagonal tiles need to be considered for the atoms i to i + W − 1 are placed in shared memory. If the off-
calculation as described below. Unlike the CPU code, which diagonal tiles were handled in the same fashion as the on-
canjustsimplyloopoveratomsiandj,itiscrucialtodivideup diagonal tiles, on the first iteration, the force on atom j due to
the calculation on the GPU into as many warp size blocks as interacting with atom i would be calculated and from this the
possible. However, this presents a problem when it comes to corresponding force on atom i obtained by negation. At the
accumulating the forces for each atom since the warps can be sametime,theforceonatomj+1duetointeractingwithatom
scheduled for computation in any order. A naıv̈ e accumulation iwouldbecalculatedandfromthisthecorrespondingforceon
of the resulting forces on each atom can thus lead to race atomiobtainedbynegation.Thus,theforcesonatomidueto
conditions and incorrect results. The use of atomic operations all atoms j of the tile would be calculated at the same time,
within the nonbond routine to avoid this problem, however, whichwouldrequireatomicoperationstoaccumulatecorrectly.
represents a serial bottleneck that would be unacceptable in ThesolutiontothisproblemthatisimplementedinAMBERis
terms of performance. The approach we use is thus to allocate topartitiontheoff-diagonaltilesintimeasillustratedinFigure2.
(N/W) + 1 output buffers per atom where N is the total By starting each thread offset by its thread ID, we avoid race
number of atoms and W is the warp size. In the case of the conditionsintheaccumulation andeliminatetheneedforper-
nonbond force calculation, each output buffer is three double- formance destroying atomic operations.
precision-values-wide for the SPDP and DPDP precision GeneralizedBornTerms.TheGBtermsarecalculatedinan
models and three single-precision-values-wide for the SPSP identical fashion to the nonbond terms described above. The
precision model corresponding to the force in the x, y, and z Born radii are first calculated within their own kernel and
directions of Cartesian coordinate space. Forces calculated reduced to per atom radii in an analogous fashion to how the
withinatilecanthenbesummedforeachatomwithoutfearof nonbond forces are handled. The remainder of the nonbond
overwriting memory effectively providing a separation in space calculation is then split over two additional kernels.
between tiles. At the conclusion of the force calculation, we Bondedand1−4Interactions.Thebondedand1−4terms
assign one thread per atom in linear order and cycle through represent a very small fraction of the computational workload
the output buffers, summing them to obtain the total force on (typically <1% of an iteration), and thus optimization of their
each atom. calculation is not critical for performance compared with, for
While this approach provides a separation in space between example, the nonbond interactions. However, efficient calcula-
tiles, it is still necessary to deal with race conditions within a tion on a GPU still requires some consideration in how these
tile. We have to distinguish on-diagonal tiles (blue) and off- terms are calculated in order to exploit the massive parallelism
diagonal tiles (green), see Figure 2. The on-diagonal tiles within the GPU while avoiding race conditions and memory
includetheinteractionofatomsitoi+W−1withthemselves overwrites. Our implementation simply creates a list of the
and thus include the self-interaction from the GB term. In interactions, sorted by type, and then divides up the bonded
contrast, off-diagonal tiles include the interaction of atoms i to and1−4termsacrossSMsonaperinteractionbasis.Sincethe
i + W − 1 with j to j + W − 1, where i ≠ j. To avoid race resultingforcesneedtobesummedforeachatom,thereisthe
conditions, it is necessary to handle these two types of tile in potential for a race condition to occur during this reduction if
different ways. twoormoreGPUthreadsattempttoaccumulateforcesforthe
On-diagonal tiles store two copies of the coordinates and same atom at the same time in global memory. Our initial
associatedparametersforeachatom,onesetinsharedmemory attempts to avoid this used atomic operations for reduction of
(index atom i in Figure 2) and the second set in the registers the resulting forces to individual atoms but showed very poor
(index atom j in Figure 2). Each thread in the warp then runs performance. Our solution to this is to make use of the tile
linearly through shared memory (atom j), accumulating only reduction buffers from the nonbond calculation as described
the force acting on the register atoms i in the corresponding above and then sum these forces in the reduction step used in
force output buffer. As illustrated in Figure 2, on the first the nonbond calculation.
iteration, the force on atom i resulting from interacting with Harmonic Restraints. At the time of writing, the AMBER
itself is obtained andstored inthe corresponding register for i, GPU implementation supports harmonic restraints to a
while simultaneously the force on atom i + 1 resulting from reference structure. These are handled in an identical fashion
interactingwithatomiisobtainedandstoredintheregisterfor to the bonded interactions being calculated as a bond between
atom i + 1. Similarly the forces on atoms i + 2 to i + W − 1 an atom and a fixed virtual particle representing the reference
resulting from interacting with atom i are obtained and stored structure.
inthecorrespondingoutputbuffer.Allthreadsthensteptothe SHAKEAlgorithm.TheSHAKEimplementationiscurrently
nextcolumntoobtaintheforcesonatomsitoi+W−1dueto restricted tohydrogenatomssincethisrepresents>99%ofthe
interactions with atom i + 1 etc. This approach does some types of simulation AMBER users run. This restriction has the
excessive computational work since, for any given atom pair benefit that each heavy atom and its attached hydrogen atoms
1548 dx.doi.org/10.1021/ct200909j|J.Chem.TheoryComput.2012,8,1542−1555
Journal of Chemical Theory and Computation Article
can thus be handled independently. Our approach is to simply considerably in subsequent versions of AMBER running on
spawn a thread for each heavy atom during the SHAKE next generation hardware.
calculation. This provides sufficient parallelization to keep the Unlike the CPU MPI implementation, the GPU MPI
GPU busy. The use of double precision entirely within the implementation is fully deterministic for a given number of
SHAKE routine provides sufficient precision to handle shake nodes and GPUs. At present, our GPU implementation per-
tolerancesontheorderof10−7Åandthusavoidscomplexities forms significant load-balancing between the SMs within each
involved inattempting to implement SHAKE constraints inan GPU rather than between GPUs, which makes it possible to
internal coordinate representation as would be required if produce a deterministic implementation. Achieving good
SHAKE was to be carried out in single precision. parallel scaling on CPU clusters has always required the use
Coordinate Update. Integration is carried out on the GPU of extensive load-balancing due to the noise introduced from
in a manner analogous to that on the CPU since it is the operating system sharing the CPU resources on a node.
intrinsically parallel given that each atom can be handled GPUs on the other hand provide significantly more stable
independently. The updated coordinates are stored in global performance, and thus load balancing between GPUs is not as
critical.
memory on the GPU. The integration is done on the GPU
In the current version of the software (AMBER v11), the
ratherthanontheCPU tonegate theneedforexpensivecopy
GPU parallel support for GB calculations is implemented by
operationsofthecoordinatesbackandforthbetweenCPUand
dividing the force calculation evenly and linearly across all
GPU.
GPUs. For M GPUs running a simulation consisting of N
Thermostats. As mentioned above, three different thermo-
atoms,GPUicalculatesallforcesforatoms(i−1)×N/M+1
stats are supported. The Berendsen and Langevin thermostats
to i × N/M. Since GB is a full O(N2) calculation, this
areappliedinthesamewayduringthecoordinateupdatestep,
introducesasmalldegreeofredundancyattheGPUlevelwith
while the Anderson thermostat is implemented as a separate
aworst-caseoutcomeofdoingtwicetheoverallcalculation,but
kernel that randomizes the velocities at regular intervals. The
usually much less than this as long as N ≫ M. As currently
reason for including the Berendsen and Langevin thermostats
implemented, there are three stages to the calculation in
withintheintegrationkernelisthattheyoperateoneverytime
parallel, each in need of synchronizing force data across all
step,whiletheAndersonthermostatistypicallycalledeveryfew
GPUs. The procedure used consists of three sequential
hundred time steps. In the case of the Langevin thermostat, a
MPI_allGather operations per iteration to merge Born radii,
total of 3N random numbers are required on each step, which
Bornforce,andgeneralforcedata.Asthehardwareevolvesand
are extracted from a pool ofGPU generated random numbers,
GPUscanreliablycommunicatedirectlywitheachotherwithin
stored in global memory, which is refilled as needed using the
a node, we intend to scrap the internode MPI communication
parallel random number generator implemented within the
and instead replace it with direct peer to peer copies between
CURAND library from the CUDA toolkit. The same pool is
GPUs.
used for the Anderson thermostat as needed.
Additional Optimizations. In the interest of performance 4. PERFORMANCE
and conserving limited GPU resources, the energy is only
To assess both the serial and parallel performance that can be
calculatedalongsidetheforceswhenneeded.Calculationofthe
achievedwithGPUacceleratedAMBER,weranaseriesofMD
energies causes a small degree of register spillage to global
simulations representative of typical research scenarios using
memory as well as additional code execution, and since one
the GPU and CPU implementations on a variety of hardware.
onlyperiodicallyneedsthesevalues,typicallyonlyevery1to2
ThesystemsusedconsistedofpartiallyfoldedTRPCage52(304
ps when writing to the output file, it is not necessary to cal-
atoms), ubiquitin53,54 (1231 atoms, PDB code 1UBQ), apo-
culate these on every iteration. Additionally, this is one aspect
myoglobin(2492atoms),andnucleosome(25095atoms,PDB
of the algorithm that is allowed to be nondeterministic at the
code 1KX5).
levelofdouble-precisionround-offerror.Weallowthisbecause In all three simulations, the ff99SB48 version of the AMBER
theenergyvalueshavenoeffectonthetrajectoryofthesimula-
force field was used with a time step of 2 fs and bonds to
tion.However,itwasdisturbingtoobservethisinactioninthe
hydrogen atoms constrained using the SHAKE algorithm. The
SPSP modewhere nondeterminism wasinitially allowed at the Hawkins, Cramer, and Truhlar GB model37 (AMBER input
level of single-precision round-off error. This was harmless as
igb=1) was used with no cutoff applied to the nonbonded
the forces remained consistent, but in the interest of avoiding
interactions and a cutoff of 15 Å for the calculation of the
confusion among end-users, with regression tests showing
effective GB radii. The output and trajectory files were written
differencesforrepeatrunsinSPSPmode,allenergysummation toevery1000steps(2ps).Inputfilesforthesesimulationsare
was promoted to double-precision where differences in round- provided in the Supporting Information.
off of the energies are seen only once in approximately 106 The software base used for all simulations was AMBER
printed interactions. version 11, including patches 1−15 for AmberTools and
Parallelization. The parallel GPU implementation is patches 1−17 for AMBER, which were released on August 18,
currently written exclusively using MPI. The reasons for this 2011.55 The executables were built under the RedHat
were to maximize portability of the code and avoid hardware- Enterprise Linux 5 operating system with Intel compiler
induced complexities such as the fact that inter-GPU com- version11.1.069,theIntelMKLlibraryversion10.1.1.019,and
municationasimplementedinCUDA4.0requiresallGPUsto theNVIDIACUDAcompilerversion4.0.MVAPICH2version
be on the same PCIe controller. Support for GPU to GPU 1.5 was used for the parallel runs for both the CPU and GPU
copiesindifferentnodesisalsonotyetsufficientlymaturetobe versions of the code. The default SPDP precision model was
exploitedinawidelyusedsoftwarepackage.Forthesereasons, used in the GPU code. While the Fortran compiler and use of
the current parallel GPU scaling should be considered as a MKL has little impact on the GPU code performance, this
lower bound on performance that will likely improve compiler/library combination gives the best performance for
1549 dx.doi.org/10.1021/ct200909j|J.Chem.TheoryComput.2012,8,1542−1555
Journal of Chemical Theory and Computation Article
the CPU code and thus forms a fair basis when assessing the Table 3. Multi-GPU Throughput Timings (ns/day) for
capabilities of the GPU implementation. Both the CPU and AMBER GB Simulations with a Time Stepof 2fs Using the
GPU simulations were run on machines equipped with a Parallel CPU Version (12 Intel X5670 Cores or 32 AMD
SuperMicroX8DTG-DFmotherboard withdualhex-coreIntel Opteron 3136 Cores on Each Node) and the Parallel GPU
X5670processorsclockedat2.93GHzandMellanoxquadruple Version with the SPDP Precision Model (One Intel X5670
a
datarate(QDR)Infinibandinterconnects.Eachcomputenode Core and One GPU Per Node)
was equipped with one GPU and one Infiniband card, which
apo-myoglobin nucleosome
wereeach connected toaPCIe x16slot.Error-correction code
CPU/GPU (2,492atoms) (25,095atoms)
(ECC) was switched off on the Tesla cards, and the 64-bit
GPUversion
NVIDIA Linux Driver version 275.21 was used. Additionally, 8×M2090 135.1 3.95
CPU simulations were run on the XSEDE Trestles super- 4×M2090 115.0 2.71
computer at the San Diego Supercomputer Center, whose 2×M2090 93.1 1.80
nodes are equipped with four oct-core AMD Opteron 6136 1×M2090 78.1 1.42
processors clocked at 2.4 GHz and also interconnected with
CPUversion
QDR Infiniband. 2048×Opteron3136 − 0.53
Serial GPU Performance. The results of the simulation 1024×Opteron3136 − 0.78
timings for single GPU runs are summarized in Table 2. For 512×Opteron3136 − 0.65
256×Opteron3136 − 0.55
Table 2. Single GPU Throughput Timings (ns/day) for 128×Opteron3136 29.8 0.31
AMBER GBSimulations with aTime Step of 2fs Using the 64×Opteron3136 18.3 0.17
Parallel CPU Version on One Node (12 Intel X5670 Cores 32×Opteron3136 10.3 0.08
or 32 AMD Opteron 6136 Cores) and the Serial GPU 12×X5670 6.6 0.07
VersionwiththeSPDPPrecisionModelonOneNode(One aFor detailsonthehardwareandsoftwarestack,seethetext.Adash
a
Intel X5670 Core and One GPU) indicates lower speed than with lessnodes.
TRPCage ubiquitin apo-myoglobin nucleosome
CPU/GPU (304atoms)(1231atoms) (2492atoms) (25095atoms)
tion.EvenasingleGPUisafactorof2to3fasterthantheCPU
GPUversion scaling limit.
M2090(6GB) 399.9 184.2 78.1 1.42 PrecisionModelPerformance.Asmentionedabove,the
C2070(6GB) 364.1 157.2 64.3 1.09 AMBER GPU implementation supports three different
C1060(4GB) 234.6 78.3 31.5 0.40 precision models. The DPDP model is logically equivalent
GTX580(1.5GB, 471.1 215.9 88.7 −
to the traditional DP approach used on the CPU. However,
PNYXLR8)
the use of DP on GPUs can be more detrimental to
CPUversion
32×Opteron 225.0b 29.9 10.3 0.08 performance than its use on CPUs. Therefore, the desire to
6136 achieve high performance prompts for use of SP where
12×X5670 247.1 19.8 6.6 0.07 possible. It is critical though to use SP carefully in order not
aFor details on the hardware and software stack, see text. A dash to detrimentally affect the accuracy of the MD. For this
indicates insufficient GPU memory for the simulation. bThe CPU reason, we designed our hybrid precision SPDP model to
coderequires>10atomspercore,andthustheTRPCagesimulation achieve performance very close to that achievable with SP
was run on 24CPU cores. but in a manner that does not impact accuracy. While we
recommendusingtheSPDPprecisionmodel,itisinstructive
to compare the performance of the different precision
small simulations such as TRPCage with 304 atoms where the
GPU’s arithmetic hardware cannot be fully utilized, the serial models available in the AMBER GPU implementation. As
shown in Table 4, the single precision (SPSP) model
GPUimplementationisapproximatelytwiceasfastasacurrent
state of the art CPU node. The real advantage of the AMBER
Table 4. Throughput Timings (ns/day) for AMBER GB
GPU implementation, however, becomes apparent for implicit
Simulations of Apo-Myoglobin (2,492 atoms) with a Time
solvent GB simulations of medium to large systems with 2500
Step of 2 fs Using the Serial GPU Version with Different
to 25000 atoms. For apo-myoglobin (2492 atoms), the serial
a
Precision Models
GPUversionofthecodeissignificantlyfasteronasingleGPU
than a state of the art CPU node. For nucleosome, this per- precisionmodel SPSP SPDP DPDP
formancegapincreasesdramaticallysuchthatasimulationthat
M2090(6GB) 92.7 78.1 25.8
takes two weeks on a single desktop machine with one GPU
C2070(6GB) 73.7 64.3 20.5
wouldtakeninemonthsusingthesamedesktopmachinewith-
C1060(4GB) 41.2 31.5 5.4
out GPU acceleration.
GTX580(1.5GB,PNYXLR8) 111.4 88.7 16.0
Parallel GPU Performance. The parallel performance of aFor details on the hardware andsoftware stack, see the text.
the GPU and CPU implementations is compared in Table 3.
Athroughputofupto135.1ns/dayforapo-myoglobinandup achieves the highest performance as expected but, as
to 3.95 ns/day for nucleosome can be achieved with eight highlighted in section 5, at the cost of accuracy. Our hybrid
NVIDIA M2090 GPUs, which is a factor of 4 to 5 faster than SPDP precision model, however, achieves a performance of
the maximum throughput that can be achieved on a typical greaterthan75% oftheSPSPprecisionmodelwithaccuracy
supercomputergiventhattheCPUscalingplateauslongbefore comparable to that of the full double precision (DPDP)
it reaches the performance achieved by the GPU implementa- model, which is between 4 to 7 times slower.
1550 dx.doi.org/10.1021/ct200909j|J.Chem.TheoryComput.2012,8,1542−1555
Journal of Chemical Theory and Computation Article
5. VALIDATION accumulating them in DP, however, leads to an improvement
of more than 1 order of magnitude in all cases for the SPDP
Acriticalandoftenoverlookedaspectofanymajorchangetoa
modelascomparedtotheSPSPmodel.Inthefollowingsection,
widely used scientific software package is a detailed validation
we will indeed show that the forces obtained from the SPSP
of any and all new approximations made. In the case of the
model are not precise enough to conserve energy in biomole-
AMBER GPU implementation the validation falls into two
cular MD simulations and can lead to instabilities in long time
distinct categories. The first is a detailed testing of the code
scaleMDsimulations,whiletheSPDPmodelissufficientforthis
itself to ensure that it correctly simulates the existing systems
purpose.
and does not introduce any new bugs or contain critical logic
5.2. Energy Conservation. One of the most important
errors.Thisisthekeyreasonwhyweimplementedafulldouble
gaugesforjudgingtheprecisionofMDsoftwareisitsnumerical
precision(DPDP)modelwithintheGPUcode.Thisprecision
stability, which is reflected in its ability to conserve the
model matches the CPU code to machine precision and thus
constants of motion, in particular the energy. To this end, we
can be used to check that the GPU code is indeed giving the
have performed constant energy MD simulations of the first
correct answers. This allowed us to take the CPU-based
three test systems described in section 4 above. We collected
regressionteststhatwehaveusedformanyyearstovalidatean
datafor100ns(TRPCage),50ns(ubiquitin),and20ns(apo-
AMBER installation and run them using the GPU code and
myoglobin)afteraninitialequilibrationfor1nsat300Kusing
obtain answers to the limits of machine precision. More
Langevin dynamics. Center of mass motion was removed
complex,however,isthesecondcategoryforvalidationwhichis
before starting the constant energy runs. In addition to
thecarefultestingofourhybrid(SPDP)precisionmodel.This
simulations using a time step of 2 fs with bonds to hydrogen
is significantly more complex since it requires a careful
evaluation of any subtle differences in the dynamics as well as atoms constrained using the SHAKE algorithm with a relative
ensemble properties from converged simulations run in DPDP geometrical tolerance of 10−6, we have run simulations using
and SPDP precision. In this section, we attempt to extensively time steps of 0.5 and 1.0 fs without constraints.
validate our SPDP model comparing a number of key The energy drifts along the trajectories are summarized in
observablesacrossallthree,DPDP,SPDP,andSPSP,precision Table6,whileFigure3showsaplotofthetotalenergyforthe
models.
5.1. Single Point Forces. The first metric tested was the Table 6. Energy Drifts Per Degree of Freedom (kT/ns/dof)
effect of the numerical precision model on the force from Simulations of 100 ns (TRPCage), 50 ns (Ubiquitin),
a
calculations as compared to a reference precision model, in and 20 ns (Apo-Myoglobin)
this case, the result from the CPU implementation. We are
timestep 0.5fs 1.0fs 2.0fs
using the starting structures of the test systems described in
TRPCage(304atoms)
section 4 above. The deviations in the forces are summarized
CPU 0.000006 0.000066 0.000355
in Table 5. The DPDP model matches the reference forces
GPU(DPDP) 0.000012 0.000082 0.000382
GPU(SPDP) 0.000003 0.000070 0.000222
Table 5. Deviations of Forces (in kcal/(mol Å)) of the
GPU(SPSP) 0.000184 0.000252 −
AMBER PMEMD GPU Implementation Using Different
ubiquitin(1231atoms)
Precision Models As Compared to Reference Values
CPU 0.000004 0.000011 −0.000216
Obtained with the CPU Implementation
GPU(DPDP) 0.000001 0.000006 −0.000247
precision TRPCage ubiquitin apo-myoglobin nucleosome GPU(SPDP) 0.000003 0.000030 −0.000165
model (304atoms) (1231atoms) (2492atoms) (25095atoms) GPU(SPSP) 0.001065 0.000305 −
maxdeviation apo-myoglobin(2492atoms)
SPSP 3.0×10−3 4.8×10−3 4.2×10−3 2.7×10−2 CPU 0.000012 0.000094 0.000416
SPDP 5.6×10−5 3.7×10−4 1.6×10−4 1.1×10−3 GPU(DPDP) −0.000004 0.000117 0.000290
DPDP 1.1×10−8 7.3×10−8 3.4×10−8 8.0×10−8 GPU(SPDP) 0.000019 0.000185 0.000139
RMSdeviation GPU(SPSP) 0.002230 0.000442* −
SPSP 5.0×10−4 6.1×10−4 4.1×10−4 1.5×10−3 aTheSHAKEalgorithmtoconstrainbondlengthstohydrogenatoms
SPDP 7.0×10−6 1.5×10−5 8.1×10−6 3.0×10−5 wasusedforatimestepof2.0fs;noconstraintswereusedforsmaller
DPDP 1.5×10−9 3.6×10−9 2.6×10−9 3.2×10−9 time steps. A dash indicates that the system heated up extremely
during thesimulation tothe pointthatitis meaninglessto reportan
verycloselywithmaximumdeviationsnotexceeding10−7kcal/ energy drift. An asterisk indicates that the energy drift increases
(molÅ)andRMSdeviationsnotexceeding10−8kcal/(molÅ), dramatically for longer time scales.
even for systems as large as nucleosome. These deviations
trajectories of TRPCage and ubiquitin for the different pre-
are entirely due to the different order of execution of the
cision models at a time step of 0.5 fs. The corresponding plot
floating point operations in the CPU and GPU implementa-
for apo-myoglobin is very similar to that for ubiquitin and can
tion. The DPDP GPU implementation of PMEMD will thus
befoundintheSupportingInformationalongwithplotsforthe
generate trajectories of precision equivalent to the CPU
larger time steps. The plots underline that it is important to
implementation. The forces obtained from the SPSP model,
validate the precision models for trajectories that are long
however, show very large deviations from the reference
valuesontheorderofupto10−2kcal/(molÅ)forthelargest enoughtouncovernumericalinstabilities.WhiletheSPSPmodel
system studied. Such large deviations introduce significant seemstoperformreasonablywellforatrajectoryof1nslength,
numerical noise into the simulations, and it is reasonable in particular for smaller systems like TRPCage, it becomes
to expect that this may render the results of MD simula- apparent that the errors introduced lead to unacceptably large
tionsquestionable.CalculatingtheforcecontributionsinSPand energydrifts inthe long term. Apart from the magnitudein the
1551 dx.doi.org/10.1021/ct200909j|J.Chem.TheoryComput.2012,8,1542−1555
Description:Mar 26, 2012 made their use in general purpose computing challenging to all but those with
implementation of implicit solvent generalized Born (GB) MD.