Table Of ContentGPU-Based Conjugate Gradient Solver for Lattice
QCD with Domain-Wall Fermions
1
1
0
2
n
a
J
2 Ting-WaiChiu1,2,Tung-HanHsieh3,Yao-YuanMao ,1, KenjiOgawa1 (for the TWQCD
∗
Collaboration)
]
t
a 1 DepartmentofPhysics,andCenterforTheoreticalSciences,NationalTaiwanUniversity,
l
- Taipei10617,Taiwan
p
2 CenterforQuantumScienceandEngineering,NationalTaiwanUniversity,Taipei10617,
e
h Taiwan
[ 3 ResearchCenterforAppliedSciences,AcademiaSinica,Taipei115,Taiwan
1
v
3 WepresentthefirstGPU-basedconjugategradient(CG)solverforlatticeQCDwithdomain-wall
2 fermions(DWF).Itiswell-knownthatCGisthemosttime-consumingpartintheHybridMonte
4
CarlosimulationofunquenchedlatticeQCD,whichbecomesevenmorecomputationaldemand-
0
. ing forlattice QCD with exactchiralsymmetry. We havedesigneda CG solverforthe general
1
0 5-dimensionalDWF operator on NVIDIA(cid:13)R CUDATMarchitecture with mixed-precision, using
1
the defect correction as well as the reliable updates algorithms. We optimize our computation
1
: byeven-oddpreconditioninginthe4Dspace-timelattice,plusseveralinnovativetechniquesfor
v
i CUDA kernels. For NVIDIA GeForce(cid:13)R GTX 285/480, our CG solver attains 180/233 Gflops
X
(sustained).
r
a
TheXXVIIIInternationalSymposiumonLatticeFieldTheory,Lattice2010
June14-19,2010
Villasimius,Italy
Speaker.
∗
(cid:13)c Copyrightownedbytheauthor(s)underthetermsoftheCreativeCommonsAttribution-NonCommercial-ShareAlikeLicence. http://pos.sissa.it/
CGforDWFonGPU Yao-YuanMao
1. Introduction
SimulationofunquenchedlatticeQCDwithexactchiralsymmetryisagrandchallengeamong
all sciences. Even for a modest 163 32 lattice with lattice spacing a 0.1 fm, it often requires
× ∼
asupercomputer withpeakcomputing powermorethan50Teraflops(e.g.,10racksofIBMBlue-
Gene/L). Therefore, only 2-3 lattice QCD groups around the world could afford to perform the
simulation of unquenched lattice QCD with the domain-wall fermion [1], or the overlap-Dirac
fermion [2]. However, this scenario has been undergoing a dramatic change during the last 12
months. With the emergence of low-cost and computationally powerful Graphic Processing Unit
(GPU),now plugging agraphic cardwithNVIDIAGTX285 (240cores, oneTeraflopspeak) into
aPCimmediatelyturnsthesystemintoapowerfuldevice, attainingasustained 180Gflopsforthe
conjugate gradient (CG)withmixedprecision.
Since 2009, the Taiwan Lattice QCD Collaboration (TWQCD) has been using a GPU clus-
ter (currently constituting of 250 NVIDIA GPUs with 40 Teraflops (sustained)) to simulate un-
quenched lattice QCDwithoptimal domain-wall quarks [3,4]. Wehave metthechallenge ofpre-
serving the chiral symmetry to a high precision and sampling all topological sectors ergodically.
Forourrecentphysicalresultsonthe2-flavorsQCD,wereferthereadersto[5]and[6].
Inthispaper, wepresentourdesignofthefirstCGsolverforthegeneral 5-dimensional DWF
operator onNVIDIACUDAarchitecture withmixed-precision, usingthedefectcorrection aswell
as the reliable updates algorithms. Our CG solver is optimized with even-odd preconditioning
on the 4-dimensional space-time lattice, plus several innovative tuning techniques for the CUDA
kernels. ForNVIDIAGeForceGTX285/480, ourCGsolverachieves180/233 Gflops(sustained).
2. Conjugate GradientSolverforDomain-WallFermions
2.1 OptimalDomain-WallFermions
Mathematically, foragivenN (thenumberofsitesinthefifthdimension), themaximalchiral
s
symmetrycanbeattained bytheoptimaldomain-wallfermion(ODWF)[3]withtheoperator
[D(m )] =(w D +1) d +(s D 1) L , (s =w ) (2.1)
q xx;ss s w xx ss s w xx ss s s
′ ′ ′ ′ − ′ ′
HereD isthestandard WilsonDiracOperatorplusanegativeparameter m (0<m <2),
w 0 0
−
(Dw)xx′ =−21(cid:229) m (1−gm )Um (x)d x+mˆ,x′+(1+gm )Um†(x′)d x−mˆ,x′ +(d−m0), (2.2)
(cid:2) (cid:3)
whereUm (x)denotesthelinkvaraible, d isthedimension ofthespace-time(d =4forQCD),
L=P L +P L , P =(1 g )/2, (2.3)
+ + 5
− − ± ±
and
d , 1<s N
(L+)ss′ =(−s(−m1,qs′/2m0)d Ns,s′, s=1≤ s , L−=(L+)T. (2.4)
Theweights w along the fifthdimension arefixedaccording tothe formula derived in[3]such
s
{ }
that the maximal chiral symmetry isattained. Ingeneral, for other DWFwithnon-maximal chiral
symmetry, the weights s and w have different values, e.g., for the conventional (Shamir)
s s
{ } { }
DWF,s =0,w =1, s,andfortheBoriciDWF[7],s =w =1, s.
s s s s
∀ ∀
2
CGforDWFonGPU Yao-YuanMao
2.2 Even-OddPreconditioning
SinceD commuteswith(w ) w d and(s ) =s d ,Eq.(2.1)becomes
w ss s ss ss s ss
′ ≡ ′ ′ ′
D(m )=D (w +s L)+(1 L). (2.5)
q w
−
Separating theevenandtheoddsitesonthe4Dspace-time lattice,Eq.(2.5)canbewrittenas
d m DEO X DEOY
D(m )= − 0 w (w +s L)+(1 L)= w , (2.6)
q DOE d m − DOEY X
w 0! w !
−
where
X (d m )w (1+cL)+(1 L), Y w (1+cL), (c) (s /w )d . (2.7)
0 ss s s ss
≡ − − ≡ ′ ≡ ′
Nowwefurtherrewriteitinamoresymmetricformbydefining
M5 √w −1YX−1√w = (d m0)+√w −1(1 L)(1+cL)−1√w −1 −1, (2.8)
≡ − −
h i
and
S1 √w −1YX−1=M5√w −1, S2 Y−1√w. (2.9)
≡ ≡
thenEq.(2.6)becomes
1 M DEO 1 0 1 0 1 M DEO
D(m )=S 1 5 w S 1=S 1 5 w S 1, (2.10)
q 1− M DOE 1 2− 1− M DOE 1 0 C 0 1 2−
5 w ! 5 w ! ! !
C 1 M DOEM DEO. (2.11)
5 w 5 w
≡ −
Obviously, the most time-consuming task in the HMC is to solve the linear system CC† x = b
| i | i
by the conjugate gradient (CG), namely, in the computation of the fermion force in the molecular
dynamics. In this work, we implement the entire conjugate gradient inside the NVIDIA GPU,
whichcanbeusedfortheHMCaswellasforcomputing thevalencequarkpropagators.
2.3 Algorithm
ConjugateGradient(CG)method[8]isawidely-usednumericalalgorithmforiterativelysolv-
ingalinearsystemAx=btoacertainprecisione ,whereAisapositive-definite Hermitianmatrix.
With the CG algorithm (see Algorithm 1), the problem is turned into a task dominated by the
matrix-vector multiplication. In this work, we utilize CUDA to implement the 5D domain-wall
fermion operator (2.10)matrix-vector multiplications oftheCGonNVIDIAGPUs. FortheGPU,
thesingle-precision operationsareseveraltimesfasterthanthedouble-precision ones,thusitisad-
vantageous tousethemixed-precision CG.Intheso-calleddefectcorrectionalgorithm,onesolves
x in the low-precision, and updates the solution xˆ and the residue rˆ in the high-precision. (see
Algorithm 2, where the hatted symbols represent variables in the high-precision). In this fashion,
mostofthefloating-point operations areinthelow-precision, thusitisadvantageous fortheGPU.
Theoretically,thedefectcorrectionalgorithmismathematicallysound,anditalwaysworksinprac-
tice. However, the seemingly drawback is that one has to build up the Krylov space every time it
restarts the CG in the low precision. On the other hand, if one does not reset the low-precision p
3
CGforDWFonGPU Yao-YuanMao
Algorithm1ConjugateGradient
x :=initialguess
0
r :=b Ax
0 0
−
p :=r
0 0
k:=0
while r >e b do
k
| | | |
a :=(r ,r )/(p ,Ap )
k k k k k
r :=r a Ap
k+1 k k k
−
b :=(r ,r )/(r ,r )
k+1 k+1 k+1 k k
x :=x +a p
k+1 k k k
p :=r +b p
k+1 k+1 k+1 k
k:=k+1
endwhile
Algorithm2Mixed-Precision ConjugateGradient(DefectCorrection)
xˆ:=initialguess
rˆ:=bˆ Aˆxˆ
−
while rˆ >eˆ bˆ do
k
| | | |
r:=rˆ
p:=rˆ
x:=0
UseAlgorithm1tosolvex=A 1rinthelow-precision toaprecision e
−
xˆ:=xˆ+x
rˆ:=bˆ Aˆxˆ
−
endwhile
vectorinsidethewhileloopofAlgorithm2(i.e.,skippingthestep(p:=rˆ)exceptatthefirsttime),
the “warm-up" time in re-building the Krylov space could be reduced. This so-called reliable up-
dates algorithm [9, 10] would run faster than the defect correction. Although the reliable updates
in this fashion may not converge for all cases due to the non-orthogonality of p and r, in practice
it seems to work for most cases. We have implemented both algorithms in our CG solver, and it
automatically switchestothedefectcorrectionifthereliableupdatesdoesnotconvergeinthefirst
place.
3. CUDAKernelsand Optimization
The CUDA architecture developed by NVIDIA enables us to do parallel computations on
NVIDIA’s GPUs. (For more detailed programming models and strategies, see “CUDA Program-
mingGuideforCUDAToolkit”[11].)
In CUDA, a thread is the smallest unit to execute a kernel, and a certain number of threads
formablock. Eachthreadwillbeassigneda3-dimensional index,andeachblocka2-dimensional
index. Inside a block, a warp of threads will execute the kernel concurrently. Each thread has its
ownregister memory,andthreadsinthesameblockshareashared memory. Thespaceofregister
4
CGforDWFonGPU Yao-YuanMao
and shared memoryisverylimited, but theyhavethefastest access speed. Theglobal memoryon
thedevicecanbeaccessed byanythread, butithastheslowestbandwidth.
Toimplement the mixed-precision CG for ODWFwith CUDA,weperform all matrix-vector
multiplication, vector reduction (inner product), and vector addition/subtraction on the GPU (de-
vice),whiletheCPU(host)isusedtodotheflowcontrol,memoryassignment, anddevicecontrol.
TheCUDAkernelsinourCGsolvercanbedividedintofivedifferentcatalogs. Wewilldiscuss
eachcatalogandtheiroptimization inthefollowingsubsections.
3.1 VectorIndexConversion
These kernels are used to change the indexing schemes between the device (GPU) and the
host (CPU). To store the Dirac spinors in an one-dimensional array, we need to map the multi-
dimensional indices to the 1D array index. One needs 4 3 2=24 real numbers to store one
× ×
Dirac spinor ateach siteof the5D lattice. OntheCPU,this color-spinor index cwhichruns from
0 to 23 is the inner-most (fastest-running) index, which is followed by the fifth-dimension index
s, and then x,y,z,t indices, where t is the outer-most (slowest-running) index. If i denotes the
host
one-dimensional arrayindexoftheCPUscheme,thenwehave
i =c+s 24+x 24N + +t 24N N N N . (3.1)
host s x y z s
× × ··· ×
However, for computation on the GPU,we assign each thread a five-dimensional site index. This
implies that adjacent threads have consecutive s indices. Thus we want to arrange the data such
that optimal coalescence is attained when loading the vector from device global memory to the
registerand/orthesharedmemoryofthethreads. SincetheGPUprovidesvectordatatypessuchas
float4anddouble2whichallowustomove4singleprecisionnumbers(or2doubleprecision
numbers)atonetime,asimplewaytomaptotheone-dimensional arrayindexonGPU(forsingle-
precision) is
i =c mod4+s 4+[c/4] 4N +x 24N + +t 24N N N N , (3.2)
dev s s x y z s
× × × ··· ×
and similarly forthe double-precision. Soevery time whenwetransfer data between the host and
thedevice, weconvert theindexaccordingly.
3.2 Matrix-Vector MultiplicationforDOE(DEO)
w w
DOE(DEO)issimilartotheusualWilson-Dirac operator D withoutthemassterm,
w w w
DOwE(DEwO) xx′ =−12(cid:229) m (1−gm )Um (x)d x+amˆ,x′+(1+gm )Um†(x′)d x−amˆ,x′ . (3.3)
(cid:2) (cid:3) (cid:2) (cid:3)
From this expression we see that the multiplication of DOE(DEO) with a vector involves the link
w w
variablesUm (x)andg -matrices. Wehaveusedthefollowingtricksforoptimization.
Firstly, since the g -matrices in Eq. (3.3) are in the combination (1 gm ), the left-handed and
±
the right-handed Dirac components are related toeach other. Also, since the link variables do not
haveDiracindices,wecanjustmultiplyUm (x)totheleft-handed components, andthenrestorethe
right-handed components.
5
CGforDWFonGPU Yao-YuanMao
Secondly,Um (x)hasnofifth-dimensiondependence,sothreadshavingthesamexbutdifferent
scansharethesameUm (x). Soweputthelinkvariables inthesharedmemory.
Thirdly, because GPU computation is memory bandwidth bound, one must try to reuse the
data. For example, the hopping term (d x amˆ,x) in Dw, all neighboring sites of x are involved in
− ′
the calculation. Ifweassign each xtoonethread, then there mustbeoverlapping data loading for
neighboringsites. Toreducethisoverlappingdatatransfer,wedistributeeach(x,y,z)toonethread,
with a loop in thet-direction. Then the neighboring data in thet-direction can be reused, and the
efficiencyisenhanced.
Besides above tuning techniques, wealso expand smallloops, andto usethe texture memory
for caching data, Here texture is used for loading the vectors and link variables. We use Python
[12]toexpandsmallloops,andtogeneratethesetofkernels forD multiplication.
w
3.3 Matrix-Vector MultiplicationforM
5
Thematrix M isgiven inEq. (2.8). Onecan seethat M isblock diagonal inthe chiral basis
5 5
and it does not depend on the space-time nor the color indices. In fact, it can be divided into two
constant matrices in the fifth dimension, i.e., the left-handed and the right-handed ones. So the
multiplication of M with a vector can be regarded as u =(cid:229) (M ) v . Here we use the shared
5 s s 5 ss s
′ ′ ′
memory tostore thesource vector(v). SinceM only takes 2N2 realnumbers, wecan putM into
5 s 5
theregisterofeachthread(withdifferentsandx,y,z). Again,aloopint isinsertedtoreusetheM
5
dataintheregister. Also,weusePythontogeneratethesekernels.
3.4 VectorReduction
To calculate the norm of a vector, we use the well-known parallel reduction with the shared
memory. However, due to the limitation on the number of threads per block, it is inevitable to
use global memory when the size of a vector becomes very large. Our solution is to perform the
block reduction inprior kernels, i.e., tosum up vector elements (already stored inregisters/shared
memory)withineachblock. Thenthesepartialsumscanbeaddedwithaparallel reduction.
3.5 VectorAdditionandSubtraction
We can combine the simple addition/subtraction with other kernels in which one vector has
beenloaded. Forexample,tomultiplyC 1 M DOEM DEO toavector,wecancombinethelast
5 w 5 w
≡ −
subtraction withthelastM multiplication.
5
4. Performance
We present some benchmarks of our CG solver, using NVIDIA GeForce GTX 285, GeForce
GTX480,TeslaTMC1060,andTeslaC2050. Notethatourcodehasnotyetbeenwell-tunedforthe
Fermiarchitecture(GTX480andC2050). FromTable1,weseethatthebottleneckofourprogram
isinthesingle-precisionD matrix-vectormultiplication. Duetothemixed-precisionCG,thetime
w
used in the double-precision operations are almost negligible. For the Fermi architecture, due to
the larger L1 cache, there is a significant improvement in the single-precision D matrix-vector
w
multiplication, andalsointhedouble-precision M matrix-vector multiplication.
5
6
CGforDWFonGPU Yao-YuanMao
Table1: BenchmarkofourCGsolverforDWFona163 32 16lattice,numbersinunitsofGflops)
× ×
D (single) M (single) D (double) M (double) CG(mixed)
w 5 w 5
GTX285 177 346 33 69 181
GTX480 248 331 32 116 233
C1060 128 290 29 61 132
C2050 160 239 22 100 156
5. Summary
WehaveimplementedanefficientGPU-basedCGsolverforgeneralizeddomain-wallfermions
in lattice QCD. Our CUDA kernels are tuned with several innovative techniques. On NVIDIA
GeForce GTX 285/480, our CG solver achieves 180/233 Gflops (sustained). This efficient CG
solver constitutes the most crucial part in TWQCD’s HMC code for simulation of unquenched
latticeQCDwiththeoptimaldomain-wallfermion.
Acknowledgments
This work is supported in part by the National Science Council (Nos. NSC96-2112-M-002-
020-MY3,NSC99-2112-M-002-012-MY3, NSC96-2112-M-001-017-MY3, NSC99-2112-M-001-
014-MY3,NSC99-2119-M-002-001) andNTU-CQSE(Nos. 99R80869, 99R80873).
References
[1] D.B.Kaplan,Phys.Lett.B288,342(1992)
[2] H.Neuberger,Phys.Lett.B417,141(1998);R.NarayananandH.Neuberger,Nucl.Phys.B443,305
(1995)
[3] T.W.Chiu,Phys.Rev.Lett.90,071601(2003);Nucl.Phys.Proc.Suppl.129,135(2004)
[4] T.W.Chiuetal.[TWQCDCollaboration],PoSLAT2009,034(2009)[arXiv:0911.5029[hep-lat]].
[5] T.W.Chiuetal.[TWQCDCollaboration],PoSLAT2010,099(2010)
[6] T.H.Hsiehetal.[TWQCDCollaboration],PoSLAT2010,085(2010)
[7] A.Borici,Nucl.Phys.Proc.Suppl.83,771(2000)[arXiv:hep-lat/9909057].
[8] M.R.HestenesandE.Stiefel,JournalofResearchoftheNationalBureauofStandards49,6(1952).
[9] G.L.G.Sleijpen,andH.A.vanderVorst,Computing56,141-164(1996).
[10] M.A.Clark,R.Babich,K.Barros,R.C.BrowerandC.Rebbi,Comput.Phys.Commun.181,1517
(2010)[arXiv:0911.3191[hep-lat]].
[11] http://developer.nvidia.com/object/gpucomputing.html
[12] http://www.python.org
7