Table Of ContentThisspaceisreservedfortheProcediaheader,donotuseit
6
1
0 A Novel Implementation of QuickHull Algorithm
2
on the GPU
y
a
M
JiayinZhang1, GangMei1∗, NengxiongXu1, and KunyangZhao1
0
SchoolofEngineeringandTechnology,ChinaUniversityofGeosciences,100083,Beijing,China
2
{zhangjy,gang.mei,xunengxiong,kunyang}@cugb.edu.cn
]
G
C
Abstract
.
s WepresentanovelGPU-acceleratedimplementationoftheQuickHullalgorihtmforcalculatingconvex
c hulls of planarpointsets. We also describea practicalsolution to demonstratehow to efficiently im-
[
plementatypicalDivide-and-ConqueralgorithmontheGPU.Wehighlyutilizetheparallelprimitives
3 providedbythelibraryThrustsuchastheparallelsegmentedscanforbetterefficiencyandsimplicity.
v Toevaluatetheperformanceofourimplementation,wecarryoutfourgroupsofexperimentaltestsus-
6
ing two groupsof pointsets in two modeson the GPU K20c. Experimentalresults indicatethat: our
0
implementationcanachievethespeedupsofupto10.98xoverthestate-of-artCPU-basedconvexhull
7
4 implementationQhull[16]. Inaddition,ourimplementationcanfindtheconvexhullof20Mpointsin
0 about0.2seconds.
.
1
Keywords: GPU;Parallelization;ConvexHull;QuickHull;Divide-and-Conquer
0
5
1
: 1 Introduction
v
i
X
Thecalculatingoftheconvexhullofasetofplanarpointsistofindtheconvexpolygonthatenclosesall
r thepoints. Severalclassicalgorithmshavebeendevelopedsince1970,includingtheGrahamscan[8],
a
Giftwrapping[9],Incrementalmethod[11],Divide-and-Conquer[15],Monotonechain[1],andQuick-
Hull [2]. Several recent efforts have also been conductedto develop efficient convexhull algorithms
[24,12].
The finding of convex hulls of large sets of points is in general computationally expensive. An
effectivestrategyfordealingwiththisproblemistocalculateconvexhullsinparallel. Inrecentyears,
toimprovethecomputationalefficiencyofcalculatingconvexhulls,severalworthfulcontributionshave
beenmadeto re-designandimplementsequentialconvexhullalgorithmsinparallelby exploitingthe
power of massively computing on the GPU [18, 10, 19, 20, 21, 22, 23, 6, 7]. Most of these GPU-
acceleratedimplementationsaredevelopedbaseduponthefamousQuickHullalgorithm[2].
Forexample,Srikanth,Kothapalli,GovindarajuluandNarayanan[18]firstparallelizedtheQuick-
Hullalgorithmtoacceleratethecalculatingof2Dconvexhulls;andthenSrungarapu,Reddy,Kothapalli
∗Correspondingauthor:[email protected];[email protected]
1
QuickHullAlgorithmontheGPU... J.Zhang,G.Mei,N.XuandK.Zhao
and Narayanan[19] improvedthe abovework and achieved better efficiency. Tzeng and Owens [22]
presentedaframeworkforacceleratingthecomputingofconvexhullintheDivide-and-Conquerfashion
bytakingadvantageofQuickHull.Similarly,byalsoutilizingtheQuickHullapproach,Stein,Gevaand
El-Sana[20]presentedanovelGPU-acceleratedimplementationof3Dconvexhullalgorithm.
Inthis paper,we presenta novelGPU-acceleratedimplementationofthe famousQuickHullalgo-
rithm for computing the convex hulls of planar point sets. We also describe a practical solution to
demonstratehowtoimplementatypicalDivide-and-ConqueralgorithmontheGPU.Ourimplementa-
tion is quite similar to the one introducedby Tzeng and Owens [22]; butthere are several significant
differencesbetweenourimplementationandthatofTzengandOwens. Ourimplementationcanachieve
aspeedupofupto10.98xoverastandardsequentialCPUimplementation,andcanfindtheconvexhull
asetof20Mpointsinabout0.2seconds.
The rest of the paper is organized as follows. Section 2 introduces several background concepts
behind our implementation. Section 3 describes the basic ideas behind our implementation. Section
4 further presentssome implementationdetails. Then Section 5 givesseveralgroupsof experimental
results,whileSection6discussestheresult. Finally,Section7concludesthiswork.
2 Segment and Segmented Scan
Segmentsarecontiguouspartitionsofthedatawhicharemaintainedbysegmentflags[4,17]. Thereare
typicallytwoformsforrepresentingsegments: thefirstoneistouseasetofheadflags; andtheother
istouseasetofkeys;seeFigure1. Aheadflagmarksthebeginningofasegment(calledthesegment
head).Akeyindicatestheindexofthesegmentthateachelement/valueinagivenarraybelongingto.
Segmentedscangeneralizesthescanprimitivebyallowingscansonarbitrarysegments(“partitions”)
oftheinputvector[4]. Toimplementsegmentedscaninparallel,Sengupta,Harris,ZhangandOwens
[17]introducedthesegmentedscanprimitivetotheGPU.AndcurrentlyboththelibrariesCUDPP[5]
andThrust[3]bothprovideefficientGPU-acceleratedsegmentedscanprimitives.
ThrustisaC++templatelibraryforCUDAbasedontheStandardTemplateLibrary(STL).Thrust
allows users to implement high performance parallel applications with minimal programming effort
through a high-level interface that is fully interoperable with CUDA C [14]. Thrust provides a rich
collection of data structuresand data parallel primitivessuch as scan, sort, and reduce, which can be
composedtogethertoimplementcomplexalgorithmswithconcise,readablesourcecode.
Figure1: Segmentsandtworepresentationformsofsegments
Segment 0 Segment 1 Segment 2 Segment 3
Value 10 56 22 35 12 89 17 87 11 45 30 19 29 40 41 37 27 20 13 21
Head 1 0 0 0 1 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0
Key 0 0 0 0 1 1 1 2 2 2 2 2 2 2 3 3 3 3 3 3
Index 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
First point 0 4 7 14
2
QuickHullAlgorithmontheGPU... J.Zhang,G.Mei,N.XuandK.Zhao
3 Basic Ideas behind Our Implementation
Themostimportantideabehindourimplementationistodirectlyoperatethedataintheinputarraysthat
areoriginallyallocatedtostorethecoordinatesofinputpoints,ratherthanintheadditionallyallocated
arraysorsplittingtheinputdataintoseparatearrays.
TheQuickHullalgorithmisaDivide-and-Conquermethod,whichtendstodividetheinputdataset
intosubsetsandthenhandlesthese subsetsrecursively. OntheGPU, aneffectivestrategyistodivide
the inputdata set into subsets, butdo notstore them in separatedarrayswith differentsizes. Instead,
all the data of the subsets are still stored in the input data array, but the data of each subset is stored
intoaSegment(i.e.,aconsecutivepiece/partitionofdata)[4]. Operationscarriedoutforeachsubsetis
exactlytheoperationsforeachsegment[22]. Weadoptthisstrategytodevelopourimplementation.
Figure2:Procedureofthe2DCUDAQuickHullontheGPU(withoutpreprocessing)
Procedure: 2D Quickhull on the GPU
Input: a set of input points pt_in
Output: convex hull ch_out
First Split
1: Use parallel reduction to find the leftmost point Pminx and the rightmost point Pmaxx
2: Determine the positions of the rest points against the line PminxPmaxx
Assign a flag value to indicate the position: flag = 1 when below; flag = 0 when above
3: Use parallel partition to split the points into two segments according to the flag values:
the lower subset Slower and the upper subset Supper
4: Use parallel sorting to sort the Slower in x-ascending and Supper in x-descending
Recursive Step
Repeat
for each segment (Pfirst, Plast) represented by two points Pfirst and Plast do
5: Find the farthest point Pfar from the line Pfirst Plast
6: Divide segment (Pfirst, Plast) into two new segments (Pfirst, Pfar) and (Pfar, Plast)
7: Update all segments (including head flags, keys, first points)
8: Detect interior points by determining the positions
9: Assign each point a state flag depending on its position to indicate the state:
1: current non-interior points; 0: determined interior points
10: Use parallel stable_partition to gather all interior points according to the state flags,
then remove all interior points
11: Update only the first points of all segments
Until there are no interior points can be found
12: Output the remaining points in pt_in as the extreme points of ch_out
TheprocedureofourimplementationispresentedinFigure2. Inourimplementation,aftersplitting
theinputsetofpointsintothelowerandtheuppersubsets,wesortthetwosubsetsseparatelyaccording
to the x-coordinates. After this sorting, either the lower or the upper subset of sorted points can be
consideredasaMonotoneChain[1];inaddition,boththeabovechainscanbefurtherconsideredastwo
halvesofageneralpolygon.Ifwedetectandremovethoseverticesofthegeneralpolygonthathavethe
interioranglesgreaterthan180degrees,thenwecanobtainthedesiredconvexhull. Theaboveideaof
firstsortingandthenremovingnon-extremepoints/concaveverticeswasintroducedin[1]and[13].
Therefore,the basicidea behindourimplementationisto “Find-and-Remove”. Inthestep offirst
split, we divide the input points and then sort them to virtually form a general polygon. In the sub-
sequent recursive step, we recursively first find those non-extreme points, and then remove them to
guaranteethatalltheremainingpointsarecompletelyextremepointsoftheexpectedconvexhull.
3
QuickHullAlgorithmontheGPU... J.Zhang,G.Mei,N.XuandK.Zhao
4 Implementation Details
4.1 DataStorage andData Layout
Weallocateseveralarraysonthedevicesidetostorethecoordinatesofplanarpoints,informationabout
segments,andotherrequiredvaluessuchasdistances;seeTable1.
Table1:Allocatedarraysforstoringdataonthedevice
Array Usage
float x[n] xcoordinates
float y[n] ycoordinates
float dist[n] Distances
int head[n] Indicatorofthefirstpointofeachsegment
(1:Headpoint;0:Notaheadpoint)
int keys[n] Indexofthesegmentthateachpointbelongsto
int first pts[n] Indexofthefirstpointofeachsegment
int flag[n] Indicatewhetherapointisanextremepointoraninteriorpoint
(1:Potentialextremepoint;0:Determinedinteriorpoint)
4.2 The Preprocessing Procedure
BeforeperformingtheQuickHullalgorithmontheGPU,wefirstcarryoutapreprocessingprocedureto
filtertheinputpoints. Theobjectiveofthispreprocessingprocedureistoreducethenumberofpoints
bydiscardingthosepointsthatarenotneededforconsiderationinthesubsequentstageofcalculating
thedesiredconvexhull.
Weusetheparallelreductiontofindtheextremepointswithminormaxxorycoordinate. Inmore
details, we adopt the thrust::minmax element(x.begin(), x.end()) to find the left-
most and the rightmost points, and similarly use the thrust::minmax element(y.begin(),
y.end())toobtainthetopmostandthebottommostpoints. Thesefourextremepointsarethenused
toformaconvexquadrilateral.
We also design a simple CUDA kernel to check each pointto determine whetherit locates inside
thequadrilateral.Inthekernel,eachthreadisresponsiblefordeterminingthepositionofonlyonepoint
Pi, i.e., whether or not a point falls into the formed convexquadrilateral. If does, the corresponding
indicatorvalueflag[i]willbesetto0,otherwise,thevalueflag[i]isstillkeptas1.
4.3 The FirstSplit
The firstsplitof theQuichHullalgorithmis todividethe setofinputpointsinto two subsets, i.e., the
lowerandtheuppersubsets,usingthelinesegmentLformedbytheleftmostandtherightmostpoints.
ThosepointsthatlocatebelowtheLaregroupedintothelowersubset,whiletheonesdistributedabove
theLarecontainedintheuppersubset.
We developanotherquite simplekernelto performthe abovesplitprocedure. Inthis kernel, each
threadtakestheresponsibilitytodeterminethepositionofonlyonepointwithrespecttothelinesegment
L. Inthisstep,wetemporarilyusethevaluesintflag[n]toindicatethepositions: ifthepointPi
locatesbelowtheL,inotherwords,ifPi belongstothelowersubset,thenthecorrespondingindicator
valueflag[i]willbesetto1,otherwise0.
4
QuickHullAlgorithmontheGPU... J.Zhang,G.Mei,N.XuandK.Zhao
Afterdeterminingthepositionsofallpoints,itisneededtogatherthepointsbelongingtothesame
subsetsuchastheloweronetogetheraccordingtotheindicatorvaluesintflag[n].Werealizedthis
procedureby simply using the functionthrust::partition(). Those pointswith the indicator
value1,i.e.,thelowerpoints,willbeplacedintothefirstconsecutivehalfoftheinputarray(asegment
of points), while the upper points will be grouped into another consecutive half (another segment of
points). Insubsequentsteps,operationswillbeperformedinthesegmentsofpoints.
4.4 The RecursiveProcedure
4.4.1 FindingtheFarthestPoint
The first step in the recursiveprocedureis to find the farthestpointsin each segment, which includes
tworemarkableissues. Thefirstistocalculatethedistanceforallpointsinparallel;andtheotheristo
findthosepointswiththefarthestdistancesforallsegmentsinparallel.
Calculatingthe Distance. The calculationof the distancefroma pointto a line is quite straight-
forward. However,inourimplementation,itisneededto calculatethedistancesfromdifferentpoints
to differentlines simultaneously in parallel. This calculation is not so easy to implementin practice.
This is because that: (1) for each segment, it is needed to compute the distance from each of those
pointsbelongingtothissegmenttothelineformedbythefirstpointsandthelastpoint;(2)foranytwo
segments,theirfirstpointandlastpointsaredifferent.
Inourimplementation,foreachpointPi,weusethevaluefirst pt[i]torecordtheindexofthe
firstpointofthatsegmentitbelongsto. Sincesegmentsarestoredconsecutively,thefirstpointofthe
(j+1)th segmentisexactlythelastpointofthejth segmentexceptforthelastsegment. Notethatthe
lastpointofthelastsegmentisthepointP0. Therefore,itiseasytoobtaintheindexofthelastpointfor
eachsegment,andthedistancefromthepointPitothelineformedbythefirstpointandthelastpoint.
FindtheFarthestPoint. Aftercalculatingthedistancesforthosepointsindifferentsegments,itis
neededtofindthefarthestpointineachsegment. Thefindingofthefarthestpointsforallsegmentsin
parallelisnotso easy. Thisis becausethereare morethan two segmentsthatare neededto findtheir
farthestpoints. Theirfarthestpointsareprivate,segment-specific. Inthiscase, itisunabletoperform
a global parallel reduction, but is able to employ a segmented parallel reduction to find the greatest
distanceforeachsegmentofpoints.
ThesegmentedparallelreductionisdesignedinThrusttomakeaparallelreductionformorethan
one segment of points. It can be used to find the min or max values in several segments in parallel.
We employ the parallel primitive thrust::reduce by key() in our implementation to find the
farthestpoints/maximumdistancesforallsegmentsinparallel.
4.4.2 TheFirstRoundofUpdatingSegments
Afterfindingthefarthestpoints,eachsegmentisthentypicallydividedintotwosmallersubsegments
usingthe farthestpoint. Thismeansthattheoldsegmentsarereplacedwith newsegments. Tocreate
newsegments,thefollowinginformationofsegmentsisneededtobeupdated:
(1)Headflags
Theheadflagsofthefarthestpointsareneededtobemodifiedfrom0to1,whichmeanseachofthe
farthestpointsbecomesthefirstpointofanewsegmentandadeterminedextremepointsofthedesired
convexhull. Theheadflagsofotherpointsarekeptunchanged.
(2)Keys
Theupdatingofkeyscanbeveryeasilyrealizedbyperformingaglobalinclusivescanforthehead
flags. Forthatineachsegmentonlytheheadflagofthefirstpointis1,thesumoftheheadflagscanbe
consideredastheindexofthesegment(i.e.,thekeys).Noticeably,thesumoftheheadflagsisone-based
5
QuickHullAlgorithmontheGPU... J.Zhang,G.Mei,N.XuandK.Zhao
rather than zero-based(i.e., starting from 1 rather 0). To make the indicesbecome much easier to be
used,wefurthermodifythekeysfromone-basedtozero-basedbyperformingaparallelsubtracting.
(3)Indicesoffirstpoints
Afterupdatingtheheadflagsandkeysofeachsegment,thecorrespondingindicesofthefirstpoints
arenolongervalid,andthusneededtobeupdated. Beforeupdatingtheindicesofthefirstpoints,we
firstassignaglobalindexforeachoftheremainingpoints,thencheckeachpointwhetheritisthefirst
pointaccordingtotheheadflags. IftheheadflagofapointPi is1,thenthispointmustbeafirstpoint
ofasegmentanditsindexisexactlyi.
4.4.3 DiscardingInteriorPoints
Thediscardingofinteriorpointsistofirstcheckwhetherornotapointlocatesinsidethetriangleformed
by the firstpointof the segment(denotedasA), the last pointofthe segment(denotedas B), andthe
farthestpointinthissegment(denotedasC).
Let△ACBdenotethetriangle,thedeterminingofthepoints’positionswithrespecttothetriangle
△ACBistocheckwhetherthosepointsinthissegmentlocateontherightsideofthedirectedlineAC
andthedirectedlineCB. Ifdoes,thenitisnotaninteriorpoint,anditscorrespondingindicatorvalue
flag[i] is set to 1; otherwise, it is an interior point and the value flag[i] must modified to 0.
Similarly, for each pointin the segment(C,B), it is onlyneededto checkwhether it falls onthe right
sidethedirectedlineCB.
Afterdeterminingallinteriorpointsin thisrecursion,we employa parallelpartitioningprocedure
to gather all interior points together according to the indicator values int flag[n]; see a simple
illustrationin Figure3. Noticeably,to maintainthe relativeorderof inputpoints, we usethe function
thrust::stable partition()ratherthanthe functionthrust::partition(). Afterthe
partitioning,wemakeanoperationresize()toremovealltheinteriorpointsfoundinthisrecursion.
Figure3:Partitioningaccordingtoflags
Flag 1 1 0 1 1 1 0 1 0 1 1 0 1 1 1 1 0 0 1 1
Value 10 56 22 35 12 89 17 87 11 45 30 19 29 40 41 37 27 20 13 21
Value 10 56 35 12 89 87 45 30 29 40 41 37 13 21 22 17 11 19 27 20
Flag 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0
4.4.4 TheSecondRoundofUpdatingSegments
Thisroundofupdatingthesegmentsisnearlythesametothefirstroundofupdating. Thereasonwhy
thisroundofupdatingisneededtobeperformedisthat:afterremovingsomeinteriorpoints,thepoints
belongingtosomesegmentsareremoved. Inthiscase,theglobalindicesofalltheremainingpointare
notconsecutiveandthusneedtoberearranged;seethefirstroundofupdatingformoredetailsaboutthe
aboveupdating.Noticeably,theheadflagsandthekeysfortheremainingpointsarestillcorrect,anddo
notneedtobeupdated.Onlytheindexofthefirstpointofeachsegmentneedstobeupdated.
6
QuickHullAlgorithmontheGPU... J.Zhang,G.Mei,N.XuandK.Zhao
5 Results
WeperformourexperimentaltestsusingtheNVIDIATeslaK20cgraphicscardwith4GBmemoryand
CUDAv5.5.TheCPUexperimentsareperformedonWindows7SP1withanIntelE5-2650(2.60GHz)
and 96GB of RAM memory. Two groupsof test data is employedin two modes to evaluate the per-
formance of our implementation. The first group of test data includes 8 sets of points that randomly
distributedintheunitsquare. Theothergroupoftestdataisderivedfrompubliclyavailable3Dmesh
modelsbyprojectingallverticesofeachmeshmodelintotheXYplane.Thesemeshmodelsaredirectly
obtainedfromtheStanford3DScanningRepositoryandtheGITLargeGeometryModelsArchive. In
addition,wecarryouttheexperimentaltestsintwomodes:(a)Mode1:inthismode,thepreprocessing
procedureisemployed,(b)Mode2: inthismode,thepreprocessingprocedureisnotused.
We test the efficiency of our implementationusing the points that are randomlydistributed in the
unit square and the points derived from 3D mesh models, and then compare the efficiency with that
of the Qhulllibrary; see Tables2 and3. Theexperimentalresultsshowthat: ourimplementationcan
achievethespeedupsofupto10.98xand8.30xintheMode1andMode2,respectively. Inaddition,it
costsabout0.2secondstocomputetheconvexhullof20Mpointsinthebestcase.
Table2:Comparisonofrunningtime(/ms)forpointsdistributedinasquareonK20c
Ourimplementation Speedup
Size Qhull
Mode1 Mode2 Mode1 Mode2
100K 16 37.2 61.7 0.43 0.26
200K 32 36.4 65.7 0.88 0.49
500K 62 41.7 69.2 1.49 0.90
1M 109 44.8 73.9 2.43 1.47
2M 234 55.9 86.2 4.19 2.71
5M 561 77.9 118.5 7.20 4.73
10M 1029 124.7 161.7 8.25 6.36
20M 2262 206.0 272.5 10.98 8.30
Table3: Comparisonofrunningtime(/ms)forpointsderivedfrom3DmodelsonK20c
Ourimplementation Speedup
3DModel Size Qhull
Mode1 Mode2 Mode1 Mode2
Armadillo 172K 16 47.5 61.7 0.34 0.26
Angel 237K 47 50.5 62.6 0.93 0.75
SkeletonHand 327K 32 49.5 61.8 0.65 0.52
Dragon 437K 62 55.9 67.4 1.11 0.92
HappyBuddha 543K 63 56.8 76.7 1.11 0.82
TurbineBlade 882K 125 64.3 72.3 1.94 1.73
VellumManuscript 2M 219 63.2 91.5 3.47 2.39
AsianDragon 3M 359 78.4 129.8 4.58 2.77
ThaiStatue 5M 515 84.4 142.6 6.10 3.61
Lucy 14M 1404 141.3 223.3 9.94 6.29
7
QuickHullAlgorithmontheGPU... J.Zhang,G.Mei,N.XuandK.Zhao
6 Discussion
6.1 Comparison
ThebasicideasbehindourimplementationissimilartothosebehindtheimplementationofTzengand
Owens[22]. Thefirstofthesameideasisthat:weperformtheDivide-and-Conqueroperationsdirectly
intheinputarrays(i.e.,theinputsetsofpoints),ratherthanonadditionallyallocatedarrays.Thesecond
isthat: theDivide-and-ConquerproceduresofQuickHullalgorithmarerealizedbycreating,updating,
orremovingSegments. Thethirdsimilarfeatureisthat: bothoftheimplementationsaredevelopedby
stronglyexploitingthedata-parallelprimitives,parallel(global)scanandparallelsegmentedscan.
However,wehaveourownideas;andthereareseveralsignificantdifferences.Thefirstdifferenceis
that:afterdividingtheinputsetofpointsintothelowerandtheuppersubsets,wefurthersorttheabove
two subsets separately according to the x-coordinates, and maintain the relative order of those sorted
points unchanged in the subsequent procedure of recursively removing interior points. In contrast,
TzengandOwens[22]donotsortthesubsetsofpointsorkeeptherelativeorderofpoints.
The second difference is the creating and removing of segments: Tzeng and Owens create new
segments using the farthest point by permuting the points located in the old segment, while we also
createnewsegmentsusingthefarthestpointbutwedonotpermutepoints. Whenremovingsegments,
weatleastretainthefirstpoint(i.e.,theheadpoint)ofeachsegmentforthatthispointisdefinitelyan
extremepointofthedesiredconvexhull,whileintheimplementationofTzengandOwensasegment
isprobablycompletelyremoved.Thisdifferenceisduetothedifferentschemesofupdatingsegments.
Anotherdifferenceisthat:weadoptthelibraryThrust[3]fortheuseofseveralefficientdata-parallel
primitivessuchasparallelscan,segmentedscan,reduction,andsorting,whilein[22]TzengandOwens
developtheirimplementationbystronglyexploitingthelibraryCUDPP[5]. Thereasonwhywechoose
to use thelibraryThrustratherthanCUDPP isthat: ThrusthasbeenintegratedinCUDA toolkit, and
canbemucheasiertobeusedinpractice.
6.2 PerformanceImpactofPreprocessing
We haveobservedthat: ourimplementationexecutingintheMode1ismuchfasterthanthatinMode
2; see Figure 4. This behavior is probablydue to the facts that: (1) more than 50% input points can
bediscardedinthispreprocessing;(2)theperformancebenefitfromthediscardingofinteriorpointsis
morethantheperformancepenaltyleadbythediscarding.
Besidestheabovementionedimprovementincomputationalefficiency,anotherbenefitofadopting
thepreprocessingprocedureisthat: thememoryusageonthedevicesideismuchless. Thisisbecause
that:afterthepreprocessingonlylessthan50%inputpointsremain.Therefore,inthesubsequentproce-
dureofcomputingtheconvexhull,itisonlyneededtoallocatemuchlessmemoryonthedevicesidefor
thearrayssuchfloatx[n],y[n],dist[n]andint keys[n],first pts[n],flag[n].
6.3 PerformanceofEach sub-procedure
Therearethreemainsub-proceduresinourimplementationwhenadoptingthepreprocessing,i.e.,the
preprocessingprocedure(pre-step),thesplittingofpointsintotwosubsets(1ststep),andtherecursive
procedureoffindingtheexpectedconvexhull(2ndstep). Tofindthepotentialperformancebottleneck,
wehaveinvestigatedthecomputationalefficiencyofthesethreesub-procedures;seethecomputational
efficiency on K20c presented in Figure 5. We have found that in most cases: (1) the most computa-
tionally expensive step is the 2nd step; (2) the most computationallyinexpensiveone is the pre-step.
Therefore, the potential performance bottleneck of our implementation is probably the 2nd step, and
needstobeoptimizedinfurtherwork.
8
QuickHullAlgorithmontheGPU... J.Zhang,G.Mei,N.XuandK.Zhao
12 12
Mode 1 Mode 2 Mode 1 Mode 2
10
10
8
8 p
u
eedup 6 Speed 46
p
S
4 2
2 0
0
100K 200K 500K 1M 2M 5M 10M 20M
Size 3D Model
(a) Pointsdistributedinunitsquare (b) Pointsderivedfrom3Dmeshmodels
Figure4: ComparisonofspeedupsintwomodesonK20c
100 200
2nd step 1st step Pre-step 2nd step 1st step
80 160
ms) ms)
e (/ 60 e (/ 120
m m
(cid:127) (cid:127)
ng 40 ng 80
ni ni
n n
u u
R 20 R 40
0 0
3D Model 3D Model
(a) TestforpointsderivedfrommeshmodelsinMode1 (b) TestforpointsderivedfrommeshmodelsinMode2
Figure5: Computationalefficiencyofsub-proceduresonK20c
7 Conclusion
We havepresenteda novelimplementationof the two-dimensionalQuickHullalgorithmon the GPU.
Inourimplementation,wehavetransformedtheDivide-and-Conquerproceduresintotheoperationsfor
segments directly in the inputarrays. We have strongly utilized several efficient primitives including
parallel sort, scan, segmented scan, and partitioning provided by the library Thrust. We have also
evaluatedtheperformanceofourimplementationusing: (1)pointsrandomlydistributedinunitsquare
and(2)pointsderivedfrom3Dmeshmodelswithorwithoutemployingthepreprocessingprocedure.
We havefoundthatourimplementationcanachievea speedupofupto10.98xovertheQhulllibrary.
We havealsoobservedthatitcostabout0.2stofindtheconvexhullof20Mpoints. We hopethatour
workcanhelpdevelopefficientimplementationsofotherDivide-and-ConqueralgorithmsontheGPU.
Acknowledgments The authors are grateful to the anonymousreferee for helpful comments that
improvedthispaper. ThisresearchwassupportedbytheNaturalScienceFoundationofChina(Grant
Nos. 40602037and40872183).
9
QuickHullAlgorithmontheGPU... J.Zhang,G.Mei,N.XuandK.Zhao
References
[1] AlexMAndrew. Another efficient algorithmfor convex hullsintwodimensions. Information Processing
Letters,9(5):216–219,1979.
[2] CBradfordBarber,DavidPDobkin,andHannuHuhdanpaa. Thequickhullalgorithmforconvexhulls. ACM
TransactionsonMathematicalSoftware(TOMS),22(4):469–483,1996.
[3] NathanBellandJaredHoberock. Thrust: Aproductivity-orientedlibraryforcuda. GPUComputingGems,
7,2011.
[4] GuyEBlelloch. Vectormodelsfordata-parallelcomputing,volume356. MITpressCambridge,1990.
[5] CUDPP. http://cudpp.github.io/,2014.
[6] Mingcen Gao, Thanh-Tung Cao, Ashwin Nanjappa, Tiow-Seng Tan, and Zhiyong Huang. ghull: A gpu
algorithmfor3dconvexhull. ACMTransactionsonMathematicalSoftware(TOMS),40(1):3,2013.
[7] MingcenGao,Thanh-TungCao,Tiow-SengTan,andZhiyongHuang. Flip-flop:convexhullconstructionvia
star-shapedpolyhedronin3d.InProceedingsoftheACMSIGGRAPHSymposiumonInteractive3DGraphics
andGames,pages45–54.ACM,2013.
[8] RonaldL.Graham. Anefficientalgorithfordeterminingtheconvexhullofafiniteplanarset. Information
processingletters,1(4):132–133,1972.
[9] Ray A Jarvis. On the identification of the convex hull of a finite set of points in the plane. Information
ProcessingLetters,2(1):18–21,1973.
[10] TomaszJurkiewiczandPiotrDanilewski. Efficientquicksort and2dconvexhullforcuda, andmsimdasa
realisticmodelofmassivelyparallelcomputations,2011.
[11] Michael Kallay. The complexity of incremental convex hull algorithms in r¡ sup¿ d¡/sup¿. Information
ProcessingLetters,19(4):197,1984.
[12] RunzongLiu,BinFang,YuanYanTang,JingWen,andJiyeQian.Afastconvexhullalgorithmwithmaximum
inscribedcircleaffinetransformation. Neurocomputing,77(1):212–221,2012.
[13] AvrahamAMelkman. On-lineconstructionoftheconvexhullofasimplepolyline. InformationProcessing
Letters,25(1):11–12,1987.
[14] NVIDIA. Thrustquickstartguide,2014.
[15] Franco P. Preparataand SeJune Hong. Convex hullsof finitesets of points intwo and threedimensions.
CommunicationsoftheACM,20(2):87–93,1977.
[16] Qhull. www.qhull.org,2014.
[17] ShubhabrataSengupta,MarkHarris,YaoZhang,andJohnDOwens. Scanprimitivesforgpucomputing. In
GraphicsHardware,volume2007,pages97–106,2007.
[18] DSrikanth, KishoreKothapalli, RGovindarajulu, andPNarayanan. Parallelizingtwodimensional convex
hullonnvidiagpuandcellbe. InInternational conferenceonhighperformancecomputing(HiPC),pages
1–5,2009.
[19] Srikanth Srungarapu, Durga Prasad Reddy, Kishore Kothapalli, and PJ Narayanan. Fast two dimensional
convexhullonthegpu. InAdvancedInformationNetworkingandApplications(WAINA),pages7–12,2011.
[20] AyalStein,EranGeva,andJihadEl-Sana. Cudahull: Fastparallel3dconvexhullonthegpu. Computers&
Graphics,36(4):265–271,2012.
[21] Min Tang, Jie-yi Zhao, Ruo-feng Tong, and Dinesh Manocha. Gpu accelerated convex hull computation.
Computers&Graphics,36(5):498–506,2012.
[22] Stanley Tzeng and John D Owens. Finding convex hulls using quickhull on the gpu. arXiv preprint
arXiv:1201.2936,2012.
[23] Jeffrey M Whiteand Kevin A Wortman. Divide-and-conquer 3d convex hulls on the gpu. arXivpreprint
arXiv:1205.1171,2012.
[24] Changyuan Xing, Zhongyang Xiong, Yufang Zhang, Xuegang Wu, Jingpei Dan, andTingping Zhang. An
efficientconvexhullalgorithmusingaffinetransformationinplanarpointset. ArabianJournalforScience
andEngineering,pages1–9,2014.
10