Table Of ContentModeling and Analyzing Systems with
Redundancy
Kristen Gardner
CMU-CS-17-112
May 2017
SchoolofComputerScience
ComputerScienceDepartment
CarnegieMellonUniversity
Pittsburgh,PA15213
ThesisCommittee:
MorHarchol-Balter,Chair
GuyBlelloch
AnupamGupta
AlanScheller-Wolf
RhondaRighter,UCBerkeley
Submittedinpartialfulfillmentoftherequirements
forthedegreeofDoctorofPhilosophy.
Copyright(cid:13)c 2017KristenGardner
ThisresearchwassponsoredbytheNationalScienceFoundationundergrantnumbersCMM-1538204,CCF-1629444,
andDGE-1252522,IntelISTC-CC,theSiebelScholarshipFoundation,theGoogleWomenTechmakersScholarship,
andMicrosoft.
The views and conclusions contained in this document are those of the author and should not be interpreted as
representingtheofficialpolicies,eitherexpressedorimplied,ofanysponsoringinstitution,theU.S.governmentor
anyotherentity.
Keywords: redundancy, replication, task assignment, dispatching, scheduling, queueing
theory,stochasticprocesses,Markovmodels,RIQ,Redundancy-d,LRF,PF
Tomyfamily.
iv
Abstract
Reducinglatencyisaprimaryconcernincomputersystems. Ascloudcomputing
andresourcesharingbecomemoreprevalent,theproblemofhowtoreducelatency
becomes more challenging because there is a high degree of variability in server
speeds. Recentcomputersystemsresearchhasshownthatthesamejobcantake12
times or even 27 times longer to run on one machine than another, due to varying
background load, garbage collection, network contention, and other factors. This
server variability is transient and unpredictable, making it hard to know how long
a job will take to run on any given server, and therefore how best to dispatch and
schedulejobs.
An increasingly popularstrategy for combating server variabilityis redundancy.
Theideaistocreatemultiplecopiesofthesamejob,dispatchthesecopiestodifferent
servers, and wait for the first copy to complete service. A great deal of empirical
computer systems research has demonstrated the benefits of redundancy: using
redundancycanyielduptoa50%reductioninmeanresponsetime. Unfortunately,
thereisverylittletheoreticalworkanalyzingperformanceinsystemswithredundancy.
This thesis presents the first exact analysis of response time in systems with
redundancy. We begin in the Independent Runtimes (IR) model, in which a job’s
service times (runtimes) are assumed to be independent across servers. Here we
deriveexactexpressionsforthe distribution ofresponsetimein acertainsetof class-
based redundancy systems. We also propose two new scheduling policies, Least
Redundant First (LRF) and Primaries First (PF), and prove that LRF minimizes
overall system response time, while PF is fair across classes of jobs with different
redundancydegrees.
WhiletheIR modelisappropriateincertain settings,inothersit doesnotmake
sensebecausetheindependenceassumptioneliminatesanynotionofan“inherentjob
size.” TheIRmodelleadstotheconclusionthatmoreredundancyisalwaysbetter,
whichoftenisnottrueinpractice. Thereforewe proposetheS&X model,whichis
thefirstmodeltodecoupleajob’sinherentsize(X)fromtheserverslowdown(S).
Thismodelisimportantbecause,unlikepriormodels,itallowsajob’sruntimestobe
correlatedacrossservers. TheS&X modelmakesitevidentthatredundancydoesnot
alwayshelp: infact,toomuchredundancycanleadtoinstability. Toovercomethis,
weproposeanewdispatchingpolicy,Redundant-to-Idle-Queue,whichisprovably
stable in the S&X model, while offering substantial response time improvements
comparedtosystemswithoutredundancy
vi
Acknowledgments
So many people have positively influenced my intellectual, professional, and
personalgrowthduringmytimeasagraduatestudent. Firstandforemost,Iwantto
thankMorHarchol-Balter,whohasbeenanexceptionaladvisorforthepastfiveyears.
Mor’senthusiasmforresearchisunparalleled;Ioweagreatdealofmyintellectual
developmenttoherconstantwillingnesstoscheduleanothermeeting,tryanewproof
approach,andwhenallelsefailsfindanewproblemtostudy. Iamequallygrateful
to Mor for her commitment to promoting my professional development, both by
alwayslooking outforopportunitiesfor metogivetalksandapply forawards, and
byofferingwide-rangingadviceonalmosteverytopicimaginable.
Alan Scheller-Wolf has effectively been a second advisor to me. In addition to
always having a new research idea to explore and providing valuable feedback on
my writing, Alan has been endlessly supportive of my career goals and has been
an outstanding role model for maintaining balance in an academic career. Thanks
to Rhonda Righter for being a wonderful host during my visit to Berkeley and for
continuing to be my collaborator and mentor. I am particularly grateful to Rhonda
for her persistence when trying to prove tricky results, especially when the proof
turnsouttorequirePoissonarrivalsafterall. Ialsowouldliketothanktherestofmy
thesiscommittee,GuyBlellochandAnupamGupta,fortheirthoughtfulquestions
andfeedbackonmywork.
Over the course of my graduate career, I have been fortunate to work with a
number of collaborators who became my co-authors on the work presented in this
thesis: SherwinDoroudi, EsaHyytia¨,Benny van Houdt,MarkVelednitsky,and Sam
Zbarsky. ThanksalsotoEvanCavallo,Jan-PieterDorsman,BrandonLieberthal,John
Wright,DannyZhu,andTimmyZhuforthemanyhoursofdiscussionandmanylines
of code they contributed to my work. Thanks to Sem Borst for hosting my visit to
BellLabsinsummer2013;Iverymuchenjoyedourcollaborationandlearnedagreat
dealfromSemduringthatsummer.
Outsideofresearch,agreatmanypeopleintheSchoolofComputerSciencehave
playedanimportantroleinmakingmytimeatCMUthepositiveexperiencethatit
hasbeen. Women@SCShasbeenanimportantsourceofcommunityforme,andI
would like to thank Carol Frieze and Mary Widom for everything that they do for
the organization and for supporting and encouraging me throughout my time as a
student. ThanksalsotothewonderfuladministrativestaffintheComputerScience
Department,inparticularDebCavlovich,NancyConway,andCatherineCopetas,for
keepingthedepartmentrunningandhandlingeveryproblemandquestionthatarises,
nomatterhowbizarre.
Finally, I have truly enjoyed my years at CMU, and for this I credit my “moral
supportcommittee”: myparentsandsister, Wendy,Tom,andAllieGardner; andall
the friendswho haveshared my experienceshere. Thanks formusicals and knitting,
forkayakingandskating,forhumoringmyloveofHarryPotterandchocolatecake,
forHanabi,andforallthosehoursofAvalon. YouhavemadeCMUahomeforme.
viii
Contents
1 Introduction 1
1.1 SystemModel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 PolicyDesignandTheoreticalModelingAssumptions . . . . . . . . . . . . . . 3
1.3 ThesisStatement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.5 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2 RelatedWork 9
2.1 ApplicationsofRedundancy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 AnalyzingRedundancySystems . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3 Non-redundancySystems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3 FirstExactAnalysis 15
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2 ModelandLimitingDistribution . . . . . . . . . . . . . . . . . . . . . . . . . . 17
N
3.3 The Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
W
3.4 The Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
M
3.5 The Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.6 Scale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.7 RelaxingAssumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.8 DiscussionandConclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4 SchedulinginRedundancySystems 47
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.2 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.3 First-ComeFirst-ServedwithRedundancy(FCFS-R) . . . . . . . . . . . . . . . 51
4.4 LeastRedundantFirst(LRF) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.5 PrimariesFirst(PF) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.6 DiscussionandConclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5 Redundancy-d: ScalingtoLargeSystems 79
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.2 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.3 MarkovChainAnalysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
ix
5.4 LargeSystemLimitAnalysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
5.5 PowerofdChoices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
5.6 DiscussionandConclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
6 S&X:ABetterModelforRedundancy 109
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
6.2 TheS&X Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
6.3 Redundancy-dintheS&X Model . . . . . . . . . . . . . . . . . . . . . . . . . 114
6.4 Redundant-to-Idle-Queue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
6.5 Results: d (cid:28) k . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
6.6 Results: Highd . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
6.7 ImprovingRedundancyPolicies . . . . . . . . . . . . . . . . . . . . . . . . . . 126
6.8 Cancel-on-Start . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
6.9 DiscussionandConclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
7 Conclusion 141
7.1 LessonsLearnedaboutRedundancy . . . . . . . . . . . . . . . . . . . . . . . . 142
7.2 OpenProblems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
Bibliography 147
x
Description:and DGE-1252522, Intel ISTC-CC, the Siebel Scholarship Foundation, the Google Women Techmakers Scholarship, and Microsoft. The views and conclusions contained in this document [70] Jeremy Visschers, Ivo Adan, and Gideon Weiss. A product form solution to a system with multi-type jobs and