Table Of Content2022 IEEE/ACM 44th International Conference on Software Engineering: Companion Proceedings (ICSE-Companion)
8
7
7
3
9
7
9
2.
2
0
2
7.
9
2
5
5 Is GitHub Copilot a Substitute for Human Pair-programming?
n
o
ni
pa An Empirical Study
m
o
C
E-
S SakiImai
C
9/I [email protected]
0
11 ColbyCollege
0.
1 Waterville,Maine,USA
OI:
D ABSTRACT IfGitHubCopilotcouldproduceequivalentadvantagesfound
E | inhumanpair-programming,thenadoptingthepracticeofpair-
E This empirical study investigates the effectiveness of pair pro-
2 IE gramming with GitHub Copilot in comparison to human pair- programmingwithGitHubCopilotwouldleadtoamoreproductive
2 andhigherqualitysoftwaredevelopmentwithoutacquiringaddi-
0 programming. Through an experiment with 21 participants we
©2 focusoncodeproductivityandcodequality.Forexperimentalde- tionalcostsofaddingasecondprogrammer.WhileVinitShahdeo,a
00 sign,aparticipantwasgivenaprojecttocode,underthreecon- softwareengineeratPostman,saidCopilotis“goingtoincreasede-
$31. ditionspresentedinarandomizedorder.Theconditionsarepair- veloper’sefficiencybyreducingdevelopmenttimeandsuggesting
22/ programmingwithCopilot,humanpair-programmingasadriver, betteralternatives“,technicalbloggerRayVillalobosstatesthatitis
8-1/ andasanavigator.Thecodesgeneratedfromthethreetrialswere hardtogetausefulresultandthatheneedstoretypecommentsto
59 analyzedtodeterminehowmanylinesofcodeonaveragewere getaproductivepieceofcode[2].Althoughthereareclaimsthat
4-9 addedineachconditionandhowmanylinesofcodeonaverage theseAItoolsmakesoftwaredevelopmentmoreproductiveand
665 wereremovedinthesubsequentstage.Theformermeasuresthe theycouldevensubstitutehumanpair-programmers,wehavenot
978-1- pofrothduecptrivoidtyucoefdecaochdec.oTnhdeitrioensuwltshisluegthgeesltatthteartmaletahsouurgehstChoepqiluoatliitny- saereenmaonreemppriordicuacltsitvuedyantodvgeirviefyhiifgAhIetrooqlusailnitsyofctowdaer.eIdnetvheilsoppmapeenrt,
n) | creasesproductivityasmeasuredbylinesofcodeadded,thequality wefocusontheissueofproductivityandcodequalitywhenusing
nio ofcodeproducedisinferiorbyhavingmorelinesofcodedeleted GitHubCopilotinsoftwaredevelopment.Wedesignedadedicated
a empiricalexperimenttocompareAIwithhumanparticipantsina
p inthesubsequenttrial.
m naturalsoftwaredevelopmentenvironment.Throughcodeanalysis,
o
C
E- CCSCONCEPTS weaimtoanswerourtwocentralresearchquestionsfocusingon
S measuringproductivityandcodequalitywithGitHubCopilot.
gs (IC •anSodfetwnvairreonanmdeintstse;n•gHinuemearinn-gce→nteDreevdecloopmmpeuntitnfgra→meCwololarbkos-
n 2 BACKGROUNDANDRELATEDWORK
di rativeandsocialcomputingsystemsandtools.
e
ce Werecognizetwomajorthemesinthepreviousworksthathave
o
Pr KEYWORDS beendoneinthisfield.ThefirstistheuseofAIinsoftwaredevel-
nion GitHub,Copilot,SoftwareDevelopment,AI opment.ManystudieshaveshownthattheuseofAIassistswith
a softwaredevelopment.Forinstance,onestudyusedatransformer-
p
m ACMReferenceFormat: basedmodelreportedaccuracyofupto69%inpredictingtokens
o
C SakiImai.2022.IsGitHubCopilotaSubstituteforHumanPair-programming?
g: AnEmpiricalStudy.In44thInternationalConferenceonSoftwareEngineering whencodetokensweremasked[4].Anotherstudyusinglarge
n languagemodelsreportedthatAIcouldrepair100%ofhandcrafted
eeri Companion(ICSE’22Companion),May21–29,2022,Pittsburgh,PA,USA. securitybugsinadditionto58%ofhistoricalbugsinopen-source
n ACM,NewYork,NY,USA,3pages.https://doi.org/10.1145/3510454.3522684
gi projects[8].Moreover,atrainedGPTlanguagemodelhasbeen
n
e E 1 RESEARCHPROBLEMANDMOTIVATION exhibitedtosolve70.2%ofproblemswith100trainingsamplesper
ar problem[3],andisalsocapableofrepairingbugsincode[9].One
w
oft GitHubCopilotisasoftwaredevelopmenttoolthatofferscode studypredicteddefectswith87%accuracy,decreasedinspection
n S generationoflines,codechunks,orevenentireprogramsbasedon effortby72%,andreducedpost-releasedefectsby44%[11].
e o existingcodeandcomments[10].Copilotismarketedasasubstitute Thesecondthemefocusesonthestudyofsoftwaredevelopment
enc forpair-programming,asoftwaredevelopmentpracticewheretwo environments,whereempiricalexperimentationofhowpeople
nfer programmerscollaborativelywriteasinglepieceofcode. writecodegivesusinsightsintohowtoenhancethesetoolsand
Co topossiblydiscoverthebestpracticeofsoftwaredevelopment[6].
al Permissiontomakedigitalorhardcopiesofallorpartofthisworkforpersonalor Therehavebeenstudiesonhowprofessionaldeveloperscompre-
ation cfolarspsrroofiomtoursceomismgrearncitaeldawdvitahnotaugtefeaenpdrtohvaitdceodptiheastbceoapritehsisarneontioctemanaddethoerfduilsltcriitbautitoend hendsoftwaretounderstandhowsoftwaredevelopmentshould
ntern omnutshtebefirhsotnpoargeed..CAobpsytrraigcthitnsgfowritchomcrpedointeinstpseorfmtihttisedw.oTrokcoowpyneodthbeyrwotishee,rosrthreapnuAblCisMh, bperodgorname,msuecrhs[a7s]h,oanwdphroowgraimmpmleemrsernetfaatcitoonrowfhtialsekvcaolindtaetxintgfoorththeer
4th I tfoeep.oRsetqounessterpveerrmsiosrsitoonrsedfriostmribpuetremtiossliisotns,[email protected]/ora Eclipsedevelopmentenvironmentimprovedproductivityofpro-
M 4 ICSE’22Companion,May21–29,2022,Pittsburgh,PA,USA grammers[5].WerecognizethatthestudyofAItoolsinsoftware
C ©2022AssociationforComputingMachinery. developmenthasnotbeenstudiedempiricallywithadedicated
A ACMISBN978-1-6654-9598-1/22/05...$15.00
E/ https://doi.org/10.1145/3510454.3522684 experiment.
E
E
2 I
2
0
2
319
Authorized licensed use limited to: UNIVERSIDADE DE SAO PAULO. Downloaded on July 01,2022 at 01:08:33 UTC from IEEE Xplore. Restrictions apply.
ICSE’22Companion,May21–29,2022,Pittsburgh,PA,USA SakiImai
3 APPROACHANDUNIQUENESS 4 RESULTS
Inthisresearch,weaimtostudyGitHubCopilotempiricallyina To answer our research questions, code productivity in RQ1 is
naturalsoftwaredevelopmentenvironment(VSCodeIDE).Hence, assessedbycomparingthenumberofaddedlinestothecode,and
theresearchquestionstobeaddressedinthisstudyareasfollows: codequalityinRQ2isanalyzedbycomparingthenumberoflines
(RQ1)IsthereanadvantageinproductivitywhileusingGitHub deletedinthesubsequenttrial.Deletionisanindicationoflow
Copilotascomparedtoahumanpairprogrammer?(RQ2)Whatis qualitycode.
thequalityofcodewrittenwithCopilotincomparisontohuman TheresultofRQ1isshowninFigure1,wherewecanseethat
pairprogrammers? theCopilotconditionproducedthehighestmaximumandmean
Inpairprogramming,twoprogrammerscollaborativelyworkon additionstolinesofcode.Themaximumnumberoflineswrittenin
thesamecode(typicallyonthesamecomputer).Eachprogrammer thetrialwithCopilotwas43whilethecodewrittenasadriverand
periodicallyswitchesbetweentworoles,adriverornavigator.The navigatorwere27and33respectively.Theminimumlinesofcode
drivercontrolsthemouseandkeyboardandwritescodewhilethe addedwas9.5forCopilotand6forbothdriverandnavigator.These
navigatorobservesthedriver’sworkandcriticallythinksabout resultssuggesthigherproductivityduringpair-programmingwith
defects,structuralissues,andalternativesolutions,whilelooking Copilotversushumanpair-programmers.
atalargerpicture[1].
UsingGitHubCopilotasasecondprogrammer,wecomparecode
whenaparticipantispairprogrammingwithahumanprogram-
merversusCopilot.Twenty-oneparticipantswhohavetakenat
leastoneprogrammingcourseworkedondevelopingtext-based
minesweepergameinPython.Noneoftheparticipantshadimple-
mentedthisgamebefore,andtheparticipantsfamiliarizedthem-
selveswiththerulesbyplayingthisgamepriortothedevelopment
task.Thedevelopmenttaskwasdoneunderthreeconditions.The
conditionsarepairprogrammingwithCopilot;pairprogramming
withanotherhumanexperimenterasadriver,andpairprogram-
mingwithanotherhumanexperimenterasanavigator.Thetime
allocatedforis20minutesforCopilot,10minutesasadriver,and
10minutesasanavigator(20minutestotalwithahumanpair).The
orderoftheseconditionswererandomizedtopreventtheexperi-
Figure2:Numberoflinesofcodedeletedinatrialsubsequent
menteffect.Duringtheexperiment,eyemovementisrecordedto
tothreedifferentconditions.
measurethedifferencebetweenhavingCopilotasacollaboratorin
comparisontoahumanprogrammer.
ToanswerRQ2,wecountedthenumberofdeletedlinesinthe
Theanalysisoftheproducedcodeisdonebyusingthendifffunc-
tionfromdifflib1.Thisisusedtocomparethenumberofadded followingtrialandnormalizethecountbythetrialduration.For
this,thelinecountsforthelastconditionwereexcludedsincethere
linestothecodeandnumberofdeletedlinestothecodeaftereach
was no trial subsequent to that where low quality code can be
trial,normalizedbythedurationofthetrial.
removed.ThemaximumlinesofcodedeletedaftertheCopilottrial
was42whilethelinesofcodedeletedafterthedriverandnavigator
trialwerelowerwith31and10,respectively.Figure2alsoshows
thatthedeletedlinecountinthefollowingtrialwashigherfor
Copilotthantheothertwoconditions.Hence,ourresultsuggests
thatthecodegeneratedwithCopilothas,onaverage,lowerquality
thanthatproducedbyhumanpair-programmers.
5 CONTRIBUTIONS
OurresultssuggestthatalthoughprogrammingwithCopilothelps
generate more lines of code than human pair-programming in
thesameperiodoftime,thequalityofcodegeneratedbyCopi-
lot appears to be lower. This result seems to suggest that pair-
programmingwithCopilotdoesnotmatchtheprofileofhuman
pair-programming.
Wearestillintheprocessofcollectingexperimentdataandanalyz-
Figure1:Numberoflinesaddedtoacodeunderthreediffer-
ingtheeye-trackingdatathathavebeenrecordedthroughoutthe
entconditions.
experiment.Withtheeye-trackingdata,wearetryingtocompare
howprogrammerinspectthecodegeneratedbyAItothatbyhu-
manpair-programmer.Ourhypothesisisthattheoverconfidence
1https://docs.python.org/3/library/difflib.html ofAItoolsleadstolessinspectionofcodegeneratedbyCopilot.
320
Authorized licensed use limited to: UNIVERSIDADE DE SAO PAULO. Downloaded on July 01,2022 at 01:08:33 UTC from IEEE Xplore. Restrictions apply.
IsGitHubCopilotaSubstituteforHumanPair-programming?AnEmpiricalStudy ICSE’22Companion,May21–29,2022,Pittsburgh,PA,USA
REFERENCES onFoundationsofsoftwareengineering.1–11.
[1] RitchieSchacherAdamArcherandScottWill.[n.d.].PrograminPairs. Retrieved [6] GailCMurphy,MikKersten,andLeahFindlater.2006.HowareJavasoftware
December31,2021fromhttps://www.ibm.com/garage/method/practices/code/ developersusingtheElipseIDE?IEEEsoftware23,4(2006),76–83.
practice_pair_programming/ [7] EmersonMurphy-Hill,ChrisParnin,andAndrewPBlack.2011.Howwerefactor,
[2] ScottCarey.2021. DevelopersreacttoGitHubCopilot. RetrievedDecember andhowweknowit. IEEETransactionsonSoftwareEngineering38,1(2011),
31,2021fromhttps://www.infoworld.com/article/3624688/developers-react-to- 5–18.
github-copilot.html [8] HammondPearce,BenjaminTan,BaleeghAhmad,RameshKarri,andBrendan
[3] MarkChen,JerryTworek,HeewooJun,QimingYuan,HenriquePondedeOliveira Dolan-Gavitt.2021.CanOpenAICodexandOtherLargeLanguageModelsHelp
Pinto,JaredKaplan,HarriEdwards,YuriBurda,NicholasJoseph,GregBrockman, UsFixSecurityBugs?arXivpreprintarXiv:2112.02125(2021).
etal.2021. Evaluatinglargelanguagemodelstrainedoncode. arXivpreprint [9] JulianAronPrennerandRomainRobbes.2021.AutomaticProgramRepairwith
arXiv:2107.03374(2021). OpenAI’sCodex:EvaluatingQuixBugs.arXivpreprintarXiv:2111.03922(2021).
[4] MatteoCiniselli,NathanCooper,LucaPascarella,AntonioMastropaolo,Emad [10] DominikSobania,MartinBriesch,andFranzRothlauf.2021.ChooseYourPro-
Aghajani,DenysPoshyvanyk,MassimilianoDiPenta,andGabrieleBavota.2021. grammingCopilot:AComparisonoftheProgramSynthesisPerformanceof
AnEmpiricalStudyontheUsageofTransformerModelsforCodeCompletion. GitHubCopilotandGeneticProgramming. arXivpreprintarXiv:2111.07875
IEEETransactionsonSoftwareEngineering(2021). (2021).
[5] MikKerstenandGailCMurphy.2006.Usingtaskcontexttoimproveprogrammer [11] AyseTosun,AyseBener,andResatKale.2010.Ai-basedsoftwaredefectpredictors:
productivity.InProceedingsofthe14thACMSIGSOFTinternationalsymposium Applicationsandbenefitsinacasestudy.InTwenty-SecondIAAIConference.
321
Authorized licensed use limited to: UNIVERSIDADE DE SAO PAULO. Downloaded on July 01,2022 at 01:08:33 UTC from IEEE Xplore. Restrictions apply.