Table Of Content

Achieving Privacy in the Adversarial Multi-Armed Bandit AristideC.Y.Tossou ChristosDimitrakakis ChalmersUniversityofTechnology UniversityofLille,France Gothenburg,Sweden ChalmersUniversityofTechnology,Sweden [email protected] HarvardUniversity,USA [email protected] 7 Abstract movierecommendationscanbeformalizedsimilarly(Pandey 1 andOlston2006). 0 Inthispaper,weimprovethepreviouslybestknownregret Privacy can be a serious issue in the bandit setting (c.f. 2 boundtoachieve(cid:15)-differentialprivacyinobliviousadversarial √ (Jain,Kothari,andThakurta2012;ThakurtaandSmith2013; banditsfromO(T2/3/(cid:15))toO( TlnT/(cid:15)).Thisisachieved n MishraandThakurta2015;Zhaoetal.2014)).Forexample, bycombiningaLaplaceMechanismwithEXP3.Weshowthat a inclinicaltrials,wemaywanttodetectandpublishresults J thoughEXP3isalreadydifferentiallyprivate,itleaksalinear amountofinformationinT.However,wecanimprovethis about the best drug without leaking sensitive information, 6 privacybyrelyingonitsintrinsicexponentialmechanismfor suchasthepatient’shealthconditionandgenome.Differen- 1 √ selectingactions.ThisallowsustoreachO( lnT)-DP,with tialprivacy(Dwork2006)formallyboundstheamountof ] aregretofO(T2/3)thatholdsagainstanadaptiveadversary, informationthatathirdpartycanlearnnomattertheirpower G an improvement from the best known of O(T3/4). This is orsideinformation. L donebyusinganalgorithmthatrunEXP3inamini-batch Differentialprivacyhasbeenusedbeforeinthestochastic loop.Finally,werunexperimentsthatclearlydemonstratethe setting(TossouandDimitrakakis2016;MishraandThakurta . s validityofourtheoreticalanalysis. 2015;Jain,Kothari,andThakurta2012)wheretheauthors c obtainoptimalalgorithmsuptologarithmicfactors.Inthe [ adversarial setting, (Thakurta and Smith 2013) adapts an 1 Introduction 1 algorithmcalledFollowTheApproximateLeadertomakeit v Weconsidermulti-armedbanditproblemsintheadversarial privateandobtainaregretboundofO(T2/3).Inthiswork, 2 settingwherebyanagentselectsonefromanumberofalter- we show that a number of simple algorithms can satisfy 2 natives(calledarms)ateachroundandreceivesagainthat privacyguarantees,whileachievingnearlyoptimalregret(up 2 dependsonitschoice.Theagent’sgoalistomaximizeitsto- tologarithmicfactors)thatscalesnaturallywiththelevelof 4 talgainovertime.Therearetwomainsettingsforthebandit privacydesired. 0 problem.Inthestochasticone,thegainsofeacharmaregen- . Ourworkisalsoofindependentinterestfornon-private 1 eratedi.i.dbysomeunknownprobabilitylaw.Intheadversar- multi-armedbanditalgorithms,astherearecompetitivewith 0 ialsetting,whichisthefocusofthispaper,thegainsaregen- thecurrentstateoftheartagainstswitching-costadversaries 7 eratedadversarially.Weareinterestedinfindingalgorithms (wherewerecovertheoptimalbound).Finally,weprovide 1 withatotalgainoverT roundsnotmuchsmallerthanthatof rigorousempiricalresultsagainstavarietyofadversaries. : v anoraclewithadditionalknowledgeabouttheproblem.In Thefollowingsectiongivesthemainbackgroundandnota- Xi bothsettings,algorithmsthatac√hievetheoptimal(problem- tions.Section3.1describesmeta-algorithmsthatperturbthe independent)regretboundofO( T)areknown(Auer,Cesa- gainsequencetoachieveprivacy,whileSection3.2explains r Bianchi, and Fischer 2002; Burnetas and Katehakis 1996; a howtoleveragetheprivacyinherentintheEXP3algorithm PandeyandOlston2006;Thompson1933;Aueretal.2003; by modifying the way gains are used. Section 4 compares Auer2002;AgrawalandGoyal2012). ouralgorithmswithEXP3inavarietyofsettings.Thefull Thisproblemisamodelformanyapplicationswherethere proofsofallourmainresultsareinthefullversion. isaneedfortrading-offexplorationandexploitation.This issobecause,wheneverwemakeachoice,weonlyobserve 2 Preliminaries thegaingeneratedbythatchoice,andnotthegainsthatwe couldhaveobtainedotherwise.Anexampleisclinicaltrials, 2.1 TheMulti-ArmedBanditproblem wherearmscorrespondtodifferenttreatmentsortests,and Formally,abanditgameisdefinedbetweenanadversaryand the goal is to maximize the number of cured patients over anagentasfollows:thereisasetofK armsA,andateach time whilebeing uncertain about theeffects of treatments. round t, the agent plays an arm I ∈ A. Given the choice Other problems, such as search engine advertisement and t I ,theadversarygrantstheagentagaing ∈ [0,1].The t It,t Copyright(cid:13)c 2017,AssociationfortheAdvancementofArtificial agentonlyobservesthegainofarmIt,andnotthatofany Intelligence(www.aaai.org).Allrightsreserved. otherarms.Thegoalofthisagentistomaximizeitstotalgain afterT rounds,(cid:80)T g .Arandomizedbanditalgorithm Λ:(A×[0,1])∗ →t=D1 (IAt,)t mapseveryarm-gainhistorytoa exp(γ/KG˜ ) γ distributionoverthenextarmtotake. pi,t =(1−γ)(cid:80)K exp(γ/Ki,Gt˜ ) + K (2.1) Thenatureoftheadversary,andspecifically,howthegains i=1 i,t are generated, determines the nature of the game. For the withγ awelldefinedconstant. stochasticadversary (Thompson1933;Auer,Cesa-Bianchi, Finally, EXP3 plays one action randomly according to andFischer2002),thegainobtainedatroundtisgenerated theprobabilitydistributionpt ={p1,t,...pK,t}withpi,tas i.i.d from a distribution P . The more general fully obliv- definedabove. It ious adversary (Audibert and Bubeck 2010) generates the gainsindependentlyatroundtbutnotnecessarilyidentically 2.2 DifferentialPrivacy fromadistributionP .Finally,wehavetheobliviousadver- The following definition (from (Tossou and Dimitrakakis It,t sary(Aueretal.2003)whoseonlyconstraintistogenerate 2016))specifieswhatismeantwhenwecalledabanditalgo- thegaing asafunctionofthecurrentactionI only,i.e. rithmdifferentiallyprivateatasingleroundt: It,t t ignoringpreviousactionsandgains. Definition2.1(Singleround((cid:15),δ)-differentiallyprivateban- Whilefocusingonobliviousadversaries,wediscovered ditalgorithm). ArandomizedbanditalgorithmΛis((cid:15),δ)- that by targeting differential privacy we can also compete differentially private at round t, if for all sequence g 1:t−1 againstthestrongerm-boundedmemoryadaptiveadversary andg(cid:48) thatdiffersinatmostoneround,wehaveforany 1:t−1 (Cesa-Bianchi,Dekel,andShamir2013;Merhavetal.2002; actionsubsetS ⊆A: Dekel,Tewari,andArora2012)whocanuseuptothelastm P (I ∈S |g )≤δ+P (I ∈S |g(cid:48) )e(cid:15), (2.2) gains.Theobliviousadversaryisaspecialcasewithm=0. Λ t 1:t−1 Λ t 1:t−1 Anotherspecialcaseofthisadversaryistheonewithswitch- where PΛ denotes the probability distribution specified by ing costs, who penalises the agent whenever he switches thealgorithmandg1:t−1 ={g1,...gt−1}withgs thegains arms,bygivingthelowestpossiblegainof0(herem=1). ofallarmsatrounds.Whenδ =0,thealgorithmissaidto be(cid:15)-differentialprivate. Regret. Relyingonthecumulativegainofanagenttoeval- The (cid:15) and δ parameters quantify the amount of privacy uateitsperformancecanbemisleading.Indeed,considerthe loss.Lower((cid:15),δ)indicatehigherprivacyandconsequently casewhereanadversarygivesazerogainforallarmsatev- wewillalsoreferto((cid:15),δ)astheprivacyloss.Definition2.1 eryround.Thecumulativegainoftheagentwouldlookbad meansthattheoutputofthebanditalgorithmatroundtis butnootheragentscouldhavedonebetter.Thisiswhyone almostinsensibletoanysinglechangeinthegainssequence. comparesthegapbetweentheagent’scumulativegainand Thisimpliesthatwhetherornotweremoveasingleround, theoneobtainedbysomehypotheticalagent,calledoracle, replacethegains,thebanditalgorithmwillstillplayalmost with additional information or computational power. This the same action. Assuming the gains at round t are linked gapiscalledtheregret. toauserprivatedata(forexamplehiscancerstatusorthe Therearealsovariantsoftheoraclethatareconsideredin advertisementheclicked),thedefinitionpreservestheprivacy theliterature.Themostcommonvariantisthefixedoracle, of that user against any third parties looking at the output. whichalwaysplaysthebestfixedarminhindsight.Theregret Thisisthecasebecausethechoicesortheparticipationof Ragainstthisoracleis: thatuserwouldnotalmostaffecttheoutput.Equation(2.2) specifieshowmuchtheoutputisaffectedbyasingleuser. T T (cid:88) (cid:88) WewouldlikeDefinition2.1toholdforallrounds,soas R= max g − g i=1,...K i,t It,t toprotecttheprivacyofallusers.Ifitdoesforsome((cid:15),δ), t=1 t=1 then we say the algorithm has per-round or instantaneous Inpractice,weeitherproveahighprobabilityboundonRor privacyloss((cid:15),δ).Suchanalgorithmalsohasacumulative anexpectedvalueERwith: privacy loss of at most ((cid:15)(cid:48),δ(cid:48)) with (cid:15)(cid:48) = (cid:15)T and δ(cid:48) = δT (cid:34) T T (cid:35) afterT steps.Ourgoalistodesignbanditalgorithmsuchthat (cid:88) (cid:88) ER=E max g − g theircumulativeprivacyloss((cid:15)(cid:48),δ(cid:48))areaslowaspossible i=1,...K i,t It,t whileachievingsimultaneouslyaverylowregret.Inpractice, t=1 t=1 we would like (cid:15)(cid:48) and the regret to be sub-linear while δ(cid:48) where the expectation is taken with respect to the random should be a very small quantity. Definition 2.2 formalizes choices of both the agent and adversary. There are other clearlythemeaningofthiscumulativeprivacylossandfor oraclesliketheshiftingoraclebutthoseareoutofscopeof easeofpresentation,wewillignoretheterm”cumulative” thispaper. whenreferringtoit. Definition2.2(((cid:15),δ)-differentiallyprivatebanditalgorithm). EXP3. The Exponential-weight for Exploration and Ex- ArandomizedbanditalgorithmΛis((cid:15),δ)-differentiallypri- ploitation(EXP3(Aueretal.2003))algorithmachievesthe √ vateuptoroundt,ifforallg andg(cid:48) thatdiffersin optimalbound(uptologarithmicfactors)ofO( TKlnK) 1:t−1 1:t−1 atmostoneround,wehaveforanyactionsubsetS ⊆At: fortheweakregret(i.e.theexpectedregretcomparedtothe fixedoracle)againstanobliviousadversary.EXP3simply PΛ(I1:t ∈S |g1:t−1)≤δ+PΛ(I1:t ∈S |g1(cid:48):t−1)e(cid:15), maintainsanestimateG˜ forthecumulativegainofarmi (2.3) i,t uptoroundtwithG˜i,t =(cid:80)ts=1 pgii,,tt1It=iwhere wherePΛandgareasdefinedinDefinition2.1. √ Mostofthetime,wewillrefertoDefinition2.2andwhen- expectedregretofDP-EXP3-LapisO( T lnT/(cid:15))whichis everweneedtouseDefinition2.1,thiswillbemadeexplicit. optimalinT uptosomelogarithmicfactors.Thisresultisa Thesimplestmechanismtoachievedifferentialprivacyfor significantimprovementoverthebestknownboundsofar afunctionistoaddLaplacenoiseofscaleproportionaltoits ofO(T2/3/(cid:15))from(ThakurtaandSmith2013)andsolves sensitivity.Thesensitivityisthemaximumamountbywhich simultaneously the challenge (whether or not one can get thevalueofthefunctioncanchangeifwechangeasingle (cid:15)-DPmechanismwithoptimalregret)posedbytheauthors. elementintheinputssequence.Forexample,iftheinputis astreamofnumbersin[0,1]andthefunctiontheirsum,we Algorithm1DP-EXP3-Lap canaddLaplacenoiseofscale 1 toeachnumberandachieve (cid:15)-differentialprivacywithane(cid:15)rrorofO(√T/(cid:15))inthesum. LetG˜i =0forallarmsandb= ln(cid:15)T,γ =(cid:113)(Ke−ln1)KT However, (Chan, Shi, and Song 2010) introduced Hybrid foreach roundt=1,···,T do Mechanism,whichachieves(cid:15)-differentialprivacywithonly Computetheprobabilitydistributionpoverthearms poly-logarithmic error (with respect to the true sum). The withp=(p ,···p )andp asineq(2.1). 1,t K,t i,t ideaistogroupthestreamofnumbersinabinarytreeand DrawanarmI fromtheprobabilitydistributionp. t onlyaddaLaplacenoiseatthenodesofthetree. Receivetherewardg It,t Asdemonstratedabove,themainchallengewithdifferen- Letthenoisygainbeg(cid:48) =g +N It,t It,t It,t tialprivacyisthustotrade-offoptimallyprivacyandutility. withNIt,t ∼Lap(1(cid:15)) ifg(cid:48) ∈[−b,b+1]then It,t Notation. Inthispaper,iwillbeusedasanindexforan Scaleg(cid:48) to[0,1] arbitraryarmin[1,K],whilek willbeusedtoindicatean UpdateItth,teestimatedcumulativegainofarmI : t ot.pWtimeaulseargmi,tatnodinItdiicsattheethaermgapinlaoyfedthbeyi-atnhaargmenattartoruonudndt. G˜It =G˜It + pgII(cid:48)tt,,tt R (T)istheregretofthealgorithmΛafterT rounds.The endif Λ indexandT aredroppedwhenitisclearfromthecontext. endfor Unlessotherwisespecified,theregretisdefinedforoblivious adversaries against the fixed oracle. We use ”x ∼ P” to Theorem3.1. IfDP-Λ-Lapisrunwithinputabasebandit denote that x is generated from distribution P. Lap(λ) is algorithmΛ,thenoisyrewardg(cid:48) ofthetruerewardg set usedtodenotetheLaplacedistributionwithscaleλwhile It,t It,t Bern(p)denotestheBernoullidistributionwithparameterp. togI(cid:48)t,t =gIt,t+NIt,twithNIt,t ∼Lap(1(cid:15)),theacceptance interval set to [−b,b+1] with the scaling of the rewards 3 AlgorithmsandAnalysis g(cid:48) outside[0,1]doneusingg(cid:48) = gI(cid:48)t,t+b;thentheregret It It,t 2b+1 3.1 DP-Λ-Lap:Differentialprivacythrough R ofDP-Λ-Lapsatisfies: DP-Λ-Lap additionalnoise √ Westartbyshowingthattheobvioustechniquetoachieve 32T agiven(cid:15)-differentialprivacyinadversarialbanditsalready ERDP-Λ-Lap ≤ERΛscaled+2TKexp(−(cid:15)b)+ (cid:15) beat the state-of-the art. The main idea is to use any base (3.1) banditalgorithmΛasinputandaddaLaplacenoiseofscale 1(cid:15) to each gain before Λ observes it. This technique gives whereRΛscaledistheupperboundontheregretofΛwhenthe (cid:15)-DPdifferentialprivacyasthegainsareboundedin[0,1] rewardsarescaledfrom[−b,b+1]to[0,1] andthenoisesareaddedi.i.dateachround. However,banditsalgorithmsrequireboundedgainswhile ProofSketch. WeobservedthatDP-Λ-Lapisaninstanceof thenoisygainsarenot.Thetrickistoignoreroundswherethe Λ run with the noisy rewards g(cid:48) instead of g. This means noisygainsfalloutsideanintervaloftheform[−b,b+1].We RscaledisanupperboundoftheregretLong(cid:48).Then,wede- Λ pickthethresholdbsuchthat,withhighprobability,thenoisy rivedalowerboundonLshowinghowcloseitistoR . DP-Λ-Lap gainswillbeinsidetheinterval[−b,b+1].Moreprecisely, Thisallowsustoconclude. bcanbechosensuchthatwithhighprobability,thenumber ofroundsignoredislowerthantheupperboundR onthe Corollary3.1. IfDP-Λ-LapisrunwithEXP3asitsbaseal- regretofΛ.GiventhatinthestandardbanditprobΛlem,the gorithmandb= ln(cid:15)T,thenitsexpectedregretERDP-EXP3-Lap gainsareboundedin[0,1],thegainsatacceptedroundsare satisfies rescaledbackto[0,1]. 4lnT(cid:112) Theorem3.2showsthatalltheseoperationsstillpreserve ER ≤ (e−1)TKlnK DP-EXP3-Lap (cid:15) (cid:15)-DPwhileTheorem3.1demonstratesthattheupperbound √ ontheexpectedregretofDP-Λ-Lapaddssomesmalladdi- 32T +2K+ tionaltermstoR .Toillustratehowsmallthoseadditional (cid:15) Λ termsare,weinstantiateDP-Λ-LapwiththeEXP3algorithm. ThisleadstoamechanismcalledDP-EXP3-Lapdescribed inAlgorithm1.Withacarefullychosenthresholdb,corol- Proof. TheproofcomesbycombiningtheregretofEXP3 lary 3.1 implies that the additional terms are such that the (Aueretal.2003)withTheorem3.1 Theorem 3.2. DP-Λ-Lap is (cid:15)-differentially private up to anactionfromEXP3andplaysitforτ rounds.Duringthat roundT. time,EXP3doesnotobserveanyfeedback.Attheendofthe interval,EXP3 feedsEXP3withasinglegain,theaverage τ ProofSketch. CombiningtheprivacyofLaplaceMechanism gainreceivedduringtheinterval. with the parallel composition (McSherry 2009) and post- Theorem 3.4 borrowed from (Dekel, Tewari, and Arora processingtheorems(DworkandRoth2013)concludesthe 2012) specifies the upper bound on the regret EXP3 . It τ proof. is remarkable that thisbound holds against the m-memory boundedadaptiveadversary.Whileintheorem3.5,weshow 3.2 Leveragingtheinherentprivacyof EXP3 the privacy loss enjoyed by this algorithm, one gets a bet- On the differential privacy of EXP3 (Dwork and Roth ter intuition of how good those results are from corollary 2013)showsthatavariationofEXP3forthefull-information 3.2and3.3.Indeed,wecanobserve√thatEXP3τ achievesa setting (where the agent observes the gain of all arms at sub-logarithmic privacy loss of O( lnT) with a regret of any round regardless of what he played) is already differ- O(T2/3) against a special case of the m-memory bounded entially private. Their results imply that one can achieve adaptiveadversarycalledtheswitchingcostsadversaryfor theo√ptimalregretwithonlyasub-logarithmicprivacyloss whichm=1.Thisistheoptimalregretbound(inthesense (O( 128logT))afterT rounds. that there is a matching lower bound (Dekel et al. 2014)). WestartthissectionbyshowingasimilarresultforEXP3 Thismeansthatinsomesensewearegettingprivacyforfree inTheorem3.3.Indeed,weshowthatEXP3isalreadydif- againstthisadversary. ferentiallyprivatebutwithaper-roundprivacylossof2.1 Theorem3.4(RegretofEXP3 (Dekel,Tewari,andArora OurresultsimplythatEXP3canachievetheoptimalregret τ 2012)). TheexpectedregretofEXP3 isupperboundedby: albeitwithalinearprivacylossofO(2T)-DPafterT rounds. τ Thisisahugegapcomparedwiththefull-informationsetting √ Tm 7TτKlnK+ +τ and underlines the significance of our result in section 3.1 τ wherewedescribeaconcretealgorithmdemonstratingthat againstthem-memoryboundedadaptiveadversaryforany theoptimalregretcanbeachievedwithonlyalogarithmic m<τ. privacylossafterT rounds. Theorem 3.5 (Privacy loss of EXP3 ). EXP3 is Theorem3.3. TheEXP3algorithmis: τ τ (cid:16) (cid:113) (cid:17) (cid:40) (cid:114) (cid:41) 4τT3 + 8ln(1/δ(cid:48))τT3,δ(cid:48) -DPuptoroundT. K(1−γ)+γ 2lnT min 2T,T ·ln ,2(1−γ)T +2 γ T Proof. Thesensitivityofeachgainisnow 1 asweareusing τ theaverage.Combinedwiththeorem(3.3),itmeanstheper- differentiallyprivateuptoroundT. roundprivacylossis2T.GiventhatEXP3onlyobserves T Inpractice,wealsowantEXP3tohaveasub-linearregret. rounds, using the advaτnced composition theorem (Dworkτ, Thisimpliesthatγ <<1andEXP3issimply2T-DPoverT Rothblum,andVadhan2010)(TheoremIII.3)concludesthe rounds. finalprivacylossoverT rounds. ProofSketch. Thefirsttwotermsinthetheoremcomefrom Corollary3.2. EXP3 runwithτ = (7KlogK)−1/3T1/3 τ theobservationthatEXP3isacombinationoftwomecha- is((cid:15),δ(cid:48))differentiallyprivateuptoroundT withδ(cid:48) =T−2, √ nisms:theExponentialMechanism(McSherryandTalwar (cid:15) = 28KlnK + 112KlnKlnT. Its expected regret 2007)andarandomizedresponse.Thelasttermcomesfrom againsttheswitchingcostsadversaryisupperboundedby the observation that with probability γ we enjoy a perfect 2(7KlnK)1/3T2/3+(7KlogK)−1/3T1/3. 0-DP.Then,weuseChernofftoboundwithhighprobability thenumberoftimeswesufferanon-zeroprivacyloss. Proof. Theproofisimmediatebyreplacingτ andδ(cid:48)inThe- orem 3.4 and 3.5 and the fact that for the switching costs We will now show that the privacy of EXP3 itself may adversary,m=1. beimprovedwithoutanyadditionalnoise,andwithonlya moderateimpactontheregret. (cid:16)4T(cid:15)+2Tln1(cid:17)1/3 Corollary 3.3. EXP3τ run with τ = (cid:15)2 δ OntheprivacyofaEXP3wrapperalgorithm Theprevi- is ((cid:15),δ) differentially private and its expected regret ousparagraphleadstotheconclusionthatitisimpossibleto againsttheswitchingcostsadversaryisupperboundedby: obtainasub-linearprivacylosswithasub-linearregretwhile O(cid:32)T2/3√KlnK(cid:18)√lnδ1(cid:19)1/3(cid:33) usingtheoriginalEXP3.Here,wewillprovethatanexisting (cid:15) techniqueisalreadyachievingthisgoal.Thealgorithmwhich wecalledEXP3 isfrom(Dekel,Tewari,andArora2012). τ 4 Experiments It groups the rounds into disjoint intervals of fixed size τ wherethej’thintervalstartsonround(j−1)τ +1andends We tested DP-EXP3-Lap, EXP3τ together with the non- onroundjτ.Atthebeginningofintervalj,EXP3 receives privateEXP3againstafewdifferentadversaries.Theprivacy τ parameter(cid:15)ofDP-EXP3-Lapissetasdefinedincorollary 1Assumingwewantasub-linearregret.SeeTheorem3.3 3.2. This is done so that the regret of DP-EXP3-Lap and EXP3 arecomparedwiththesameprivacylevel.Allthe Stochastic adversary This adversary draws the gains of τ other parameters of DP-EXP3-Lap are taken as defined in thefirstarmi.i.dfromBern(0.55)whereasallothergainsare corollary 3.1 while the parameters of EXP3τ are taken as drawni.i.dfromBern(0.5). definedincorollary3.2. Forallexperiments,thehorizonisT =218andthenumber Fully oblivious adversary. For the best arm k, it first ofarmsisK =4.Weperformed720independenttrialsand draws a number p uniformly in [0.5,0.5+2·ε] and gen- reportedthemedian-of-meansestimator2ofthecumulative eratesthegaingk,t ∼Bern(p).Forallotherarms,pisdrawn regret.Itpartitionsthetrialsintoa0equalgroupsandreturn from[0.5−ε,0.5+ε].Thisprocessisrepeatedateveryround. themedianofthesamplemeansofeachgroup.Proposition Inourexperiments,ε=0.05 4.1 is a well known result (also in (Hsu and Sabato 2013; LerasleandOliveira2011))givingtheaccuracyofthisestima- √ Anobliviousadversary. Thisadversaryisidenticaltothe tor.ItsconvergenceisO(σ/ N),withexponentialprobabil- fullyobliviousoneforeveryroundmultipleof200.Between itytails,eventhoughtherandomvariablexmayhaveheavy- twomultiplesof200thelastgainofthearmisgiven. tails. In comparison, the empirical mean can not provide suchguaranteeforanyσ >0andconfidencein[0,1/(2e)] (Catoni2012). TheSwitchingcostsadversary Thisadversary(definedat Figure1in(Dekeletal.2014))definesastochasticprocesses Proposition 4.1. Let x be a random variable with mean µ (includingsimpleGaussianrandomwalkasspecialcase)for andvarianceσ2 <∞.AssumethatwehaveN independent generatingthegains.Itwasusedtoprovethatanyalgorithm sample of x and let µˆ be the median-of-means computed againstthisadversarymustincuraregretofO(T2/3). using a0 groups. With probability at least 1−e−a0/4.5, µˆ (cid:112) satisfies|µˆ−µ|≤σ 6a /N. 0 Discussion Figure 1 shows our results against a variety We set the number of groups to a0 = 24, so that the of adversaries, with respect to a fixed oracle. Overall, the confidenceintervalholdsw.p.atleast0.995. performance (in term of regret) of DP-EXP3-Lap is very Wealsoreportedthedeviationofeachalgorithmusingthe competitiveagainstthatofEXP3whileprovidingasignif- Gini’sMeanDifference(GMDhereafter)(GiniandPearson icantbetterprivacy.ThismeansthatDP-EXP3-Lapallows 1912). GMD computes the deviation as (cid:80)N (2j −N − us to get privacy for free in the bandit setting against an j=1 1)x withx thej-thorderstatisticsofthesample(that adversarynotmorepowerfulthantheobliviousone. (j) (j) is x(1) ≤ x(2) ≤ ... ≤ x(N)). As shown in (Yitzhaki and The performance of EXP3τ is worse than that of DP- others 2003; David 1968), the GMD provides a superior EXP3-Lapagainstanobliviousadversaryoronelesspow- approximation of the true deviation than the standard one. erful.However,thesituationiscompletelyreversedagainst To account for the fact that the cumulative regret of our themorepowerfulswitchingcostadversary.Inthatsetting, algorithms might not follow a symmetric distribution, we EXP3τ outperforms both EXP3 and DP-EXP3-Lap con- computedtheGMDseparatelyforthevaluesaboveandbelow firmingthetheoreticalanalysis.Wecansee EXP3τ asthe themedian-of-means. algorithmprovidingusprivacyforfreeagainstswitchingcost Atroundt,wecomputedthecumulativeregretagainstthe adversaryandadaptivem-boundedmemoryoneingeneral. fixedoraclewhoplaysthebestarmassumingthattheend ofthegameisatt.Theoracleusestheactualsequenceof 5 Conclusion gainstodecidehisbestarm.Foragiventrial,wemakesure We have provided the first results on differentially private thatallalgorithmsareplayingthesamegamebygenerating adversarialmulti-armedbandits,whichareoptimaluptologa- thegainsforallpossiblepairofround-armbeforethegame rithmicfactors.Oneopenquestionishowdifferentialprivacy starts. affectsregretinthefullreinforcementlearningproblem.At thispointintime,theonlyknownresultsintheMDPsetting obtaindifferentiallyprivatealgorithmsforMonteCarlopol- Deterministic adversary. As shown by (Audibert and icyevaluation(Balle,Gomrokchi,andPrecup2016).While Bubeck2010),theexpectedregretofanyagentagainstan this implies that it is possible to obtain policy iteration al- obliviousadversarycannotbeworsethanthatagainstthe gorithms,itisunclearhowtoextendthistothefullonline worstcasedeterministicadversary.Inthisexperiment,arm reinforcementlearningproblem. 2isthebestandgives1foreveryevenround.Totrickthe players into picking the wrong arms, the first arm always Acknowledgements. Thisresearchwassupportedbythe gives0.38whereasthethirdgives1foreveryroundmultiple SNSFgrants“AdaptivecontrolwithapproximateBayesian of3.Theremainingarmsalwaysgive0.Asshownbythe computationanddifferentialprivacy”and“SwissSenseSyn- figure,thissimpleadversaryisalreadypowerfulenoughto ergy”,bytheMarieCurieActions(REA608743),theFuture makethealgorithmsattaintheirupperbound. ofLifeInstitute“MechanismDesignforAIArchitectures” andtheCNRSSpecificActiononSecurity. 2Used heavily in the streaming literature (Alon, Matias, and Szegedy1996) EXP3 6000 DP-EXP3-Lap EXP3 EXP3τ 8000 DP-EXP3-Lap EXP3τ 5000 et mulativeregret46000000 Cumulativeregr34000000 Cu 2000 2000 1000 00 50000 100000 150000 200000 250000 00 50000 100000 150000 200000 250000 timestep timestep (a)Deterministic (b)Stochastic EXP3 8000 DP-EXP3-Lap EXP3 EXP3τ 6000 DP-EXP3-Lap EXP3τ 5000 et6000 gr mulativeregret34000000 Cumulativere4000 u C 2000 2000 1000 00 50000 100000 150000 200000 250000 00 50000 100000 150000 200000 250000 timestep timestep (c)FullyOblivious (d)Oblivious 10000 8000 et gr ere 6000 EXP3 ulativ DEXPP-E3XτP3-Lap m u 4000 C 2000 0 0 50000 100000 150000 200000 250000 timestep (e)Switchingcosts Figure1:RegretandErrorbaragainstfivedifferentadversaries,withrespecttothefixedoracle References InProceedingsofthe2010IEEE51stAnnualSymposiumon FoundationsofComputerScience,FOCS’10,51–60. [AgrawalandGoyal2012] Agrawal,S.,andGoyal,N. 2012. Analysisofthompsonsamplingforthemulti-armedbandit [Dwork2006] Dwork, C. 2006. Differential privacy. In problem. InCOLT2012. ICALP,1–12. Springer. [GiniandPearson1912] Gini,C.,andPearson,K.1912.Vari- [Alon,Matias,andSzegedy1996] Alon,N.;Matias,Y.;and abilita` emutabilita`:contributoallostudiodelledistribuzioni Szegedy,M. 1996. Thespacecomplexityofapproximating edellerelazionistatistiche.Fascicolo1. tipografiadiPaolo thefrequencymoments. In28thSTOC,20–29. ACM. Cuppini. [AudibertandBubeck2010] Audibert,J.-Y.,andBubeck,S. [HsuandSabato2013] Hsu,D.,andSabato,S. 2013. Loss 2010. Regret bounds and minimax policies under partial minimization and parameter estimation with heavy tails. monitoring. J.Mach.Learn.Res.11:2785–2836. arXivpreprintarXiv:1307.1827. [Aueretal.2003] Auer,P.;Cesa-Bianchi,N.;Freund,Y.;and [Jain,Kothari,andThakurta2012] Jain, P.; Kothari, P.; and Schapire,R.E. 2003. Thenonstochasticmultiarmedbandit Thakurta,A. 2012. Differentiallyprivateonlinelearning. In problem. SIAMJ.Comput.32(1):48–77. Mannor,S.;Srebro,N.;andWilliamson,R.C.,eds.,COLT [Auer,Cesa-Bianchi,andFischer2002] Auer, P.; Cesa- 2012,volume23,24.1–24.34. Bianchi, N.; and Fischer, P. 2002. Finite time analysis [LerasleandOliveira2011] Lerasle, M., and Oliveira, R. I. of the multiarmed bandit problem. Machine Learning 2011. Robust empirical mean estimators. arXiv preprint 47(2/3):235–256. arXiv:1112.3914. [Auer2002] Auer, P. 2002. Using confidence bounds for [McSherryandTalwar2007] McSherry, F., and Talwar, K. exploitation-exploration trade-offs. Journal of Machine 2007. Mechanismdesignviadifferentialprivacy. InProceed- LearningResearch3:397–422. ings of the 48th Annual IEEE Symposium on Foundations [Balle,Gomrokchi,andPrecup2016] Balle,B.;Gomrokchi, ofComputerScience,FOCS’07,94–103. Washington,DC, M.;andPrecup,D. 2016. Differentiallyprivatepolicyevalu- USA:IEEEComputerSociety. ation. InICML2016. [McSherry2009] McSherry,F.D. 2009. Privacyintegrated [BurnetasandKatehakis1996] Burnetas, A. N., and Kate- queries:Anextensibleplatformforprivacy-preservingdata hakis, M. N. 1996. Optimal adaptive policies for sequen- analysis. InProceedingsofthe2009ACMSIGMODInter- tialallocationproblems. AdvancesinAppliedMathematics nationalConferenceonManagementofData,SIGMOD’09, 17(2):122–142. 19–30. NewYork,NY,USA:ACM. [Catoni2012] Catoni, O. 2012. Challenging the empirical [Merhavetal.2002] Merhav,N.;Ordentlich,E.;Seroussi,G.; meanandempiricalvariance:Adeviationstudy. Annalesde andWeinberger,M.J. 2002. Onsequentialstrategiesforloss l’I.H.P.Probabilite´setstatistiques48(4):1148–1185. functions with memory. IEEE Trans. Information Theory 48(7):1947–1958. [Cesa-Bianchi,Dekel,andShamir2013] Cesa-Bianchi, N.; Dekel, O.; and Shamir, O. 2013. Online learning with [MishraandThakurta2015] Mishra, N., and Thakurta, A. switching costs and other adaptive adversaries. In NIPS, 2015. (nearly)optimaldifferentiallyprivatestochasticmulti- 1160–1168. armbandits. Proceedingsofthe31thUAI. [Chan,Shi,andSong2010] Chan,T.H.;Shi,E.;andSong,D. [PandeyandOlston2006] Pandey, S., and Olston, C. 2006. 2010. Privateandcontinualreleaseofstatistics. InAutomata, Handlingadvertisementsofunknownqualityinsearchadver- LanguagesandProgramming.Springer. 405–417. tising. InScho¨lkopf,B.;Platt,J.C.;andHoffman,T.,eds., TwentiethNIPS,1065–1072. [David1968] David, H. 1968. Miscellanea: Gini’s mean differencerediscovered. Biometrika55(3):573–575. [ThakurtaandSmith2013] Thakurta,A.G.,andSmith,A.D. 2013. (nearly)optimalalgorithmsforprivateonlinelearning [Dekeletal.2014] Dekel,O.;Ding,J.;Koren,T.;andPeres, infull-informationandbanditsettings. InNIPS,2733–2741. Y. 2014. Bandits with switching costs: T2/3 regret. In [Thompson1933] Thompson,W. 1933. OntheLikelihood Proceedingsofthe46thAnnualACMSymposiumonTheory thatOneUnknownProbabilityExceedsAnotherinViewof ofComputing,STOC’14,459–467. NewYork,NY,USA: theEvidenceoftwoSamples. Biometrika25(3-4):285–294. ACM. [TossouandDimitrakakis2016] Tossou,A.C.Y.,andDim- [Dekel,Tewari,andArora2012] Dekel,O.;Tewari,A.;and itrakakis, C. 2016. Algorithms for differentially private Arora,R. 2012. Onlinebanditlearningagainstanadaptive multi-armedbandits. InAAAI,2087–2093. AAAIPress. adversary:fromregrettopolicyregret. InICML. icml.cc/ Omnipress. [Yitzhakiandothers2003] Yitzhaki,S.,etal. 2003. Gini’s meandifference:Asuperiormeasureofvariabilityfornon- [DworkandRoth2013] Dwork,C.,andRoth,A. 2013. The normaldistributions. Metron61(2):285–316. algorithmicfoundationsofdifferentialprivacy. Foundations andTrends(cid:13)R inTheoreticalComputerScience9(3–4):211– [Zhaoetal.2014] Zhao, J.; Jung, T.; Wang, Y.; and Li, X. 407. 2014. Achievingdifferentialprivacyofdatadisclosureinthe smartgrid. In2014IEEEConferenceonComputerCommu- [Dwork,Rothblum,andVadhan2010] Dwork,C.;Rothblum, nications,INFOCOM2014,504–512. G.N.;andVadhan,S.2010.Boostinganddifferentialprivacy.

Achieving Privacy in the Adversarial Multi-Armed Bandit PDF

0.59 MB·

by Aristide C. Y. Tossou

#journals #arxiv

Checking for file health...

Save to my drive

Quick download

Download

Download Achieving Privacy in the Adversarial Multi-Armed Bandit PDF Free - Full Version

by Aristide C. Y. Tossou| 0.59

Download Achieving Privacy in the Adversarial Multi-Armed Bandit by Aristide C. Y. Tossou in PDF format completely FREE. No registration required, no payment needed. Get instant access to this valuable resource on PDFdrive.to!

Free Download PDF

About Achieving Privacy in the Adversarial Multi-Armed Bandit

No description available for this book.

Detailed Information

Author:	Aristide C. Y. Tossou
File Size:	0.59
Format:	PDF
Price:	FREE

Download Free PDF

Safe & Secure Download - No registration required

Why Choose PDFdrive for Your Free Achieving Privacy in the Adversarial Multi-Armed Bandit Download?

100% Free: No hidden fees or subscriptions required for one book every day.
No Registration: Immediate access is available without creating accounts for one book every day.
Safe and Secure: Clean downloads without malware or viruses
Multiple Formats: PDF, MOBI, Mpub,... optimized for all devices
Educational Resource: Supporting knowledge sharing and learning

Frequently Asked Questions

Is it really free to download Achieving Privacy in the Adversarial Multi-Armed Bandit PDF?

Yes, on https://PDFdrive.to you can download Achieving Privacy in the Adversarial Multi-Armed Bandit by Aristide C. Y. Tossou completely free. We don't require any payment, subscription, or registration to access this PDF file. For 3 books every day.

How can I read Achieving Privacy in the Adversarial Multi-Armed Bandit on my mobile device?

After downloading Achieving Privacy in the Adversarial Multi-Armed Bandit PDF, you can open it with any PDF reader app on your phone or tablet. We recommend using Adobe Acrobat Reader, Apple Books, or Google Play Books for the best reading experience.

Is this the full version of Achieving Privacy in the Adversarial Multi-Armed Bandit?

Yes, this is the complete PDF version of Achieving Privacy in the Adversarial Multi-Armed Bandit by Aristide C. Y. Tossou. You will be able to read the entire content as in the printed version without missing any pages.

Is it legal to download Achieving Privacy in the Adversarial Multi-Armed Bandit PDF for free?

https://PDFdrive.to provides links to free educational resources available online. We do not store any files on our servers. Please be aware of copyright laws in your country before downloading.

The materials shared are intended for research, educational, and personal use in accordance with fair use principles.