Table Of ContentAchieving Privacy in the Adversarial Multi-Armed Bandit
AristideC.Y.Tossou ChristosDimitrakakis
ChalmersUniversityofTechnology UniversityofLille,France
Gothenburg,Sweden ChalmersUniversityofTechnology,Sweden
[email protected] HarvardUniversity,USA
[email protected]
7 Abstract movierecommendationscanbeformalizedsimilarly(Pandey
1
andOlston2006).
0 Inthispaper,weimprovethepreviouslybestknownregret
Privacy can be a serious issue in the bandit setting (c.f.
2 boundtoachieve(cid:15)-differentialprivacyinobliviousadversarial
√ (Jain,Kothari,andThakurta2012;ThakurtaandSmith2013;
banditsfromO(T2/3/(cid:15))toO( TlnT/(cid:15)).Thisisachieved
n MishraandThakurta2015;Zhaoetal.2014)).Forexample,
bycombiningaLaplaceMechanismwithEXP3.Weshowthat
a inclinicaltrials,wemaywanttodetectandpublishresults
J thoughEXP3isalreadydifferentiallyprivate,itleaksalinear
amountofinformationinT.However,wecanimprovethis about the best drug without leaking sensitive information,
6 privacybyrelyingonitsintrinsicexponentialmechanismfor suchasthepatient’shealthconditionandgenome.Differen-
1 √
selectingactions.ThisallowsustoreachO( lnT)-DP,with tialprivacy(Dwork2006)formallyboundstheamountof
] aregretofO(T2/3)thatholdsagainstanadaptiveadversary, informationthatathirdpartycanlearnnomattertheirpower
G an improvement from the best known of O(T3/4). This is orsideinformation.
L donebyusinganalgorithmthatrunEXP3inamini-batch Differentialprivacyhasbeenusedbeforeinthestochastic
loop.Finally,werunexperimentsthatclearlydemonstratethe setting(TossouandDimitrakakis2016;MishraandThakurta
.
s validityofourtheoreticalanalysis. 2015;Jain,Kothari,andThakurta2012)wheretheauthors
c
obtainoptimalalgorithmsuptologarithmicfactors.Inthe
[
adversarial setting, (Thakurta and Smith 2013) adapts an
1 Introduction
1 algorithmcalledFollowTheApproximateLeadertomakeit
v Weconsidermulti-armedbanditproblemsintheadversarial privateandobtainaregretboundofO(T2/3).Inthiswork,
2 settingwherebyanagentselectsonefromanumberofalter-
we show that a number of simple algorithms can satisfy
2 natives(calledarms)ateachroundandreceivesagainthat
privacyguarantees,whileachievingnearlyoptimalregret(up
2 dependsonitschoice.Theagent’sgoalistomaximizeitsto-
tologarithmicfactors)thatscalesnaturallywiththelevelof
4 talgainovertime.Therearetwomainsettingsforthebandit
privacydesired.
0
problem.Inthestochasticone,thegainsofeacharmaregen-
. Ourworkisalsoofindependentinterestfornon-private
1 eratedi.i.dbysomeunknownprobabilitylaw.Intheadversar-
multi-armedbanditalgorithms,astherearecompetitivewith
0 ialsetting,whichisthefocusofthispaper,thegainsaregen-
thecurrentstateoftheartagainstswitching-costadversaries
7 eratedadversarially.Weareinterestedinfindingalgorithms
(wherewerecovertheoptimalbound).Finally,weprovide
1 withatotalgainoverT roundsnotmuchsmallerthanthatof
rigorousempiricalresultsagainstavarietyofadversaries.
:
v anoraclewithadditionalknowledgeabouttheproblem.In Thefollowingsectiongivesthemainbackgroundandnota-
Xi bothsettings,algorithmsthatac√hievetheoptimal(problem- tions.Section3.1describesmeta-algorithmsthatperturbthe
independent)regretboundofO( T)areknown(Auer,Cesa-
gainsequencetoachieveprivacy,whileSection3.2explains
r Bianchi, and Fischer 2002; Burnetas and Katehakis 1996;
a howtoleveragetheprivacyinherentintheEXP3algorithm
PandeyandOlston2006;Thompson1933;Aueretal.2003;
by modifying the way gains are used. Section 4 compares
Auer2002;AgrawalandGoyal2012).
ouralgorithmswithEXP3inavarietyofsettings.Thefull
Thisproblemisamodelformanyapplicationswherethere
proofsofallourmainresultsareinthefullversion.
isaneedfortrading-offexplorationandexploitation.This
issobecause,wheneverwemakeachoice,weonlyobserve
2 Preliminaries
thegaingeneratedbythatchoice,andnotthegainsthatwe
couldhaveobtainedotherwise.Anexampleisclinicaltrials, 2.1 TheMulti-ArmedBanditproblem
wherearmscorrespondtodifferenttreatmentsortests,and
Formally,abanditgameisdefinedbetweenanadversaryand
the goal is to maximize the number of cured patients over
anagentasfollows:thereisasetofK armsA,andateach
time whilebeing uncertain about theeffects of treatments.
round t, the agent plays an arm I ∈ A. Given the choice
Other problems, such as search engine advertisement and t
I ,theadversarygrantstheagentagaing ∈ [0,1].The
t It,t
Copyright(cid:13)c 2017,AssociationfortheAdvancementofArtificial agentonlyobservesthegainofarmIt,andnotthatofany
Intelligence(www.aaai.org).Allrightsreserved. otherarms.Thegoalofthisagentistomaximizeitstotalgain
afterT rounds,(cid:80)T g .Arandomizedbanditalgorithm
Λ:(A×[0,1])∗ →t=D1 (IAt,)t mapseveryarm-gainhistorytoa exp(γ/KG˜ ) γ
distributionoverthenextarmtotake. pi,t =(1−γ)(cid:80)K exp(γ/Ki,Gt˜ ) + K (2.1)
Thenatureoftheadversary,andspecifically,howthegains i=1 i,t
are generated, determines the nature of the game. For the withγ awelldefinedconstant.
stochasticadversary (Thompson1933;Auer,Cesa-Bianchi, Finally, EXP3 plays one action randomly according to
andFischer2002),thegainobtainedatroundtisgenerated theprobabilitydistributionpt ={p1,t,...pK,t}withpi,tas
i.i.d from a distribution P . The more general fully obliv- definedabove.
It
ious adversary (Audibert and Bubeck 2010) generates the
gainsindependentlyatroundtbutnotnecessarilyidentically 2.2 DifferentialPrivacy
fromadistributionP .Finally,wehavetheobliviousadver- The following definition (from (Tossou and Dimitrakakis
It,t
sary(Aueretal.2003)whoseonlyconstraintistogenerate 2016))specifieswhatismeantwhenwecalledabanditalgo-
thegaing asafunctionofthecurrentactionI only,i.e. rithmdifferentiallyprivateatasingleroundt:
It,t t
ignoringpreviousactionsandgains. Definition2.1(Singleround((cid:15),δ)-differentiallyprivateban-
Whilefocusingonobliviousadversaries,wediscovered ditalgorithm). ArandomizedbanditalgorithmΛis((cid:15),δ)-
that by targeting differential privacy we can also compete differentially private at round t, if for all sequence g
1:t−1
againstthestrongerm-boundedmemoryadaptiveadversary andg(cid:48) thatdiffersinatmostoneround,wehaveforany
1:t−1
(Cesa-Bianchi,Dekel,andShamir2013;Merhavetal.2002; actionsubsetS ⊆A:
Dekel,Tewari,andArora2012)whocanuseuptothelastm
P (I ∈S |g )≤δ+P (I ∈S |g(cid:48) )e(cid:15), (2.2)
gains.Theobliviousadversaryisaspecialcasewithm=0. Λ t 1:t−1 Λ t 1:t−1
Anotherspecialcaseofthisadversaryistheonewithswitch- where PΛ denotes the probability distribution specified by
ing costs, who penalises the agent whenever he switches thealgorithmandg1:t−1 ={g1,...gt−1}withgs thegains
arms,bygivingthelowestpossiblegainof0(herem=1). ofallarmsatrounds.Whenδ =0,thealgorithmissaidto
be(cid:15)-differentialprivate.
Regret. Relyingonthecumulativegainofanagenttoeval- The (cid:15) and δ parameters quantify the amount of privacy
uateitsperformancecanbemisleading.Indeed,considerthe loss.Lower((cid:15),δ)indicatehigherprivacyandconsequently
casewhereanadversarygivesazerogainforallarmsatev- wewillalsoreferto((cid:15),δ)astheprivacyloss.Definition2.1
eryround.Thecumulativegainoftheagentwouldlookbad meansthattheoutputofthebanditalgorithmatroundtis
butnootheragentscouldhavedonebetter.Thisiswhyone almostinsensibletoanysinglechangeinthegainssequence.
comparesthegapbetweentheagent’scumulativegainand Thisimpliesthatwhetherornotweremoveasingleround,
theoneobtainedbysomehypotheticalagent,calledoracle, replacethegains,thebanditalgorithmwillstillplayalmost
with additional information or computational power. This the same action. Assuming the gains at round t are linked
gapiscalledtheregret. toauserprivatedata(forexamplehiscancerstatusorthe
Therearealsovariantsoftheoraclethatareconsideredin advertisementheclicked),thedefinitionpreservestheprivacy
theliterature.Themostcommonvariantisthefixedoracle, of that user against any third parties looking at the output.
whichalwaysplaysthebestfixedarminhindsight.Theregret Thisisthecasebecausethechoicesortheparticipationof
Ragainstthisoracleis: thatuserwouldnotalmostaffecttheoutput.Equation(2.2)
specifieshowmuchtheoutputisaffectedbyasingleuser.
T T
(cid:88) (cid:88) WewouldlikeDefinition2.1toholdforallrounds,soas
R= max g − g
i=1,...K i,t It,t toprotecttheprivacyofallusers.Ifitdoesforsome((cid:15),δ),
t=1 t=1 then we say the algorithm has per-round or instantaneous
Inpractice,weeitherproveahighprobabilityboundonRor privacyloss((cid:15),δ).Suchanalgorithmalsohasacumulative
anexpectedvalueERwith: privacy loss of at most ((cid:15)(cid:48),δ(cid:48)) with (cid:15)(cid:48) = (cid:15)T and δ(cid:48) = δT
(cid:34) T T (cid:35) afterT steps.Ourgoalistodesignbanditalgorithmsuchthat
(cid:88) (cid:88)
ER=E max g − g theircumulativeprivacyloss((cid:15)(cid:48),δ(cid:48))areaslowaspossible
i=1,...K i,t It,t whileachievingsimultaneouslyaverylowregret.Inpractice,
t=1 t=1
we would like (cid:15)(cid:48) and the regret to be sub-linear while δ(cid:48)
where the expectation is taken with respect to the random
should be a very small quantity. Definition 2.2 formalizes
choices of both the agent and adversary. There are other
clearlythemeaningofthiscumulativeprivacylossandfor
oraclesliketheshiftingoraclebutthoseareoutofscopeof
easeofpresentation,wewillignoretheterm”cumulative”
thispaper.
whenreferringtoit.
Definition2.2(((cid:15),δ)-differentiallyprivatebanditalgorithm).
EXP3. The Exponential-weight for Exploration and Ex-
ArandomizedbanditalgorithmΛis((cid:15),δ)-differentiallypri-
ploitation(EXP3(Aueretal.2003))algorithmachievesthe
√ vateuptoroundt,ifforallg andg(cid:48) thatdiffersin
optimalbound(uptologarithmicfactors)ofO( TKlnK) 1:t−1 1:t−1
atmostoneround,wehaveforanyactionsubsetS ⊆At:
fortheweakregret(i.e.theexpectedregretcomparedtothe
fixedoracle)againstanobliviousadversary.EXP3simply PΛ(I1:t ∈S |g1:t−1)≤δ+PΛ(I1:t ∈S |g1(cid:48):t−1)e(cid:15),
maintainsanestimateG˜ forthecumulativegainofarmi (2.3)
i,t
uptoroundtwithG˜i,t =(cid:80)ts=1 pgii,,tt1It=iwhere wherePΛandgareasdefinedinDefinition2.1.
√
Mostofthetime,wewillrefertoDefinition2.2andwhen- expectedregretofDP-EXP3-LapisO( T lnT/(cid:15))whichis
everweneedtouseDefinition2.1,thiswillbemadeexplicit. optimalinT uptosomelogarithmicfactors.Thisresultisa
Thesimplestmechanismtoachievedifferentialprivacyfor significantimprovementoverthebestknownboundsofar
afunctionistoaddLaplacenoiseofscaleproportionaltoits ofO(T2/3/(cid:15))from(ThakurtaandSmith2013)andsolves
sensitivity.Thesensitivityisthemaximumamountbywhich simultaneously the challenge (whether or not one can get
thevalueofthefunctioncanchangeifwechangeasingle (cid:15)-DPmechanismwithoptimalregret)posedbytheauthors.
elementintheinputssequence.Forexample,iftheinputis
astreamofnumbersin[0,1]andthefunctiontheirsum,we Algorithm1DP-EXP3-Lap
canaddLaplacenoiseofscale 1 toeachnumberandachieve
(cid:15)-differentialprivacywithane(cid:15)rrorofO(√T/(cid:15))inthesum. LetG˜i =0forallarmsandb= ln(cid:15)T,γ =(cid:113)(Ke−ln1)KT
However, (Chan, Shi, and Song 2010) introduced Hybrid foreach roundt=1,···,T do
Mechanism,whichachieves(cid:15)-differentialprivacywithonly Computetheprobabilitydistributionpoverthearms
poly-logarithmic error (with respect to the true sum). The withp=(p ,···p )andp asineq(2.1).
1,t K,t i,t
ideaistogroupthestreamofnumbersinabinarytreeand DrawanarmI fromtheprobabilitydistributionp.
t
onlyaddaLaplacenoiseatthenodesofthetree. Receivetherewardg
It,t
Asdemonstratedabove,themainchallengewithdifferen- Letthenoisygainbeg(cid:48) =g +N
It,t It,t It,t
tialprivacyisthustotrade-offoptimallyprivacyandutility. withNIt,t ∼Lap(1(cid:15))
ifg(cid:48) ∈[−b,b+1]then
It,t
Notation. Inthispaper,iwillbeusedasanindexforan Scaleg(cid:48) to[0,1]
arbitraryarmin[1,K],whilek willbeusedtoindicatean UpdateItth,teestimatedcumulativegainofarmI :
t
ot.pWtimeaulseargmi,tatnodinItdiicsattheethaermgapinlaoyfedthbeyi-atnhaargmenattartoruonudndt. G˜It =G˜It + pgII(cid:48)tt,,tt
R (T)istheregretofthealgorithmΛafterT rounds.The endif
Λ
indexandT aredroppedwhenitisclearfromthecontext. endfor
Unlessotherwisespecified,theregretisdefinedforoblivious
adversaries against the fixed oracle. We use ”x ∼ P” to
Theorem3.1. IfDP-Λ-Lapisrunwithinputabasebandit
denote that x is generated from distribution P. Lap(λ) is
algorithmΛ,thenoisyrewardg(cid:48) ofthetruerewardg set
usedtodenotetheLaplacedistributionwithscaleλwhile It,t It,t
Bern(p)denotestheBernoullidistributionwithparameterp. togI(cid:48)t,t =gIt,t+NIt,twithNIt,t ∼Lap(1(cid:15)),theacceptance
interval set to [−b,b+1] with the scaling of the rewards
3 AlgorithmsandAnalysis g(cid:48) outside[0,1]doneusingg(cid:48) = gI(cid:48)t,t+b;thentheregret
It It,t 2b+1
3.1 DP-Λ-Lap:Differentialprivacythrough R ofDP-Λ-Lapsatisfies:
DP-Λ-Lap
additionalnoise
√
Westartbyshowingthattheobvioustechniquetoachieve
32T
agiven(cid:15)-differentialprivacyinadversarialbanditsalready ERDP-Λ-Lap ≤ERΛscaled+2TKexp(−(cid:15)b)+ (cid:15)
beat the state-of-the art. The main idea is to use any base
(3.1)
banditalgorithmΛasinputandaddaLaplacenoiseofscale
1(cid:15) to each gain before Λ observes it. This technique gives whereRΛscaledistheupperboundontheregretofΛwhenthe
(cid:15)-DPdifferentialprivacyasthegainsareboundedin[0,1] rewardsarescaledfrom[−b,b+1]to[0,1]
andthenoisesareaddedi.i.dateachround.
However,banditsalgorithmsrequireboundedgainswhile ProofSketch. WeobservedthatDP-Λ-Lapisaninstanceof
thenoisygainsarenot.Thetrickistoignoreroundswherethe Λ run with the noisy rewards g(cid:48) instead of g. This means
noisygainsfalloutsideanintervaloftheform[−b,b+1].We RscaledisanupperboundoftheregretLong(cid:48).Then,wede-
Λ
pickthethresholdbsuchthat,withhighprobability,thenoisy rivedalowerboundonLshowinghowcloseitistoR .
DP-Λ-Lap
gainswillbeinsidetheinterval[−b,b+1].Moreprecisely, Thisallowsustoconclude.
bcanbechosensuchthatwithhighprobability,thenumber
ofroundsignoredislowerthantheupperboundR onthe Corollary3.1. IfDP-Λ-LapisrunwithEXP3asitsbaseal-
regretofΛ.GiventhatinthestandardbanditprobΛlem,the gorithmandb= ln(cid:15)T,thenitsexpectedregretERDP-EXP3-Lap
gainsareboundedin[0,1],thegainsatacceptedroundsare satisfies
rescaledbackto[0,1].
4lnT(cid:112)
Theorem3.2showsthatalltheseoperationsstillpreserve ER ≤ (e−1)TKlnK
DP-EXP3-Lap
(cid:15)
(cid:15)-DPwhileTheorem3.1demonstratesthattheupperbound √
ontheexpectedregretofDP-Λ-Lapaddssomesmalladdi- 32T
+2K+
tionaltermstoR .Toillustratehowsmallthoseadditional (cid:15)
Λ
termsare,weinstantiateDP-Λ-LapwiththeEXP3algorithm.
ThisleadstoamechanismcalledDP-EXP3-Lapdescribed
inAlgorithm1.Withacarefullychosenthresholdb,corol- Proof. TheproofcomesbycombiningtheregretofEXP3
lary 3.1 implies that the additional terms are such that the (Aueretal.2003)withTheorem3.1
Theorem 3.2. DP-Λ-Lap is (cid:15)-differentially private up to anactionfromEXP3andplaysitforτ rounds.Duringthat
roundT. time,EXP3doesnotobserveanyfeedback.Attheendofthe
interval,EXP3 feedsEXP3withasinglegain,theaverage
τ
ProofSketch. CombiningtheprivacyofLaplaceMechanism gainreceivedduringtheinterval.
with the parallel composition (McSherry 2009) and post- Theorem 3.4 borrowed from (Dekel, Tewari, and Arora
processingtheorems(DworkandRoth2013)concludesthe 2012) specifies the upper bound on the regret EXP3 . It
τ
proof. is remarkable that thisbound holds against the m-memory
boundedadaptiveadversary.Whileintheorem3.5,weshow
3.2 Leveragingtheinherentprivacyof EXP3
the privacy loss enjoyed by this algorithm, one gets a bet-
On the differential privacy of EXP3 (Dwork and Roth ter intuition of how good those results are from corollary
2013)showsthatavariationofEXP3forthefull-information 3.2and3.3.Indeed,wecanobserve√thatEXP3τ achievesa
setting (where the agent observes the gain of all arms at sub-logarithmic privacy loss of O( lnT) with a regret of
any round regardless of what he played) is already differ- O(T2/3) against a special case of the m-memory bounded
entially private. Their results imply that one can achieve adaptiveadversarycalledtheswitchingcostsadversaryfor
theo√ptimalregretwithonlyasub-logarithmicprivacyloss whichm=1.Thisistheoptimalregretbound(inthesense
(O( 128logT))afterT rounds. that there is a matching lower bound (Dekel et al. 2014)).
WestartthissectionbyshowingasimilarresultforEXP3 Thismeansthatinsomesensewearegettingprivacyforfree
inTheorem3.3.Indeed,weshowthatEXP3isalreadydif- againstthisadversary.
ferentiallyprivatebutwithaper-roundprivacylossof2.1
Theorem3.4(RegretofEXP3 (Dekel,Tewari,andArora
OurresultsimplythatEXP3canachievetheoptimalregret τ
2012)). TheexpectedregretofEXP3 isupperboundedby:
albeitwithalinearprivacylossofO(2T)-DPafterT rounds. τ
Thisisahugegapcomparedwiththefull-informationsetting √ Tm
7TτKlnK+ +τ
and underlines the significance of our result in section 3.1
τ
wherewedescribeaconcretealgorithmdemonstratingthat
againstthem-memoryboundedadaptiveadversaryforany
theoptimalregretcanbeachievedwithonlyalogarithmic
m<τ.
privacylossafterT rounds.
Theorem 3.5 (Privacy loss of EXP3 ). EXP3 is
Theorem3.3. TheEXP3algorithmis: τ τ
(cid:16) (cid:113) (cid:17)
(cid:40) (cid:114) (cid:41) 4τT3 + 8ln(1/δ(cid:48))τT3,δ(cid:48) -DPuptoroundT.
K(1−γ)+γ 2lnT
min 2T,T ·ln ,2(1−γ)T +2
γ T Proof. Thesensitivityofeachgainisnow 1 asweareusing
τ
theaverage.Combinedwiththeorem(3.3),itmeanstheper-
differentiallyprivateuptoroundT. roundprivacylossis2T.GiventhatEXP3onlyobserves T
Inpractice,wealsowantEXP3tohaveasub-linearregret. rounds, using the advaτnced composition theorem (Dworkτ,
Thisimpliesthatγ <<1andEXP3issimply2T-DPoverT Rothblum,andVadhan2010)(TheoremIII.3)concludesthe
rounds. finalprivacylossoverT rounds.
ProofSketch. Thefirsttwotermsinthetheoremcomefrom Corollary3.2. EXP3 runwithτ = (7KlogK)−1/3T1/3
τ
theobservationthatEXP3isacombinationoftwomecha- is((cid:15),δ(cid:48))differentiallyprivateuptoroundT withδ(cid:48) =T−2,
√
nisms:theExponentialMechanism(McSherryandTalwar
(cid:15) = 28KlnK + 112KlnKlnT. Its expected regret
2007)andarandomizedresponse.Thelasttermcomesfrom
againsttheswitchingcostsadversaryisupperboundedby
the observation that with probability γ we enjoy a perfect
2(7KlnK)1/3T2/3+(7KlogK)−1/3T1/3.
0-DP.Then,weuseChernofftoboundwithhighprobability
thenumberoftimeswesufferanon-zeroprivacyloss. Proof. Theproofisimmediatebyreplacingτ andδ(cid:48)inThe-
orem 3.4 and 3.5 and the fact that for the switching costs
We will now show that the privacy of EXP3 itself may
adversary,m=1.
beimprovedwithoutanyadditionalnoise,andwithonlya
moderateimpactontheregret. (cid:16)4T(cid:15)+2Tln1(cid:17)1/3
Corollary 3.3. EXP3τ run with τ = (cid:15)2 δ
OntheprivacyofaEXP3wrapperalgorithm Theprevi- is ((cid:15),δ) differentially private and its expected regret
ousparagraphleadstotheconclusionthatitisimpossibleto againsttheswitchingcostsadversaryisupperboundedby:
obtainasub-linearprivacylosswithasub-linearregretwhile O(cid:32)T2/3√KlnK(cid:18)√lnδ1(cid:19)1/3(cid:33)
usingtheoriginalEXP3.Here,wewillprovethatanexisting (cid:15)
techniqueisalreadyachievingthisgoal.Thealgorithmwhich
wecalledEXP3 isfrom(Dekel,Tewari,andArora2012).
τ 4 Experiments
It groups the rounds into disjoint intervals of fixed size τ
wherethej’thintervalstartsonround(j−1)τ +1andends We tested DP-EXP3-Lap, EXP3τ together with the non-
onroundjτ.Atthebeginningofintervalj,EXP3 receives privateEXP3againstafewdifferentadversaries.Theprivacy
τ
parameter(cid:15)ofDP-EXP3-Lapissetasdefinedincorollary
1Assumingwewantasub-linearregret.SeeTheorem3.3 3.2. This is done so that the regret of DP-EXP3-Lap and
EXP3 arecomparedwiththesameprivacylevel.Allthe Stochastic adversary This adversary draws the gains of
τ
other parameters of DP-EXP3-Lap are taken as defined in thefirstarmi.i.dfromBern(0.55)whereasallothergainsare
corollary 3.1 while the parameters of EXP3τ are taken as drawni.i.dfromBern(0.5).
definedincorollary3.2.
Forallexperiments,thehorizonisT =218andthenumber Fully oblivious adversary. For the best arm k, it first
ofarmsisK =4.Weperformed720independenttrialsand draws a number p uniformly in [0.5,0.5+2·ε] and gen-
reportedthemedian-of-meansestimator2ofthecumulative eratesthegaingk,t ∼Bern(p).Forallotherarms,pisdrawn
regret.Itpartitionsthetrialsintoa0equalgroupsandreturn from[0.5−ε,0.5+ε].Thisprocessisrepeatedateveryround.
themedianofthesamplemeansofeachgroup.Proposition Inourexperiments,ε=0.05
4.1 is a well known result (also in (Hsu and Sabato 2013;
LerasleandOliveira2011))givingtheaccuracyofthisestima-
√ Anobliviousadversary. Thisadversaryisidenticaltothe
tor.ItsconvergenceisO(σ/ N),withexponentialprobabil-
fullyobliviousoneforeveryroundmultipleof200.Between
itytails,eventhoughtherandomvariablexmayhaveheavy-
twomultiplesof200thelastgainofthearmisgiven.
tails. In comparison, the empirical mean can not provide
suchguaranteeforanyσ >0andconfidencein[0,1/(2e)]
(Catoni2012). TheSwitchingcostsadversary Thisadversary(definedat
Figure1in(Dekeletal.2014))definesastochasticprocesses
Proposition 4.1. Let x be a random variable with mean µ
(includingsimpleGaussianrandomwalkasspecialcase)for
andvarianceσ2 <∞.AssumethatwehaveN independent
generatingthegains.Itwasusedtoprovethatanyalgorithm
sample of x and let µˆ be the median-of-means computed
againstthisadversarymustincuraregretofO(T2/3).
using a0 groups. With probability at least 1−e−a0/4.5, µˆ
(cid:112)
satisfies|µˆ−µ|≤σ 6a /N.
0
Discussion Figure 1 shows our results against a variety
We set the number of groups to a0 = 24, so that the of adversaries, with respect to a fixed oracle. Overall, the
confidenceintervalholdsw.p.atleast0.995. performance (in term of regret) of DP-EXP3-Lap is very
Wealsoreportedthedeviationofeachalgorithmusingthe competitiveagainstthatofEXP3whileprovidingasignif-
Gini’sMeanDifference(GMDhereafter)(GiniandPearson icantbetterprivacy.ThismeansthatDP-EXP3-Lapallows
1912). GMD computes the deviation as (cid:80)N (2j −N − us to get privacy for free in the bandit setting against an
j=1
1)x withx thej-thorderstatisticsofthesample(that adversarynotmorepowerfulthantheobliviousone.
(j) (j)
is x(1) ≤ x(2) ≤ ... ≤ x(N)). As shown in (Yitzhaki and The performance of EXP3τ is worse than that of DP-
others 2003; David 1968), the GMD provides a superior EXP3-Lapagainstanobliviousadversaryoronelesspow-
approximation of the true deviation than the standard one. erful.However,thesituationiscompletelyreversedagainst
To account for the fact that the cumulative regret of our themorepowerfulswitchingcostadversary.Inthatsetting,
algorithms might not follow a symmetric distribution, we EXP3τ outperforms both EXP3 and DP-EXP3-Lap con-
computedtheGMDseparatelyforthevaluesaboveandbelow firmingthetheoreticalanalysis.Wecansee EXP3τ asthe
themedian-of-means. algorithmprovidingusprivacyforfreeagainstswitchingcost
Atroundt,wecomputedthecumulativeregretagainstthe adversaryandadaptivem-boundedmemoryoneingeneral.
fixedoraclewhoplaysthebestarmassumingthattheend
ofthegameisatt.Theoracleusestheactualsequenceof 5 Conclusion
gainstodecidehisbestarm.Foragiventrial,wemakesure We have provided the first results on differentially private
thatallalgorithmsareplayingthesamegamebygenerating adversarialmulti-armedbandits,whichareoptimaluptologa-
thegainsforallpossiblepairofround-armbeforethegame rithmicfactors.Oneopenquestionishowdifferentialprivacy
starts. affectsregretinthefullreinforcementlearningproblem.At
thispointintime,theonlyknownresultsintheMDPsetting
obtaindifferentiallyprivatealgorithmsforMonteCarlopol-
Deterministic adversary. As shown by (Audibert and
icyevaluation(Balle,Gomrokchi,andPrecup2016).While
Bubeck2010),theexpectedregretofanyagentagainstan
this implies that it is possible to obtain policy iteration al-
obliviousadversarycannotbeworsethanthatagainstthe
gorithms,itisunclearhowtoextendthistothefullonline
worstcasedeterministicadversary.Inthisexperiment,arm
reinforcementlearningproblem.
2isthebestandgives1foreveryevenround.Totrickthe
players into picking the wrong arms, the first arm always
Acknowledgements. Thisresearchwassupportedbythe
gives0.38whereasthethirdgives1foreveryroundmultiple
SNSFgrants“AdaptivecontrolwithapproximateBayesian
of3.Theremainingarmsalwaysgive0.Asshownbythe
computationanddifferentialprivacy”and“SwissSenseSyn-
figure,thissimpleadversaryisalreadypowerfulenoughto
ergy”,bytheMarieCurieActions(REA608743),theFuture
makethealgorithmsattaintheirupperbound.
ofLifeInstitute“MechanismDesignforAIArchitectures”
andtheCNRSSpecificActiononSecurity.
2Used heavily in the streaming literature (Alon, Matias, and
Szegedy1996)
EXP3
6000 DP-EXP3-Lap
EXP3 EXP3τ
8000 DP-EXP3-Lap
EXP3τ 5000
et
mulativeregret46000000 Cumulativeregr34000000
Cu 2000
2000
1000
00 50000 100000 150000 200000 250000 00 50000 100000 150000 200000 250000
timestep timestep
(a)Deterministic (b)Stochastic
EXP3
8000 DP-EXP3-Lap
EXP3 EXP3τ
6000 DP-EXP3-Lap
EXP3τ
5000 et6000
gr
mulativeregret34000000 Cumulativere4000
u
C
2000
2000
1000
00 50000 100000 150000 200000 250000 00 50000 100000 150000 200000 250000
timestep timestep
(c)FullyOblivious (d)Oblivious
10000
8000
et
gr
ere 6000 EXP3
ulativ DEXPP-E3XτP3-Lap
m
u 4000
C
2000
0
0 50000 100000 150000 200000 250000
timestep
(e)Switchingcosts
Figure1:RegretandErrorbaragainstfivedifferentadversaries,withrespecttothefixedoracle
References InProceedingsofthe2010IEEE51stAnnualSymposiumon
FoundationsofComputerScience,FOCS’10,51–60.
[AgrawalandGoyal2012] Agrawal,S.,andGoyal,N. 2012.
Analysisofthompsonsamplingforthemulti-armedbandit [Dwork2006] Dwork, C. 2006. Differential privacy. In
problem. InCOLT2012. ICALP,1–12. Springer.
[GiniandPearson1912] Gini,C.,andPearson,K.1912.Vari-
[Alon,Matias,andSzegedy1996] Alon,N.;Matias,Y.;and
abilita` emutabilita`:contributoallostudiodelledistribuzioni
Szegedy,M. 1996. Thespacecomplexityofapproximating
edellerelazionistatistiche.Fascicolo1. tipografiadiPaolo
thefrequencymoments. In28thSTOC,20–29. ACM.
Cuppini.
[AudibertandBubeck2010] Audibert,J.-Y.,andBubeck,S.
[HsuandSabato2013] Hsu,D.,andSabato,S. 2013. Loss
2010. Regret bounds and minimax policies under partial
minimization and parameter estimation with heavy tails.
monitoring. J.Mach.Learn.Res.11:2785–2836.
arXivpreprintarXiv:1307.1827.
[Aueretal.2003] Auer,P.;Cesa-Bianchi,N.;Freund,Y.;and
[Jain,Kothari,andThakurta2012] Jain, P.; Kothari, P.; and
Schapire,R.E. 2003. Thenonstochasticmultiarmedbandit
Thakurta,A. 2012. Differentiallyprivateonlinelearning. In
problem. SIAMJ.Comput.32(1):48–77.
Mannor,S.;Srebro,N.;andWilliamson,R.C.,eds.,COLT
[Auer,Cesa-Bianchi,andFischer2002] Auer, P.; Cesa- 2012,volume23,24.1–24.34.
Bianchi, N.; and Fischer, P. 2002. Finite time analysis
[LerasleandOliveira2011] Lerasle, M., and Oliveira, R. I.
of the multiarmed bandit problem. Machine Learning
2011. Robust empirical mean estimators. arXiv preprint
47(2/3):235–256.
arXiv:1112.3914.
[Auer2002] Auer, P. 2002. Using confidence bounds for
[McSherryandTalwar2007] McSherry, F., and Talwar, K.
exploitation-exploration trade-offs. Journal of Machine
2007. Mechanismdesignviadifferentialprivacy. InProceed-
LearningResearch3:397–422.
ings of the 48th Annual IEEE Symposium on Foundations
[Balle,Gomrokchi,andPrecup2016] Balle,B.;Gomrokchi, ofComputerScience,FOCS’07,94–103. Washington,DC,
M.;andPrecup,D. 2016. Differentiallyprivatepolicyevalu- USA:IEEEComputerSociety.
ation. InICML2016.
[McSherry2009] McSherry,F.D. 2009. Privacyintegrated
[BurnetasandKatehakis1996] Burnetas, A. N., and Kate- queries:Anextensibleplatformforprivacy-preservingdata
hakis, M. N. 1996. Optimal adaptive policies for sequen- analysis. InProceedingsofthe2009ACMSIGMODInter-
tialallocationproblems. AdvancesinAppliedMathematics nationalConferenceonManagementofData,SIGMOD’09,
17(2):122–142. 19–30. NewYork,NY,USA:ACM.
[Catoni2012] Catoni, O. 2012. Challenging the empirical [Merhavetal.2002] Merhav,N.;Ordentlich,E.;Seroussi,G.;
meanandempiricalvariance:Adeviationstudy. Annalesde andWeinberger,M.J. 2002. Onsequentialstrategiesforloss
l’I.H.P.Probabilite´setstatistiques48(4):1148–1185. functions with memory. IEEE Trans. Information Theory
48(7):1947–1958.
[Cesa-Bianchi,Dekel,andShamir2013] Cesa-Bianchi, N.;
Dekel, O.; and Shamir, O. 2013. Online learning with [MishraandThakurta2015] Mishra, N., and Thakurta, A.
switching costs and other adaptive adversaries. In NIPS, 2015. (nearly)optimaldifferentiallyprivatestochasticmulti-
1160–1168. armbandits. Proceedingsofthe31thUAI.
[Chan,Shi,andSong2010] Chan,T.H.;Shi,E.;andSong,D. [PandeyandOlston2006] Pandey, S., and Olston, C. 2006.
2010. Privateandcontinualreleaseofstatistics. InAutomata, Handlingadvertisementsofunknownqualityinsearchadver-
LanguagesandProgramming.Springer. 405–417. tising. InScho¨lkopf,B.;Platt,J.C.;andHoffman,T.,eds.,
TwentiethNIPS,1065–1072.
[David1968] David, H. 1968. Miscellanea: Gini’s mean
differencerediscovered. Biometrika55(3):573–575. [ThakurtaandSmith2013] Thakurta,A.G.,andSmith,A.D.
2013. (nearly)optimalalgorithmsforprivateonlinelearning
[Dekeletal.2014] Dekel,O.;Ding,J.;Koren,T.;andPeres,
infull-informationandbanditsettings. InNIPS,2733–2741.
Y. 2014. Bandits with switching costs: T2/3 regret. In
[Thompson1933] Thompson,W. 1933. OntheLikelihood
Proceedingsofthe46thAnnualACMSymposiumonTheory
thatOneUnknownProbabilityExceedsAnotherinViewof
ofComputing,STOC’14,459–467. NewYork,NY,USA:
theEvidenceoftwoSamples. Biometrika25(3-4):285–294.
ACM.
[TossouandDimitrakakis2016] Tossou,A.C.Y.,andDim-
[Dekel,Tewari,andArora2012] Dekel,O.;Tewari,A.;and
itrakakis, C. 2016. Algorithms for differentially private
Arora,R. 2012. Onlinebanditlearningagainstanadaptive
multi-armedbandits. InAAAI,2087–2093. AAAIPress.
adversary:fromregrettopolicyregret. InICML. icml.cc/
Omnipress. [Yitzhakiandothers2003] Yitzhaki,S.,etal. 2003. Gini’s
meandifference:Asuperiormeasureofvariabilityfornon-
[DworkandRoth2013] Dwork,C.,andRoth,A. 2013. The
normaldistributions. Metron61(2):285–316.
algorithmicfoundationsofdifferentialprivacy. Foundations
andTrends(cid:13)R inTheoreticalComputerScience9(3–4):211– [Zhaoetal.2014] Zhao, J.; Jung, T.; Wang, Y.; and Li, X.
407. 2014. Achievingdifferentialprivacyofdatadisclosureinthe
smartgrid. In2014IEEEConferenceonComputerCommu-
[Dwork,Rothblum,andVadhan2010] Dwork,C.;Rothblum,
nications,INFOCOM2014,504–512.
G.N.;andVadhan,S.2010.Boostinganddifferentialprivacy.