Table Of ContentAn Online Actor Critic Algorithm and a Statistical
Decision Procedure for Personalizing Intervention
by
HuitianLei
Adissertationsubmittedinpartialfulfillment
oftherequirementsforthedegreeof
DoctorofPhilosophy
(Statistics)
intheUniversityofMichigan
2016
DoctoralCommittee:
ProfessorSusanA.Murphy,co-Chair
AssistantProfessorAmbujTewari,co-Chair
AssociateProfessorLuWang
AssistantProfessorShuhengZhou
©HuitianLei
2016
Dedication
Tomymother
ii
TABLE OF CONTENTS
Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
ListofFigures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
ListofTables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x
Chapter
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 AReviewonAdaptiveInterventionandJust-in-timeAdaptiveIntervention 3
1.2 AReviewonBanditandContextualBanditAlgorithm . . . . . . . . . . 5
2 OnlineLearningofOptimalPolicy: Formulation,AlgorithmandTheory . . . 10
2.1 Problemformulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1.1 Modeling the Decision Making Problem as a Contextual Bandit
Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1.2 TheRegularizedAverageReward . . . . . . . . . . . . . . . . . 13
2.2 AnOnlineActorCriticAlgorithm . . . . . . . . . . . . . . . . . . . . . 20
2.2.1 TheCriticwithaLinearFunctionApproximation . . . . . . . . . 21
2.2.2 TheActorandtheActorCriticAlgorithm . . . . . . . . . . . . . 22
2.3 AsymptoticTheoryoftheActorCriticAlgorithm . . . . . . . . . . . . . 23
2.4 SmallSampleVarianceestimationandBootstrapConfidenceintervals . . 28
2.4.1 Plug-inVarianceEstimationandWaldConfidenceintervals . . . 29
2.4.2 BootstrapConfidenceintervals . . . . . . . . . . . . . . . . . . . 35
2.5 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3 NumericalExperiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.1 I.I.D.Contexts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.2 AR(1)Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.3 ContextisInfluencedbyPreviousActions . . . . . . . . . . . . . . . . . 52
3.3.1 LearningEffect . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.3.2 BurdenEffect . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.4 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3.4.1 LearningEffect: ActorCriticAlgorithmUsesλ∗ . . . . . . . . . 67
3.4.2 Learning Effect with Correlated S and S : Actor Critic Algo-
2 3
rithmUsesλ∗ . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
iii
3.4.3 BurdenEffect: ActorCriticAlgorithmUsesλ∗ . . . . . . . . . . 70
4 AMultipleDecisionProcedureforPersonalizingIntervention . . . . . . . . . 73
4.1 LiteratureReview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.1.1 Thetestofqualitativeinteraction . . . . . . . . . . . . . . . . . 75
4.1.2 MultipleHypothesisTesting,MultipleDecisionTheory . . . . . 77
4.2 TheDecisionProcedureandControllingtheErrorProbabilities . . . . . . 81
4.2.1 NotationandAssumptions . . . . . . . . . . . . . . . . . . . . . 81
4.2.2 TheDecisionSpace . . . . . . . . . . . . . . . . . . . . . . . . 81
4.2.3 TestStatistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
4.2.4 TheTwo-stageDecisionProcedure . . . . . . . . . . . . . . . . 83
4.2.5 TheLossFunctionandErrorprobabilities . . . . . . . . . . . . . 84
4.3 ChoosingtheCriticalValuesc andc . . . . . . . . . . . . . . . . . . . 85
0 1
4.4 ComparingwithAlternativeMethods . . . . . . . . . . . . . . . . . . . . 86
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
iv
LIST OF FIGURES
2.1 Plug in variance estimation as a function of µˆ and µˆ , x axis represents µˆ ,
2 3 t,2
y axis represents µˆ and z axis represents the plug-in asymptotic variance of
t,3
ˆ
θ withλ = 0.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
0
2.2 Wald confidence interval coverage for 1000 simulated datasets as a function
ofµˆ andµˆ atsamplesize100. . . . . . . . . . . . . . . . . . . . . . . . . . 34
3 2
2.3 Wald confidence interval coverage in 1000 simulated datasets as a function of
µˆ andµˆ atsamplesize500. . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3 2
√Tˆ(θ −θ∗)
2.4 Histogramsofthenormalizeddistance i i fori = 0,1atsamplesize100 35
Vˆi
3.1 RelativeMSEvsARcoefficientη atsamplesize200. RelativeMSEisrelative
totheMSEatη = 0. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.2 RelativeMSEvsARcoefficientη atsamplesize500. RelativeMSEisrelative
totheMSEatη = 0. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.3 Learning effect: box plots of regularized average cost at different levels of
learningeffect. Samplesizeis200. . . . . . . . . . . . . . . . . . . . . . . . 57
3.4 Learning effect: box plots of regularized average cost at different levels of
learningeffect. Samplesizeis500. . . . . . . . . . . . . . . . . . . . . . . . 57
3.5 Burden effect: box plots of regularized average cost at different levels of the
burdeneffectatsamplesize200. . . . . . . . . . . . . . . . . . . . . . . . . . 65
3.6 Burden effect: box plots of regularized average cost at different levels of the
burdeneffectatsamplesize500. . . . . . . . . . . . . . . . . . . . . . . . . . 65
v
LIST OF TABLES
2.1 Underestimation of the plug-in variance estimator and the Wald confidence
intervals. TheoreticalWaldCIiscreatedbasedonthetrueasymptoticvariance. 32
3.1 I.I.D.contexts: biasinestimatingtheoptimalpolicyparameter. Bias=E(θˆ)−
t
θ∗. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.2 I.I.D.contexts: MSEinestimatingtheoptimalpolicyparameter. . . . . . . . 47
3.3 I.I.D. contexts: coverage rates of percentile-t bootstrap confidence intervals
fortheoptimalpolicyparameter. . . . . . . . . . . . . . . . . . . . . . . . . 48
3.4 I.I.D.contexts: coverageratesofEfron-typebootstrapconfidenceintervalsfor
theoptimalpolicyparameter. Coverageratessignificantlylowerthan0.95are
markedwithasterisks(*). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.5 I.I.D. contexts with a lenient stochasticity constraint: bias in estimating the
optimalpolicyparameter. Bias=E(θˆ)−θ∗ . . . . . . . . . . . . . . . . . . . 49
t
3.6 I.I.D. contexts with a lenient stochasticity constraint: MSE in estimating the
optimalpolicyparameter. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.7 I.I.D.contextswithalenientstochasticityconstraint: coverageratesofpercentile-
t bootstrap confidence interval. Coverage rates significantly lower than 0.95
aremarkedwithasterisks(*). . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.8 AR(1)contexts: biasinestimatingtheoptimalpolicyparameter. Bias=E(θˆ)−θ∗ 50
t
3.9 AR(1)contexts: MSEinestimatingtheoptimalpolicyparameter . . . . . . . . 50
3.10 AR(1) contexts: coverage rates of percentile-t bootstrap confidence intervals.
Coverageratessignificantlylowerthan0.95aremarkedwithasterisks(*). . . . 50
3.11 Learningeffect: theoptimalpolicyandtheoraclelambda. . . . . . . . . . . . 53
3.12 Learningeffect: biasinestimatingtheoptimalpolicyparameterwhileestimat-
ingλonlineatsamplesize200. Bias=E(θˆ)−θ∗ . . . . . . . . . . . . . . . . 55
t
3.13 Learning effect: MSE in estimating the optimal policy parameter while esti-
matingλonlineatsamplesize200. . . . . . . . . . . . . . . . . . . . . . . . 55
3.14 Learning effect: coverage rates of percentile-t bootstrap confidence intervals
for the optimal policy parameter at sample size 200. λ is estimated online.
Coverageratessignificantlylowerthan0.95aremarkedwithasterisks(*). . . . 55
3.15 Learningeffect: biasinestimatingtheoptimalpolicyparameterwhileestimatingλ
onlineatsamplesize500. Bias=E(θˆ)−θ∗ . . . . . . . . . . . . . . . . . . . 55
t
3.16 Learning effect: MSE in estimating the optimal policy parameter while esti-
matingλonlineatsamplesize500. . . . . . . . . . . . . . . . . . . . . . . . 56
vi
3.17 Learning effect: coverage rates of percentile-t bootstrap confidence intervals
fortheoptimalpolicyparameteratsamplesize500. λisestimatedonline.Coverage
ratessignificantlylowerthan0.95aremarkedwithasterisks(*). . . . . . . . . 56
3.18 Learningeffect: themyopicequilibriumpolicy. . . . . . . . . . . . . . . . . 58
3.19 Learning effect: bias in estimating the myopic equilibrium policy at sample
size200. Bias=E(θˆ)−θ∗∗ . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
t
3.20 Learning effect: MSE in estimating the myopic equilibrium policy at sample
size200. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.21 Learning effect: bias in estimating the myopic equilibrium policy at sample
size500. Bias=E(θˆ)−θ∗∗ . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
t
3.22 Learning effect: MSE in estimating the myopic equilibrium policy at sample
size500. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.23 Burdeneffect: theoptimalpolicyandtheoraclelambda. . . . . . . . . . . . . 61
3.24 Burden effect: bias in estimating the optimal policy parameter while estimat-
ingλonlineatsamplesize200. Bias=E(θˆ)−θ∗ . . . . . . . . . . . . . . . . 62
t
3.25 Burdeneffect: MSEinestimatingtheoptimalpolicyparameterwhileestimat-
ingλonlineatsamplesize200. . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.26 Burden effect: coverage rates of percentile-t bootstrap confidence intervals
for the optimal policy parameter at sample size 200. λ is estimated online.
Coverageratessignificantlylowerthan0.95aremarkedwithasterisks(*). . . . 62
3.27 Burdeneffect: biasinestimatingtheoptimalpolicyparameterwhileestimatingλ
onlineatsamplesize500. Bias=E(θˆ)−θ∗ . . . . . . . . . . . . . . . . . . . 63
t
3.28 Burdeneffect: MSEinestimatingtheoptimalpolicyparameterwhileestimat-
ingλonlineatsamplesize500. . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.29 Burden effect: coverage rates of percentile-t bootstrap confidence intervals
for the optimal policy parameter at sample size 200. λ is estimated online.
Coverageratessignificantlylowerthan0.95aremarkedwithasterisks(*). . . . 63
3.30 Burdeneffect: themyopicequilibriumpolicy. . . . . . . . . . . . . . . . . . . 66
3.31 Burdeneffect: biasinestimatingthemyopicequilibriumpolicyatsamplesize
200. Bias=E(θˆ)−θ∗∗ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
t
3.32 Burden effect: MSE in estimating the myopic equilibrium policy at sample
size200. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
3.33 Burdeneffect: biasinestimatingthemyopicequilibriumpolicyatsamplesize
500. Bias=E(θˆ)−θ∗∗ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
t
3.34 Burden effect: MSE in estimating the myopic equilibrium policy at sample
size500. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3.35 Learningeffect: biasinestimatingtheoptimalpolicyparameteratsamplesize
200. Thealgorithmusesλ∗ insteadoflearningλonline. Bias=E(θˆ)−θ∗. . . 67
t
3.36 Learning effect: MSE in estimating the optimal policy parameter at sample
size200. Thealgorithmusesλ∗ insteadoflearningλonline. . . . . . . . . . 68
3.37 Learning effect: coverage rates of percentile-t bootstrap confidence intervals
for the optimal policy parameter at sample size 200. The algorithm uses λ∗
instead of learning λ online. Coverage rates significantly lower than 0.95 are
markedwithasterisks(*). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
vii
3.38 Learningeffect: biasinestimatingtheoptimalpolicyparameteratsamplesize
500. Thealgorithmusesλ∗ insteadoflearningλonline. Bias=E(θˆ)−θ∗. . . . 68
t
3.39 Learning effect: MSE in estimating the optimal policy parameter at sample
size500. Thealgorithmusesλ∗ insteadoflearningλonline. . . . . . . . . . . 68
3.40 Learning effect: coverage rates of percentile-t bootstrap confidence intervals
for the optimal policy parameter at sample size 500. The algorithm uses λ∗
instead of learning λ online. Coverage rates significantly lower than 0.95 are
markedwithasterisks(*). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
3.41 LearningeffectwithcorrelatedS andS : biasinestimatingtheoptimalpolicy
2 3
parameter at sample size 200. The algorithm uses λ∗ instead of learning λ
online. Bias=E(θˆ)−θ∗ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
t
3.42 Learning effect with correlated S and S : MSE in estimating the optimal
2 3
policyparameteratsamplesize200. Thealgorithmusesλ∗ insteadoflearning
λonline. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
3.43 LearningeffectwithcorrelatedS andS : coverageratesofpercentile-tboot-
2 3
strapconfidenceintervalsfortheoptimalpolicyparameteratsamplesize200.
The algorithm uses λ∗ instead of learning λ online. Coverage rates signifi-
cantlylowerthan0.95aremarkedwithasterisks(*). . . . . . . . . . . . . . . 69
3.44 LearningeffectwithcorrelatedS andS : biasinestimatingtheoptimalpolicy
2 3
parameter at sample size 500. The algorithm uses λ∗ instead of learning λ
online. Bias=E(θˆ)−θ∗ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
t
3.45 Learning effect with correlated S and S : MSE in estimating the optimal
2 3
policyparameteratsamplesize500. Thealgorithmusesλ∗ insteadoflearning
λonline. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
3.46 LearningeffectwithcorrelatedS andS : coverageratesofpercentile-tboot-
2 3
strapconfidenceintervalsfortheoptimalpolicyparameteratsamplesize500.
The algorithm uses λ∗ instead of learning λ online. Coverage rates signifi-
cantlylowerthan0.95aremarkedwithasterisks(*). . . . . . . . . . . . . . . 70
3.47 Burden effect: bias in estimating the optimal policy parameter at sample size
200. Thealgorithmusesλ∗ insteadoflearningλonline. Bias=E(θˆ)−θ∗. . . . 70
t
3.48 Burdeneffect: MSEinestimatingtheoptimalpolicyparameteratsamplesize
200. Thealgorithmusesλ∗ insteadoflearningλonline. . . . . . . . . . . . . 71
3.49 Burdeneffect: coverageratesofpercentile-tbootstrapconfidenceintervalsfor
theoptimalpolicyparameteratsamplesize200. Thealgorithmusesλ∗instead
of learning λ online. Coverage rates significantly lower than 0.95 are marked
withasterisks(*). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
3.50 Burden effect: bias in estimating the optimal policy parameter at sample size
500. Thealgorithmusesλ∗ insteadoflearningλonline. Bias=E(θˆ)−θ∗. . . 71
t
3.51 Burdeneffect: MSEinestimatingtheoptimalpolicyparameteratsamplesize
500. Thealgorithmusesλ∗ insteadoflearningλonline. . . . . . . . . . . . . 72
3.52 Burdeneffect: coverageratesofpercentile-tbootstrapconfidenceintervalsfor
theoptimalpolicyparameteratsamplesize500. Thealgorithmusesλ∗instead
of learning λ online. Coverage rates significantly lower than 0.95 are marked
withasterisks(*). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
viii
4.1 ThedecisionspaceD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
4.2 The Decision Rule for the two-stage decision procedure for personalizing
treatment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
4.3 Thelossfunction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
4.4 Thecriticalvaluesc andc atα = 0.05 . . . . . . . . . . . . . . . . . . . . . 86
0 1
ix
Description:1.2 A Review on Bandit and Contextual Bandit Algorithm 5. 2 Online . for the optimal policy parameter at sample size 200. λ is estimated online. Coverage rates [51] O. Linton, K. Song, and Y.J. Whang. An improved