Table Of Content

An Online Actor Critic Algorithm and a Statistical Decision Procedure for Personalizing Intervention by HuitianLei Adissertationsubmittedinpartialfulfillment oftherequirementsforthedegreeof DoctorofPhilosophy (Statistics) intheUniversityofMichigan 2016 DoctoralCommittee: ProfessorSusanA.Murphy,co-Chair AssistantProfessorAmbujTewari,co-Chair AssociateProfessorLuWang AssistantProfessorShuhengZhou ©HuitianLei 2016 Dedication Tomymother ii TABLE OF CONTENTS Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii ListofFigures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v ListofTables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x Chapter 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 AReviewonAdaptiveInterventionandJust-in-timeAdaptiveIntervention 3 1.2 AReviewonBanditandContextualBanditAlgorithm . . . . . . . . . . 5 2 OnlineLearningofOptimalPolicy: Formulation,AlgorithmandTheory . . . 10 2.1 Problemformulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.1.1 Modeling the Decision Making Problem as a Contextual Bandit Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.1.2 TheRegularizedAverageReward . . . . . . . . . . . . . . . . . 13 2.2 AnOnlineActorCriticAlgorithm . . . . . . . . . . . . . . . . . . . . . 20 2.2.1 TheCriticwithaLinearFunctionApproximation . . . . . . . . . 21 2.2.2 TheActorandtheActorCriticAlgorithm . . . . . . . . . . . . . 22 2.3 AsymptoticTheoryoftheActorCriticAlgorithm . . . . . . . . . . . . . 23 2.4 SmallSampleVarianceestimationandBootstrapConfidenceintervals . . 28 2.4.1 Plug-inVarianceEstimationandWaldConfidenceintervals . . . 29 2.4.2 BootstrapConfidenceintervals . . . . . . . . . . . . . . . . . . . 35 2.5 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3 NumericalExperiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.1 I.I.D.Contexts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.2 AR(1)Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.3 ContextisInfluencedbyPreviousActions . . . . . . . . . . . . . . . . . 52 3.3.1 LearningEffect . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 3.3.2 BurdenEffect . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 3.4 Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 3.4.1 LearningEffect: ActorCriticAlgorithmUsesλ∗ . . . . . . . . . 67 3.4.2 Learning Effect with Correlated S and S : Actor Critic Algo- 2 3 rithmUsesλ∗ . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 iii 3.4.3 BurdenEffect: ActorCriticAlgorithmUsesλ∗ . . . . . . . . . . 70 4 AMultipleDecisionProcedureforPersonalizingIntervention . . . . . . . . . 73 4.1 LiteratureReview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 4.1.1 Thetestofqualitativeinteraction . . . . . . . . . . . . . . . . . 75 4.1.2 MultipleHypothesisTesting,MultipleDecisionTheory . . . . . 77 4.2 TheDecisionProcedureandControllingtheErrorProbabilities . . . . . . 81 4.2.1 NotationandAssumptions . . . . . . . . . . . . . . . . . . . . . 81 4.2.2 TheDecisionSpace . . . . . . . . . . . . . . . . . . . . . . . . 81 4.2.3 TestStatistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 4.2.4 TheTwo-stageDecisionProcedure . . . . . . . . . . . . . . . . 83 4.2.5 TheLossFunctionandErrorprobabilities . . . . . . . . . . . . . 84 4.3 ChoosingtheCriticalValuesc andc . . . . . . . . . . . . . . . . . . . 85 0 1 4.4 ComparingwithAlternativeMethods . . . . . . . . . . . . . . . . . . . . 86 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 iv LIST OF FIGURES 2.1 Plug in variance estimation as a function of µˆ and µˆ , x axis represents µˆ , 2 3 t,2 y axis represents µˆ and z axis represents the plug-in asymptotic variance of t,3 ˆ θ withλ = 0.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 0 2.2 Wald confidence interval coverage for 1000 simulated datasets as a function ofµˆ andµˆ atsamplesize100. . . . . . . . . . . . . . . . . . . . . . . . . . 34 3 2 2.3 Wald confidence interval coverage in 1000 simulated datasets as a function of µˆ andµˆ atsamplesize500. . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3 2 √Tˆ(θ −θ∗) 2.4 Histogramsofthenormalizeddistance i i fori = 0,1atsamplesize100 35 Vî 3.1 RelativeMSEvsARcoefficientη atsamplesize200. RelativeMSEisrelative totheMSEatη = 0. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.2 RelativeMSEvsARcoefficientη atsamplesize500. RelativeMSEisrelative totheMSEatη = 0. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 3.3 Learning effect: box plots of regularized average cost at different levels of learningeffect. Samplesizeis200. . . . . . . . . . . . . . . . . . . . . . . . 57 3.4 Learning effect: box plots of regularized average cost at different levels of learningeffect. Samplesizeis500. . . . . . . . . . . . . . . . . . . . . . . . 57 3.5 Burden effect: box plots of regularized average cost at different levels of the burdeneffectatsamplesize200. . . . . . . . . . . . . . . . . . . . . . . . . . 65 3.6 Burden effect: box plots of regularized average cost at different levels of the burdeneffectatsamplesize500. . . . . . . . . . . . . . . . . . . . . . . . . . 65 v LIST OF TABLES 2.1 Underestimation of the plug-in variance estimator and the Wald confidence intervals. TheoreticalWaldCIiscreatedbasedonthetrueasymptoticvariance. 32 3.1 I.I.D.contexts: biasinestimatingtheoptimalpolicyparameter. Bias=E(θˆ)− t θ∗. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.2 I.I.D.contexts: MSEinestimatingtheoptimalpolicyparameter. . . . . . . . 47 3.3 I.I.D. contexts: coverage rates of percentile-t bootstrap confidence intervals fortheoptimalpolicyparameter. . . . . . . . . . . . . . . . . . . . . . . . . 48 3.4 I.I.D.contexts: coverageratesofEfron-typebootstrapconfidenceintervalsfor theoptimalpolicyparameter. Coverageratessignificantlylowerthan0.95are markedwithasterisks(*). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.5 I.I.D. contexts with a lenient stochasticity constraint: bias in estimating the optimalpolicyparameter. Bias=E(θˆ)−θ∗ . . . . . . . . . . . . . . . . . . . 49 t 3.6 I.I.D. contexts with a lenient stochasticity constraint: MSE in estimating the optimalpolicyparameter. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.7 I.I.D.contextswithalenientstochasticityconstraint: coverageratesofpercentile- t bootstrap confidence interval. Coverage rates significantly lower than 0.95 aremarkedwithasterisks(*). . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.8 AR(1)contexts: biasinestimatingtheoptimalpolicyparameter. Bias=E(θˆ)−θ∗ 50 t 3.9 AR(1)contexts: MSEinestimatingtheoptimalpolicyparameter . . . . . . . . 50 3.10 AR(1) contexts: coverage rates of percentile-t bootstrap confidence intervals. Coverageratessignificantlylowerthan0.95aremarkedwithasterisks(*). . . . 50 3.11 Learningeffect: theoptimalpolicyandtheoraclelambda. . . . . . . . . . . . 53 3.12 Learningeffect: biasinestimatingtheoptimalpolicyparameterwhileestimat- ingλonlineatsamplesize200. Bias=E(θˆ)−θ∗ . . . . . . . . . . . . . . . . 55 t 3.13 Learning effect: MSE in estimating the optimal policy parameter while esti- matingλonlineatsamplesize200. . . . . . . . . . . . . . . . . . . . . . . . 55 3.14 Learning effect: coverage rates of percentile-t bootstrap confidence intervals for the optimal policy parameter at sample size 200. λ is estimated online. Coverageratessignificantlylowerthan0.95aremarkedwithasterisks(*). . . . 55 3.15 Learningeffect: biasinestimatingtheoptimalpolicyparameterwhileestimatingλ onlineatsamplesize500. Bias=E(θˆ)−θ∗ . . . . . . . . . . . . . . . . . . . 55 t 3.16 Learning effect: MSE in estimating the optimal policy parameter while esti- matingλonlineatsamplesize500. . . . . . . . . . . . . . . . . . . . . . . . 56 vi 3.17 Learning effect: coverage rates of percentile-t bootstrap confidence intervals fortheoptimalpolicyparameteratsamplesize500. λisestimatedonline.Coverage ratessignificantlylowerthan0.95aremarkedwithasterisks(*). . . . . . . . . 56 3.18 Learningeffect: themyopicequilibriumpolicy. . . . . . . . . . . . . . . . . 58 3.19 Learning effect: bias in estimating the myopic equilibrium policy at sample size200. Bias=E(θˆ)−θ∗∗ . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 t 3.20 Learning effect: MSE in estimating the myopic equilibrium policy at sample size200. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 3.21 Learning effect: bias in estimating the myopic equilibrium policy at sample size500. Bias=E(θˆ)−θ∗∗ . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 t 3.22 Learning effect: MSE in estimating the myopic equilibrium policy at sample size500. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 3.23 Burdeneffect: theoptimalpolicyandtheoraclelambda. . . . . . . . . . . . . 61 3.24 Burden effect: bias in estimating the optimal policy parameter while estimat- ingλonlineatsamplesize200. Bias=E(θˆ)−θ∗ . . . . . . . . . . . . . . . . 62 t 3.25 Burdeneffect: MSEinestimatingtheoptimalpolicyparameterwhileestimat- ingλonlineatsamplesize200. . . . . . . . . . . . . . . . . . . . . . . . . . 62 3.26 Burden effect: coverage rates of percentile-t bootstrap confidence intervals for the optimal policy parameter at sample size 200. λ is estimated online. Coverageratessignificantlylowerthan0.95aremarkedwithasterisks(*). . . . 62 3.27 Burdeneffect: biasinestimatingtheoptimalpolicyparameterwhileestimatingλ onlineatsamplesize500. Bias=E(θˆ)−θ∗ . . . . . . . . . . . . . . . . . . . 63 t 3.28 Burdeneffect: MSEinestimatingtheoptimalpolicyparameterwhileestimat- ingλonlineatsamplesize500. . . . . . . . . . . . . . . . . . . . . . . . . . 63 3.29 Burden effect: coverage rates of percentile-t bootstrap confidence intervals for the optimal policy parameter at sample size 200. λ is estimated online. Coverageratessignificantlylowerthan0.95aremarkedwithasterisks(*). . . . 63 3.30 Burdeneffect: themyopicequilibriumpolicy. . . . . . . . . . . . . . . . . . . 66 3.31 Burdeneffect: biasinestimatingthemyopicequilibriumpolicyatsamplesize 200. Bias=E(θˆ)−θ∗∗ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 t 3.32 Burden effect: MSE in estimating the myopic equilibrium policy at sample size200. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 3.33 Burdeneffect: biasinestimatingthemyopicequilibriumpolicyatsamplesize 500. Bias=E(θˆ)−θ∗∗ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 t 3.34 Burden effect: MSE in estimating the myopic equilibrium policy at sample size500. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 3.35 Learningeffect: biasinestimatingtheoptimalpolicyparameteratsamplesize 200. Thealgorithmusesλ∗ insteadoflearningλonline. Bias=E(θˆ)−θ∗. . . 67 t 3.36 Learning effect: MSE in estimating the optimal policy parameter at sample size200. Thealgorithmusesλ∗ insteadoflearningλonline. . . . . . . . . . 68 3.37 Learning effect: coverage rates of percentile-t bootstrap confidence intervals for the optimal policy parameter at sample size 200. The algorithm uses λ∗ instead of learning λ online. Coverage rates significantly lower than 0.95 are markedwithasterisks(*). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 vii 3.38 Learningeffect: biasinestimatingtheoptimalpolicyparameteratsamplesize 500. Thealgorithmusesλ∗ insteadoflearningλonline. Bias=E(θˆ)−θ∗. . . . 68 t 3.39 Learning effect: MSE in estimating the optimal policy parameter at sample size500. Thealgorithmusesλ∗ insteadoflearningλonline. . . . . . . . . . . 68 3.40 Learning effect: coverage rates of percentile-t bootstrap confidence intervals for the optimal policy parameter at sample size 500. The algorithm uses λ∗ instead of learning λ online. Coverage rates significantly lower than 0.95 are markedwithasterisks(*). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 3.41 LearningeffectwithcorrelatedS andS : biasinestimatingtheoptimalpolicy 2 3 parameter at sample size 200. The algorithm uses λ∗ instead of learning λ online. Bias=E(θˆ)−θ∗ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 t 3.42 Learning effect with correlated S and S : MSE in estimating the optimal 2 3 policyparameteratsamplesize200. Thealgorithmusesλ∗ insteadoflearning λonline. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 3.43 LearningeffectwithcorrelatedS andS : coverageratesofpercentile-tboot- 2 3 strapconfidenceintervalsfortheoptimalpolicyparameteratsamplesize200. The algorithm uses λ∗ instead of learning λ online. Coverage rates signifi- cantlylowerthan0.95aremarkedwithasterisks(*). . . . . . . . . . . . . . . 69 3.44 LearningeffectwithcorrelatedS andS : biasinestimatingtheoptimalpolicy 2 3 parameter at sample size 500. The algorithm uses λ∗ instead of learning λ online. Bias=E(θˆ)−θ∗ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 t 3.45 Learning effect with correlated S and S : MSE in estimating the optimal 2 3 policyparameteratsamplesize500. Thealgorithmusesλ∗ insteadoflearning λonline. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 3.46 LearningeffectwithcorrelatedS andS : coverageratesofpercentile-tboot- 2 3 strapconfidenceintervalsfortheoptimalpolicyparameteratsamplesize500. The algorithm uses λ∗ instead of learning λ online. Coverage rates signifi- cantlylowerthan0.95aremarkedwithasterisks(*). . . . . . . . . . . . . . . 70 3.47 Burden effect: bias in estimating the optimal policy parameter at sample size 200. Thealgorithmusesλ∗ insteadoflearningλonline. Bias=E(θˆ)−θ∗. . . . 70 t 3.48 Burdeneffect: MSEinestimatingtheoptimalpolicyparameteratsamplesize 200. Thealgorithmusesλ∗ insteadoflearningλonline. . . . . . . . . . . . . 71 3.49 Burdeneffect: coverageratesofpercentile-tbootstrapconfidenceintervalsfor theoptimalpolicyparameteratsamplesize200. Thealgorithmusesλ∗instead of learning λ online. Coverage rates significantly lower than 0.95 are marked withasterisks(*). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 3.50 Burden effect: bias in estimating the optimal policy parameter at sample size 500. Thealgorithmusesλ∗ insteadoflearningλonline. Bias=E(θˆ)−θ∗. . . 71 t 3.51 Burdeneffect: MSEinestimatingtheoptimalpolicyparameteratsamplesize 500. Thealgorithmusesλ∗ insteadoflearningλonline. . . . . . . . . . . . . 72 3.52 Burdeneffect: coverageratesofpercentile-tbootstrapconfidenceintervalsfor theoptimalpolicyparameteratsamplesize500. Thealgorithmusesλ∗instead of learning λ online. Coverage rates significantly lower than 0.95 are marked withasterisks(*). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 viii 4.1 ThedecisionspaceD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 4.2 The Decision Rule for the two-stage decision procedure for personalizing treatment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 4.3 Thelossfunction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 4.4 Thecriticalvaluesc andc atα = 0.05 . . . . . . . . . . . . . . . . . . . . . 86 0 1 ix

Description:

1.2 A Review on Bandit and Contextual Bandit Algorithm 5. 2 Online . for the optimal policy parameter at sample size 200. λ is estimated online. Coverage rates [51] O. Linton, K. Song, and Y.J. Whang. An improved

An Online Actor Critic Algorithm and a Statistical Decision Procedure for Personalizing Intervention PDF

108 Pages·2016·1.21 MB·English

Checking for file health...

Save to my drive

Quick download

Download

Download An Online Actor Critic Algorithm and a Statistical Decision Procedure for Personalizing Intervention PDF Free - Full Version

by Unknow| 2016| 108 pages| 1.21| English

Download An Online Actor Critic Algorithm and a Statistical Decision Procedure for Personalizing Intervention by in PDF format completely FREE. No registration required, no payment needed. Get instant access to this valuable resource on PDFdrive.to!

Free Download PDF

About An Online Actor Critic Algorithm and a Statistical Decision Procedure for Personalizing Intervention

Detailed Information

Author:	Unknown
Publication Year:	2016
Pages:	108
Language:	English
File Size:	1.21
Format:	PDF
Price:	FREE

Download Free PDF

Safe & Secure Download - No registration required

Why Choose PDFdrive for Your Free An Online Actor Critic Algorithm and a Statistical Decision Procedure for Personalizing Intervention Download?

100% Free: No hidden fees or subscriptions required for one book every day.
No Registration: Immediate access is available without creating accounts for one book every day.
Safe and Secure: Clean downloads without malware or viruses
Multiple Formats: PDF, MOBI, Mpub,... optimized for all devices
Educational Resource: Supporting knowledge sharing and learning

Frequently Asked Questions

Is it really free to download An Online Actor Critic Algorithm and a Statistical Decision Procedure for Personalizing Intervention PDF?

Yes, on https://PDFdrive.to you can download An Online Actor Critic Algorithm and a Statistical Decision Procedure for Personalizing Intervention by completely free. We don't require any payment, subscription, or registration to access this PDF file. For 3 books every day.

How can I read An Online Actor Critic Algorithm and a Statistical Decision Procedure for Personalizing Intervention on my mobile device?

After downloading An Online Actor Critic Algorithm and a Statistical Decision Procedure for Personalizing Intervention PDF, you can open it with any PDF reader app on your phone or tablet. We recommend using Adobe Acrobat Reader, Apple Books, or Google Play Books for the best reading experience.

Is this the full version of An Online Actor Critic Algorithm and a Statistical Decision Procedure for Personalizing Intervention?

Yes, this is the complete PDF version of An Online Actor Critic Algorithm and a Statistical Decision Procedure for Personalizing Intervention by Unknow. You will be able to read the entire content as in the printed version without missing any pages.

Is it legal to download An Online Actor Critic Algorithm and a Statistical Decision Procedure for Personalizing Intervention PDF for free?

https://PDFdrive.to provides links to free educational resources available online. We do not store any files on our servers. Please be aware of copyright laws in your country before downloading.

The materials shared are intended for research, educational, and personal use in accordance with fair use principles.