Table Of ContentMachine Learning: A Probabilistic Perspective
Solutions Manual
(Please do not make publicly available)
Kevin P. Murphy
TheMITPress
Cambridge,Massachusetts
London,England
ii
Chapter 1
Introduction
1.1 Solutions
1.1.1 KNNclassifieronshuffledMNISTdata
Wejusthavetoinsertthefollowingpieceofcode.
Listing1.1:PartofmnistShuffled1NNdemo
... load data
%% permute columns
D = 28*28;
setSeed(0); perm = randperm(D);
Xtrain = Xtrain(:, perm);
Xtest = Xtest(:, perm);
... same as before
1.1.2 ApproximateKNNclassifiers
AccordingtoJohnChia,thefollowingcodewillwork.
Listing1.2::
[result, ndists] = flann_search(Xtrain’, Xtest’, 1, ...
struct(’algorithm’, ’kdtree’, ’trees’, 8, ’checks’, 64));
errorRate = mean(ytrain(result) ˜= ytest0)
HereportsthefollowingresultsonMNISTwith1NN.
ntests=1000 ntests=10,000
Err Time Err Time
Flann 4.8% 17s 3.35% 17.2s
Vanilla 3.8% 3.68s 3.09% 28.36s
Sotheapproximatemethodissomewhatfasterforlargetestsets,butisslightlylessaccurate.
1.1.3 CVforKNN
SeeFigure1.1(b).TheCVestimateisanoverestimateofthetesterror,buthastherightshape.Note,however,thattheempirical
testerrorisonlybasedon500testpoints. Abettercomparisonwoulduseamuchlargertestset.
1
5−fold cross validation, ntrain = 200
0.35 train 1
0.3 test 0.9
0.8
misclassification rate00..0012..5512 misclassification rate000...567
0.4
0.05 0.3
00 20 40 60 80 100 120 0.20 20 40 60 80 100 120
K K
(a) (b)
Figure1.1:(a)MisclassificationratevsKinaK-nearestneighborclassifier.Ontheleft,whereKissmall,themodeliscomplexandhence
weoverfit. Ontheright,whereK islarge,themodelissimpleandweunderfit. Dottedblueline: trainingset(size200). Solidredline: test
set(size500).(b)5-foldcrossvalidationestimateoftesterror.FiguregeneratedbyknnClassifyDemo.
2
Chapter 2
Probability
2.1 Solutions
2.1.1 Probabilitiesaresensitivetotheformofthequestionthatwasusedtogeneratetheanswer
1. Theeventspaceisshownbelow,whereX isonechildandY theother.
X Y Prob.
G G 1/4
G B 1/4
B G 1/4
B B 1/4
LetN bethenumberofgirlsandN thenumberofboys. Wehavetheconstraint(sideinformation)thatN +N = 2
g b b g
and0 ≤ N ,N ≤ 2. WearetoldN ≥ 1andareaskedtocomputetheprobabilityoftheeventN = 1(i.e.,onechild
b g b g
isagirl). ByBayesrulewehave
p(N ≥1|N =1)p(N =1)
p(N =1|N ≥1) = b g g (2.1)
g b p(N ≥1)
b
1×1/2
= =2/3 (2.2)
3/4
2. LetY betheidentityoftheobservedchildandX betheidentityoftheotherchild. Wewantp(X =g|Y =b). ByBayes
rulewehave
p(Y =b|X =g)p(X =g)
p(X =g|y =b) = (2.3)
p(Y =b)
(1/2)×(1/2)
= =1/2 (2.4)
1/2
TomMinka(Minka1998)haswrittenthefollowingabouttheseresults:
Thisseemslikeaparadoxbecauseitseemsthatinbothcaseswecouldconditiononthefactthat”atleastonechild
isaboy.”Butthatisnotcorrect;youmustconditionontheeventactuallyobserved,notitslogicalimplications. In
thefirstcase,theeventwas”Hesaidyestomyquestion.”Inthesecondcase,theeventwas”Onechildappearedin
frontofme.”Thegeneratingdistributionisdifferentforthetwoevents. Probabilitiesreflectthenumberofpossible
waysaneventcanhappen,likethenumberofroadstoatown. Logicalimplicationsarefurtherdowntheroadand
maybereachedinmoreways,throughdifferenttowns. Thedifferentnumberofwayschangestheprobability.
2.1.2 Legalreasoning
LetE betheevidence(theobservedbloodtype),andI betheeventthatthedefendantisinnocent,andG = ¬I betheevent
thatthedefendantisguilty.
1. Theprosecutorisconfusingp(E|I)withp(I|E). Wearetoldthatp(E|I)=0.01buttherelevantquantityisp(I|E). By
Bayesrule,thisis
p(E|I)p(I) 0.01p(I)
p(I|E)= = (2.5)
p(E|I)p(I)+p(E|G)p(G) 0.01p(I)+(1−p(I))
3
sincep(E|G) = 1andp(G) = 1−p(I). Sowecannotdeterminep(I|E)withoutknowingthepriorprobabilityp(I).
Sop(E|I)=p(I|E)onlyifp(G)=p(I)=0.5,whichishardlyapresumptionofinnocence.
To understand this more intuitively, consider the following isomorphic problem (from http://en.wikipedia.
org/wiki/Prosecutor’s_fallacy):
Abigbowlisfilledwithalargebutunknownnumberofballs. Someoftheballsaremadeofwood,andsome
ofthemaremadeofplastic. Ofthewoodenballs,100arewhite;outoftheplasticballs,99areredandonly1
arewhite. Aballispulledoutatrandom,andobservedtobewhite.
Without knowledge of the relative proportions of wooden and plastic balls, we cannot tell how likely it is that the ball
is wooden. If the number of plastic balls is far larger than the number of wooden balls, for instance, then a white ball
pulled from the bowl at random is far more likely to be a white plastic ball than a white wooden ball — even though
whiteplasticballsareaminorityofthewholesetofplasticballs.
2. Thedefenderisquotingp(G|E)whileignoringp(G). Theprioroddsare
p(G) 1
= (2.6)
p(I) 799,999
Theposterioroddsare
p(G|E) 1
= (2.7)
p(I|E) 7999
So the evidence has increased the odds of guilt by a factor of 1000. This is clearly relevant, although perhaps still not
enoughtofindthesuspectguilty.
2.1.3 Varianceofasum
Wehave
var[X+Y] = E[(X+Y)2]−(E[X]+E[Y])2 (2.8)
= E[X2+Y2+2XY]−(E[X]2+E[Y]2+2E[X]E[Y]) (2.9)
= E[X2]−E[X]2+E[Y2]−E[Y]2+2E[XY]−2E[X]E[Y] (2.10)
= var[X]+var[Y]+2cov[X,Y] (2.11)
IfX andY areindependent,thencov[X,Y]=0,sovar[X+Y]=var[X]+var[Y].
2.1.4 Bayesruleformedicaldiagnosis
LetT = 1representapositivetestoutcome,T = 0representanegativetestoutcome,D = 1meanyouhavethedisease,and
D =0meanyoudon’thavethedisease. Wearetold
P(T =1|D =1) = 0.99 (2.12)
P(T =0|D =0) = 0.99 (2.13)
P(D =1) = 0.0001 (2.14)
WeareaskedtocomputeP(D =1|T =1),whichwecandousingBayes’rule:
P(T =1|D =1)P(D =1)
P(D =1|T =1) = (2.15)
P(T =1|D =1)P(D =1)+P(T =1|D =0)P(D =0)
0.99×0.0001
= (2.16)
0.99×0.0001+0.01×0.9999
= 0.009804 (2.17)
Soalthoughyouaremuchmorelikelytohavethedisease(giventhatyouhavetestedpositive)thanarandommemberofthe
population,youarestillunlikelytohaveit.
4
2.1.5 TheMontyHallproblem
LetH denotethehypothesisthattheprizeisbehinddoori. Wemakethefollowingassumptions: thethreehypothesesH ,H
i 1 2
andH areequiprobableapriori,i.e.,,
3
1
P(H )=P(H )=P(H )= . (2.18)
1 2 3 3
Thedatumwereceive,afterchoosingdoor1,isoneofD = 3andD = 2(meaningdoor3or2isopened,respectively). We
assumethatthesetwopossibleoutcomeshavethefollowingprobabilities. Iftheprizeisbehinddoor1thenthehosthasafree
choice;inthiscaseweassumethatthehostselectsatrandombetweenD = 2andD = 3. Otherwisethechoiceofthehostis
forcedandtheprobabilitiesare0and1.
1
P(D =2(cid:107)H )= P(D =2(cid:107)H )=0 P(D =2(cid:107)H )=1
1 2 2 3 (2.19)
1
P(D =3(cid:107)H )= P(D =3(cid:107)H )=1 P(D =3(cid:107)H )=0
1 2 2 3
Now,usingBayestheorem,weevaluatetheposteriorprobabilitiesofthehypotheses:
P(D =3(cid:107)H )P(H )
P(H (cid:107)D =3)= i i (2.20)
i P(D =3)
P(H (cid:107)D =3)=(1/2)(1/3) P(H (cid:107)D =3)= (1)(1/3) P(H (cid:107)D =3)= (0)(1/3) (2.21)
1 P(D=3) 2 P(D=3) 3 P(D=3)
ThedenominatorP(D =3)is(1/2)becauseitisthenormalizingconstantforthisposteriordistribution. So
1 2
P(H (cid:107)D =3) = P(H (cid:107)D =3) = P(H (cid:107)D =3) = 0. (2.22)
1 3 2 3 3
Sothecontestantshouldswitchtodoor2inordertohavethebiggestchanceofgettingtheprize.
Manypeoplefindthisoutcomesurprising.Therearetwowaystomakeitmoreintuitive.Oneistoplaythegamethirtytimes
with a friend and keep track of the frequency with which switching gets the prize. Alternatively, you can perform a thought
experimentinwhichthegameisplayedwithamilliondoors. Therulesarenowthatthecontestantchoosesonedoor,thenthe
gameshowhostopens999,998doorsinsuchawayasnottorevealtheprize, leavingthecontestant’sselecteddoorandone
other door closed. The contestant may now stick or switch. Imagine the contestant confronted by a million doors, of which
doors1and234,598havenotbeenopened,door1havingbeenthecontestant’sinitialguess. Wheredoyouthinktheprizeis?
AnotherwaytothinkabouttheproblemistouseadirectedgraphicalmodeloftheformP →M ←F,whereP indicates
the location the prize, F indicates your first choice, and M indicates which door Monty opens. Clearly P and F cause
(determine)M. WhenweobserveM,ourbeliefaboutP changesbecausewehaveobservedevidenceaboutitschildM.
2.1.6 MomentsofaBernoullidistribution
Mean
(cid:88)
E[X]= xp(x)=0p(X =0)+1p(X =1)=θ (2.23)
x∈{0,1}
Variance
var[X] = E(cid:2)(X−µ)2(cid:3)= (cid:88) p(x)(x−µ)2 (2.24)
x∈{0,1}
= θ(1−θ)2+(1−θ)(0−θ)2 (2.25)
= θ(1+θ2−2θ)+(1−θ)θ2 (2.26)
= θ+θ3−2θ2+θ2−θ3 (2.27)
= θ−θ2 =θ(1−θ) (2.28)
Alternativeproof
E(cid:2)X2(cid:3) = 02p(x=0)+12p(x=1)=θ (2.29)
var[X] = E(cid:2)X2(cid:3)−E[X]2 =θ−θ2 =θ(1−θ) (2.30)
5
2.1.7 Conditionalindependence
1. Bayes’rulegives
P(E ,E |H)P(H)
P(H|E ,E )= 1 2 (2.31)
1 2 P(E ,E )
1 2
Thus the information in (ii) is sufficient. In fact, we don’t need P(E ,E ) because it is equal to the normalization
1 2
constant(toenforcethesumtooneconstraint). (i)and(iii)areinsufficient.
2. Nowtheequationsimplifiesto
P(E |H)P(E |H)P(H)
P(H|E ,E )= 1 2 (2.32)
1 2 P(E ,E )
1 2
so(i)and(ii)areobviouslysufficient. (iii)isalsosufficient,becausewecancomputeP(E ,E )usingnormalization.
1 2
2.1.8 Pairwiseindependencedoesnotimplymutualindependence
Weprovidetwocounterexamples.
Let X and X be independent binary random variables, and X = X ⊕X , where ⊕ is the XOR operator. We have
1 2 3 1 2
p(X |X ,X )(cid:54)=p(X ),sinceX canbedeterministicallycalculatedfromX andX . Sothevariables{X ,X ,X }arenot
3 1 2 3 3 1 2 1 2 3
mutuallyindependent. However,wealsohavep(X |X ) = p(X ),sincewithoutX ,noinformationcanbeprovidedtoX .
3 1 3 2 3
SoX ⊥X andsimilarlyX ⊥X . Hence{X ,X ,X }arepairwiseindependent.
1 3 2 3 1 2 3
Hereisadifferentexample. Lettherebefourballsinabag,numbered1to4. Supposewedrawoneatrandom. Define3
eventsasfollows:
• X : ball1or2isdrawn.
1
• X : ball2or3isdrawn.
2
• X : ball1or3isdrawn.
3
We have p(X ) = p(X ) = p(X ) = 0.5. Also, p(X ,X ) = p(X ,X ) = p(X ,X ) = 0.25. Hence p(X ,X ) =
1 2 3 1 2 2 3 1 3 1 2
p(X )p(X ), andsimilarlyfortheotherpairs. Hencetheeventsarepairwise independent. However, p(X ,X ,X ) = 0 (cid:54)=
1 2 1 2 3
1/8=p(X )p(X )p(X ).
1 2 3
2.1.9 Conditionalindependenceiffjointfactorizes
Independency⇒Factorization. Letg(x,z)=p(x|z)andh(y,z)=p(y|z). IfX ⊥Y|Z then
p(x,y|z)=p(x|z)p(y|z)=g(x,z)h(y,z) (2.33)
Factorization⇒Independency. Ifp(x,y|z)=g(x,z)h(y,z)then
(cid:88) (cid:88) (cid:88) (cid:88)
1 = p(x,y|z)= g(x,z)h(y,z)= g(x,z) h(y,z) (2.34)
x,y x,y x y
(cid:88) (cid:88) (cid:88)
p(x|z) = p(x,y|z)= g(x,z)h(y,z)=g(x,z) h(y,z) (2.35)
y y y
(cid:88) (cid:88) (cid:88)
p(y|z) = p(x,y|z)= g(x,z)h(y,z)=h(y,z) g(x,z) (2.36)
x x x
(cid:88) (cid:88)
p(x|z)p(y|z) = g(x,z)h(y,z) g(x,z) g(y,z) (2.37)
x y
= g(x,z)h(y,z)=p(x,y|z) (2.38)
2.1.10 Conditionalindependence
1. True,since
(X ⊥W|Z,Y) ⇒ p(X|W,Z,Y)=p(X|Z,Y) (2.39)
(X ⊥Y|Z) ⇒ p(X|Z,Y)=p(X|Z) (2.40)
⇒ p(X|W,Z,Y)=p(X|Z) (2.41)
⇒ (X ⊥Y,W|Z) (2.42)
2. False. ConsidertheDAGinFigure2.1. Itencodesthat(X ⊥Y|Z)and(X ⊥Y|W)butnot(X ⊥Y|Z,W).
6
Figure2.1:ADGM.
2.1.11 Derivingtheinversegammadensity
Wehave
dx
p (y)=p (x)| | (2.43)
y x dy
where
dx 1
=− =−x2 (2.44)
dy y2
So
ba
p (y) = x2 xa−1e−xb (2.45)
y Γ(a)
ba
= xa+1e−xb (2.46)
Γ(a)
ba
= y−(a+1)e−b/y =IG(y|a,b) (2.47)
Γ(a)
2.1.12 Normalizationconstantfora1DGaussian
Followingthefirsthintwehave
(cid:90) 2π(cid:90) ∞ (cid:18) r2 (cid:19)
Z2 = rexp − drdθ (2.48)
2σ2
0 0
(cid:20)(cid:90) 2π (cid:21)(cid:20)(cid:90) ∞ (cid:18) r2 (cid:19) (cid:21)
= dθ rexp − dr (2.49)
2σ2
0 0
= (2π)I (2.50)
whereI istheinnerintegral
(cid:90) ∞ (cid:18) r2 (cid:19)
I = rexp − (2.51)
2σ2
0
Followingthesecondhintwehave
(cid:90) r
I = −σ2 − e−r2/2σ2dr (2.52)
σ2
(cid:104) (cid:105)∞
= −σ2 e−r2/2σ2 (2.53)
0
= −σ2[0−1]=σ2 (2.54)
Hence
Z2 = 2πσ2 (2.55)
(cid:112)
Z = σ (2π) (2.56)
7
2.1.13 Expressingmutualinformationintermsofentropies
(cid:88) p(x,y)
I(X,Y) = p(x,y)log (2.57)
p(x)p(y)
x,y
(cid:88) p(x|y)
= p(x,y)log (2.58)
p(x)
x,y
(cid:88) (cid:88)
= − p(x,y)logp(x)+ p(x,y)logp(x|y) (2.59)
x,y x,y
(cid:32) (cid:33)
(cid:88) (cid:88)
= − p(x)logp(x)− − p(x,y)logp(x|y) (2.60)
x x,y
(cid:32) (cid:33)
(cid:88) (cid:88) (cid:88)
= − p(x)logp(x)− − p(y) p(x|y)logp(x|y) (2.61)
x y x
= H(X)−H(X|Y) (2.62)
WecanshowI(X,Y)=H(Y)−H(Y|X)bysymmetry.
2.1.14 Mutualinformationforcorrelatednormals
Theentropyis
h(X,Y)= 1log(cid:2)(2πe)2detΣ(cid:3)= 1log(cid:2)(2πe)2σ4(1−ρ2)(cid:3) (2.63)
2 2
SinceX andY areindividuallynormalwithvarianceσ2,wehave
h(X)=h(Y)= 1log(cid:2)2πeσ2(cid:3) (2.64)
2
Hence
I(X,Y) = h(X)+h(Y)−h(X,Y) (2.65)
1
= log[2πeσ2]− log[(2πe)2σ4(1−ρ2)] (2.66)
2
1 1
= log[(2πeσ2)2]− log[(2πe2σ2)2(1−ρ2)] (2.67)
2 2
1 1 1
= log =− log[1−ρ2] (2.68)
2 1−ρ2 2
1. ρ=1. Inthiscase,X =Y,andI(X,Y)=∞,whichmakessense.
2. ρ=0. Inthiscase,X andY areindependent,andI(X,Y)=0,whichmakessense.
3. ρ=−1. Inthiscase,X =−Y,andI(X,Y)=∞,whichagainmakessense.
2.1.15 Ameasureofcorrelation(normalizedmutualinformation)
1. Wehave
H(X)−H(Y|X) H(Y)−H(Y|X) I(X,Y)
r = = = (2.69)
H(X) H(X) H(X)
wherethesecondstepfollowssinceH(X)=H(Y)
8