Table Of ContentTruncation-free Hybrid Inference for DPMM
ArnimBleier
DepartmentofComputationalSocialScience
LeibnizInstitutefortheSocialSciences
Cologne,50667-Germany
[email protected]
7
1
0
2 Abstract
n
a Dirichlet process mixture models (DPMM) are a cornerstone of Bayesian non-
J parametrics. While these models free from choosing the number of components
3 a-priori, computationally attractive variational inference often reintroduces the
1 need to do so, via a truncation on the variational distribution. In this paper we
presentatruncation-freehybridinferenceforDPMM,combiningtheadvantages
] ofsampling-basedMCMCandvariationalmethods. Theproposedhybridization
G
enables more efficient variational updates, while increasing model complexity
L only if needed. We evaluate the properties of the hybrid updates and their em-
. piricalperformanceinsingle-aswellasmixed-membershipmodels. Ourmethod
s
c iseasytoimplementandperformsfavorablycomparedtoexistingschemas.
[
1 1 Background
v
3
Tobeginwith,consideramodelfordata = x ,x ,...,x thatisassumedtobegeneratedby
4 1 2 N
X { }
a mixture of simpler component models F(φ ). Following the single-membership assumption of
7 k
3 eachdatapointxi ∈X beingexplainedbyasinglecomponentφzi wehave
0 θ Dir(α)
. ∼ K
1 φ H(β) k [1,K]
k
0 ∼ ∀ ∈
z Cat(θ) i [1,N]
7 i
∼ ∀ ∈
1 x F(φ ) i [1,N], (1)
: i ∼ zi ∀ ∈
v andarriveattheinfinite-dimensionalDPMMforK . WhilecollapsedGibbssampling(CGS)
i is commonly used to explore the unbounded latent s→pa∞ce of this model, consider as an alternative
X
thecollapsedvariationaldistribution
r
a N
(cid:89)
q(z)= Cat(z γ ,...,γ ), (2)
i i1 iK+1
|
i=1
similartopartiallycollapsedapproximations[4],howeverwithθandφintegratedout. Theupdates
to optimize the variational parameter γ with regards to the true distribution p are then for each
i
observationi 1,...,N
∈{ }
γ exp(E[logp(z =k z )]+E[logp(x x¬i ,β)]). (3)
ik ∝ i | ¬i i | z=k
While exact computations are hard, second-order Tailor expansions [4] or the use of zero-order
informationhavebeensuggestedforestimation[1]. Usingonlyzero-orderinformationtheoptimal
settingsofγ are
n¬i
n¬ik+α p(xi |x¬z=ik,β) ifk ≤K
γ . (4)
ik ∝n¬iα+α p(xi |β) ifk =K+1,
.
unusable,,
K+1#dimensional##
'
update# /
1$ 2$ …$ K$ K+1$
⇠ sample#if#a#new##
c#
⇠ dimension#is#needed#
1$ 2$
if#c#=#1#
hybrid##
update#'HYB/ 1$ 2⇣$ …$ K$
if#c#=#2#
K+1$
trunca'on)free,
Figure1: Illustrationofhybridupdates
whereweoverloadthenotationn¬i = (cid:80)N¬iγ , ascomparedtothestandardCGS,withtheex-
k j=1 jk
pectednumberofdatapointsexplainedbythekthcomponent. Besidesthesimplifyingassumptions
made,theseupdateswouldplaceprobabilitymassoneachcomponentγ =0 k 1,...,K+1
ik
(cid:54) ∀ ∈{ }
andintroduceanewcomponentineachstepoftheinference. Acommonwaytoaddressthisprob-
lemistheuseoffixedfinite-dimensionalvariationalapproximationsforDPMM[4]. Alternatively,
Lin[5]aswellasWangetal.[7],amongstothers,discussmethodsforgrowingthetruncationaspart
of the inference. Lin [5] introduces an additional parameter controlling the growth of the trunca-
tion. Wangetal. [7]estimateparametersforlocallycollapsedvariationalinferencefromtraditional
samples,losingvaluableinformationintheupdates.
We extend the idea of estimating variational parameters from samples with a method to construct
the samples more efficiently. In the remainder of this paper, we start with the construction of the
proposed truncation-free hybrid updates. After that we study the properties of these updates. We
concludewithashortevaluationaswellasadiscussionofthecurrentlimitationsanddirectionsfor
futurework.
2 ConstructionofHybridUpdates
Our goal is to allow for a truncation-free variational posterior inference that keeps as much infor-
mationaspossibleoftheupdates,whilestillbeingabletoexploretheunboundedlatentspaceina
fashionsimilartotheGibbssampler. Toreachthatgoal,wesuggesttoreplacetheunusableK +1
dimensionalupdateofthevariationalparameterγ withahybridupdate. Thesuggestedupdatehas
i
eithertheformofaK dimensionaltruncatedvariationalparameteroraK+1dimensionalparam-
eter instantiating a new component. Our hybridization depends only on local information and we
will use the abbreviation ϕ = γ to refer, in this section, to the kth component in the K +1
k ik
dimensionalprobabilityvectorinEquation4.
Letξbethetwo-dimensionalparameterofacategoricaldistributionwiththefirstdimension
K
(cid:88)
ξ ϕ (5)
1 k
∝
k=1
beingproportionaltothesumoftheexplanatorypowerofthefirstK components,andthesecond
dimension
ξ ϕ (6)
2 K+1
∝
beingproportionaltowhatisexplainedbytheyetuninstantiatedcomponents. Furthermore,letζ ,
c
withc 1,2 ,beprobabilityvectorsrepresenting,respectively,atruncatedvariationaldistribution
∈{ }
andaGibbssampleinstantiatinganewdimensioninvectornotation
ϕ ifk K 0 ifk K
k
≤ ≤
ζ , ζ (7)
1k 2k
∝0 ifk =K+1 ∝1 ifk =K+1.
2
4600
4400
y
xit4200
e
pl4000
r
pe3800
HCVB0
3600 CGS
3400 TSBVB
0.02 0.1 0.5 1 2
iteration
Figure2: Predictiveperformanceinsinge-membershipmodelsfortheAssociatedPressdataset.
Withthissetup,wethensampleavariable
c Cat(ξ) (8)
∼
fromξ toindicatewhetherthetruncatedvariationalupdateortheprobabilityvectorinstantiatinga
newcomponentisselected
ϕHYB ζ . (9)
k ∝ ck
TheprobabilityvectorϕHYBisourhybridupdate.Thehybridupdatereplacestheoriginalunusable
K+1dimensionalvariationalupdateγ ϕHYB,usingmostofthetimetheefficienttruncatedK
i
←
dimensionalvariationaldistributionwithoutintroducinganewdimensionintheupdatestep,while
introducinganewK +1th componentonlyifneeded,similartoaGibbssampler. Foragraphical
illustrationoftheconstructionofϕHYB,seeFigure1.
3 PropertiesofHybridUpdates
This construction of the hybrid updates has a number of favorable properties. By definition of
the parameter ξ (Equations 5, 6), the event of sampling the second category c = 2 has the same
expectationandvarianceassamplinganewcomponentintheGibbssampler:
E[1[c=2]]=E[1[z =K+1]]
i
Var[1[c=2]]=Var[1[z =K+1]]. (10)
i
TogetherwithEquation7and9,weseethatthiscarriesovertothehybridupdateitself
E[1[ϕHYB =1]]=E[1[z =K+1]]
K+1 i
Var[1[ϕHYB =1]]=Var[1[z =K+1]], (11)
K+1 i
making it possible to introduce new components like in a Gibbs sampler. Even more, the preser-
vationofexpectationisnotlimitedtotheK +1th dimensionitself,butourhybridupdatesϕHYB
preservetheexpectation,withregardstoϕ ,overallK+1dimensions
k
E[ϕHYB]=E[ϕ ] k [1,..,K+1]. (12)
k k ∀ ∈
Notethatthisis,withtheexceptionoflocallycollapsedvariationalinference[7],generallynotthe
caseforvariationalupdatesinnon-parametricmodels.
Moreover,thesumoftheexplanatorypoweroftheexistingK dimensionswillexceedtheexplana-
torypowerofintroducinganewdimensionE[ξ ]>E[ξ ]formostdatapoints, supportingtheuse
1 2
ofthemoreinformativevariationaldistributionζ inmostoftheupdates. Finally,thecomputations
1
necessary for the hybrid update ϕHYB are easy to implement and readily available by almost no
additionalcomputationalcostsfromthenormalizationtermsofϕ.
3
35000 35000
HCSVB0
PCSVB0
30000 30000
SCTFVB
SCVB0 K=40
25000 25000
exity20000 exity20000 SSCCVVBB00 KK==130000
pl pl
er er
p15000 p15000
10000 10000
5000 5000
103 104 105 106 107 101 102 103 104 105
documents seen seconds
Figure3: PredictiveperformanceinHDP-LDAmodelsfortheNewYorkTimesdataset.
4 ExperimentsandDiscussion
Thissectionconcludesourpaperwithanearlyempiricalevaluationoftheproposedhybridupdates.
Fortheevaluationweusedtwotextdatasets: (1)TheAssociatedPresscorpusconsistingof2,250
documents, where we used a vocabulary of 10,932 distinct terms occurring over a total of 398k
tokens. (2) The larger New York Times corpus consisting of 1,8 million articles, from which we
extracted153milliontokensusingavocabularyof77,928distinctterms.
TheAssociatedPresscorpuswasusedforevaluatingtheproposedupdatesinthesingle-membership
model together with a Dirichlet-Multinomial data model for the documents. In the experiments,
we held out 20% of the documents as a test set test and batch-trained on the remaining train
documents. Next,wespliteachtestdocumentx X testintwopartsx =(xa,xb),xaconXsisting
i ∈X i i i i
of 70% of the document for estimating the indicator variable and computed the perplexity of the
remaining 30% xb. We then compared the perplexity versus the number of iterations. Figure 2
i
displaystheperformanceforhybridupdatesinthezero-ordercollapsedvariationalsetting(HCVB0),
CGSandtruncated(T=40)stick-breakingmeanfieldvariationalBayes(TSBVB)[4].
TheNewYorkTimescorpuswasusedforevaluatingthehybridupdatesinmixed-membershipHDP-
LDA models, with test-train splits similar to above. However, for this larger dataset we resorted
tocollapsedstochasticinferenceusingminibatchesof60documents. Wecomparedourmethodto
thetruncation-freelocallycollapsedvariationalinference(SCTFVB)[7]andthefinitedimensional
stochastic collapsed variational Bayesian inference for LDA (SCVB0) [3] using 40, 100 and 300
topics. We employed our hybrid updates in a setting similar to SCTFVB, however using a lower-
bound approximation for the estimation of the stick-breaking weights [6, 2]. We used the same
parameterizationsfortheupdateschedulesinallinferenceschemas. Figure3displaystheresults,as
afunctionofthenumberofdocumentsprocessed(left)andwall-clocktimeinseconds(right), for
thesameruns.Inthefigure,HCSVB0denotestheresultswithourhybridupdates.PCSVB0denotes
afinitedimensionalvariationalapproximationotherwiseidenticaltoHCSVB0,buttruncatedat300
topics.
Inthiswork,wedesiredtousetheadvantagesofMCMCschemasandvariationalschemascombined
inasingleinferenceschemaforBayesiannon-parametricmodels. Weansweredtothisdemandby
presenting a novel type of hybridization that efficiently uses the full variational distribution while
samplingfortheintroductionofnewcomponents. Theproposedmethodiseasytoimplementand
measurably improves the predictive performance over state of the art methods for single- as well
asmixed-membershipmodelsatlittleadditionalcomputationalcost. Thecurrentlimitationsofthe
presented work are two-fold. While we have established some favorable properties of the updates
andfoundpredictiveperformanceimprovements,werelyonapproximationsandhaveonlylimited
theoreticalargumentslegitimizingourapproach. Theotherlimitationofthisworkisitsscope. Next
to a more thorough experimental evaluation and further formalization, an adaption of the hybrid
updates to Wang et al. [8]’s Chinese restaurant process based variational inference for the HDP
couldpotentiallybeapromisingdirectionforfuturework.
4
References
[1] Arthur Asuncion, Max Welling, Padhraic Smyth, and Yee Whye Teh. On smoothing and in-
ference for topic models. In Proceedings of the 25th Conference on Uncertainty in Artificial
Intelligence,2009.
[2] ArnimBleier.Practicalcollapsedstochasticvariationalinferenceforthehdp.InNIPSWorkshop
onTopicModels: Computation,Application,andEvaluation,2013.
[3] JamesFoulds,LeviBoyles,ChristopherDuBois,PadhraicSmyth,andMaxWelling. Stochastic
collapsed variational bayesian inference for latent dirichlet allocation. In Proceedings of the
19thACMSIGKDDinternationalconferenceonKnowledgediscoveryanddatamining,2013.
[4] Kenichi Kurihara, Max Welling, and Yee Whye Teh. Collapsed variational dirichlet process
mixturemodels. InIJCAI,volume7,2007.
[5] DahuaLin. Onlinelearningofnonparametricmixturemodelsviasequentialvariationalapprox-
imation. InAdvancesinNeuralInformationProcessingSystems,2013.
[6] IsseiSato,KenichiKurihara,andHiroshiNakagawa.Practicalcollapsedvariationalbayesinfer-
enceforhierarchicaldirichletprocess. InProceedingsofthe18th ACMSIGKDDinternational
conferenceonKnowledgediscoveryanddatamining,2012.
[7] ChongWangandDavidMBlei. Truncation-freeonlinevariationalinferenceforbayesiannon-
parametricmodels. InAdvancesinneuralinformationprocessingsystems,2012.
[8] Chong Wang, John William Paisley, and David M Blei. Online variational inference for the
hierarchicaldirichletprocess. InAISTATS,2011.
5