Table Of ContentA Collaborative Kalman Filter for Time-Evolving Dyadic Processes
San Gultekin John Paisley
Department of Electrical Engineering, Columbia University
Email: {sg3108, jpaisley}@columbia.edu
Abstract—WepresentthecollaborativeKalmanfilter(CKF), into consideration the patterns of all other measurements up
a dynamic model for collaborative filtering and related fac- until that point.
torization models. Using the matrix factorization approach to
An effective approach to collaborative filtering is via
5
collaborative filtering, the CKF accounts for time evolution
1 matrix factorization [7], [11]. These methods embed users
by modeling each low-dimensional latent embedding as a
0 multidimensional Brownian motion. Each observation is a and objects into a low-dimensional latent space and make
2 random variable whose distribution is parameterized by the predictions based upon the relevant dot products. Such
n dot product of the relevant Brownian motions at that moment methodshaveprovedverypowerfulsincetheyarerelatedto
a in time. This is naturally interpreted as a Kalman filter with thelow-rankmatrixcompletionproblem,whereinitisshown
J multiple interacting state space vectors. We also present a
that a sparsely sampled low-rank matrix can be exactly
method for learning a dynamically evolving drift parameter
2
for each location by modeling it as a geometric Brownian recoveredinidealcases,orverycloselyotherwise[2].Inthe
2
motion. We handle posterior intractability via a mean-field Bayesian setting,treating missing data is efficiently handled
] variational approximation, which also preserves tractability by the joint likelihood function, which allows for simple
L for downstream calculations in a manner similar to the closed form updates to the latent embeddings [12].
M Kalmanfilter.Weevaluatethemodelonseverallargedatasets, One common assumption in matrix factorization ap-
providing quantitative evaluation on the 10 million Movielens
. and 100 million Netflix datasets and qualitative evaluation on proaches is that the model is “static,” meaning that every
t
a a set of 39 million stock returns divided across roughly 6,500 user or object has a latent location that is fixed in time and
t companies from the years 1962–2014. themodellearnsthissinglepointusingalldataina“batch”
s
[ setting. Therefore, these models assume, e.g., that a user’s
taste in movies is unchanging, or a team’s sequence of wins
1 I. INTRODUCTION
and losses is a stationary process. Though performance of
v
4 Collaborative filtering is a general and effective method such models can be very good, this assumption is clearly
2 for making pairwise predictions based on historical data. It untrueandmodelsthataccountfordynamicinformationare
6 ismostfrequentlyassociatedwithrecommendationsystems, more appropriate in principle.
5
where the model learns user preference by considering In this paper we present a dynamic collaborative filter-
0
all previous user/object interactions. User rating systems ing model for temporally evolving data for applications
.
1
are widespread online. For example, Netflix lets users rate including recommendation systems, currency exchange rate
0
movies on a five star rating system, Amazon allows users tracking,student/questiontutoringsystemsandathleticcom-
5
1 to rate products and YouTube allows users to “like” or petition prediction, among others. The model is inspired by
: “dislike” videos on their website. In cases such as these, the matrix factorization approach to collaborative filtering,
v
a major objective is to recommend content to users in whereeachobjecthasalocationinalatentspace.However,
i
X an intelligent way. Collaborative filtering addresses this rather than fixing these locations, we allow them to drift
r problembyallowingthepreferencesofotheruserstoinform accordingtoamultidimensionalBrownianmotion[3].Since
a
the recommendations made for a user of interest [13]. we are motivated by real-time prediction, we only process
Though typically used for predicting user responses, the each observation in the data stream once. Therefore our
mainideabehindcollaborativefilteringcanbebroadenedto problem is ideally suited to large data sets. For inference
include other situations where the outcome of an event is we introduce a variational approximation [16] since the
predicted using information about the outcomes of all other posterior distribution of model variables given an event is
events in a dyadic setting. For example, in sporting events not in closed form; this allows for closed form updates
one might want to predict whether team A will beat team and preserves tractability for future calculations. The model
B by using the results of all other games up to that point. thus learns a time-evolving probability distribution on latent
Or one might wish to estimate the probability that batter user/object locations. From the perspective of a single loca-
X will get a hit against pitcher Y at a specific moment tion, the proposed model can be interpreted as a state-space
in time using all previous at-bats for every batter/pitcher modelwithdynamicssimilartoaKalmanfilter[17],andso
combination. In other problem domains, one might wish to wecallourapproachthecollaborativeKalmanfilter (CKF).
predictthestockpricesorexchangeratesinawaythattakes In addition, for problems such as stock modeling the
drift parameter of the latent Brownian motions cannot be A large number of collaborative filtering methods can
considered to be a fixed value because of the changing be viewed as variations and developments of this basic
volatility of the market. We therefore further develop the model. A potential drawback of such approaches is that
CKFbymodelingthedriftparameterofeachlatentlocation they do not use temporal information. A goal in Section III
with a geometric Brownian motion. This approach will re- will be to show how this model can be easily extended to
quireapproximationsfortractableinferencethatnevertheless model the temporal evolution of u and w using Brownian
i j
provide good and meaningful results as shown on a large motion. Since the resulting model has dynamics similar to
stock dataset. the Kalman filter, we briefly review this method next.
Weorganizethepaperasfollows:InSectionIIwereview
the basic matrix factorization approach to collaborative B. The Kalman filter
filtering and the Kalman filter for time evolving linear As mentioned, a drawback of many collaborative filter-
modeling.InSectionIIIwepresentthecollaborativeKalman ing techniques such as those that involve updates similar
filtering model. We derive a variational inference algorithm to Equation (1) is that no temporal dynamics are being
for this model in Section IV and show experimental results modeled. When viewing the locations of a user u or
i
in Section V. object w as being in a state space, the Kalman filter
j
naturally arises as the first place to look for developing a
II. BACKGROUND
dynamic representation for matrix factorization models. In
Before presenting the collaborative Kalman filter (CKF),
this section we provide a brief review of the Kalman filter
we review the basic matrix factorization approach to col-
[17], focusing on the equations and using parameterizations
laborative filtering and the Kalman filter for dynamic state
that are relevant to our CKF model.
space modeling. These two approaches form the basis for
The Kalman filter models a sequence of observed vectors
our model, which we present in Section III.
y ∈ Rp as linear functions of a sequence of latent state
n
A. Collaborative filtering with matrix factorization vectors w ∈ Rd with additive noise. These state vectors
n
evolve according to a first-order Markov process, where the
Generally speaking, the matrix factorization approach to
collaborativefilteringmodelsanincompleteM×N matrixZ current state equals the previous state plus additive noise.
asafunctionoftheproductoftwomatrices,aK×M matrix Assuming a Gaussian prior distribution on w0, then for n=
U and a K×N matrix W [7]. The columns of U (denoted 1,...,N and zero-mean noise, this is written as follows,
u ) represent the locations of each user in a latent space
i w |w ∼ N(w ,αI), (2)
and the columns of W (denoted w ) represent the locations n+1 n n
j y |w ∼ N(Aw ,σ2I). (3)
of each object in the same latent space. In the probabilistic n+1 n+1 n+1
approachtocollaborativefilteringusingmatrixfactorization,
Inference for the state evolution of w proceeds according
the observed values in Z have a distribution that depends
to the forward algorithm, which includes two steps: (i)
on the dot products of U and W, z ∼ p(uTw ). For
ij i j marginalizingthepreviousstatetoobtainapriordistribution
example, when Z is binary, p(·) could be the logistic or
on the current state, and (ii) calculating the posterior of the
probit link functions, when m-ary it could be the ordered
current state given an observation.1
probit model, and when real-valued it could be a univariate
Specifically, let the distribution of the current state be
Gaussian. After learning U and W based on the observed
w ∼ N(µ ,Σ ). The distribution of the next state w
data,predictionsofz fortheunseenpair(i,j)canbemade n n n n+1
ij is calculated by marginalizing the current state,
using the distribution function p(uTw ).
i j
(cid:90)
To see how matrix factorization can be interpreted as a
p(w ) = p(w |w )p(w )dw
collaborative filtering technique, consider the case where n+1 n+1 n n n
Rd
zij ∼N(uTi wj,σ2).Usingamaximumlikelihoodinference = N(wn+1|µn,αI+Σn). (4)
approach for U and W, the maximum likelihood update of
u is a least squares vector using the relevant w and z . If The drift parameter α controls the volatility of the state
i j ij
we let the set Ω contain the index values of objects that vectors and is a measure of drift in one unit of time. (In
ui
user i has rated, this update is this paper, we present a method for learning a dynamically
evolving value for α.)
(cid:16) (cid:17)−1(cid:16) (cid:17)
u = (cid:80) w wT (cid:80) z w . (1) After observing y , the posterior of w is
i j∈Ωui j j j∈Ωui ij j n+1 n+1
Therefore, the design matrix for ui uses the locations of all p(wn+1|yn+1) ∝ p(yn+1|wn+1)p(wn+1)
objects w rated by user i. The update for each w parallels
j j = N(w |µ ,Σ ), (5)
n+1 n+1 n+1
this update for u using the relevant user locations, and so
i
thelocationsofallotherusersindirectlyinfluencetheupdate
1Becauseweareonlyinterestedinreal-timepredictioninthispaper,we
for user i in this way. donotdiscussthebackwardalgorithmusedforKalmansmoothing.
where, after defining B =αI+Σ , time.Atanygiventimet,theoutputfordyad(i,j)usesthe
n n
dot product (cid:104)u[t],w[t](cid:105) as a parameter to a distribution
Σ = (ATA/σ2+B−1)−1, (6) i j
n+1 n as discussed in Section II-A.
µ = Σ (ATy/σ2+B−1µ ). (7)
n+1 n+1 n n In the following discussion of the model we focus on
FollowingEquation(4),wecanusethisposteriordistribu- one event (e.g., rating, trial, etc.), which can then be easily
tion on w to find the prior for the next state w . This generalizedtoaccountforeveryeventinawaysimilartothe
n+1 n+2
alsoindicateshowµ andΣ wereobtainedforstepn,and Kalman filter. We also present the model in the context of
n n
so only an initial Gaussian distribution on w is required to m-arypredictionusingorderedprobitregression[5].Forthis
0
allow for analytical calculations to be made for all n. problem, the real line R is partitioned into m regions with
The extension to continuous time requires a simple mod- each region denoting a class, such as star rating. Let Ik =
ification: For continuous-time Kalman filters, the variance (lk,rk]bethepartitionforclasskwherelk <rk,rk =lk+1,
of the drift from wn to wn+1 depends on the time between lk = rk−1 and k = 1,...,m. The output for pair (i,j) at
these two events. Calling this time difference ∆tn, we can timet,denotedzij[t]∈{1,...,m},isobtainedbydrawing
make the continuous-time extension by replacing αI with from a Gaussian distribution with the mean (cid:104)ui[t],wj[t](cid:105)
∆t αI. Therefore w is a Brownian motion [3]. and some preset variance σ2 and finding which partition it
n
falls within. Therefore, the model assumes an order relation
C. Related work
between the m classes, for example that a 4 star rating is
A similar model that addressed spatio-temporal infor- closer to a 5 star rating than a 1 star rating. (In the binary
mation was presented in [8]. A fixed-drift version of the special case, this order relation no longer matters.) We give
model presented here was originally presented at a machine a more detailed description of the model below.
learning workshop [10] and is most closely related to [15], Likelihood model: Let u[t] ∈ Rd represent the la-
i
[14]. Because the models share the same name, we discuss tent state vector for object i at time t and similarly for
some important differences before presenting our version, w[t] ∈ Rd. Using the partition I and the probit function,
j
and also to highlight our contributions: the probability that z [t]=k is
ij
• Our approach does not require Kalman smoothing.
(cid:90)
Therefore,wedonotstorethesequenceofstatevectors, P(z [t]=k|u ,w )= N(cid:0)y|(cid:104)u[t],w[t](cid:105),σ2(cid:1)dy.
which allows for our model to scale to massive data ij i j i j
sets.Themodelin[15],[14]isonlydesignedtohandle Ik (8)
small-scale data problems. For inference we introduce the latent variable yij[t] and
• Wederiveavariationalinferencealgorithmforlearning expand Equation (8) hierarchically,
time-evolving distributions on the latent state space
vectors akin to the Kalman filter. We also use a z [t]|y [t] = (cid:80) kI(y [t]∈I ), (9)
ij ij k ij k
geometric Brownian motion for learning a dynamic y [t]|u ,w ∼ N((cid:104)u[t],w[t](cid:105),σ2). (10)
ij i j i j
drift parameter. Our experiments will show that a
fully Bayesian approach on the state space improves
We observe that, unlike most collaborative filtering models,
predictive ability and learning a dynamic drift gives
there is no assumption that the pair (i,j) only interacts
significant information about the underlying structure
once, which has practical advantages. For example, teams
of the data set.
may play each other multiple times, and even in a user
• We derive a probit model for binary and ordinal obser- rating setting the user may change his or her rating at
vationssuchasmovieratingsinsteadofdirectlymodel-
different points in time, which contains valuable dynamic
ingalldatatypeswithGaussians.Wewillshowthatthis
information. In related problems such as stock modeling,
approach gives an improvement in performance when updated values arrive at regular intervals of time.2
compared with modeling, e.g., discrete star ratings as
Prior model: At time t we are interested in posterior
real-valued Gaussian random variables.
inference for state vectors u[t] and w[t]. This requires
i j
III. COLLABORATIVEKALMANFILTERING a prior model, which as we’ve discussed we take to be a
latent multidimensional Brownian motion. Let the duration
The motivation behind the collaborative Kalman filter
of time since the last event be ∆[t] for u and ∆[t] for
(CKF) is to extend the matrix factorization model described ui i wj
w . Also, as with the Kalman filter, assume that we have
in Section II-A to the dynamic setting. The continuous-time j
multivariate normal posterior distributions for u[t−∆[t]]
Kalman filtering framework discussed in Section II-B is a and w[t − ∆[t]], with t − ∆[t] being the tiime ofutihe
natural means for doing this. The CKF models each latent j wj ui
last observation for u and similarly for w . We write these
location of a user or object as moving in space according i j
to a Brownian motion. In our case, these correspond to the
sets of vectors {ui} and {wj}, which are now functions of 2Aswewillsee,inthiscasey isobservedandthelikelihoodis(10).
distributions as Again there is a drift parameter c that plays an important
roleandrequiresagoodsetting,butweobservethatdefining
u[t−∆[t]] ∼ N(µ(cid:48) [t−∆[t]],Σ(cid:48) [t−∆[t]]),
i ui ui ui ui ui αtobeanexponentiatedBrownianmotionhasanimportant
wj[t−∆[wt]j] ∼ N(µ(cid:48)wj[t−∆[wt]j],Σ(cid:48)wj[t−∆[wt]j]). (11) modeling purpose by allowing for volatility in time, and is
The posterior parameters µ(cid:48) and Σ(cid:48) are also dynamically not done simply to avoid parameter setting.
When α is a geometric Brownian motion, the transition
evolving and indexed by time. As is evident, we are assum-
from posterior to prior for u and w needs to be modified.
ing that we have independent posteriors of these variables. i j
In Equation (12) the constant value of α allowed for Σ[t]
Though this is not analytically true, our mean-field varia-
to be calculated by adding a time-scaled αI to the previous
tionalinferencealgorithmwillmakethisapproximationand
posterior. In this case, the update is modified for, e.g., u by
so we can proceed under this assumption.
Bythecontinuous-timeextensionofEquation(4),wecan performing the integration implied in Equation (12),
mtoaorgbitnaainliztheeupi[rito−r d∆is[uttri]i]buatinodnswoj[ntu−i a∆n[wdt]jw]jinatthtiemientte,rval Σu[t]=Σ(cid:48)u[t−∆[ut]]+I(cid:82)tt−∆[ut]ea[s]ds. (14)
Wewillderiveasimpleapproximationofthisthisstochastic
u[t] ∼ N(µ [t],Σ [t]),
i ui ui integral for inference.
w[t] ∼ N(µ [t],Σ [t]), (12)
j wj wj
IV. VARIATIONALINFERENCEFORTHECKF
where
Having defined the model prior, we next turn to posterior
µ[t]=µ(cid:48)[t−∆[t]], Σ[t]=Σ(cid:48)[t−∆[t]]+∆[t]αI,
inference. Since the CKF model is dynamic, we treat poste-
for the respective u and w . We note that we use an rior inference in the same way by learning a time-evolving
i j
posterior distribution on the underlying vectors {u } and
apostrophe to distinguish priors from posteriors. i
We see that we can interpret the CKF as a Kalman filter {wj}(andthehiddendatayij[t]whentheobservationsare
with multiple interacting state space vectors rather than a from a latent probit model). We also learn the Brownian
knownandfixeddesignmatrixAasinEquation(3).Wealso motioncontrollingthedriftofeachlatentstatevector,a[t],
recall from the Kalman filter that αI is the drift covariance using a point estimate. We therefore break the presentation
in one unit of time, and so the transition from posterior to of our inference algorithm into two parts: One dealing with
prior involves the addition of a small value to the diagonal u,w andy,andonedealingwitha.Wepresentanoverview
of the posterior covariance matrix. This is what allows for of the inference algorithm in Algorithm 1.
dynamic modeling; when α = 0, this is simply an online
A. An approximation for the hyperprior model
Bayesian algorithm and the posteriors will converge to a
Beforediscussingourinferencealgorithm,wefirstpresent
delta function at the mean.
Hyperprior model: The drift parameter α controls how an approximation to the stochastic integral in Equation
muchthestatevectorscanmoveinoneunitoftime.Whenα (14) that will lead to efficient updates of the Brownian
motion a[t]. To this end, we first introduce the following
becomesbigger,thestatevectorscanmovegreaterdistances
to better fit the variable y, also making these vectors more approximation,
forgetful of previous information. When α → 0 the model (cid:82)t ea[s]ds ≈ ea[t]∆[t]. (15)
does not forget any previous information and the state t−∆[t]
vectorssimplyconvergetoapointsincetheposteriordistri- We recall from the description of the model that a[t] can
bution becomes more concentrated at the mean. Therefore, be shared or vector-specific. That is, e.g., a[t] could be
α is clearly an important parameter that can impact model uniqueforeachu ,orsharedamong{u }(inwhichcasean
i i
performance. appropriate subscript notation for a could be added). From
WedeveloptheCKFmodelbyallowingαtodynamically the perspective of a generic multidimensional Brownian
changeintimeaswell.Thisprovidesameasureofvolatility motion, the generative structure for this approximation is
thatwillbeespeciallymeaningfulinmodelsforstockprices,
u[t] ∼ N(µ(cid:48)[t−∆ ],Σ(cid:48)[t−∆ ]+ea[t]∆ I),
as we will show. For notational convenience, we define the u u u u u
hyperprior for a shared α, but observe that extensions to a a[t] ∼ N(a[t−∆a],c∆a). (16)
u – and w –specific α is straightforward (which we discuss
i j That is, with this approximation we first draw the log drift
in more detail in Section IV).
value at time t, a[t], according to its underlying Brownian
WemodelαasageometricBrownianmotionbydefining
motion. We then approximate the drift parameter as being
α[t] = ea[t] and letting a[t] be a Brownian motion.
constant in the unobserved interval. Clearly, as the intervals
Therefore, as with the state vectors, if t−∆[t] is the last
a between observations become smaller, this approximation
observed time for a[t], the distribution of a[t] is
becomes better. For inference we will learn a point estimate
a[t]∼N(a[t−∆[t]],c∆[t]). (13) of the Brownian motion a[t].
a a
Algorithm 1 CKF parameter updates at time t C. Variational q distribution for the prior model
1: To update the parameters of dyad (i,j) at time t: AsindicatedinEquation(19),constructingthevariational
2: for iteration=1,...,T do objective function for the event at time t involves taking
3: Update q(yij[t]) as in Equation (24). the expectation of the log joint likelihood and adding the
4: Update q(ui[t]) as in (26). entropies of each q. This requires defining the distribution
5: Update q(wj[t]) by modifying (26) as noted in text. family for q(yij), q(ui) and q(wj). The optimal form for
6: Update aui[t] as in (27). each q distribution can be found by exponentiating the
78:: enUdpfodrate awi[t] by modifying (27) as noted in text. eoxfpiencteteredstlo[g16j]o:inqtil∝ikeelxihpo{oEdq−hio[llndipn(g·)o]}u.t the q distribution
9: Notes: The following changes can be made. Temporarily ignoring the time notation, at time t we can
• When y is observed, skip Step 3 and directly use y in find q(yij[t]) by calculating
Steps 4 and 5. q(y )∝exp{lnp(z |y )+E [lnp(y |u ,w )]}. (20)
ij ij ij q ij i j
•Whenmodelingashareda amongall{u},theupdate
u
is changed as noted in text, and similarly for w. Since p(z [t]|y [t])=I(y [t]∈I ), it follows that
ij ij ij zij[t]
q(y [t]) is defined on the interval I , which is the cell
ij zij[t]
corresponding to class z [t]. After taking the expectation,
ij
the optimal q distribution is found to be
B. A variational approximation
Focusing on the latent state-space variables first, the q(y [t])=TN (cid:0)y [t]|(cid:104)E u[t],E w[t](cid:105),σ2(cid:1).
ij Izij[t] ij q i q j
posterior of an event at time t is (21)
ThisisatruncatednormaldistributiononthesupportI
p(ui[t],wj[t],yij[t]|zij[t],aui,awj)∝ (17) withmeanparameter(cid:104)Equi[t],Eqwj[t](cid:105)(definedlaterz)iaj[ntd]
p(z [t]|y )p(y [t]|u ,w )p(u[t]|a )p(w[t]|a ) variance σ2.
ij ij ij i j i ui j wj We can follow the same procedure to find q(u[t]) for
i
Unlike the Kalman filter the full posterior of the CKF time t. Again ignoring the time component we have
is not analytically tractable. Typically when this occurs
q(u )∝exp{E [lnp(y |u ,w )]+lnp(u )}. (22)
with nonlinear Kalman filters sampling methods such as i q ij i j i
sequential Monte Carlo are employed as an approximation
The prior on u[t] is discussed below. Solving for this
i
[4]. This is not computationally feasible for our problem
distribution shows that q(u[t]) is a multivariate Gaussian,
i
since we are learning the evolution of a large collection of
latent state vectors. Instead we use the deterministic mean- q(u[t])=N(cid:0)u[t]|µ(cid:48) [t],Σ(cid:48) [t](cid:1), (23)
i i ui ui
field variational method for approximate posterior inference
withmeanandcovarianceparametergivenbelow.Symmetry
[16]. As we will show, this not only allows for tractable
results in the same q distribution family for w[t].
posterior calculations for the state vectors u, w and hidden j
datay,butalsoallowsforatransitionfrompriortoposterior D. A coordinate ascent algorithm for q(u), q(w) and q(y)
similar to the Kalman filter that retains tractability for all
We next present a coordinate ascent algorithm for updat-
downstream calculations.
ing the three q distributions that appear at time t.
We approximate the posterior of Equation (17) with the
following factorized distribution, Coordinate update of q(y): The analytical update for the
mean parameter m [t] of q(y [t]) is
ij ij
q(u[t],w[t],y [t])=q(u[t])q(w[t])q(y [t]).
i j ij i j ij
(18) m [t]=(cid:104)E u[t],E w[t](cid:105). (24)
ij q i q j
We derive the specific distributions in Section IV-C. This
The values of E u[t] and E w[t] are given below. The
q distribution approximates the posterior of each variable q i q j
expectationofy [t]dependsontheintervalinwhichitfalls
as being independent of all other variables, which has a ij
according to the truncated normal. Let l and r be
significantadvantageinthisdynamicsetting.Havingdefined zij[t] zij[t]
the left and right boundaries of this interval. Then defining
q, we seek to minimizes the KL divergence between q and
the true posterior at time t by equivalently maximizing the l −m [t] r −m [t]
α [t]= zij[t] ij , β [t]= zij[t] ij ,
variational objective function [16], ij σ ij σ
L =E [lnp(z [t],u[t],w[t],y [t])]−E [lnq]. (19) we have that
t q ij i j ij q
φ(α [t])−φ(β [t])
Since this is a non-convex function, a local maximum of Lt Eqyij[t]=mij[t]+σΦ(βij[t])−Φ(αij[t]), (25)
isfoundbyoptimizingtheparametersofeachqasdiscussed ij ij
in Section IV-D. whereφ(·)isthepdfandΦ(·)thecdfofastandardnormal.
Coordinate update of q(u): The updates for mean µ(cid:48) [t] F. Simplifications to the model and predictions of new data
ui
and covariance Σ(cid:48)ui[t] of q(ui[t]) are Whenyij[t]isobservedonlyaslightmodificationneeds
(cid:16) (cid:17)−1 to be made to the algorithm above. In this case q(y) is no
Σ(cid:48) = Σ−1+(µ(cid:48) µ(cid:48)T +Σ(cid:48) )/σ2 ,
ui ui wj wj wj longer necessary and in the update for each q(u) and q(w)
µ(cid:48) = Σ(cid:48) (cid:16)E [y ]µ(cid:48) /σ2+Σ−1µ (cid:17). (26) the term Eqyij[t] can be replaced with the observed value.
ui ui q ij wj ui ui Predictions using the CKF require an approximation of
All parameters above are functions evaluated at t. We recall an intractable integral. One approximation is to use Monte
tchoavtartihaencepriΣouri[mt]ean≈µΣu(cid:48)ui[i[tt]−=∆[utiµ]](cid:48)ui+[te−aui[∆t[]ut∆i][]uti]aIn,dwphriicohr aCraerliontemreesthteoddsi.nStaukpipnrgesissinυgijth:=e tEimq[eΦ(t(cid:104),uthi,ewejx(cid:105)p/eσc)t]atWioencwane
follows from the approximation discussed above. approximate υ using S samples from these distributions,
ij
Coordinate update of q(w): Because of symmetry, the S
updates for µ(cid:48)wj[t] and Σ(cid:48)wj[t] of q(wj[t]) are similar to υˆi(jS) = S1 (cid:88)Φ((cid:104)ui(s),wj(s)(cid:105)/σ), (28)
those for ui, but with the indices ui and wj switched on all s=1
means and covariances. We therefore omit these updates.
whereu(s) i∼idq(u[t])andw(s) i∼idq(w[t]).Thisgivesan
E. Inferring the geometric Brownian motion exp{a[t]} i i j j
unbiasedestimateoftheexpectation.Afasterapproximation
Wenextderiveanalgorithmforinferringthedriftprocess is to directly use the means from q for u[t] and w[t].
i j
a[t] used in the geometric Brownian motion. We derive a
pointestimateofthisvalueassumingindividualaforeachu V. EXPERIMENTS
i
andw .Toupdate,e.g.,a [t]weapproximatetherelevant We evaluate the performance of the collaborative Kalman
j ui
terms in L using a second order Taylor expansion about the filter on the following three data sets:
pointaui[t−∆[atu]i],whichisthelastinferredvalueforthis • The Netflix data set containing 100 million movie
process. We recall that the form of this approximation for a ratings from 1999 to 2006. The movies are rated from
general function f(x) is 1 to 5 and ratings are distributed across roughly 17K
f(x)≈f(x )+(x−x )f(cid:48)(x )+ 1(x−x )2f(cid:48)(cid:48)(x ), movies and 480K users.
0 0 0 2 0 0 • The MovieLens data set containing 10 million movie
with the solution at the point x=x0−f(cid:48)(cid:48)(x0)−1f(cid:48)(x0). In ratings from 1995 to 2009. The ratings are given on a
our case, f =Eq[lnp(ui[t],aui[t])]. half star scale from 0.5 to 5 and are distributed across
We can efficiently update a [t] using this Taylor ex-
ui 10K movies and 71K users.
pansion as follows: Let Σ(cid:48) [t − ∆[t]] = QΛQT be the
ui ui • Stock returns data measured at opening and closing
eigendecomposition of the posterior covariance of u at the
i times for 433 companies from the AMEX exchange,
previous observation. (This needs to be done only once for
2,774 companies from NASDAQ and 3,273 companies
the updates at this time point.) Also, let
from the NYSE for a total of 6,480 stocks and 39.1
v =(µ(cid:48) [t]−µ [t])TQ, M =QTΣ(cid:48) [t]Q. million total measurements from 1962–2014.3
ui ui ui
The model requires setting some important parameters: The
Since this depends on the updated posterior mean and
dimension d of u and w , the standard deviation of the
covariance of u , v and M should be recomputed after each i j
i probit model σ, the partition widths r −l , and the drift
updateofq(u ).Letλ bethedtheigenvalueofΣ(cid:48) [t−∆[t]] k k
i d ui ui parameter c for the geometric Brownian motion a[t]. We
and define
η = eaui[t]∆aui . discuss these below for the respective problems.
d λd+eaui[t]∆aui A. Movie rating prediction
ThenusingasecondorderTaylorexpansionoftheobjective We first present results on dynamic modeling of the
function at the previous point aui[t − ∆aui] gives the Netflix and MovieLens data sets. For these problems, we
following first and second derivatives with respect to aui, tried several different dimension settings (d = 10,20,30).
f(cid:48) = −a[t]−ca∆[ta−∆a] − 12(cid:80)dηd(1− λvdd2++eMa[td]∆,da), sWheareledarbnyedmaovcoiensvebrgyinsgetatiungshcare=d a0mosningceusewres adnod naowt
f(cid:48)(cid:48) = − 1 − 1(cid:80) η (1−η )
c∆a 2 d d d assume there to be fundamental shifts in the overall user or
+1(cid:80) η (1−2η ) vd2+Md,d . (27) movie landscape. We set σ to be the value that minimizes
2 d d d λd+ea[t]∆a the KL divergence between the probit and logistic link
We have removed the subscript dependency on ui above. functions, which we found to be approximately σ = 1.76.
We then use the closed form equation from the Taylor We randomly initialized the mean of the priors on each u
i
expansion to update aui[t]. The update for awj[t] has the and wj at time zero, and set the covariance equal to the
same form, but with the relevant variational parameters and
eigendecomposition replacing those above. 3Datadownloadedfromhttp://ichart.finance.yahoo.com/
TableI
User 15701 User 101 RMSERESULTSFORTHENETFLIX100MILLIONANDMOVIELENS10
MILLIONDATASETS.COMPARISONSSHOWANADVANTAGETO
MODELINGTHEDYNAMICINFORMATIONWITHINTHEDATA.
Number of ratings Number of ratings MCKodFel dS=ize10 N0.e8t5fl4i0x Mo0v.7i7eL26ens
d=20 0.8534 0.7654
d=30 0.8540 0.7635
Time (days) Time (days)
Online VB-EM d=10 0.8682 0.7855
Figure1. CumulativetotalofmoviesratedbytwousersfromtheNetflix d=20 0.8707 0.7805
data set. (left) This user rates large batches of movies in a few sittings.
d=30 0.8668 0.7786
Thoughthedynamicswon’tbecapturedbythemodel,westillcanperform
sequential inference for this user to make predictions. (right) A more Batch VB-EM d=10 0.8825 0.7996
incrementalratingpatternthathasdynamicvalue.
d=20 0.8688 0.7896
d=30 0.8638 0.7865
Batch MAP-EM d=10 0.9277 0.9133
6Number of ratings (10)x 5Number of ratings (10)x MBP3MFF dddd====33320000 0000....9999001114485732 0000....8899441147317233
Months Months
(a) Netflix (b) MovieLens to have a dynamic benefit, but we do for the second user.
However, as an online algorithm we note that the CKF still
Figure2. Histogramsofthetotalnumberofratingswithinamonthover
can make useful predictions in both cases; in the limiting
thecourseofeachdataset.
case that all users rate all movies in a given instant, the
CKF will simply reduce to the online zero-drift model. In
identity matrix. For the partition width, we set r −l =σ Figure 2 we show the monthly ratings histogram for both
k k
forNetflixandr −l =σ/2forMovieLens,whichaccounts data sets, which gives a sense of the dynamic information
k k
for the half vs. whole star rating system. We compare with from the movie perspective.
several algorithms: We use the RMSE as a performance measure, which we
1) OnlineVB-EM:Thenon-driftspecialcaseoftheCKF showfor allalgorithmsin TableI.For thebatchalgorithms,
with exp{a[t]}=0. werandomlyheldout5%ofthedatafortestingtocalculate
2) BatchVB-EM:Thebatchvariationalinferenceversion thisvalue.Weranmultipletrialsandfoundthatthestandard
of (1) that uses all ratings when updating u and w . deviation for these algorithms was small enough to omit.
i j
3) Batch MAP-EM: A version of probabilistic matrix (Wediscusstraining/testingfortheonlinealgorithmsbelow.)
factorization (PMF) [11] that uses the probit model We observe that our algorithm outperforms the other base-
as above. Point estimates of u and w are learned linealgorithms,andsodynamicmodelingofuserpreference
i j
and the hidden data y is inferred using EM. doesindeedgivenanimprovementinratingprediction.This
ij
4) BPMF: Bayesian PMF [12] without a probit model. is especially evident when comparing the CKF with Online
5) M3F: Mixed-membership matrix factorization [9]. VB, the only difference between these algorithms being the
The zero-drift version of this model is an online sequential introduction of a drift in the state space vectors.
method for Bayesian inference that is similar to other “big Wealsoobservetheimprovementofthevariationalframe-
data” extensions of the variational inference approach [1], work in general. For example, by comparing Batch VB
[6]. The difference between this and the batch model is that with PMF EM, we see that variational inference provides
we only process each rating once with the online algorithm, an improvement over a MAP EM implementation, which
while with batch inference we iterate over users and movies models a point estimate of u and w . Both methods use
i j
several times processing all data in a single iteration, as is a latent variable y in a probit function for the observed
ij
more typically done. rating,butthevariationalapproachmodelstheuncertaintyin
The ability of our model to exploit dynamic information thesestatespacevectors,andsoweseethatafullyBayesian
dependsonhowdynamicthedatais.Forexample,inFigure approach is helpful. We also found in our experiments that
1 we show the cumulative number of ratings for two users treating the rating z [t] as being generated from a probit
ij
as a function of time. With the first user, we don’t expect model and learning a latent y is important as well, which
ij
r
a
0.90 0.84 5 st 3
0.83
0.89
0.82 ar 2 BTahtemManatBrixegins
0.88 0.81 st FindingNemo
4
0.87 0.80 1 PGFuirgolphutFnCdiclhutiobognDay
0.79
r
a
0.86 0.78 st 0
3 AceVentura
(a) Netflix (b) MovieLens
−1
Figure 3. The RMSE as a function of number of ratings for a user and ar
movie. The (m,n) entry contains the RMSE calculated over user/movie st Anchorman
pairs where the user has rated at least 10(m−1) movies and the movie 2 −2 LizzieMcGuire
hasbeenratedatleast10(n−1)times.Thevalueshowninthelower-right
corneris200+ratingseach,whichweuseinTableI.
−3
58 60 62 64 66 68 70 72 74 76 78
Month
we observed by comparing PMF EM with the original PMF
Figure4. Anexampleofauserdriftovertimeasseenthroughpredicted
algorithm [11].4 ratingsforseveralmoviesfromtheNetflixdataset.They-axisisthelatent
Calculating the RMSE requires a different approach be- variablespace,whichwepartitionaccordingtostarratingasindicated.
tween the online and static models. To calculate the RMSE
for the two dynamic models—CKF and the online model—
we do not use the test set for the batch models, but instead observed no clear evolution in movie preference for a user.
makepredictionsofeveryratinginthedatasetbeforeusing We consider to be intuitively reasonable since we argue
the observed rating to update the model. This gives a more that few people fundamentally change their taste. However,
realistic setting for measuring performance of the online being able to find those individual users or movies that
models, especially for the CKF where we’re interested in do undergo a significant change can have an impact on
predicting the user’s rating at that time. Therefore, we must learningthelatentvectorsforallusersandmoviessincethe
choose which predictions to calculate the RMSE over, since model is collaborative, and so we argue that modeling time
clearly it will be bad for the first several ratings for a evolution can be valuable even when the actual percentage
particularuserormovie.Bylookingattheusersandmovies of dynamically changing behaviors is small.
that appear in the testing sets of the batch models we found
B. Stock Data
that for Netflix each user in the test set had an average of
613 ratings in the training set and each movie in the test Wenextpresentaqualitativeevaluationofourmodelona
set had 53,395 ratings in the training set. For MovieLens stockreturnsdataset.Forthisproblem,thevalueofy [t]is
ij
this was 447 per user and 7,157 per movie, meaning that in observed since it is treated as the continuous-valued stock
both cases a substantial amount of data was used to learn price, so we can model it directly as discussed in section
locations in training before making predictions in testing. IV-F. We set the latent dimension d = 5, but observed
Therefore, to calculate our RMSE of the online models we similar results for higher values. We learned stock-specific
disregardthepredictionifeithertheuserormoviehasunder drift Brownian motions a [t] and set c = 5 × 10−2,
ui u
200 previous ratings. This arguably still puts the RMSE for which we found to be a good setting through parameter
our model at a disadvantage, but we noticed that the value tuning since the learned a [t] were not too smooth, but
ui
didn’t change much with values larger than 200. We show still stable. We also observe that, for this problem, there is
the RMSE as a function of this number in Figure 3. only one state vector corresponding to w, which we refer
In Figure 4 we show the dynamics of an individual user to as a “state-of-the-world” (SOW) vector. Therefore, this
by plotting the predicted rating for a set of movies as a problem falls more within the traditional Kalman filtering
function of time. We can see an evolving preference in setting discussed in Section II-B. For the SOW vector, we
this plot–for example a strong interest in Batman Begins set c =0 to learn a converging value of a [t], which we
w w
that then slightly decreases with time, while interest in note was a [t]→−11.7. For the noise standard deviation
w
The Matrix begins to increase toward the end. Also, while we set σ =0.01, which enforce that the state space vectors
there’s some interest in Anchorman at the beginning, the track y [t] closely. (We again note that j = 1 for this
ij
user then quickly becomes uninterested. In most cases, we problem.) We plot the number of active stocks by year in
Figure 5 where we see that as time goes by, the number of
4We omit these PMF results in the table, which we note gave RMSE
stocks actively being traded increases significantly.
resultsoveroneascanbeseenintheoriginalpaper.Wealsonotethatthe
PMFalgorithmisthenon-dynamicversionof[15]. We first assess the tracking ability of our model. The
6000
140 Chevron
ks5000 120 BP
c
sto4000 100
of 80
er 3000
b 60
m2000
nu 40
1000
1980 1985 1990 1995 2000 2005 2010
0 (a) Stockprice
1965 1970 1975 1980 1985 1990 1995 2000 2005 2010
year
Figure5. Thenumberofactivelytradedstocksasafunctionoftime. −2
−3
Walmart Pfizer
−4
−5
−6
1982 1987 1992 1997 2002 2007 2012
(b) Brownianmotionvolatilityparameteraui[t].
Figure 7. (a) The historical stock price for two oil companies, BP
and Chevron. (b) The log drift Brownian motion (aui[t]) indicating the
volatilityofeachstock.Weseethatthoughthestockpricesaredifferent,
the volatility of both oil companies share the same shape since they are
Coca-Cola Microsoft closelylinkedinthemarket.
linked in the market. For example in the early 80’s, late
90’s and late 2000’s, oil prices were particularly volatile.
This is modeled by the increase of the geometric Brownian
motion exp{a [t]}, which captures the fact that the state
Figure 6. A histogram of the tracking errors of the CKF for four ui
vectors are moving around significantly in the latent space
stocksusingthemeanofeachq distribution(inlog2 scale).Thetracking
performance is very accurate using a 5 dimensional latent space (order during this time to rapidly adjust to stock prices.
10−3).Theseresultaretypicalofallhistograms.
We also consider the stock volatility across different
market sectors. In Figure 8 we show the historical stock
ability to accurately track the stock prices indicates that prices for five companies along the top row and their
the latent structure being learned is capturing meaningful respective a [t] below them on the second row. Three of
ui
information about the data set. We show these results as the companies are in the steel market, while the other two
errorhistogramsonlog2scaleinFigure6forfourcompanies are from different markets (pharmaceutical and beverage).
(tracking was very accurate and could not be distinguished We again see that the learned Brownian motion captures a
visually from the true signal) and mention that these results similarvolatilityforthesteelcompanies.Ineachcase,there
arerepresentativeofalltrackingperformances.Weseefrom is significant volatility associated with the 2008 financial
these plots that the prediction errors of the stock prices are crisisduetothedecreaseinnewconstruction.Thisvolatility
small, on the order of 10−3, and so we can conclude that isnotfoundwiththepharmaceuticalorbeveragecompanies.
our five dimensional state space representation for ui and w However,wedoseeamajorspikeinvolatilityforCoca-Cola
is sufficient to capture all degrees of freedom in time across around1985,whichwasaresultoftheirunsuccessful“New
the 6,480 stocks. Coke” experiment. These observations are confirmed by the
We next look more closely at the Brownian motions respective stock prices. However, we note that the level of
a [t] learned by the model to capture the volatility of the volatility is not only associated with large changes in stock
ui
stock prices. In Figure 7(a) we show the historical stock value. In the Pfizer example, we see that the volatility is
price for two oil companies, BP and Chevron. Below this consistently high for the first half of its stock life, and then
in Figure 7(b) we show the respective functions a learned decreases significantly for the second half, which is due to
ui
for these stocks. We see that the volatility of both of these thesignificantlydifferentlevelsofvolatilitybeforeandafter
oil companies share similar shapes since they are closely the year 2000.
Posco (steel) Schnitzer (steel) US Steel Pfizer Coca-Cola
128000 110100 128000 114600 140
160 90 160 120
140 80 140 120
120 70 120 100 100
100 60 100 80
80 50 80 60 80
246000 12340000 246000 2400 4600
1999 2004 2009 1998 2003 2008 2013 01991 1996 2001 2006 2011 0 1982 1987 1992 1997 2002 2007 2012 19621967197219771982198719921997200220072012
Posco(steel) Schnitzer(steel) USSteel Pfizer Coca-Cola
−1 −1
−1 −1
−2 −2 −2 −2
−2 −3 −3 −3 −3
−3 −4 −4 −4 −4
−4 −5 −−56 −−65 −5
−5 −6
−6 −7 −7
−6 −7
1997 2002 2007 2012 1997 2002 2007 2012 1992 1997 2002 2007 2012 1982 1987 1992 1997 2002 2007 2012 19621967197219771982198719921997200220072012
Figure8. (Toprow)Thehistoricalstockpricesforthreesteelcompanies,onepharmaceuticalandonebeveragecompany.(Bottomrow)Thecorresponding
log drift Brownian motions (aui[t]) for the respective stocks from the top row. We see that the three steel companies shared high volatility during the
periodofthe2008financialcrises,butcompaniesinotherareassuchasthepharmaceuticalandbeverageindustrywerenotsimilarlyaffected.
VI. CONCLUSION [2] E. J. Cande`s and B. Recht. Exact matrix completion via
convex optimization. Foundations of Computational Mathe-
matics, 9(6):717–772, 2009.
We have presented a collaborative Kalman filter for dy-
namic collaborative filtering. The model uses the matrix [3] E. C¸inlar. Probability and Stochastics. Springer, 2011.
factorizationapproach,whichembedsusersandobjectsina
[4] A.Doucet,S.Godsill,andC.Andrieu. OnsequentialMonte
latent space. However, unlike most related models in which
CarlosamplingmethodsforBayesianfiltering. Statisticsand
these locations are fixed in time, we allowed them to drift
computing, 10(3):197–208, 2000.
according to a Brownian motion. This allows for dynamic
evolution of, e.g., user preference. We built on this basic [5] W. Greene. Econometric Analysis. Prentice Hall, 2011.
model by presenting a method for learning a time-evolving
[6] M. D. Hoffman, D. M. Blei, C. Wang, and J. Paisley.
drift parameter for the state space vectors using a geometric Stochastic variational inference. The Journal of Machine
Brownian motion. Learning Research, 14(1):1303–1347, 2013.
Many other applications could potentially benefit from
[7] Y. Koren, R. Bell, and C. Volinsky. Matrix factorization
thisframework.Forexample,wecouldpredictresultsofbat- techniques for recommender systems. Computer, 42(8):30–
ter/pitchereventsovertimeinbaseballorwinnersofsporting 37, 2009.
competitions. Such a dynamic model would clearly be an
[8] Z. Lu, D. Agarwal, and I. S. Dhillon. A spatio-temporal
advantage since player or team performance can experience
approach to collaborative filtering. In ACM Conference on
streaks and slumps. Another potential use is in automated
Recommender Systems, 2009.
tutoring systems, where a student is asked a sequence of
questions (e.g., math questions or language flashcards). In [9] L. Mackey, D. Weiss, and M. I. Jordan. Mixed membership
matrixfactorization. InInternationalConferenceonMachine
a collaborative filtering environment we would wish to
learning, 2010.
estimate the probability that a student answers a question
correctly. Clearly as the student learns these probabilities [10] J. Paisley, S. Gerrish, and D. Blei. Dynamic modeling with
will evolve. Having estimates for such a probability would thecollaborativeKalmanfilter. InNYAS5thAnnualMachine
Learning Symposium, 2010.
allow for intelligent tutoring, where questions are selected
that are not too easy or too hard and are able to help with [11] R. Salakhutdinov and A. Mnih. Probabilistic matrix factor-
the learning process. ization. InAdvancesinNeuralInformationProcessing,2007.
[12] R.SalakhutdinovandA.Mnih. Bayesianprobabilisticmatrix
REFERENCES factorization using Markov chain Monte Carlo. In Interna-
tional Conference on Machine learning, 2008.
[1] T. Broderick, N. Boyd, A. Wibisono, A. C. Wilson, and [13] B. Sarwar, G. Karypis, J. Konstan, and J. Riedl. Item-
M. Jordan. Streaming variational bayes. In Advances in based collaborative filtering recommendation algorithms. In
Neural Information Processing Systems, 2013. International Conference on World Wide Web, 2001.