Table Of ContentEigenGP: Gaussian Process Models with Adaptive Eigenfunctions
HaoPeng YuanQi
DepartmentsofComputerScience DepartmentsofCSandStatistics
PurdueUniversity PurdueUniversity
WestLafayette,IN47906 WestLafayette,IN47906
[email protected] [email protected]
Abstract
5 mussen[2005].
1 Among all sparse GP regression methods, a state-of-the-
0
Gaussianprocesses(GPs)provideanonparametricrepresen- art approach is to represent a function as a sparse finite lin-
2
tation of functions. However, classical GP inference suffers ear combination of pairs of trigonometric basis functions, a
ul from high computational cost for big data. In this paper, we sine and a cosine for each spectral point; thus this approach
J proposeanewBayesianapproach,EigenGP,thatlearnsboth is called sparse spectrum Gaussian process (SSGP) [La´zaro-
3 basis dictionary elements—eigenfunctions of a GP prior— Gredillaetal.,2010]. SSGPintegratesoutbothweightsand
1 andpriorprecisionsinasparsefinitemodel. Itiswellknown phasesofthetrigonometricfunctionsandlearnsallhyperpa-
that,amongallorthogonalbasisfunctions,eigenfunctionscan rametersofthemodel(frequenciesandamplitudes)bymax-
]
G providethemostcompactrepresentation. Unlikeothersparse imizing the marginal likelihood. Using global trigonomet-
Bayesian finite models where the basis function has a fixed ric functions as basis functions, SSGP has the capability of
L
form,oureigenfunctionsliveinareproducingkernelHilbert approximating any stationary Gaussian process model and
.
s space as a finite linear combination of kernel functions. We been shown to outperform alternative sparse GP methods—
c
learnthedictionaryelements—eigenfunctions—andtheprior including fully independent training conditional (FITC) ap-
[
precisions over these elements as well as all the other hy- proximation [Snelson and Ghahramani, 2006]—on bench-
3
perparameters from data by maximizing the model marginal markdatasets. AnotherpopularsparseBayesianfinitelinear
v
likelihood. We explore computational linear algebra to sim- modelistherelevancevectormachine(RVM)[Tipping,2000;
2
6 plifythegradientcomputationsignificantly. Ourexperimen- FaulandTipping,2001]. Ituseskernelexpansionsovertrain-
3 tal results demonstrate improved predictive performance of ingsamplesasbasisfunctionsandselectsthebasisfunctions
0 EigenGPoveralternativesparseGPmethodsaswellasrele- by automatic relevance determination [MacKay, 1992; Faul
.
1 vancevectormachines. andTipping,2001].
0 Inthispaper,weproposeanewsparseBayesianapproach,
4 EigenGP, that learns both functional dictionary elements—
1 1 Introduction
eigenfunctions—and prior precisions in a finite linear model
:
v representation of a GP. It is well known that, among all or-
i
X Gaussianprocesses(GPs)arepowerfulnonparametricBayes- thogonal basis functions, eigenfunctions provide the most
ian models with numerous applications in machine learning compact representation. Unlike SSGP or RVMs where the
r
a andstatistics. GPinference, however, iscostly. Trainingthe basis function has a fixed form, our eigenfunctions live in a
exact GP regression model with N samples is expensive: it reproducing kernel Hilbert space (RKHS) as a finite linear
takes an O(N2) space cost and an O(N3) time cost. To ad- combination of kernel functions with their weights learned
dressthisissue,avarietyofapproximatesparseGPinference from data. We further marginalize out weights over eigen-
approacheshavebeendeveloped[WilliamsandSeeger,2001; functions and estimate all hyperparameters—including basis
Csato´ and Opper, 2002; Snelson and Ghahramani, 2006; points for eigenfunctions, lengthscales, and precision of the
La´zaro-Gredillaetal.,2010;WilliamsandBarber,1998;Tit- weightprior—bymaximizingthemodelmarginallikelihood
sias, 2009; Qi et al., 2010; Higdon, 2002; Cressie and Jo- (also known as evidence). To do so, we explore computa-
hannesson, 2008]—for example, using the Nystro¨m method tionallinearalgebraandgreatlysimplifythegradientcompu-
to approximate covariance matrices [Williams and Seeger, tation for optimization (thus our optimization method is to-
2001], optimizing a variational bound on the marginal like- tallydifferentfromRVMoptimizationmethods). Asaresult
lihood [Titsias, 2009] or grounding the GP on a small set of of this optimization, our eigenfunctions are data dependent
(blurred)basispoints[SnelsonandGhahramani,2006;Qiet and make EigenGP capable of accurately modeling nonsta-
al., 2010]. An elegant unifying view for various sparse GP tionary data. Furthermore, by adding an additional kernel
regression models is given by Quin˜onero-Candela and Ras- term in our model, we can turn the finite model into an in-
1
finite model to model the prediction uncertainty better—that andSchervish,2004]. ConstructinggeneralnonstationaryGP
is,itcangivenonzeropredictionvariancewhenatestsample modelsremainsachallengingtask.
isfarfromthetrainingsamples. For regression, we use a Gaussian likelihood function
EigenGP is computationally efficient. It takes O(NM) p(yi|f) = N(yi|f(xi),σ2), where σ2 is the variance of
spaceandO(NM2)timefortrainingonwithM basisfunc- the observation noise. Given the Gaussian process prior
tions,whichissameasSSGPandmoreefficientthanRVMs overf andthedatalikelihood, theexactposteriorprocessis
(as RVMs learn weights over N, not M, basis functions.). p(f|D,y) ∝ GP(f|0,k)(cid:81)Ni=1p(yi|f). Although the poste-
Similar to FITC and SSGP, EigenGP focuses on predictive riorprocessforGPregressionhasananalyticalform,weneed
accuracy at low computational cost, rather than on faithfully tostoreandinvertanN byN matrix,whichhasthecompu-
converging towards the full GP as the number of basis func- tational complexity O(N3), rendering GP unfeasible for big
tionsgrows. (Forthelattercase,pleaseseetheapproach[Yan dataanalytics.
andQi,2010]thatexplicitlyminimizestheKLdivergencebe-
tweenexactandapproximateGPposteriorprocesses.)
3 Model of EigenGP
The rest of the paper is organized as follows. Section 2
describes the background of GPs. Section 3 presents the
Toenablefastinferenceandobtainanonstationarycovariance
EigenGP model and an illustrative example. Section 4 out-
function,ournewmodelEigenGPprojectstheGPpriorinan
lines the marginal likelihood maximization for learning dic-
eigensubspace. Specifically,wesetthelatentfunction
tionaryelementsandtheotherhyperparameters. InSection5,
we discuss related work. Section 6 shows regression results M
(cid:88)
on multiple benchmark regression datasets, demonstrating f(x)= α φj(x) (1)
j
improved performance of EigenGP over Nystro¨m [Williams j=1
andSeeger,2001],RVM,FITC,andSSGP.
where M (cid:28) N and {φj(x)} are eigenfunctions of the GP
prior. We assign a Gaussian prior over α = [α ,...,α ],
1 M
2 Background of Gaussian Processes α ∼ N(0,diag(w)),sothatf followsaGPpriorwithzero
meanandthefollowingcovariancefunction
WedenoteN independentandidenticallydistributedsamples M
as D = {(x1,y1),...,(xn,yn)}N, where xi is a D dimen- k˜(x,x(cid:48))=(cid:88)wjφj(x)φj(x(cid:48)). (2)
sionalinput(i.e.,explanatoryvariables)andyiisascalarout- j=1
put(i.e.,aresponse),whichweassumeisthenoisyrealization
ofalatentfunctionf atx . To compute the eigenfunctions {φj(x)}, we can use the
i
Galerkin projection to approximate them by Hermite poly-
A Gaussian process places a prior distribution over the
latent function f. Its projection f at {x }N defines a nomials [Marzouk and Najm, 2009]. For high dimensional
x i i=1
joint Gaussian distribution p(f ) = N(f|m0,K), where, problems, however, this approach requires a tensor product
x
without any prior preference, the mean m0 are set to 0 ofunivariateHermitepolynomialsthatdramaticallyincreases
and the covariance function k(x ,x ) ≡ K(x ,x ) en- thenumberofparameters.
i j i j
To avoid this problem, we apply the Nystro¨m method
codes the prior notion of smoothness. A popular choice
[Williams and Seeger, 2001] that allows us to obtain an ap-
is the anisotropic squared exponential covariance function:
k(x,x(cid:48)) = a exp(cid:0)−(x−x(cid:48))Tdiag(η)(x−x(cid:48))(cid:1), where proximationtotheeigenfunctionsinahighdimensionalspace
0
the hyperparameters include the signal variance a and the efficiently. Specifically, given inducing inputs (i.e. basis
0
lengthscales η = {η }D , controlling how fast the covari- points)B=[b1,...,bM],wereplace
d d=1
ancedecayswiththedistancebetweeninputs. Usingthisco- (cid:90)
variancefunction,wecanpruneinputdimensionsbyshrink- k(x,x(cid:48))φj(x)p(x)dx=λjφj(x(cid:48)) (3)
ing the corresponding lengthscales based on the data (when
η = 0, the d-th dimension becomes totally irrelevant tothe byitsMonteCarloapproximation
d
covariance function value). This pruning is known as Auto-
M
1 (cid:88)
maticRelevanceDetermination(ARD)andthereforethisco- k(x,b )φj(b )≈λ φj(x) (4)
M i i j
variance is also called the ARD squared exponential. Note
i=1
that the covariance function value remains the same when
(x(cid:48) −x) is the same – regardless where x(cid:48) and x are. This Then,byevaluatingthisequationatBsothatwecanestimate
thevaluesofφj(b )andλ ,weobtainthej-theigenfunction
thusleadstoastationaryGPmodel. Fornonstationarydata, i j
φj(x)asfollows
however,astationaryGPmodelisamisfit. Althoughnonsta-
√
tionary GP models have been developed and applied to real M
worldapplications,theyareoftenlimitedtolow-dimensional φj(x)= λ(M)k(x)u(jM) =k(x)uj (5)
problems, such as applications in spatial statistics [Paciorek j
2
wherek(x) (cid:44) [k(x,b1),...,k(x,bM)],λ(jM) andu(jM) are Basispointslearned Basispointsestimatedby
the j-th eigenvalue and eigenvector of the covariance func- fromK-means evidencemaximization
√
tion evaluated at B, and u = Mu(M)/λ(M). Note that, 2 2
j j j
forsimplicity,wehavechosenthenumberoftheinducingin- ns 1 1
o
puts to be the same as the number of the eigenfunctions; in cti 0 0
di y y
practice, we can use more inducing inputs while computing Pre −1 −1
onlythetopM eigenvectors. Asshownin(5),oureigenfunc- −2 −2
tion lives in a RKHS as a linear combination of the kernel −3 −3
0 2 4 6 0 2 4 6
functionsevaluatedatBwithweightsuj. 1 x 1 x
Inserting(5)into(1)weobtain ons 0.5 0.5
cti
M M n 0 0
f(x)=(cid:88)αj(cid:88)uijk(x,bi) (6) enfuy−0.5 y−0.5
g
j=1 i=1 Ei −1 −1
This equation reveals a two-layer structure of EigenGP. The −1.5 −1.5
0 2 4 6 0 2 4 6
x x
firstlayerlinearlycombinesmultiplekernelfunctionstogen-
erate each eigenfunction φj. The second layer takes these
eigenfunctionsasthebasisfunctionstogeneratethefunction
Figure 1: Illustrationof the effect of optimizing basispoints
valuef. Notethatf isaBayesianlinearcombinationof{φj}
(i.e.,inducinginputs). Inthefirstrow,thepinkdotsrepresent
wheretheweightsαareintegratedouttoavoidoverfitting.
data points, the blue and red solid curves correspond to the
All the model hyperparameters are learned from data.
predictive mean of EigenGP and ± two standard deviations
Specifically, for the first layer, to learn the eigenfunctions
around the mean, the black curve corresponds to the predic-
{φ }, we estimate the inducing inputs B and the kernel hy-
i tivemeanofthefullGP,andtheblackcrossesdenotethebasis
perparameters(suchaslengthscalesηfortheARDkernel)by
points. Inthesecondrow,thecurvesinvariouscolorsrepre-
maximizing the model marginal likelihood. For the second
sentthefiveeigenfunctionsofEigenGP.
layer,wemarginalizeoutαtoavoidoverfittingandmaximize
themodelmarginallikelihoodtolearnthehyperparameterw
oftheprior. 3.1 IllustrativeExample
With the estimated hyperparameters, the prior over f is
nonstationarybecauseitscovariancefunctionin(2)variesat Forthisexample,weconsideratoydatasetusedbytheFITC
different regions of x. This comes at no surprise since the algorithm [Snelson and Ghahramani, 2006]. It contains 200
eigenfunctionsaretiedwithp(x)in(3). Thisnonstationarity one-dimensional training samples. We use 5 basis points
reflectsthefactthatourmodelisadaptivetothedistribution (M = 5) and choose the ARD kernel. We then compare
oftheexplanatoryvariablesx. thebasisfunctions{φj}andthecorrespondingpredictivedis-
Notethattorecoverthefulluncertaintycapturedbytheker- tributions in two cases. For the first case, we use the kernel
nelfunctionk,wecanaddthefollowingtermintothekernel widthηlearnedfromthefullGPmodelasthekernelwidthfor
functionofEigenGP: EigenGP,andapplyK-meanstosetthebasispointsBasclus-
δ(x−x(cid:48))(k(x,x(cid:48))−k˜(x,x(cid:48))) (7) tercenters. TheideaofusingK-meanstosetthebasispoints
has been suggested by Zhang et al. [2008] to minimize an
whereδ(a) = 1ifandonlyifa = 0. Comparedtotheorig- error bound for the Nystro¨m approximation. For the second
inal EigenGP model, which has a finite degree of freedom, case, we optimize η, B and w by maximizing the marginal
thismodifiedmodelhastheinfinitenumberofbasisfunctions likelihoodofEigenGP.
(assuming k has an infinite number of basis functions as the The results are shown in Figure 1. The first row demon-
ARDkernel). Thus,thismodelcanaccuratelymodeltheun- strates that, by optimizing the hyperparameters, EigenGP
certainty of a test point even when it is far from the training achieves the predictions very close to what the full GP
set.Wederivetheoptimizationupdatesofallthehyperparam- achieves—butusingonly5basisfunctions. Incontrast,when
etersforboththeoriginalandmodifiedEigenGPmodels. But the basis points are set to the cluster centers by K-means,
according to our experiments, the modified model does not EigenGP leads to the prediction significantly different from
improvethepredictionaccuracyovertheoriginalEigenGP(it thatofthefullGPandfailstocapturethedatatrend, inpar-
even reduces the accuracy sometimes.). Therefore, we will ticular,forx∈(3,5). ThesecondrowofFigure1showsthat
focus on the original EigenGP model in our presentation for K-meanssetsthebasispointsalmostevenlyspacedony,and
itssimplicity. Beforewepresentdetailsabouthyperparame- accordingly the five eigenfunctions are smooth global basis
teroptimization,letusfirstlookatanillustrativeexampleon functionswhoseshapesarenotdirectlylinkedtothefunction
the effect of hyperparameter optimization, in particular, the they fit. Evidence maximization, by contrast, sets the basis
optimizationoftheinducinginputsB. pointsunevenlyspacedtogeneratethebasisfunctionswhose
3
shapesaremorelocalizedandadaptivetothefunctiontheyfit; NotethatwecancomputeK C−1 efficientlyvialow-rank
BX N
for example, the eigenfunction represented by the red curve updates. Also,R11T(S11T)canbeimplementedefficiently
wellmatchesthedataontheright. byfirstsummingoverthecolumnsofS(R)andthencopying
it multiple times—without any multiplication operation. To
4 Learning Hyperparameters obtain tr(C−N1yydTBC−N1dCN), we simply replace C−N1 in (12)
and(13)byC−1yyTC−1.
N N
Inthissectionwedescribehowweoptimizeallthehyperpa- Using similar derivations, we can obtain the derivatives
rameters, denoted by θ, which include B in the covariance withrespecttothelengthscaleη,a andσ2respectively.
0
function(2),andallthekernelhyperparameters(e.g.,a and
0 Tocomputethederivativewithrespecttow,wecanusethe
η). Tooptimizeθ,wemaximizethemarginallikelihood(i.e. formulatr(Pdiag(w)Q)=1T(QT◦P)wtoobtainthetwo
evidence)basedonaconjugateNewtonmethod1. Weexplore
tracetermsin(9)asfollows:
twostrategiesforevidencemaximization. Thefirstoneisse-
quential optimization, which first fixes w while updating all
theotherhyperparameters,andthenoptimizeswwhilefixing tr(C−N1dCN) =1T(Φ◦(C−1Φ)) (14)
theotherhyperparameters. Thesecondstrategyistooptimize dw N
all the hyperparameters jointly. Here we skip the details for tr(C−1yyTC−1dC )
themorecomplicatedjointoptimizationanddescribethekey N dw N N =1T(Φ◦(C−N1yyTC−N1Φ)) (15)
gradient formula for sequential optimization. The key com-
putationinouroptimizationisforthelogmarginallikelihood
anditsgradient: Foreithersequentialorjointoptimization,theoverallcom-
putational complexity is O(max(M2,D)N) where D is the
lnp(y|θ)= −12ln|CN|− 21yTC−N1y− N2 ln(2π), (8) datadimension.
dlnp(y|θ)= −21(cid:2)tr(C−N1dCN)−tr(C−N1yyTCN−1dCN)(cid:3)(9)
where C = K˜ + σ2I, K˜ = Φdiag(w)ΦT, and Φ =
N
{φm(xn)} is an N by M matrix. Because the rank of K˜ is 5 Related Work
M (cid:28) N,wecancomputeln|C |andC−1 efficientlywith
N N
thecostofO(M2N)viathematrixinversionanddeterminant
Ourworkiscloselyrelatedtotheseminalworkby[Williams
lemmas. Even with the use of the matrix inverse lemma for
and Seeger, 2001], but they differ in multiple aspects. First,
thelow-rankcomputation,anaivecalculationwouldbevery
we define a valid probabilistic model based on an eigen-
costly. Weapplyidentitiesfromcomputationallinearalgebra
decomposition of the GP prior. By contrast, the previous
[Minka,2001;deLeeuw,2007]tosimplifytheneededcom-
approach [Williams and Seeger, 2001] aims at a low-rank
putationdramatically.
TocomputethederivativewithrespecttoB,wefirstnotice approximation to the finite covariance/kernel matrix used in
that,whenwisfixed,wehave GPtraining—fromanumericalapproximationperspective—
and its predictive distribution is not well-formed in a prob-
C =K˜ +σ2I=K K −1K +σ2I (10) abilistic framework (e.g., it may give a negative variance
N XB BB BX
of the predictive distribution.). Second, while the Nystro¨m
whereKXB isthecross-covariancematrixbetweenthetrain- method simply uses the first few eigenvectors, we maximize
ingdataXandtheinducinginputsB,andKBB isthecovari- the model marginal likelihood to adjust their weights in the
ancematrixonB. covariancefunction. Third,exploringtheclusteringproperty
FortheARDsquaredexponentialkernel,utilizingthefol- of the eigenfunctions of the Gaussian kernel, our approach
lowingidentities, tr(PTQ) = vec(P)Tvec(Q)andvec(P◦ canconductsemi-supervisedlearning,whilethepreviousone
Q)=diag(vec(P))vec(Q),wherevec(·)vectorizesamatrix
cannot. Thesemi-supervisedlearningcapabilityofEigenGP
intoacolumnvector,and◦representstheHadamardproduct,
is investigated in another paper of us. Fourth, the Nystro¨m
wecanderivethederivativeofthefirsttracetermin(9)
method lacks a principled way to learn model hyperparam-
tr(C−1dC ) eters including the kernel width and the basis points while
N N =4RXTdiag(η)−4(R11T)◦(BTdiag(η))
dB EigenGPdoesnot.
−4SBTdiag(η)+4(S11T)◦(BTdiag(η)) (11) Our work is also related to methods that use kernel prin-
ciplecomponentanalysis(PCA)tospeedupkernelmachines
where1isacolumnvectorofallones,and
[Hoegaertsetal.,2005].However,forthesemethodsitcanbe
R = (K −1K C−1)◦K (12) difficult—ifnotimpossible—tolearnimportanthyperparam-
BB BX N BX eters including kernel width for each dimension and induc-
S = (K −1K C−1K K −1)◦K (13)
BB BX N XB BB BB inginputs(notasubsetofthetrainingsamples). Bycontrast,
1We use the code from http://www.gaussianprocess.org/ EigenGPlearnsallthesehyperparametersfromdatabasedon
gpml/code/matlab/doc gradientsofthemodelmarginallikelihood.
4
6 Experimental Results (a)Nystro¨m (b)SSGP
4 2
In this section, we compare EigenGP2 and alternative meth- 2 1
odsonsyntheticandrealbenchmarkdatasets. Thealternative 0
y 0 y
methods include the sparse GP methods—FITC, SSGP, and −1
theNystro¨mmethod—aswellasRVMs.Weimplementedthe −2
−2
Nystro¨mmethodourselvesanddownloadedthesoftwareim-
−4 −3
plementationsfortheothermethodsfromtheirauthors’web- −2 0 2 4 6 8 −2 0 2 4 6 8
x x
sites. ForRVMs,weusedthefastfixedpointalgorithm[Faul (c)FITC (d)EigenGP
2 2
andTipping,2001].WeusedtheARDkernelforallthemeth-
odsexceptRVMs(sincetheydonotestimatethelengthscales 1 1
inthiskernel)andoptimizedallthehyperparametersviaevi- 0 0
y y
dencemaximization. ForRVMs,wechosethesquaredexpo- −1 −1
nentialkernelwiththesamelengthscaleforallthedimensions
−2 −2
andapplieda10-foldcross-validationonthetrainingdatato
−3 −3
selectthelengthscale. Onlargerealdata,weusedthevalues −2 0 2 4 6 8 −2 0 2 4 6 8
x x
of η, a , and σ2 learned from the full GP on a subset that
0
was1/10ofthetrainingdatatoinitializeallthemethodsex-
cept RVMs. For the rest configurations, we used the default Figure 2: Predictions of four sparse GP methods. Pink dots
setting of the downloaded software packages. For our own representtrainingdata;bluecurvesarepredictivemeans;red
model,wedenotetheversionswithsequentialandjointopti- curvesaretwostandarddeviationsaboveandbelowthemean
mizationasEigenGPandEigenGP*,respectively.Toevaluate curves,andtheblackcrossesindicatetheinducinginputs.
thetestperformanceofeachmethod,wemeasuretheNormal-
izedMeanSquareError(NMSE)andtheMeanNegativeLog
Probability(MNLP),definedas: average results from 10 runs are reported in Table 1 (dataset
1). We have the results from two versions of the Nystro¨m
NMSE= (cid:80)i(yi−µi)2/(cid:80)i(µi−y¯)2 (16) method. Forthefirstversion,thekernelwidthislearnedfrom
MNLP= 1 (cid:80) [(yi−µi)2+lnσ2+ln2π] (17) the full GP and the basis locations are chosen by K-means
2N i σi i as before; for the second version, denoted as Nystro¨m*, its
where y , µ and σ2 are the response value, the predictive hyperparametersarelearnedbyevidencemaximization. Note
i i i
meanandvarianceforthei-thtestpointrespectively,andy¯is thattheevidencemaximizationalgorithmfortheNystro¨map-
theaverageresponsevalueofthetrainingdata. proximationisnoveltoo—developedbyusforthecompara-
tiveanalysis.Table1showsthatbothEigenGPandEigenGP*
approximate the mean of the full GP model more accurately
6.1 ApproximationQualityonSyntheticData
thantheothermethods,inparticular,severalordersofmagni-
As in Section 3.1, we use the synthetic data from the FITC tudebetterthantheNystro¨mmethod.
paperforthecomparativestudy. Toletallthemethodshave Furthermore, we add the difference term (7) into the ker-
thesamecomputationalcomplexity,wesetthenumberofin- nel function and denote this version of our algorithm as
ducinginputsM = 7. TheresultsaresummarizedinFigure EigenGP+. Itgivesbetterpredictivevariancewhenfarfrom
2. FortheNystro¨mmethod,weusedthekernelwidthlearned thetrainingdatabutitspredictivemeanisslightlyworsethan
fromthefullGPandappliedK-meanstochoosethebasislo- the version without this term (7); the NMSE and MNLP of
cations[Zhangetal.,2008].Figure2(a)showsthatitdoesnot EigenGP+ are0.014±0.001and−0.081±0.004. Thus,on
fitwell. Figure2(b)demonstratesthatthepredictionofSSGP theotherdatasets,weonlyusetheversionswithoutthisterm
oscillatesoutsidetherangeofthetrainingsamples,probably (EigenGP and EigenGP∗) for their simplicity and effective-
duetothefactthatthesinusoidalcomponentsareglobaland ness. Wealsoexaminetheperformanceofallthesemethods
span the whole data range (increasing the number of basis withahighercomputationalcomplexity. Specifically,weset
functionswouldimproveSSGP’spredictiveperformance,but M = 10. Again, both versions of the Nystro¨m method give
increase the computational cost.). As shown by Figure 2(c), poor predictive distributions. And SSGP still leads to extra
FITCfailstocapturetheturnsofthedataaccuratelyforxnear wavy patterns outside the training data. FITC, EigenGP and
4whileEigenGPcan. EigenGP+ give good predictions. Again, EigenGP+ gives
Using the full GP predictive mean as the label for x ∈ better predictive variance when far from the training data ,
[−1,7] (we do not have the true y values in the test data), butwithasimilarpredictivemeanasEigenGP.
we compute the NMSE and MNLP of all the methods. The
Finally, we compare the RVM with EigenGP* on this
2The implementation is available at: https://github.com/ dataset.WhiletheRVMgivesNMSE=0.048in2.0seconds,
hao-peng/EigenGP EigenGP* achieves NMSE = 0.039±0.017 in 0.33±0.04
5
dictive mean on the right is smaller than the true one, prob-
Table1: NMSEandMNLPonsyntheticdata
ably affected by the small dynamic range of the data on the
left. Figure 3(b) shows that the predictive mean of FITC at
NMSE
Method dataset1 dataset2 x∈(2,3)haslowerfrequencyandsmalleramplitudethanthe
truefunction—perhapsinfluencedbythelow-frequencypart
Nystro¨m 39±18 1526±769
ontheleftx ∈ (0,2). Actuallybecauseofthelow-frequency
Nystro¨m* 2.41±0.53 2721±370
part, FITC learns a large kernel width η; the average kernel
FITC 0.02±0.005 0.50±0.04
widthlearnedoverthe10runsis207.75. Thislargevalueaf-
SSGP 0.54±0.01 0.22±0.05
fectsthequalityoflearnedbasispoints(e.g.,lackingofbasis
EigenGP 0.006±0.001 0.06±0.02
pointsforthehighfrequencyregionontheright).Bycontrast,
EigenGP∗ 0.009±0.002 0.06±0.02
using the same initial kernel width as FITC, EigenGP learns
asuitablekernelwidth—onaverage,η =0.07—andprovides
MNLP
goodpredictionsasshowninFigure3(c).
Method dataset1 dataset2
Nystro¨m 645±56 2561±1617
6.3 Accuracyvs. TimeonRealData
Nystro¨m* 7.39±1.66 40±5
FITC −0.07±0.01 0.88±0.05
SSGP 1.22±0.03 0.87±0.09
g 0.28 0.8
EigenGP −0.33±0.00 0.40±0.07 usin 0.26 RFIVTMC 0.72 FSISTGCP
EigenGP∗ −0.31±0.01 0.44±0.07 Ho SSGP EigenGP
aliforniaNMSE00..2224 EEiiggeennGGPP* MNLP 00..5664 EigenGP*
C 0.2 0.48
second with M = 30 (EigenGP performs similarly), both (a)
0.18 0.4
fasterandmoreaccurate. 0 1000 2000 3000 0 1000 2000 3000
0.65 Seconds 1.3 Seconds
RVM FITC
0.6 FITC 1.2 SSGP
6.2 PredictionQualityonNonstationaryData SSGP EigenGP
S
PPTMSE0.55 EEiiggeennGGPP* NLP 1.1 EigenGP*
We then compare all the sparse GP methods on an one- PN 0.5 M 1
b)
dimensionalnonstationarysyntheticdatasetwith200training (
0.45 0.9
and 500 test samples. The underlying function is f(x) =
xsin(x3)wherex ∈ (0,3)andthestandarddeviationofthe 0.40 2000 4000 6000 0.80 2000 4000 6000
0.15 Seconds 0.2 Seconds
whitenoiseis0.5. Thisfunctionisnonstationaryinthesense RVM FITC
that its frequency and amplitude increase when x increases m 0.12 FITC −0.08 SSGP
m SSGP EigenGP
fnruommb0e.rWofebraasnidsopmoilnytsge(nfuenractteiodntsh)etdoabtae1104tifmoresalalntdhesectotmhe- TelecoNMSE00..0069 EEiiggeennGGPP* MNLP−−00..6346 EigenGP*
e
petitive methods. Using the true function value as the label, ol
P 0.03 −0.92
wecomputethemeansandthestandarderrorsofNMSEand c)
(
MNLPasinTable1(dataset2). FortheNystro¨mmethod,the 00 1000 2000 3000 −1.20 1000 2000 3000
Seconds Seconds
marginallikelihoodoptimizationleadstomuchsmallererror
NMSE NMLP
than the K-means based approach. However, both of them
farepoorlywhencomparedwiththealternativemethods. Ta-
ble1alsoshowsthatEigenGPandEigenGP∗ achieveastrik- Figure 4: NMSE and MNLP vs. training time. Each meth-
ing ∼25,000 fold error reduction compared with Nystro¨m*, od (except RVMs) has five results associated with M =
and a ∼10-fold error reduction compared with the second 25,50,100,200,400, respectively. In (c), the fifth result of
bestmethod,SSGP.RVMsgaveNMSE0.0111±0.0004with FITCisoutoftherange;theactualtrainingtimeis3485sec-
1.4±0.05seconds, averagedover10runs, whiletheresults onds,theNMSE0.033,andtheMNLP−1.27. Thevaluesof
ofEigenGP*withM =50areNMSE0.0110±0.0006with MNLPforRVMsare1.30,1.25and0.95,respectively.
0.89±0.1042seconds(EigenGPgivessimilarresults).
We further illustrate the predictive mean and standard de- To evaluate the trade-off between prediction accuracy and
viation on a typical run in Figure 3. As shown in Fig- computationalcost,weusethreelargerealdatasets. Thefirst
ure 3(a), the predictive mean of SSGP contains reasonable dataset is California Housing [Pace and Barry, 1997]. We
high frequency components for x ∈ (2,3) but, as a station- randomly split the 8 dimensional data into 10,000 training
ary GP model, these high frequency components give extra and 10,640 test points. The second dataset is Physicochem-
wavy patterns in the left region of x. In addition, the pre- ical Properties of Protein Tertiary Structures (PPPTS) which
6
(a)SSGP (b)FITC (c)EigenGP
4 4 4
2 2 2
y y y
0 0 0
−2 −2 −2
−4 −4 −4
0 1 2 3 0 1 2 3 0 1 2 3
x x x
Figure3:Predictionsonnonstationarydata.Thepinkdotscorrespondtonoisydataaroundthetruefunctionf(x)=xsin(x3),
represented by the green curves. The blue and red solid curves correspond to the predictive means and ± two standard
deviationsaroundthemeans. TheblackcrossesnearthebottomrepresenttheestimatedbasispointsforFITCandEigenGP.
can be obtained from Lichman [2013]. We randomly split learningbyeitherusingstochasticgradientdescenttoupdate
the 9 dimensional data into 20,000 training and 25,730 test the weights of the eigenfunctions or applying the online VB
points. The third dataset is Pole Telecomm that was used in ideaforGPs[Hensmanetal.,2013].
La´zaro-Gredillaetal.[2010]. Itcontains10,000trainingand
5000 test samples, each of which has 26 features. We set
M = 25,50,100,200,400, and the maximum number of it- Acknowledgments
erationsinoptimizationtobe100forallmethods.
The NMSE, MNLP and the training time of these meth- ThisworkwassupportedbyNSFECCS-0941043,NSFCA-
ods are shown in Figure 4. In addition, we ran the REERawardIIS-1054903, andtheCenterforScienceofIn-
Nystro¨m method based on the marginal likelihood maxi- formation, an NSF Science and Technology Center, under
mization, which is better than using K-means to set the ba- grantagreementCCF-0939370.
sis points. Again, the Nystro¨m method performed orders
of magnitude worse than the other methods: with M =
25,50,100,200,400, on California Housing, the Nystro¨m References
method uses 146, 183, 230, 359 and 751 seconds for train-
ing, respectively, and gives the NMSE 917, 120, 103, 317, NoelCressieandGardarJohannesson. Fixedrankkrigingfor
and95; onPPPTS,thetrainingtimesare258,304,415,853 very large spatial data sets. Journal of the Royal Statisti-
and2001secondsandtheNMSEsare1.7×105, 6.4×104, calSociety:SeriesB(StatisticalMethodology),70(1):209–
1.3×104,8.1×103,and8.1×103;andonPoleTelecomm, 226,February2008.
the training times are 179, 205, 267, 478, and 959 seconds
andtheNMSEsare2.3×104,5.8×103,3.8×103,4.5×102 LehelCsato´ andManfredOpper. SparseonlineGaussianpro-
and 84. The MNLPs are consistently large, and are omitted cesses. NeuralComputation,14:641–668,March2002.
hereforsimplicity.
JandeLeeuw. Derivativesofgeneralizedeigensystemswith
ForRVMs,weincludecross-validationinitstrainingtime
applications. In Department of Statistics Papers. Depart-
because choosing an appropriate kernel width is crucial for
mentofStatistics,UCLA,UCLA,2007.
RVM. Since RVM learns the number of basis functions au-
tomaticallyfromthedata,inFigure4itshowsasingleresult
Anita C. Faul and Michael E. Tipping. Analysis of sparse
foreachdataset.EigenGPachievesthelowestpredictionerror
Bayesianlearning.InAdvancesinNeuralInformationPro-
usingshortertime. ComparedwithEigenGPbasedonthese-
cessingSystems14.MITPress,2001.
quential optimization, EigenGP∗ achieves similar errors, but
takeslongerbecausethejointoptimizationismoreexpensive.
JamesHensman, Nicolo` Fusi, andNeilD.Lawrence. Gaus-
sianprocessesforbigdata. Inthe29thConferenceonUn-
certainty in Articial Intelligence (UAI 2013), pages 282–
7 Conclusions
290,2013.
Inthispaperwehavepresentedasimpleyeteffectivesparse DaveHigdon. Spaceandspace-timemodelingusingprocess
Gaussianprocessmethod,EigenGP,andappliedittoregres- convolutions. InCliveW.Anderson,VicBarnett,PhilipC.
sion. DespiteitssimilaritytotheNystro¨mmethod,EigenGP Chatwin, and Abdel H. El-Shaarawi, editors, Quantitative
can improve its prediction quality by several orders of mag- methods for current environmental issues, pages 37–56.
nitude. EigenGP can be easily extended to conduct online SpringerVerlag,2002.
7
Luc Hoegaerts, Johan A. K. Suykens, Joos Vandewalle, and Christopher K. I. Williams and Matthias Seeger. Using the
Bart De Moor. Subset based least squares subspace re- Nystro¨mmethodtospeedupkernelmachines.InAdvances
gressioninRKHS. Neurocomputing,63:293–323,January in Neural Information Processing Systems 13, volume 13.
2005. MITPress,2001.
MiguelLa´zaro-Gredilla,JoaquinQuin˜onero-Candela,CarlE. Feng Yan and Yuan Qi. Sparse Gaussian process regression
Rasmussen, and An´ıbal R. Figueiras-Vidal. Sparse spec- via (cid:96)1 penalization. In Proceedings of 27th International
trum Gaussian process regression. Journal of Machine ConferenceonMachineLearning,pages1183–1190,2010.
LearningResearch,11:1865–1881,2010.
Kai Zhang, Ivor W. Tsang, and James T. Kwok. Improved
MosheLichman. UCImachinelearningrepository,2013. Nystro¨m low-rank approximation and error analysis. In
Proceedings of the 25th International Conference on Ma-
DavidJ.C.MacKay. Bayesianinterpolation. NeuralCompu- chineLearning,pages1232–1239.ACM,2008.
tation,4(3):415–447,1992.
Youssef M. Marzouk and Habib N. Najm. Dimensionality
reduction and polynomial chaos acceleration of Bayesian
inference in inverse problems. Journal of Computational
Physics,228(6):1862–1902,2009.
Thomas P. Minka. Old and new matrix algebra useful for
statistics. Technicalreport,MITMediaLab,2001.
R.KelleyPaceandRonaldBarry. Sparsespatialautoregres-
sions. Statistics and Probability Letters, 33(3):291–297,
1997.
Christopher J. Paciorek and Mark J. Schervish. Nonstation-
ary covariance functions for Gaussian process regression.
InAdvancesinNeuralInformationProcessingSystems16.
MITPress,2004.
Yuan Qi, Ahmed H. Abdel-Gawad, and Thomas P. Minka.
Sparse-posterior Gaussian processes for general likeli-
hoods. In Proceedings of the 26th Conference on Uncer-
taintyinArtificialIntelligence,2010.
Joaquin Quin˜onero-Candela and Carl E. Rasmussen. A uni-
fyingviewofsparseapproximateGaussianprocessregres-
sion. Journal of Machine Learning Research, 6:1935–
1959,122005.
Edward Snelson and Zoubin Ghahramani. Sparse Gaussian
processesusingpseudo-inputs. InAdvancesinNeuralIn-
formationProcessingSystems18.MITpress,2006.
Michael E. Tipping. The relevance vector machine. In Ad-
vancesinNeuralInformationProcessingSystems12.MIT
Press,2000.
MichalisK.Titsias.Variationallearningofinducingvariables
in sparse Gaussian processes. In The 12th International
Conference on Artificial Intelligence and Statistics, pages
567–574,2009.
ChristopherK.I.WilliamsandDavidBarber. Bayesianclas-
sificationwithGaussianprocesses. IEEETransactionson
Pattern Analysis and Machine Intelligence, 20(12):1342–
1351,1998.
8