Table Of ContentLearning multilayer perceptrons efficiently
C. Bunzmann, M. Biehl , R. Urbanczik
Institut fu¨r theoretische Physik
Universit¨at Wu¨rzburg
Am Hubland
D-97074 Wu¨rzburg
1
0 Germany
0
2
n analysis of the algorithm’s performance shows that the
a A learning algorithm for multilayer perceptrons is pre- ratio of required examples to the number of free param-
J sented which is based on finding the principal components
eters in the networks does stay finite in the thermody-
0 of a correlation matrix computed from the example inputs
namic limit.
1 and their target outputs. For large networks our procedure
The multilayerarchitecturewe analyzeis a committee
needsfar fewerexamplestoachievegood generalization than
] traditional on-line algorithms. machine with K hidden units defined by
n
n K
s- PACS: 84.35.+i,89.80.+h,64.60.Cn τ(ξ)=g K−1/2 h(BiTξ)!, (1)
di Multilayer neural networks achieve the ability to ap- Xi=1
t. proximate any reasonable function arbitrarily well [1] by where ξ ∈ IRN is the input and the Bi ∈ IRN are the
a unknown parameter vectors. The goal of learning is to
interconnecting sufficiently many architecturally identi-
m
estimate these parameter vectors, which we shall refer
cal processing elements, the perceptrons. This replica-
- to as teacher vectors, from a training set of P examples
d tion of identical elements invariably leads to symmetries
(ξµ,τ(ξµ))oftheinput/outputrelationship. Weshallini-
n in the multilayer architecture which need to be broken
o duringtrainingtoachievegoodgeneralization[2]. Inthe tially focus on regression problems and later sketch the
c modifications to our procedure needed for classification
dynamicsofgradientbasedtrainingproceduresthe sym-
[ tasks. For regression, the output function g is usually
metriesgiverisetosuboptimalfixedpointsandslowcon-
1 vergence. In particular this holds for stochastic gradient assumed invertible, and this easily reduces to the linear
v case by applying the inverse function to the target out-
descent methods which to date have been the method of
2 puts. So in this case we simply assume g(x) = x. For
choice for training large networks because, at each time
3
brevity we also assume that the B are orthonormal.
1 step, the update of the network parameters is based on i
Initssimplestformourprocedurecanbeseenasgener-
1 the presentation of only a single training example and
0 hence computationally cheap. Such stochastic gradient alizingHebbianlearning. TheretheparametervectorJ¯P
1 ofasimpleperceptronapproximatingthetargetfunction
descent methods have been intensively studied in the
0 τ(ξ) is obtained as the average
framework of on-line learning, where each of the train-
/
at ing examples is used just once by the network [3]. This P
m is attractive because there is no need to store the entire J¯P =P−1 τ(ξµ)ξµ. (2)
- setoftrainingexamples,andtheapproachalsoworksfor µX=1
d nonstationary problems. However, in the on-line frame-
n Our basic observation is that when τ(ξ) is given by a
workthenumberofrequiredtrainingexamplesiscoupled
o multilayerperceptronitisimportanttonotonlyconsider
totheslowtemporalevolutioncausedbythesymmetries
c the empirical mean of the distribution of τ(ξµ)ξµ as in
: and thus unreasonably large training sets are necessary.
v plain Hebbian learning but also its correlation matrix.
In fact the ratio of the number of examples needed for
i In particular some of the eigenvectors of the correlation
X good generalization to the number of free parameters in
matrix correspondto the parameter vectors B and thus
r the network diverges with the network size [4]. While i
a the supervisedlearningtask is mappedonto the wellun-
there have been investigations into optimizing the on-
derstoodproblemofprincipalcomponentanalysis. While
line dynamics [5–7], these have not lead to practical al-
thismappingbyitselfdoesnotsolvetheoriginalproblem,
gorithmssincetheoptimizedproceduresassumethatthe
it does provide a crucial reduction in its dimensionality,
symmetry breaking provided by the initial conditions is
such that the remaining problem becomes almost trivial
macroscopic and known to the learning algorithm.
in the thermodynamic limit N .
Inthisletterwepresentalearningalgorithmwhichhas →∞
We shall consider the general correlationmatrix
many of the attractive features of the traditional on-line
procedures but yields good generalization using a much P
smaller number of training examples. Further an exact CP =P−1 F (τ(ξµ))ξµξµT (3)
µ=1
X
1
where the simple choice F(x) = x2 for the weight func- We want to choose Γ to minimize the generalization er-
tionF yieldstheanalogytoHebbianlearning. Assuming rorǫ = 1(τ(ξ) σ (ξ))2 andtothisendPˆadditional
g 2 − Γ ξ
thecomponentsoftheinputvectorsξµtobeindependent examples(cid:10)(ξˆν,τ(ξˆν)), ν =1(cid:11),...,Pˆ, are used. To simplify
Gaussianrandomvariableswithzeromeanandunitvari-
the theoretical analysis, the additional examples should
ance,itisstraightforwardtoanalyzethespectrumofCP
be picked independently of the examples used to obtain
inthelimitP . Thespectrumthenonlyhas three ∆˜. We then apply the standard on-line gradient descent
aeingdenavnayluveescλto0r→,λ¯or∞atnhdogλo∆n.alTthoeadlletgeeancehrearcyveocftλor0sisBNi−s aKn procedure Γν+1 = Γν η Γ τ(ξˆν) σΓν(ξˆν) 2. How-
i − ∇ −
eigenvector. The eigenvalue λ¯ has the single eigenvector ever, by choosing a scaling o(cid:16)f the learning ra(cid:17)te η such
B¯ = K 1/2 K B ; this eigenvector is of little interest that η 1 for large N, the stochasticity drops out of
− i=1 i ≪
since for large P one also has B¯ J¯P, and it is thus the procedure [8] and in the restricted space it performs
simpler to uPse Hebb’s rule (2) to e∝stimate B¯. The im- gradientdescent in ǫg. Further we can scale Pˆ such that
portant eigenspace is the one of λ∆ since it is spanned ηPˆ 1 and then ΓPˆ will be a minimum of ǫg for large
≫
by the K 1 vectors B1 Bj (j =2,...,K). The three N. Finally both scaling conditions canbe satisfiedwhile
− −
eigenvaluescanbe writtenas averagesoverindependent, observing Pˆ N, so that the required number of addi-
≪
zero mean, unit variance, Gaussian random variables tional examples is negligible compared to the size of the
yi. For instance one obtains λ∆ = 12 F(τy)(y1−y2)2 y firsttrainingset. Notethatthankstothereduceddimen-
where τy = K−1/2 Ki=1h(yi). The a(cid:10)ctivation functio(cid:11)n sionality the details of the scaling of η and Pˆ only affect
finitesizeeffectsandnottheperformanceinthethermo-
h is sigmoidal(odd, monotonicand bounded), andwhen
P dynamic limit. Simulations combining the two stages of
stating specific numerical values we will always assume
h(y) = erf(y). Then for the choice F(x) = x2 one finds our algorithm (Fig. 1) show a steady decay in ǫg.
the ordering λ¯ >λ >λ and λ λ = 1 8 (1 1 ). We now turn to the theoretical calculation of ρ in
0 ∆ 0− ∆ K3π − √5 the important first stage of the algorithm. The smallest
For a finite number P of training examples the degen-
eigenvalueofCP canbefoundbyminimizingJTCPJ un-
eracy in the spectrum is broken by random fluctuations.
der the constraint J =1. Hence we consider the parti-
But a computation of the orthonormal eigenvectors ∆Pj tionfunctionZ = |d|Jexp( βPJTCPJ)wheretheinte-
corresponding to the K 1 smallest eigenvalues of CP −
− gration is over the N-dimensional unit sphere. For large
nevertheless yields an estimate of the space spanned by R
N thetypicalpropertiesoftheminimizationproblemare
the difference vectors B B . To measure the success
1− j found by calculating the training set average lnZ and
of approximating this space, we introduce the overlap h i
taking the limit β . Using replicas, one introduces
→ ∞
ρ=(K 1)−1/2Tr(∆TBBT∆)1/2, (4) the K dimensional order parameter R , the typical over-
− lap BTJ of the teacher vectors with a vector J drawn
where B is the matrix (B1,...,BK) of the teacher vec- from the Gibbs distribution Z−1exp(−βPJTCPJ), and
tors and ∆ = (∆P,...,∆P ). This is a sensible mea- further the replica symmetric scalar order parameter q
sure because ρ is1invariantKw−1ith respect to the choice of which is the overlap J1TJ2 of two vectors drawn from
an orthonormal basis of the space spanned by the ∆P this distribution. ForP >N the correlationmatrixCP
j
and since it attains its maximal value of 1 iff all ∆P lie isnonsingular,q approaches1withincreasingβ,andthe
j
inthespacespannedbytheB . Simulationsshowingthe relevant scaling is χ = β(1 q) = (1). Then for the
i − O
overlapρ as function of the number of examples per free quenched average of lnZ one finds in the limit β
→∞
parameter, α = P/(KN), are depicted in Fig. 1. They
lnZ
indicateasecondorderphasetransitionfromzerotopos- h i = extrRTA(α,χ)R+a(α,χ). (5)
βN
itive ρ at a critical value α . The evolution of ρ can be R,χ
c
calculated exactly in the thermodynamic limit N .
→ ∞ Herea(α,χ)= αK G (τ ) + 1 ,theK byK matrix
But since up to now we are only estimating the space − h χ y iy 2χ
spannedby the B , insteadofthe vectorsthemselves,we A(α,χ) has entries
i
address this problem first and defer the calculation of ρ.
δ
apUprsoinxgimtahteedabboyvelienigeeanrspcoamcebpinraotcieodnusreo,fththeeB∆i cPananbde Ajk(α,χ)=−αKhGχ(τy)(yjyk−δjk)iy − 2jχk , (6)
j
J¯P. Thus the original KN dimensional problem is re- and G (τ ) = F(τ )/(1 +2χF(τ )). Since Eq. (5) is
χ y y y
duced to K2 dimensions and this is much easier for quadratic in R the extremal problem can only have a
large N. To exploit the dimensionality reduction, we solution R = 0 if the matrix A is singular. From the
write the possible linear combinations as ∆˜Γ where ∆˜ = symmetries6one easily obtains that A has just two eigen-
(J¯P,∆P1,...,∆PK 1), Γ is a K by K matrix of tunable values. The first can be written as A11 A12, it is the
parameters,andw−edefineastudentnetworkσΓinthere- relevant eigenvalue in our case [9] and it−s eigenspace is
stricted space via σ (ξ)=g K 1/2 K h (∆˜Γ)Tξ . spanned by the vectors e e (j = 2,...,K), where
Γ − i=1 i 1 − j
(cid:16) (cid:16) (cid:17)(cid:17)
P
2
e ,...,e denotethestandardbasisofIRK. Thisdegen- inFig. 2showthatthelargeK theoryprovidesareason-
1 K
eracyshowsthatthe differencebetweenthe K 1small- able approximation already for networks with quite few
−
esteigenvaluesofthecorrelationmatrixvanishesforlarge hidden units.
N. So the simple procedure of analyzing the properties To round off the analysis of the regression problem,
of the single vector J minimizing JTCPJ, in fact yields we obtain a theoretical prediction for the generalization
the properties of the K 1 eigenvectors vectors of C error achieved by combining the two stages of our pro-
P
−
with smallest eigenvalues in the thermodynamic limit. cedure. A simple calculation shows that the overlapr of
Due to the degeneracy, we can reparametrize (5) set- the HebbianvectorJ¯µ withB¯,r =B¯TJ¯µ/J¯µ ,for large
| |
ting R=ρ(e1−ej)/√2 and obtain an extremalproblem N satisfies r = (1+ arcs2iαnK(2/3))−1/2. Further, using the
with only two variables ρ and χ. Note that ρ is indeed explicitexpressionforǫ givenin[3]andthevaluesrand
g
theparameterintroducedintheanalysisofthenumerical ρ, we can calculate the minimum of ǫ in the restricted
g
simulations. Its evolution is now obtained by solving (5) spaceandfindatheoreticalpredictionforthegeneraliza-
and this confirms the continuous phase transition found tion error obtained by the second stage. This prediction
in the simulations from ρ = 0 to positive ρ at a critical is compared to the simulations in Fig. 1 and 2.
value αc. For K = 2 one finds αc = 4.49 and for K = 3 We next consider classification problems, that is we
the result is αc = 8.70. As shown in Fig. 1 beyond the assume that the output function of the network given
phasetransitionthereisexcellentagreementbetweenthe by Eq. (1) is g(x) = sgn(x). Then, since the out-
theoretical prediction and the simulations. put is binary, λ = λ¯ holds for any choice of the
0
To obtain generic results for large K note that the weight function F, and our procedure cannot immedi-
contributionsofy1 andy2 tothetargetoutputτy willbe ately be applied. However, it is possible to gain in-
small in this limit. Decomposing τ as τ =τ +δ /√K formation about the target rule by not just consider-
y y y∗ y
where δ = h(y )+h(y ), and expanding G (τ ) up to ing its output τµ but by comparing the output to the
y 1 2 χ y
second order for large K simplifies Eq. 5 to: value B¯Tξµ which would have been obtained by the lin-
earized network, g(x)=h(x)=x. This is feasible since,
hlβnNZi = eρx,χtr−α4ρ2 G′χ′(τy∗) y δy2((y1−y2)2−2) y ebviaenn fvoerctcolarssJ¯ifiPcadteiofinn,edB¯bcyan(2b)e. esWtime aatreedtbhyusthleeaHdebto-
(cid:10) (cid:11) (cid:10)1 ρ2 (cid:11) consider more general correlation matrices of the form
−αKhGχ(τy)iy+ −2χ . (7) CP = P−1 Pµ=1F τ(ξµ),J¯µ−1Tξµ ξµξµT . A reason-
ablewayofchoosing(cid:16)F istofocuson(cid:17)inputsξµ wherethe
Ontheonehand,applyingthecentrallimittheorem,the P
target output has a different sign than the output of the
multiple integrals G (τ ) and G (τ ) can now be
′χ′ y∗ y h χ y iy linearizednetwork. SoweconsiderF(x,y)=Θ( xy) µ,
replacedbyasingle(cid:10)average(cid:11),ontheotherhandthestruc- where Θ is the Heavyside step function. − −
ture of (7) shows that χ approacheszero with increasing In the large P limit, the matrix CP has the same
K. These observations yield the very simple result that eigenspaces as in the case of regression and the three
ρ2 =1 αc/α for large K and α > αc. The value of αc eigenvalues will in general be different. For the acti-
−
is obtained as vation function we chose h(x) = sgn(x) to compare
our results with the findings for on-line Gibbs learn-
4K F2(µz)
α = z (8) ing [10], to our knowledge the only algorithm which has
c 2
F (µz) 2 z2h2((cid:10)z) µ(cid:11)2 2 zh(z) 2 been simulated in any detail for classifications task with
h ′′ iz h iz − − h iz
a connected committee machine. For this h one finds
(cid:16) (cid:17)
where µ2 = h2(z) , and the distribution of z is Gaus- λ∆ >λ0 >λ¯,andforlargeK toleadingorderλ∆ λ0 =
sianwithzeromeanzandunitvariance. Itisnowstraight- √2π 4/(π2K). Numericalsimulationsofthepro−cedure
forward to (cid:10)derive(cid:11)the optimal choice of F from Eq. are sh−own in Fig. 3 using µ = π−1arccos 2/π. Moti-
(8) by noting that in the denominator F (µz) = vatedbyourfindingsinthecaseofregression,thischoice
eµq−u2al(itzy2−to1)FF(2µ(µzz))z. /ApFply(µinzg)thethCeanucyhiheyl-dS′s′chtwhaairtzztihne- onfeeµdeydieltdosaλc0hi=eve0 gfoorodlarggeeneKra.liWzathiiolne tahreeptmrauicnhinlgarsgeetsr
(cid:10) (cid:11) z h ′′ iz than for regression,they are, for exactly the same archi-
optimal choice is F(x) = x2 µ2. For this choice the
(cid:10) (cid:11) − tecture, approximately 20-times smaller than for on-line
eigenvalue λ of the correlation matrix CP vanishes for
0 Gibbs learning [10] already for N =150.
large P. So the optimal F maximizes the signal to noise
Throughout this Letter we have assumed that the in-
ratiobetweentheeigenspaceofdifferencevectorswewant
puts are drawn from an isotropic distribution. Typ-
to estimate and the orthogonal space.
ically, in practical applications, the input data itself
For the activation function h(x) = erf(x) one finds
will have some structure, and in this case the inputs
that α =1.96K when F is optimal, whereas the simple
c have to be whitened before applying our procedure. To
choice F(x) = x2 yields a critical α which is one and a
this end one computes the correlations of the inputs by
half times higher. Simulation results for K = 7 plotted
3
DP = P 1 P ξµξµT. Then instead of just consider-
− µ=1
ing CP, one finds the extremal eigenvectors ∆˜ of the
P j
transformed matrix R 1TCPR 1, where the square ma-
− −
trix R is the Cholesky factorization of DP, RTR = DP.
Then an estimate for the space spanned by the differ- [1] G. Cybenko. Math. Control Signals and Systems, 2:303,
ence vectors is obtained by transforming back, setting 1989.
∆ =R 1∆˜ . NumericalsimulationsforF(x)=x2show [2] H.Schwarzeand J.Hertz.Europhys. Lett., 21:785, 1993.
j − j
that whitening noticeably improves the performance of [3] D. Saad and S. Solla. Phys. Rev. Lett., 74:4337, 1995.
[4] M.Biehl,P.Riegler,andC.W¨ohler.J.Phys.A,29:4769,
our algorithm even when the inputs are picked from an
1996.
isotropic distribution. This is due to the fact that the
[5] D.Saad and M. Rattray.Phys. Rev. Lett., 79:2578, 1997.
spectrum of the empirical correlation matrix DP has a
[6] R. Vicenteand N.Caticha. J. Phys. A, 30:L599, 1997.
finite width even when it converges to a delta peak with
[7] M. Rattray, D. Saad, and S. Amari. Phys. Rev. Lett.,
increasing P.
81:5461, 1998.
In summary, we have presented the first learning al- [8] S. Amari. IEEE Transactions on Electronic Computers,
gorithm for a realistic multilayer network which can be 16(3):299, 1967.
exactly analyzed in the thermodynamic limit and yields [9] TheeigenvectorofthesecondeigenvalueA11+(K−1)A12
good generalization already for a finite ratio of training has the form Ri = const corresponding to the maximal
examplestofreeparameters. Thiscontrastswiththebe- eigenvalue of CP.
havior of traditional on-line procedures [3,4] for target [10] J.KimandH.Sompolinsky.Phys.Rev.E,58:2348,1998.
[11] R. Urbanczik.Europhys. Lett., 35:553, 1996.
functions such as the ones considered in this Letter. As
[12] G. Reents and R. Urbanczik. Phys. Rev. Lett., 80:5445,
longas there areonthe orderofN on-line steps,the dy-
1998.
namics is exactly described by deterministic differential
[13] A.C.C. Coolen, D. Saad, and Y. Xiong. Europhys. Lett.,
equationsforasetofself-averagingorderparameters[12]
51:691, 2000.
in the limit of large N. However, even when tuned by
choosinga goodlearningrate,the dynamicsis stuckina
badlygeneralizingplateauonthis timescaleunlesssome 1.0 0.10
informationaboutthetargetfunctionisbuiltintotheini-
0.8 0.08
tialconditions. For the realisticcase ofrandomlychosen
initial values, the symmetries are eventually broken due 0.6 0.06
tosmallinitialfluctuationsandtheaccumulationoffluc-
tuationsduringthedynamics,butonlywhenthenumber ρ 0.4 0.04 ǫg
of on-line steps is much larger than N. This symmetry
0.2
0.02
breakingisnotadequatelydescribedbythedeterministic
differential equations. When the training set is sampled 0.0
0.00
without replacement, this divergence of the time scale 0 4 8 12 16 20
means that large numbers of examples are needed. But
α
even in the recently analyzed scenario of sampling with
FIG.1. ResultsforK =3. TheevolutionofρforN =400
replacement [13], the above theoretical problems remain
( ) and N = 1600 ( ). The right axis relates to the values
and arecompounded since the analysisrequires involved ◦ •
of ǫg found when the two stages of our procedure are com-
techniques such as dynamical replica theory and quite a bined, N = 400 ((cid:3)) and N = 1600 ((cid:4)). For ǫg the value
few approximations already on the time scale N. of α refers to the number of examples in both training sets,
Hence, we believe thatthe findings inthis Letter open α=(P+Pˆ)/KN. Thefulllinesshowthetheoretical predic-
new avenues for research and that in particular the per- tion for the thermodynamic limit. Where not shown, error-
formanceofthealgorithmforclassificationtasksdeserves bars are smaller than the symbolsize.
a more detailed analysis. But the perhaps most intrigu-
ing aspect of our procedure is that it does not really
assume that the number of hidden units in the architec-
tureisknownapriori. Thisishighlightedbytheinsetof
Fig. 2 showing that the number of hidden units can be
determined by inspecting the eigenvalue spectrumof the
correlation matrix. So our findings also provide a novel
perspective on the problem of model selection.
The work of two of us (C.B. and R.U.) was supported
by the Deutsche Forschungsgemeinschaft.
4
0.8
0.06
0.6
21
0.04
ρ 0.4 12 ǫ
g
3
0.2 0.02
0.0
0
0 8 16 24 32 40
α
FIG. 2. Results for K = 7 using the weight function
F(x)=x2−µ2. Thenumericalsimulations forN =2000 are
comparedtothetheoreticalcurvesfoundinthelargeK limit.
Wherenotshown,errorbarsaresmallerthanthesymbolsize.
The inset shows a histogram of the 200 smallest eigenvalues
of CP for a single training set at α = 22. A gap separates
the6smallest eigenvaluesfromtherestofthespectrum. The
range of theshown eigenvalues is [−0.1,−0.07].
1 0.4
0.8
0.3
0.6
ρ ǫ
0.4 g
0.2
0.2
0 0.1
0 200 400 600
α
FIG.3. Numericalsimulationsoftheclassificationproblem
for K = 7,N = 150 showing the values of ρ ( ) and ǫg ((cid:3)).
◦
The errorbars for the ρ values are smaller than the symbol
size. Here ǫg is the probability of misclassifying a new input.
Training in the restricted space during the second stage of
ourprocedureusesthevariantoftheperceptronlearningrule
described in [11] for tree committee machines.
5