Table Of ContentLearning with Kernels
vorgelegt von
Diplom(cid:0)Physiker
Alexander Johannes Smola
Vom Fachbereich (cid:1)(cid:2) (cid:3) Informatik
der Technischen Universit(cid:4)at Berlin
zur Erlangung des akademischen Grades
Doktor der Naturwissenschaften
(cid:0) Dr(cid:5) rer(cid:5) nat(cid:5) (cid:0)
Promotionsausschuss(cid:6)
Vorsitzender(cid:6) Prof(cid:5) Dr(cid:5) A(cid:5) Biedl
Berichter(cid:6) Prof(cid:5) Dr(cid:5) S(cid:5) J(cid:4)ahnichen
Berichter(cid:6) Prof(cid:5) Dr(cid:5) J(cid:5) Shawe(cid:0)Taylor
Tag der wissenschaftlichen Aussprache(cid:6)
(cid:1)(cid:7)(cid:5) November (cid:1)(cid:8)(cid:8)(cid:9)(cid:10) (cid:1)(cid:11)(cid:5)(cid:12)(cid:12) Uhr
Berlin (cid:1)(cid:8)(cid:8)(cid:9)
(cid:0) D (cid:9)(cid:2) (cid:0)
Foreword
The present thesis can take its place among the numerous doctoral theses and
other publications that are currently revolutionizing the area of machine learning(cid:0)
Theauthor(cid:1)sbasicconcerniswithkernel(cid:2)basedmethodsandinparticularSupport
Vector algorithms for regression estimation for the solution of inverse(cid:3) often ill(cid:2)
posed problems(cid:0) However(cid:3) Alexander Smola(cid:1)s thesis stands out from many of
the other publications in this (cid:4)eld(cid:0) This is due in part to the author(cid:1)s profound
theoretical penetration of his subject(cid:2)matter(cid:3) but also and in particular to the
wealth of detailed results he has included(cid:0)
EspeciallyneatandofparticularrelevancearethealgorithmicextensionsofSup(cid:5)
port Vector Machines(cid:3) which can be combined as building blocks(cid:3) thus markedly
improving the Support Vectors(cid:0) Of substantial interest is also the very elegant un(cid:5)
supervisedmethod fornonlinear feature extraction(cid:3)which appliesthe kernel(cid:5)based
method to classical Principal Component Analysis (cid:6)kernel PCA(cid:7)(cid:0) And although
only designed to illustrate the theoretical results(cid:3) the practical applications the
author gives us from the area of high(cid:5)energy physics and time(cid:5)series analysis are
highly convincing(cid:0)
In many respects the thesis is groundbreaking(cid:3) but it is likely to soon become
a frequently cited work for numerous innovative applications from the (cid:4)eld of
statistical machine learning and for improving our theoretical understanding of
Support Vector Machines(cid:0)
Stefan Ja(cid:0)hnichen(cid:3) Professor(cid:3)Technische Universita(cid:8)t Berlin
Director(cid:3) GMD Berlin
iv Foreword
AlexSmola(cid:1)sthesishasbranchedoutin at least(cid:4)venoveldirectionsbroadlybased
around kernel learning machines(cid:9) analysis of cost functions(cid:3) relations to regular(cid:5)
ization networks(cid:3) optimization algorithms(cid:3) extensions to unsupervised learning in(cid:5)
cluding regularized principal manifolds(cid:3) entropy numbers for linear operators and
applications to bounding covering numbers(cid:0) I will highlight some of the signi(cid:4)cant
contributions made in each of these areas(cid:0)
Cost Functions This sectionpresents a veryneat coherent view of costfunctions
and their e(cid:10)ect on the overall algorithmics of kernel regression(cid:0) In particular(cid:3) it
is shown how using a general convex cost function still allows the problem to be
cast as a convex programming problem solvable via the dual(cid:0) Experiments show
that choosing the right cost function can improve performance(cid:0) The section goes
on to describe a very useful approach to choosing the (cid:0) for the (cid:0)(cid:5)insensitive loss
measure(cid:3)basedontraditionalstatisticalmethods(cid:0)Furtherre(cid:4)nementsarisingfrom
this approachgive a new algorithm termed (cid:1)(cid:5)SV regression(cid:0)
Kernels and Regularization The chapter covers the relation between kernels
used in Support Vector Machines and Regularization Networks(cid:0) This connection
is a very valuable contribution to understanding the operation of SV machines
and in particulartheir generalizationproperties(cid:0) The analysis of particular kernels
and experiments showing the e(cid:10)ects of their regularization properties are very
illuminating(cid:0) Consideration of higher dimensional input spaces is made and the
case of dot product kernels studied in some detail(cid:0) This leads to the introduction
of Conditionally Positive De(cid:4)nite Kernels and semiparametric estimation(cid:3) both of
which are new in the context of SV machines(cid:0)
Optimization Algorithms This section takes the interior point methods and
implements them for SV regression and classi(cid:4)cation(cid:0) By taking into account the
speci(cid:4)cs of the problem e(cid:11)ciency savingshave been made(cid:0) The considerationthen
turns to subset selection to handle large data sets(cid:0) This introduces among other
techniques(cid:3) SMO or sequential minimal optimization(cid:0) This approach is generalized
to the regressioncase and proves an extremely e(cid:11)cient method(cid:0)
Unsupervised Learning The extension to Kernel PCA is a nice idea which
appears to work well in practice(cid:0) The further work on Regularized Principal
Manifolds is very novel and opens up a number of interesting problems and
techniques(cid:0)
Entropy numbers(cid:0) kernels and operators Theestimationofcoveringnumbers
via techniques from operator theory is another major contribution to the state(cid:5)of(cid:5)
the(cid:5)art(cid:0) Many new results are presented(cid:3) among others the generalization bounds
for Regularized Principal Manifolds are given(cid:0)
John Shawe(cid:1)Taylor(cid:3) Professor(cid:3)Royal Holloway(cid:3) University of London
To my parents
Abstract
SupportVector(cid:6)SV(cid:7) Machinescombineseveraltechniquesfromstatistics(cid:3)machine
learning and neural networks(cid:0) One of the most important ingredients are kernels(cid:3)
i(cid:0)e(cid:0)theconceptoftransforminglinearalgorithmsintononlinearonesviaamapinto
feature spaces(cid:0) The present work focuses on the following issues(cid:9)
Extensions of Support Vector Machines(cid:0)
Extensions of kernel methods to other algorithms such as unsupervised learning(cid:0)
Capacity bounds which are particularly well suited for kernel methods(cid:0)
AfterabriefintroductiontoSVregressionitisshownhowtheclassical(cid:2)(cid:2)insensitive
loss function can be replaced by other cost functions while keeping the original
advantages or adding other features such as automatic parameter adaptation(cid:0)
Moreover the connection between kernels and regularization is pointed out(cid:0) A
theoretical analysis of several common kernels follows and criteria to check Mer(cid:5)
cer(cid:1)s condition more easily are presented(cid:0) Further modi(cid:4)cations lead to semipara(cid:5)
metric models and greedy approximation schemes(cid:0) Next three di(cid:10)erent types of
optimization algorithms(cid:3) namely interior point codes(cid:3) subset selection algorithms(cid:3)
and sequential minimal optimization (cid:6)including pseudocode(cid:7) are presented(cid:0) The
primal(cid:2)dual framework is used as an analytic tool in this context(cid:0)
Unsupervisedlearningisanextensionofkernelmethodstonewproblems(cid:0)Besides
KernelPCAonecanusetheregularizationtoobtainmoregeneralfeatureexractors(cid:0)
A second approach leads to regularized quantization functionals which allow a
smoothtransitionbetween theGenerativeTopographicMapand PrincipalCurves(cid:0)
The second part deals with uniform convergence bounds for the algorithms and
concepts presented so far(cid:0) It starts with a brief self contained overview and an
introduction to functional analytic tools which play a crucial role in this problem(cid:0)
Byviewing the class ofkernelexpansionsasan imageofalinearoperatoronemay
give bounds on the generalization ability of kernel expansions even when standard
concepts like the VC dimension fail or give too conservativeestimates(cid:0)
In particular it is shown that it is possible to compute the covering numbers
of the given hypothesis classes directly instead of taking the detour via the VC
dimension(cid:0) Applications of the new tools to SV machines(cid:3) convex combinations of
hypotheses (cid:6)i(cid:0)e(cid:0) boosting and sparse coding(cid:7)(cid:3) greedy approximation schemes(cid:3) and
principal curves conclude the presentation(cid:0)
Keywords SupportVectors(cid:3)Regression(cid:3)KernelExpansions(cid:3)Regularization(cid:3)Sta(cid:5)
tistical Learning Theory(cid:3) Uniform Convergence(cid:12)
viii Abstract
Support Vektor (cid:6)SV(cid:7) Maschinen verbinden verschiedene Techniken der Statistik(cid:3)
des maschinellen Lernens und Neuronaler Netze(cid:0) Eine Schlu(cid:8)sselposition f(cid:8)allt den
Kernen zu(cid:3) d(cid:0)h(cid:0) dem Konzept(cid:3) lineareAlgorithmen durch eine Abbildung in Merk(cid:5)
malsr(cid:8)aume nichtlinear zu machen(cid:0) Die Dissertation behandelt folgende Aspekte(cid:9)
Erweiterungen des Support Vektor Algorithmus
Erweiterungen und Anwendungen kernbasierter Methoden auf andere Algorith(cid:5)
men wie das unu(cid:8)berwachte Lernen
Absch(cid:8)atzungen zur Generalisierungsf(cid:8)ahigkeit(cid:3) die besonders auf kernbasierte
Methoden abgestimmt sind
Nacheiner kurzenEinfu(cid:8)hrungin die SV Regressionwird gezeigt(cid:3)wie die (cid:2) unemp(cid:5)
(cid:4)ndliche Kostenfunktion durch andere Funktionen ersetzt werden kann(cid:3) w(cid:8)ahrend
gleichzeitigdieVorteiledesurspru(cid:8)nglichenAlgorithmuserhaltenbleiben(cid:3)oderauch
neue Eigenschaften wie automatische Parameteranpassunghinzugefu(cid:8)gt werden(cid:0)
WeiterhinwirddieVerbindungzwischenKernenund Regularisierungaufgezeigt(cid:0)
Es folgt eine theoretische Analyse verschiedener h(cid:8)au(cid:4)g verwendeter Kerne(cid:3) nebst
KriterienzurleichtenU(cid:8)berpru(cid:8)fungvonMercersBedingung(cid:0)WeitereVer(cid:8)anderungen
fu(cid:8)hren zu semiparametrischenModellen sowie(cid:13)geizigen(cid:14)N(cid:8)aherungsverfahren(cid:0)Ab(cid:5)
schlie(cid:15)end werden drei Optimierungsalgorithmen vorgestellt(cid:3) na(cid:8)mlich die Methode
der inneren Punkte(cid:3) Auswahlalgorithmen und sequentiell minimale Optimierung(cid:0)
Als analytisches Werkzeug fungiert hier das prima(cid:8)r(cid:2)duale Konzept der Opti(cid:5)
mierunge(cid:0)AuchPseudocodewirdindiesemZusammenhangzurVerfu(cid:8)gunggestellt(cid:0)
Unu(cid:8)berwachtes Lernen ist ein Anwendungsfall kernbasierter Methoden auf neue
Probleme(cid:0) Neben Kern PCA kann man das Regularisierungskonzeptdazu verwen(cid:5)
den(cid:3) allgemeinere Mermalsextraktoren zu erhalten(cid:0) Ein zweiter Ansatz fu(cid:8)hrt zu
einem stufenlosen U(cid:8)bergangzwischen der der erzeugenden topographischen Abbil(cid:5)
dung (cid:6)GTM(cid:7) und Hauptkurven(cid:0)
Der zweite Teil der Dissertation besch(cid:8)aftigt sich mit Absch(cid:8)atzungen zur uni(cid:5)
formen Konvergenz fu(cid:8)r die bisher vorgestellten Algorithmen und Konzepte(cid:0) Dazu
wirdzuerstkurzeinU(cid:8)berblicku(cid:8)berexistierendeTechnikenzurKapazit(cid:8)atskontrolle
und Funktionalanalysis gegeben(cid:0) Letztere spielen eine entscheidende Rolle(cid:3) da die
Klasse der Kernentwicklungen als Bild unter einem linearen Operator aufgefa(cid:15)t
werdenkann(cid:3)wasAbsch(cid:8)atzungenderGeneralisierungsf(cid:8)ahigkeitsogarindenFa(cid:8)llen
erm(cid:8)oglicht(cid:3) in denen klassische Ans(cid:8)atze wie die VC Dimension versagen bzw(cid:0) zu
konservative Absch(cid:8)atzungen geben(cid:0)
Insbesondere wird gezeigt(cid:3) da(cid:15) es m(cid:8)oglich ist(cid:3) die U(cid:8)berdeckungszahlen einer
gegebenenHypothesenklassedirekt zuberechnen(cid:3)ohnedenUmwegu(cid:8)berdieBerech(cid:5)
nung derVC Dimension zu nehmen(cid:0) Anwendungen (cid:4)nden die neuen Methoden bei
Support Vektor Maschinen(cid:3) Konvexkombinationenvon Hypothesen (cid:6)z(cid:0)B(cid:0) Boosting
und sp(cid:8)arliche Kodierung(cid:7)(cid:3) (cid:13)geizigen(cid:14)Na(cid:8)herungsverfahrenund Hauptkurven(cid:0)
Schlagworte Support Vektoren(cid:3) Regression(cid:3) Kernentwicklungen(cid:3) Regular(cid:5)
isierung(cid:3) Statistische Lerntheorie(cid:3) Uniforme Konvergenz(cid:12)
Preface
The goal of this thesis is to give a self contained overview over Support Vector
Machinesandsimilarkernelbasedmethods(cid:3)mainlyforRegressionEstimation(cid:0)Itis(cid:3)
inthissense(cid:3)complementarytoBernhardSch(cid:8)olkopf(cid:1)sworkonPatternRecognition(cid:0)
Yet it also contains new insights in capacity control which can be applied to
classi(cid:4)cation problems as well(cid:0)
Itisprobablybesttoviewthisworkasatechnicaldescriptionofatoolset(cid:3)namely
the building blocks of a Support Vector Machine(cid:0) The (cid:4)rst part describes its basic
machineryandthepossibleadd(cid:2)onsthatcanbe usedformodifying it(cid:3)justlikethe
lea(cid:16)et one would get from a car dealer with a choice of all the (cid:1)extras(cid:1) available(cid:0)
In this respect the second part could be regarded as a list of operating instruc(cid:5)
tions(cid:3) namely how to e(cid:10)ectively carry out capacitycontrolfor a class of systems of
the SV type(cid:0)
How to read this Thesis
I tried to organizethis workboth in a self contained(cid:3) and modularmanner(cid:0) Where
necessary(cid:3)proofshavebeenmovedintotheappendixofthecorrespondingchapters
and can be omitted if the reader is willing to accept some results on faith(cid:0) Some
fundamentalresults(cid:3)however(cid:3)ifneededtounderstandthefurtherwayofreasoning(cid:3)
are derived in the main body(cid:0)
How not to read this Thesis
This is the work of a physicist who decided to do applied statistics(cid:3) ended up
in a computer science department(cid:3) and sometimes had engineering applications
or functional analysis in mind(cid:0) Hence it provides a mixture of techniques and
concepts from several domains(cid:3) su(cid:11)cient to annoy many readers(cid:3) due to the
lack of mathematical rigor (cid:6)from a mathematician(cid:1)s point(cid:7)(cid:3) the sometimes rather
theoretical reasoning and some technical proofs (cid:6)from a practicioner(cid:1)s point(cid:7)(cid:3) the
lackofhardlyanyconnectionwithphysics(cid:3)orsomealgorithmsthatwork(cid:3)buthave
not(cid:6)yet(cid:7)beenproventobeoptimalortoterminateina(cid:4)nitenumberofsteps(cid:3)etc(cid:0)
However(cid:3)I tried to split the nuisance equally among the disciplines(cid:0)
x Preface
Acknowledgements
I would like to thank the researchers at the Neuro group of GMD FIRST with
whom I had the pleasure of carrying out research in an excellent environment(cid:0) In
particular the discussions with Peter Behr(cid:3) Thilo Frie(cid:15)(cid:3) Jens Kohlmorgen(cid:3) Steven
Lemm(cid:3) Sebastian Mika(cid:3) Takashi Onoda(cid:3) Petra Philips(cid:3) Gunnar R(cid:8)atsch(cid:3) Andras
Zieheandmanyotherswereofteninspiring(cid:0)Besidesthat(cid:3)thesystemadministrators
Roger Holst and Andreas Schulz helped in many ways(cid:0)
The second lab to be thanked is the machine learning group at ANU(cid:3) Canberra(cid:0)
PeterBartlett(cid:3) Jon Baxter(cid:3) Shai Ben(cid:2)David(cid:3)and Lew Mason helped me in getting
a deeper understanding of Statistical Learning Theory(cid:0) The research visit at ANU
was also one of the most productive periods of the past two years(cid:0)
NexttomentionaretwodepartmentsatAT(cid:17)TandBellLaboratories(cid:3)withLeon
Bottou(cid:3) Chris Burges(cid:3) Yann LeCun(cid:3) Patrick Ha(cid:10)ner(cid:3) Craig Nohl(cid:3) Patrice Simard(cid:3)
and Charles Stenard(cid:0) Much of the work would not have been possible(cid:3) if I had not
hadthechancetolearnaboutSupportVectorsintheir(cid:3)then(cid:3)jointresearchfacility
inHolmdel(cid:3)USA(cid:0)IamparticularlygratefultoVladimirVapnikinthisrespect(cid:0)His
(cid:13)hands on(cid:14) approach to statistics was a reliable guide to interesting problems(cid:0)
MoreoverI had the fortune to discuss with and get help from people like Bernd
Carl(cid:3)Nello Cristianini(cid:3)LeovanHemmen(cid:3) AdamKrzyzak(cid:3)David MacKay(cid:3)Heidrun
Mu(cid:8)ndlein(cid:3)NoboruMurata(cid:3)ManfredOpper(cid:3)JohnShawe(cid:2)Taylor(cid:3)MarkStitson(cid:3)Sara
Solla(cid:3)GraceWahba(cid:3)andJasonWeston(cid:0)ChrisBurges(cid:3)AndreElisee(cid:10)(cid:3)RalfHerbrich(cid:3)
OlviMangasarian(cid:3)Klaus(cid:2)RobertMu(cid:8)ller(cid:3)BernhardSch(cid:8)olkopf(cid:3)JohnShawe(cid:2)Taylor(cid:3)
Anja Westerho(cid:10) and Robert Williamson gave helpful comments on the thesis and
found many errors(cid:0)
Special thanks go the three people with whom most of the results of this
thesis were obtained (cid:18) Klaus(cid:5)Robert Mu(cid:8)ller(cid:3) Bernhard Sch(cid:8)olkopf(cid:3) and Robert
Williamson(cid:9)thankstoKlausinparticularforalwaysremindingmeofthenecessity
that theoretical advances have to be backed by experimental evidence(cid:3) for advice
and discusions about learning theory and neural networks(cid:3) and for allowing me to
focusonresearch(cid:3)quiteundisturbedfromadministrativechores(cid:12)thankstoBernhard
for many discussions which were a major source of ideas and for ensuring that the
proofsstayedtheoreticallysound(cid:12)thankstoBobforteachingmemanythingsabout
statistical learning theory and functional analysis(cid:3) and many valuable discussions(cid:0)
Closecollaborationisonlypossibleifcomplementedbyfriendship(cid:0)Manyofthese
researchers(cid:3) in particular Bernhard(cid:3) became good friends(cid:3) far beyond the level of
scienti(cid:4)c cooperation(cid:0)
Finally I would like to thank Stefan J(cid:8)ahnichen(cid:0) He provided as head of GMD
and dean of the TU Berlin computer science department the researchenvironment
in which new ideas could be developed(cid:0) His wise advice and guidance in scienti(cid:4)c
matters and academic issues was very helpful(cid:0)
This work was made possible through funding by ARPA(cid:3) grants of the DFG JA
(cid:19)(cid:20)(cid:21)(cid:22)(cid:20)(cid:23)andJA(cid:19)(cid:20)(cid:21)(cid:22)(cid:24)(cid:23)(cid:3)fundingfromtheAustralianResearchCouncil(cid:3)travelgrants
from the NIPS foundation and NEuroNet(cid:3) and support of NeuroCOLT (cid:25)(cid:0)
Description:The author's basic concern is with kernel{based methods and in particular via techniques from operator theory is another major contribution to the . Peter Bartlett, Jon Baxter, Shai Ben{David, and Lew Mason helped me in getting situation graphically:2 only the points outside the shaded region.