Table Of ContentDensity Ratio Estimation
in Machine Learning
MASASHI SUGIYAMA
TokyoInstituteofTechnology
TAIJI SUZUKI
TheUniversityofTokyo
TAKAFUMI KANAMORI
NagoyaUniversity
cambridge university press
Cambridge,NewYork,Melbourne,Madrid,CapeTown,
Singapore,SãoPaulo,Delhi,MexicoCity
CambridgeUniversityPress
32AvenueoftheAmericas,NewYork,NY10013-2473,USA
www.cambridge.org
Informationonthistitle:www.cambridge.org/9780521190176
©MasashiSugiyama,TaijiSuzuki,andTakafumiKanamori2012
Firstpublished2012
PrintedintheUnitedStatesofAmerica
AcatalogrecordforthispublicationisavailablefromtheBritishLibrary.
LibraryofCongressCataloginginPublicationdataisavailable
ISBN978-0-521-19017-6Hardback
Contents
Foreword page ix
Preface xi
Part I Density-RatioApproachtoMachineLearning
1 Introduction 3
1.1 MachineLearning 3
1.2 Density-RatioApproachtoMachineLearning 9
1.3 AlgorithmsofDensity-RatioEstimation 13
1.4 TheoreticalAspectsofDensity-RatioEstimation 17
1.5 OrganizationofthisBookataGlance 18
Part II MethodsofDensity-RatioEstimation
2 DensityEstimation 25
2.1 BasicFramework 25
2.2 ParametricApproach 27
2.3 Non-ParametricApproach 33
2.4 NumericalExamples 36
2.5 Remarks 37
3 MomentMatching 39
3.1 BasicFramework 39
3.2 Finite-OrderApproach 39
3.3 Infinite-OrderApproach:KMM 43
3.4 NumericalExamples 44
3.5 Remarks 45
4 ProbabilisticClassification 47
4.1 BasicFramework 47
4.2 LogisticRegression 48
4.3 Least-SquaresProbabilisticClassifier 50
4.4 SupportVectorMachine 51
4.5 ModelSelectionbyCross-Validation 53
4.6 NumericalExamples 53
4.7 Remarks 54
5 DensityFitting 56
5.1 BasicFramework 56
5.2 ImplementationsofKLIEP 57
5.3 ModelSelectionbyCross-Validation 64
5.4 NumericalExamples 65
5.5 Remarks 65
6 Density-RatioFitting 67
6.1 BasicFramework 67
6.2 ImplementationofLSIF 68
6.3 ModelSelectionbyCross-Validation 70
6.4 NumericalExamples 73
6.5 Remarks 74
7 UnifiedFramework 75
7.1 BasicFramework 75
7.2 ExistingMethodsasDensity-RatioFitting 77
7.3 InterpretationofDensity-RatioFitting 81
7.4 PowerDivergenceforRobustDensity-RatioEstimation 84
7.5 Remarks 87
8 DirectDensity-RatioEstimationwithDimensionalityReduction 89
8.1 DiscriminantAnalysisApproach 89
8.2 DivergenceMaximizationApproach 99
8.3 NumericalExamples 108
8.4 Remarks 115
Part III ApplicationsofDensityRatiosinMachine
Learning
9 ImportanceSampling 119
9.1 CovariateShiftAdaptation 119
9.2 Multi-TaskLearning 131
10 DistributionComparison 140
10.1 Inlier-BasedOutlierDetection 140
10.2 Two-SampleTest 148
11 MutualInformationEstimation 163
11.1 Density-RatioMethodsofMutualInformationEstimation 164
11.2 SufficientDimensionReduction 174
11.3 IndependentComponentAnalysis 183
12 ConditionalProbabilityEstimation 191
12.1 ConditionalDensityEstimation 191
12.2 ProbabilisticClassification 203
Part IV TheoreticalAnalysisofDensity-Ratio
Estimation
13 ParametricConvergenceAnalysis 215
13.1 Density-RatioFittingunderKullback–LeiblerDivergence 215
13.2 Density-RatioFittingunderSquaredDistance 219
13.3 OptimalityofLogisticRegression 223
13.4 AccuracyComparison 225
13.5 Remarks 235
14 Non-ParametricConvergenceAnalysis 236
14.1 MathematicalPreliminaries 236
14.2 Non-ParametricConvergenceAnalysisofKLIEP 242
14.3 ConvergenceAnalysisofKuLSIF 247
14.4 Remarks 250
15 ParametricTwo-SampleTest 252
15.1 Introduction 252
15.2 EstimationofDensityRatios 253
15.3 EstimationofASCDivergence 257
15.4 OptimalEstimatorofASCDivergence 259
15.5 Two-SampleTestBasedonASCDivergenceEstimation 265
15.6 NumericalStudies 269
15.7 Remarks 274
16 Non-ParametricNumericalStabilityAnalysis 275
16.1 Preliminaries 275
16.2 RelationbetweenKuLSIFandKMM 279
16.3 ConditionNumberAnalysis 282
16.4 OptimalityofKuLSIF 286
16.5 NumericalExamples 292
16.6 Remarks 297
Part V Conclusions
17 ConclusionsandFutureDirections 303
ListofSymbolsandAbbreviations 307
References 309
Index 327
Foreword
Estimating probability distributions is widely viewed as a central question in
machinelearning.Thewholeenterpriseofprobabilisticmodelingusingprobabilis-
ticgraphicalmodelsisgenerallyaddressedbylearningmarginalandconditional
probability distributions. Classification and regression – starting with Fisher’s
fundamental contributions – are similarly viewed as problems of estimating
conditionaldensities.
Thepresentbookintroducesanexcitingalternativeperspective–namely,that
virtuallyallproblemsinmachinelearningcanbeformulatedandsolvedasprob-
lems of estimating density ratios – the ratios of two probability densities. This
bookprovidesacomprehensivereviewoftheelegantlineofresearchundertaken
by the authors and their collaborators over the last decade. It reviews existing
workondensity-ratioestimationandderivesavarietyofalgorithmsfordirectly
estimating density ratios. It then shows how these novel algorithms can address
notonlystandardmachinelearningproblems–suchasclassification,regression,
and feature selection – but also a variety of other important problems such as
learning under a covariate shift, multi-task learning, outlier detection, sufficient
dimensionalityreduction,andindependentcomponentanalysis.
Ateachpointthisbookcarefullydefinestheproblemsathand,reviewsexisting
work,derivesnovelmethods,andreportsonnumericalexperimentsthatvalidate
the effectiveness and superiority of the new methods.Aparticularly impressive
aspectoftheworkisthatimplementationsofmostofthemethodsareavailable
fordownloadfromtheauthors’webpages.
Thelastpartofthebookisdevotedtomathematicalanalysesofthemethods.
Thisincludesnotonlyananalysisforthecasewheretheassumptionsunderlying
the algorithms hold, but also situations in which the models are misspecified.
Carefulstudyoftheseresultswillnotonlyprovidefundamentalinsightsintothe
problemsandalgorithmsbutwillalsoprovidethereaderwithanintroductionto
manyvaluableanalytictools.
Insummary,thisisadefinitivetreatmentofthetopicofdensity-ratioestimation.
Itreflectstheauthors’carefulthinkingandsustainedresearchefforts.Researchers
andstudentsalikewillfinditanimportantsourceofideasandtechniques.Thereis
nodoubtthatthisbookwillchangethewaypeoplethinkaboutmachinelearning
andstimulatemanynewdirectionsforresearch.
ThomasG.Dietterich
SchoolofElectricalEngineering
OregonStateUniversity,Corvallis,OR,USA
Preface
Machine learning is aimed at developing systems that learn. The mathematical
foundationofmachinelearninganditsreal-worldapplicationshavebeenexten-
sively explored in the last decades. Various tasks of machine learning, such as
regression and classification, typically can be solved by estimating probability
distributionsbehinddata.However,estimatingprobabilitydistributionsisoneof
themostdifficultproblemsinstatisticaldataanalysis,andthussolvingmachine
learningtaskswithoutgoingthroughdistributionestimationisakeychallengein
modernmachinelearning.
Sofar,variousalgorithmshavebeendevelopedthatdonotinvolvedistribution
estimation but solve target machine learning tasks directly. The support vector
machineisasuccessfulexamplethatfollowsthisline–itdoesnotestimatedata-
generating distributions but directly obtains the class-decision boundary that is
sufficientforclassification.However,developingsuchanexcellentalgorithmfor
eachofthemachinelearningtaskscouldbehighlycostlyanddifficult.
Toovercometheselimitationsofcurrentmachinelearningresearch,weintro-
duce and develop a novel paradigm called density-ratio estimation – instead of
probabilitydistributions,theratioofprobabilitydensitiesisestimatedforstatisti-
caldataprocessing.Thedensity-ratioapproachcoversvariousmachinelearning
tasks,forexample,non-stationarityadaptation,multi-tasklearning,outlierdetec-
tion, two-sample tests, feature selection, dimensionality reduction, independent
componentanalysis,causalinference,conditionaldensityestimation,andproba-
biliticclassification.Thus,density-ratioestimationisaversatiletoolformachine
learning.Thisbookisaimedatintroducingthemathematicalfoundation,practical
algorithms,andapplicationsofdensity-ratioestimation.
Mostofthecontentsofthisbookarebasedonthejournalandconferencepapers
wehavepublishedinthelastcoupleofyears.Weacknowledgeourcollaboratorsfor
theirfruitfuldiscussions:HirotakaHachiya,ShoheiHido,YasuyukiIhara,Hisashi
Kashima, Motoaki Kawanabe, Manabu Kimura, Masakazu Matsugu, Shin-ichi
Nakajima,Klaus-RobertMüller,JunSese,JaakSimm,IchiroTakeuchi,Masafumi
PicturetakeninNagano,Japan,inthesummerof2009.Fromlefttoright,TaijiSuzuki,
MasashiSugiyama,andTakafumiKanamori.
Takimoto, Yuta Tsuboi, Kazuya Ueki, Paul von Bünau, Gordon Wichern, and
MakotoYamada.
Finally, we thank the Ministry of Education, Culture, Sports, Science and
Technology; theAlexander von Humboldt Foundation; the Okawa Foundation;
Microsoft Institute for Japanese Academic Research Collaboration Collabora-
tive Research Project; IBM FacultyAward; Mathematisches Forschungsinstitut
OberwolfachResearch-in-PairsProgram;theAsianOfficeofAerospaceResearch
and Development; Support Center forAdvanced Telecommunications Technol-
ogyResearchFoundation;andtheJapanScienceandTechnologyAgencyfortheir
financialsupport.
MasashiSugiyama,TaijiSuzuki,andTakafumiKanamori
1
Introduction
The goal of machine learning is to extract useful information hidden in data
(Hastieetal.,2001;SchölkopfandSmola,2002;Bishop,2006).Thischapteris
devoted to describing a brief overview of the machine learning field and show-
ing our focus in this book – density-ratio methods. In Section 1.1, fundamental
machinelearningframeworksofsupervisedlearning,unsupervisedlearning,and
reinforcementlearningarebrieflyreviewed.Thenweshowexamplesofmachine
learningproblemstowhichthedensity-ratiomethodscanbeappliedinSection1.2
and briefly review methods of density-ratio estimation in Section 1.3. A brief
overviewoftheoreticalaspectsofdensity-ratioestimationisgiveninSection1.4.
Finally,theorganizationofthisbookisdescribedinSection1.5.
1.1 MachineLearning
Dependingonthetypeofdataandthepurposeoftheanalysis,machinelearning
taskscanbeclassifiedintothreecategories:
Supervisedlearning: An input–output relation is learned from input–output
samples.
Unsupervisedlearning: Someinteresting“structure”isfoundfrominput-only
samples.
Reinforcementlearning: A decision-making policy is learned from reward
samples.
Inthissectionwebrieflyrevieweachofthesetasks.
1.1.1 SupervisedLearning
In the supervised learning scenario, data samples take the form of input–output
pairs and the goal is to infer the input–output relation behind the data. Typi-
cal examples of supervised learning problems are regression and classification
(Figure1.1):
3