Table Of ContentPRIVILEGED MULTI-LABEL LEARNING
ShanYou1,ChangXu2,YunheWang1,ChaoXu1&DachengTao3
1KeyLab. ofMachinePerception(MOE),CooperativeMedianetInnovationCenter,
SchoolofEECS,PekingUniversity,P.R.China
2SchoolofSoftware,UniversityofTechnologySydney
3SchoolofInformationTechnologies,UniversityofSydney
ABSTRACT
7
1
Thispaperpresentsprivilegedmulti-labellearning(PrML)toexploreandexploit
0
the relationship between labels in multi-label learning problems. We suggest
2
thatforeachindividuallabel, itcannotonlybeimplicitlyconnectedwithother
n
labelsviathelow-rankconstraintoverlabelpredictors,butalsoitsperformance
a
onexamplescanreceivetheexplicitcommentsfromotherlabelstogetheracting
J
as an Oracle teacher. We generate privileged label feature for each example
5
and its individual label, and then integrate it into the framework of low-rank
2
basedmulti-labellearning. Theproposedalgorithmcanthereforecomprehensively
exploreandexploitlabelrelationshipsbyinheritingallthemeritsofprivileged
]
L informationandlow-rankconstraints. WeshowthatPrMLcanbeefficientlysolved
M by dual coordinate descent algorithm using iterative optimization strategy with
cheapupdates. Experimentsonbenchmarkdatasetsshowthatthroughprivileged
.
t labelfeatures,theperformancecanbesignificantlyimprovedandPrMLissuperior
a
toseveralcompetingmethodsinmostcases.
t
s
[
1 1 INTRODUCTION
v
4 Differentfromsingle-labelclassification,multi-labellearning(MLL)allowseachexampletoown
9 multiple and non-exclusive labels. For instance, when to post a photo taken in the scene of Rio
1
Olympics on Instagram, Twitter or Facebook, we may simultaneously include hashtags as #Ri-
7
oOlympics,#athletes,#medalsand#flags. Orarelatednewsarticlecanbesimultaneouslyannotated
0
as “Sports”, “Politics” and “Brazil”. Multi-label learning aims to accurately allocate a group of
.
1 labels to unseen examples with the knowledge harvested from the training data, and it has been
0 widely-usedinmanyapplications,suchasdocumentcategorizationYangetal.(2009);Lietal.(2015),
7 image/videosclassification/annotationYangetal.(2016);Wangetal.(2016);Bappyetal.(2016),
1
genefunctionclassificationCesa-Bianchietal.(2012)andimageretrievalRanjanetal.(2015).
:
v
Themoststraightforwardapproachis1-vs-allorBinaryRelevance(BR)Tsoumakasetal.(2010),
i
X which decomposes the multi-label learning into a set of independent binary classification tasks.
r However, due to neglecting label relationships, only passable performance can be achieved. A
a numberofmethodshavethusbeendevelopedforfurtherimprovingtheperformancebytakinglabel
relationships into consideration, such as label ranking Fu¨rnkranz et al. (2008), chains of binary
classificationReadetal.(2011),ensembleofmulti-classclassificationTsoumakasetal.(2011)and
label-specificfeaturesZhang&Wu(2015). Recently,embedding-basedmethodshaveemergedas
amainstreamsolutionofthemulti-labellearningproblem. Theapproachesassumethatthelabel
matrixislow-rank,andadoptdifferentmanipulationstoembedtheoriginallabelvectors,suchas
compressedsensingHsuetal.(2009),principalcomponentanalysisTai&Lin(2012),canonical
correlationanalysisZhang&Schneider(2011),landmarkselectionBalasubramanian&Lebanon
(2012)andmanifolddeductionBhatiaetal.(2015);Houetal.(2016).
Mostoflow-rankbasedmulti-labellearningalgorithmsexploitlabelrelationshipsinthehypothesis
space. Thehypothesesofdifferentlabelsareinteractedwitheachotherunderthelow-rankconstraint,
whichisasanimplicituseoflabelrelationships. Bycontrast,multiplelabelscanhelpeachotherina
moreexplicitway,wherethehypothesisofalabelisnotonlyevaluatedbythelabelitself,butalso
canbeassessedbytheotherlabels. Morespecificallyinmulti-labellearning,forthelabelhypothesis
at hand, the other labels can together act as an Oracle teacher to provide some comments on its
1
performance,whichisthenbeneficialforupdatingthelearner. Multiplelabelsofexamplescanonly
be accessedin thetraining stageinstead of thetesting stage, and thenOracle teachersonly exist
inthetrainingstage. Thisprivileged settinghasbeenstudiedinLUPI(learningusingprivileged
information)paradigmVapniketal.(2009);Vapnik&Vashist(2009);Vapnik&Izmailov(2015)and
ithasbeenreportedthatappropriateprivilegedinformationcanboosttheperformanceinranking
Sharmanskaetal.(2013), metriclearningFouadetal.(2013), classificationPechyony&Vapnik
(2010)andvisualrecognitionMotiianetal.(2016).
Inthispaper,webridgeconnectionsbetweenlabelsthroughprivilegedlabelinformationandthen
formulateaneffectiveprivilegedmulti-labellearning(PrML)method. Foreachlabel,eachexample’s
privilegedlabelfeaturecanbegeneratedfromotherlabels. Thenitisabletoprovideadditionalguid-
ancesonthelearningofthislabel,giventheunderlyingconnectionsbetweenlabels. Byintegrating
theprivilegedinformationintothelow-rankbasedmulti-labellearning,eachlabelpredictorlearned
fromtheresultingmodelnotonlyinteractswithotherlabelsviatheirpredictors,butalsoreceives
explicitcommentsfromtheselabels. IterativeoptimizationstrategyisemployedtosolvePrML,and
we theoretically show that each subproblem can be solved by dual coordinate descent algorithm
withtheguaranteeofsolution’suniqueness. Experimentalresultsdemonstratethesignificanceof
exploitingtheprivilegedlabelfeaturesandtheeffectivenessoftheproposedalgorithm.
2 PROBLEM FORMULATION
Inthissectionweelaboratetheintrinsicprivilegedinformationinmulti-labellearningandformulate
thecorrespondingprivilegedmulti-labellearning(PrML)aswell.
Wefirstintroducemulti-labellearning(MLL)problemanditsfrequentnotations. Givenntraining
points,wedenotethewholedatasetasD = {(x ,y ),...,(x ,y )},wherex ∈ X ⊆ Rd isthe
1 1 n n i
inputfeaturevectorandy ∈ Y ⊆ {−1,1}L isthecorrespondinglabelvectorwiththelabelsize
i
L. LetX =[x ,x ,...,x ]∈Rd×n bethedatamatrixandY =[y ,y ,...,y ]∈{−1,1}L×n be
1 2 n 1 2 n
thelabelmatrix. Specifically,Y =1ifandonlyifthei-thlabelisassignedtotheexamplex and
ij j
Y =−1otherwise. GiventhedatasetD,multi-labellearningisformulatedaslearningamapping
ij
functionf :Rd →{−1,1}Lthatcanaccuratelypredictlabelsforunseentestpoints.
2.1 LOW-RANKMULTI-LABELEMBEDDING
Astraightforwardmannertoparameterizethedecisionfunctionisusinglinearclassifiers,i.e.f(x)=
ZTx=[z ,...,z ]TxwhereZ ∈Rd×L. Notethatthelinearformisactuallyincorporatedwiththe
1 L
biastermbyaugmentinganadditional1tothefeaturevectorx. BinaryRelevance(BR)method
Tsoumakasetal.(2010)decomposesmulti-labellearningintoasetofsingle-labellearningproblems.
Thebinaryclassifierforeachlabelcanbeobtainedbythewidely-usedSVMmethod:
n
min 1(cid:107)z∗(cid:107)2+C(cid:88)ξ
zi=[z∗i;bi],ξi 2 i 2 j=1 ij (1)
s.t. Y ((cid:104)z∗,x (cid:105)+b )≥1−ξ
ij i j i ij
ξ ≥0,∀j =1,...,n,
ij
where ξ = [ξ ,...,ξ ]T is slack variable and (cid:104)·(cid:105) is the inner product between two vectors or
i i1 in
matrices.Predictors{z ,...,z }ofdifferentlabelsarethusindependentlysolvedwithoutconsidering
1 L
relationshipsbetweenlabels,whichlimitstheclassificationperformanceofBRmethod.
Somelabelscanbecloselyconnectedandusedtooccurtogetheronexamples,andthusthelabel
matrix is often supposed to be low-rank, which leads to the low rank of label predictor matrix
Z =[z ,...,z ]asaresult. ConsideringtherankofZ ask,whichissmallerthandandL,weare
1 L
abletoemploytwosmallermatricestoapproximateZ, i.e.Z = DTW. D ∈ Rk×d canbeseen
asadictionaryofhypothesesinlatentspaceRk,whileeachw inW = [w ,...,w ] ∈ Rk×L is
i 1 L
thecoefficientvectortogeneratethepredictorofi-thlabelfromthehypothesisdictionaryD. Each
2
classifierz isrepresentedasz =DTw (i=1,2,...,L)andProblem(1)canbeextendedinto:
i i i
L L n
min 1((cid:107)D(cid:107)2 +(cid:88)(cid:107)w (cid:107)2)+C(cid:88)(cid:88)ξ
D,W,ξ 2 F i 2 ij
i=1 i=1j=1 (2)
s.t. Y ((cid:104)DTw ,x (cid:105))≥1−ξ
ij i j ij
ξ ≥0,∀i=1,...,L;j =1,...,n,
ij
whereξ =[ξ ,...,ξ ]T. ThusinEq.(2),theclassifiersofalllabelsz aredrawnfromanidentical
1 L i
low-dimensionalsubspace,i.e.therowspaceofD. Thenusingblockcoordinatedescent,eitherDor
W canbesolvedwithintheempiricalriskminimization(ERM)frameworkbyturningitintoahinge
lossminimizationproblem.
2.2 PRIVILEGEDINFORMATIONINMULTI-LABELLEARNING
Theslackvariableξ inEq.(2)indicatesthepredictionerrorofthej-thexampleonthei-thlabel. In
ij
fact,itdepictstheerror-tolerantabilityofamodel,andisdirectlyrelatedtotheoptimalclassifier
anditsclassificationperformance. Fromadifferentpointofview,slackvariablescanberegardedas
commentsofsomeOracleTeacherontheperformanceofpredictorsoneachexample. Inmulti-label
contextforeachlabel,itshypothesisisnotonlyevaluatedbyitself,butalsoassessedbytheother
labels. ThusotherlabelscanbeseenasitsOracleteacher,whowillprovidesomecommentsduring
thislabel’slearning. Notethattheselabelvaluesareknownasapriorionlyduringtraining;whenwe
getdowntolearningthei-thlabel’spredictor,weactuallyknowthevaluesofotherlabelsforeach
trainingpointx . Therefore,wecanformulatetheotherlabelvaluesasprivilegedinformation(or
j
hiddeninformation)ofeachexample. Let
(cid:77)
y˜ =y , withi-thelementbeing0. (3)
i,j j
Wecally˜ thetrainingpointx ’sprivilegedlabelfeatureonthei-thlabel. Itcanbeseenthatthe
i,j j
privilegedlabelspaceisconstructedstraightforwardlyfromtheoriginallabelspace. Theseprivileged
labelfeaturescanthusberegardedasanexplicitwaytoconnectalllabels. Inaddition,notethat
thevaliddimension(removing0)ofy˜ isL−1,sincewetaketheotherL−1labelvaluesasthe
i,j
privilegedlabelfeatures. Moreover,notalltheotherlabelshavethepositiveimpactonthelearning
ofsomelabelSunetal.(2014),andthusitisappropriatetostrategicallyselectsomekeylabelsto
formulatetheprivilegedlabelfeatures. WewilldiscussthisintheExperimentsection.
Sinceforeachlabel,theotherlabelsserveastheOracleteacherviatheprivilegedlabelfeaturey˜
i,j
oneachexample,thecommentsonslackvariablescanbemodelledasalinearfunctionVapnik&
Vashist(2009),
ξ (y˜ ;w˜ )=(cid:104)w˜ ,y˜ (cid:105). (4)
ij i,j i i i,j
Thefunctionξ (y˜ ;w˜ )isthuscalledcorrectingfunctionwithrespecttothei-thlabel,wherew˜ is
ij i,j i i
theparametervector. AsshowninEq.(4),theprivilegedcommentsy˜ directlycorrectthevaluesof
i,j
slackvariablesasthepriorknowledgeortheadditionalinformation. Integratingprivilegedfeatures
asEq.(4)intotheSVMstimulatesthepopularSVM+methodVapnik&Vashist(2009),whichhas
beenprovedtoimprovetheconvergencerateandtheperformance.
Integratingtheproposedprivilegedlabelfeaturesintothelow-rankparameterstructureasEqs.(2)
and(4),weformulateanewmulti-labellearningmodel,privilegedmulti-labellearning(PrML)by
castingitintotheSVM+-basedLUPIparadigm,
L L n
min 1(cid:107)D(cid:107)2 + 1(cid:88)(γ (cid:107)w (cid:107)2+γ (cid:107)w˜ (cid:107)2)+C(cid:88)(cid:88)(cid:104)w˜ ,y˜ (cid:105)
D,W,W˜ 2 F 2 i=1 1 i 2 2 i 2 i=1j=1 i i,j (5)
s.t. Y (cid:104)DTw ,x (cid:105)≥1−(cid:104)w˜ ,y˜ (cid:105)
ij i j i i,j
(cid:104)w˜ ,y˜ (cid:105)≥0,∀i=1,...,L;j =1,...,n,
i i,j
whereW˜ = [w˜ ,...,w˜ ]. Particularly,weabsorbthebiastermtoobtainacompactvariantofthe
1 L
originalSVM+,becauseitisturnedouttohaveasimplerforminthedualspaceandcanbesolved
moreefficiently. Inthisway,thetrainingdatawithinmulti-labellearningisactuallyinthetriplet
fashion,i.e.(x ,y ,Y˜),i=1,...,n,whereY˜ =[y˜ ,...,y˜ ]istheprivilegedlabelfeaturematrix
i i i i 1,i L,i
foreachlabel.
3
Remark. When W = I, i.e. the low-dimensional projection is identical, the proposed PrML
degeneratesintoasimplerBR-stylemodel(wecallitprivilegedBinaryRelevance,PrBR),wherethe
wholemodeldecomposesintoLindependentbinarymodels. However,everysinglemodelisstill
combinedwiththecommentsformprivilegedinformation,thusitmaystillbesuperiortoBR.
3 OPTIMIZATION
Inthissection,wepresenthowtosolvetheproposedprivilegedmulti-labellearningalgorithmEq.(5).
ThewholemodelofEq.(5)isnotconvexduetothemultiplicationofDTw inconstraints. However,
i
each subproblem with fixed D or W is convex, thus it can be solved by various efficient convex
solvers. Notethat(cid:104)DTw ,x (cid:105)hastwoequivalentforms,i.e.(cid:104)w ,Dx (cid:105)and(cid:104)D,w xT(cid:105),andthus
i j i j i j
the correcting function can be coupled with D or W, without damaging the convexity of either
subproblem. Inthisway,Eq.(5)canbesolvedusingthealternativeiterationstrategy,i.e.iteratively
conducting thefollowingtwo steps: optimizing W and privileged variable W˜ with fixed D, and
updating D and privileged variable W˜ with fixed W. Both subproblems are related to SVM+,
inducingtheirdualproblemstobequadraticprogramming(QP).Inthefollowing,weelaboratethe
solvingprocessinrealimplementations.
3.1 OPTIMIZINGW,W˜ WITHFIXEDD
Fixing D, Eq.(5) can be decomposed into L independent binary classification problems, each of
whichregardsthevariablepair(w ,w˜ ). Paralleltechniquesormulti-corecomputationcanthus
i i
beemployedtospeedupthetrainingprocess. Inspecific,theoptimizationproblemwithrespectto
(w ,w˜ )is
i i
n
min 1(γ (cid:107)w (cid:107)2+γ (cid:107)w˜ (cid:107)2)+C(cid:88)(cid:104)w˜ ,y˜ (cid:105)
wi,w˜i 2 1 i 2 2 i 2 j=1 i i,j (6)
s.t.Y (cid:104)w ,Dx (cid:105)≥1−(cid:104)w˜ ,y˜ (cid:105)
ij i j i i,j
(cid:104)w˜ ,y˜ (cid:105)≥0,∀j =1,...,n.
i i,j
anditsdualformis(seesupplementarymaterials)
1 1
max − (α◦y∗)TK (α◦y∗)+1Tα− (α+β−C1)TK˜ (α+β−C1) (7)
α,β 2 i D i 2γ i
with the parameter update γ ← γ /γ ,C ← C/γ and the constraints α (cid:23) 0,β (cid:23) 0, i.e. α ≥
2 1 1 j
0,β ≥0,∀j ∈[1:n].Moreover,y∗ =[Y ,Y ,...,Y ]T isthelabel-wisevectorsforthei-thlabel.
j i i1 i2 in
◦istheHadamard(element-wise)productoftwovectorsormatrices. K ∈Rn×nistheD-based
D
features’ inner product (kernel) matrix with K (j,q) = (cid:104)Dx ,Dx (cid:105). K˜ is the privileged label
D j q i
features’innerproduct(kernel)matrixwithrespecttothei-thlabel,whereK˜ (j,q)=(cid:104)y˜ ,y˜ (cid:105). 1
i i,j i,q
isthevectorwithallones.
Pechyonyetal.(2010)proposedanSMO-stylealgorithm(gSMO)forSVM+problem. However,
becauseofthebiasterm,theLagrangemultipliersaretangledtogetherinthedualproblem,which
leadstoamorecomplicatedconstraintset
{(α,β)|αTy∗ =0,1T(α+β−C1)=0,α(cid:23)0,β (cid:23)0}
i
than{(α,β)|α(cid:23)0,β (cid:23)0}inourPrML.Hencebyabsorbingthebiasterm,Eq.(6)canproducea
morecompactdualproblemonlywithnon-negativeconstraint. Coordinatedescent(CD)1algorithm
canbeappliedtosolvethedualproblem,andaclosed-formsolutioncanbeobtainedineachiteration
stepLietal.(2016).AftersolvingtheEq.(7),accordingtotheKarush-Kuhn-Tucker(KKT)conditions,
theoptimalsolutionfortheprimalproblem(6)canbeexpressedbytheLagrangemultipliers:
n
(cid:88)
w = α Y Dx (8)
i j ij j
j=1
n
1 (cid:88)
w˜ = (α +β −C)y˜ (9)
i γ j j i,j
j=1
1Weoptimizeanequivalent“min”probleminsteadoftheoriginal“max”one.
4
3.2 OPTIMIZINGD,W˜ WITHFIXEDW
GivenfixedcoefficientmatrixW,weupdateandlearnthelineartransformationDwiththehelpof
commentsprovidedbyprivilegedinformation. Thustheproblem(5)for(D,W˜)isreducedto
L L n
min 1(cid:107)D(cid:107)2 + γ2 (cid:88)(cid:107)w˜ (cid:107)2+C(cid:88)(cid:88)(cid:104)w˜ ,y˜ (cid:105)
D,W˜ 2 F 2 i 2 i i,j
i=1 i=1j=1 (10)
s.t.Y (cid:104)D,w xT(cid:105)≥1−(cid:104)w˜ ,y˜ (cid:105)
ij i j i i,j
(cid:104)w˜ ,y˜ (cid:105)≥0,∀i=1,...,L;j =1,...,n.
i i,j
Eq.(10)hasLnconstraints,eachofwhichcanbeindexedwithatwo-dimensionalsubscript[i,j].
TheLagrangemultipliersofEq.(10)arethustwo-dimensionalaswell. Tomakethedualproblem
of Eq.(10) consistent with Eq.(7), we define a bijection φ : [1 : L]×[1 : n] → [1 : Ln] as the
row-basedvectorizationindexmapping,i.e.φ([i,j])=(i−1)n+j. Inanutshell,wearrangethe
constraints(alsothemultipliers)accordingtotheorderofrow-basedvectorization. Inthisway,the
correspondingdualproblemofEq.(10)isformulatedas(seedetailsinsupplementarymaterials)
1 1
max − (α◦y∗)TK (α◦y∗)+1Tα− (α+β−C1)TK˜(α+β−C1) (11)
α(cid:23)0,β(cid:23)0 2 W 2γ2
wherey∗ =[y∗;y∗;...;y∗]istherow-basedvectorizationofY andK˜ =diag(K˜ ,K˜ ,...,K˜ )is
1 2 L 1 2 L
ablockdiagonalmatrix,whichcorrespondstothekernelmatrixofprivilegedlabelfeatures. K
W
is the kernel matrix of input features with every element K (s,t) = (cid:104)G ,G (cid:105), where
W φ−1(s) φ−1(t)
G =w xT. BasedontheKKTconditions,(D,W˜)canbeconstructedusing(α,β):
ij i j
Ln
(cid:88)
D = α y∗G (12)
s s φ−1(s)
s=1
n
1 (cid:88)
w˜ = (α +β −C)y˜ (13)
i γ φ([i,j]) φ([i,j]) i,j
2
j=1
Inthisway,Eq.(11)hasanidenticaloptimizationformwithEq.(7). Thuswecanalsoturnittothe
fastCDmethodLietal.(2016). However,duetothescriptindexmapping,directlyusingthemethod
proposedinLietal.(2016)isveryexpensive. ConsideringtheprivilegedkernelmatrixK˜ isblock
sparse,wecanfurtherspeedupthecalculation. DetailsofthemodifiedversionofdualCDalgorithm
forsolvingEq.(11)arepresentedinAlgorithm1. Alsonotethatoneprimarymeritofthisalgorithm
isthefreecalculationofthewholekernelmatrix. Instead,weonlyneedtocalculateitsdiagonal
elementsasline2inAlgorithm1.
3.3 FRAMEWORKOFPRML
Ourproposedprivilegedmulti-labellearningissummarizedinAlgorithm2.AsindicatedinAlgorithm
2,bothDandW areupdatedwiththehelpofcommentsfromprivilegedinformation. Notethatthe
primalvariablesanddualvariablesareconnectedwithKKTconnections,andthusinrealapplications
lines5-6and8-9inAlgorithm2canbeimplementediteratively. Sinceeachsubproblemisactually
alinearSVM+optimizationandsolvedbytheCDmethod,itsconvergenceisconsistentwiththat
ofthedualCDalgorithmforlinearSVMHsiehetal.(2008). Duetothecheapupdates,Hsiehetal.
(2008);Lietal.(2016)empiricallyshoweditcanbemuchfasterthanGMO-stylemethodsandmany
otherconvexsolverswhend(numberoffeatures)islarge. Moreover,theindependenceoflabelsin
Problem(6)enablestouseparalleltechniquesandmulticorecomputationtoaccommodatethelarge
L(numberoflabels). Asforalargen(numberofexamples)(alsolargeLforProblem(10)),wecan
usemini-batchCDmethodTakacetal.(2015),whereeachtimeabatchofexamplesareselected
andCDupdatesareparallellyappliedtothem,i.e.lines5-17canbeimplementedparallelly. Also
recentlyChiangetal.(2016)designedaframeworkforparallelCDandachievedsignificantspeeding
upevenwhenthedandnareverylarge. Thus,ourmodelcanscaletod,Landn. Inaddition,the
solutionforeachofsubproblemisalsounique,asTheorem1stated.
Theorem1. Thesolutiontotheproblem(6)or(10)isuniqueforanyγ >0,γ >0,C >0.
1 2
5
Algorithm1AdualcoordinatedescentalgorithmforsolvingEq.(11)
Input: Training data: feature matrix X = [x ,x ,...,x ] ∈ Rd×n, label matrix Y =
1 2 n
[y ,y ,...,y ] ∈ {−1,1}L×n. Privileged label features: y˜ of i-th label and j-th exam-
1 2 n i,j
ple. Low-dimensionalembeddingprojection: W =[w ,...,w ]∈Rk×L. Learningparameters:
1 L
γ,C >0.
1: Define the bijection φ : [1 : L]×[1 : n] → [1 : Ln] as the row-based vectorization index
mapping,i.e.φ([i,j])=(i−1)n+j. Denoteitsinverseasφ−1.
2: Constuct Q ∈ R2Ln for each s ∈ [1 : 2Ln]: for 1 ≤ s ≤ Ln, [i,j] ← φ−1(s), Qs =
(cid:107)w (cid:107)2(cid:107)x (cid:107)2+1/γ(cid:107)y˜ (cid:107)2;forLn+1≤s≤2Ln,[i,j]←φ−1(s−Ln),Q =1/γ(cid:107)y˜ (cid:107)2.
i 2 j 2 i,j 2 s i,j 2
3: Initialization: η =[α;β]←0,D ←0andw˜i ←−Cγ (cid:80)nj=1y˜i,j foreachi∈[1:L].
4: whilenotconvergencedo
5: randomlypickanindexs
6: if1≤s≤Lnthen
7: [i,j]←φ−1(s)
8: ∇s ←YijwiTDxj −1+w˜iTy˜i,j
9: δ ←max{−ηs,−∇s/Qs}
10: D ←D+δYijwixTj
11: elseifLn+1≤s≤2Lnthen
12: [i,j]←φ−1(s−Ln)
13: ∇s ←w˜iTy˜i,j
14: δ ←max{−ηs,−∇s/Qs}
15: endif
16: w˜i ←w˜i+(δ/γ)y˜i,j
17: ηs ←ηs+δ
18: endwhile
Output: DictionaryDandcorrectingfunctions{w˜ }L .
i i=1
Algorithm2PrivilegedMulti-labelLearning(PrML)
Input: Training data: feature matrix X = [x ,x ,...,x ] ∈ Rd×n, label matrix Y =
1 2 n
[y ,y ,...,y ]∈{−1,1}L×n. Learningparameters: γ ,γ ,C ≥0.
1 2 n 1 2
1: Constructionofprivilegedlabelfeaturesforeachlabelandeachtrainingpoint,e.g. asEq.(3).
2: initializationofD
3: whilenotconvergencedo
4: foreachi∈[1:L]do
5: [α,β]←solvingEq.(7)
6: updatewi,w˜iaccordingtoEq.(8)andEq.(9)
7: endfor
8: [α,β]←solvingEq.(11)
9: updateD,W˜ accordingtoEq.(12)andEq.(13)
10: endwhile
Output: Alinearmulti-labelclassifierZ =DTW,togetherwithacorrectingfunctionW˜ w.r.t. L
labels.
Proofskeleton. BothEq.(6)andEq.(10)canbecastintoanidenticalSVM+optimizationwiththe
formofobjectivefunctionbeingF = 1(cid:107)w(cid:107)2+ γ (cid:107)w˜(cid:107)2+C(cid:80)n (cid:104)w˜,x˜ (cid:105)andaclosedconvex
2 2 2 2 j=1 j
feasible solution set. Denote u := (w;w˜) and assume two optimal solutions u ,u , we have
1 2
F(u ) = F(u ). Letu = (1−t)u +tu ,t ∈ [0,1],thenF(u ) ≤ (1−t)F(u )+tF(u ) =
1 2 t 1 2 t 1 2
F(u ),thusF(u )=F(u )=F(u )forallt∈[0,1],whichimpliesthatg(t)=F(u )−F(u )≡
1 t 1 2 t 1
0,∀t∈[0,1]. Moreover,wehave
n
g(t)=1t2(cid:107)w −w (cid:107)2+t(cid:104)w −w ,w (cid:105)+ γt2(cid:107)w˜ −w˜ (cid:107)2+t(cid:104)w˜ −w˜ ,w˜ (cid:105)+tC(cid:88)(cid:104)w˜ −w˜ ,x˜ (cid:105)
2 2 1 2 2 1 1 2 2 1 2 2 1 1 2 1 i
i=1
g(cid:48)(cid:48)(t)=(cid:107)w −w (cid:107)2+γ(cid:107)w˜ −w˜ (cid:107)2 =0.
2 1 2 2 1 2
6
Table1: Datastatistics. nisthetotalnumberofexamples. dandLarethenumberoffeaturesand
labels,respectively;L¯ andDen(L)aretheaveragenumberofpositivelabelsinaninstanceandthe
labeldensity,respectively. ‘Type’meansfeaturetype.
Dataset n d L L¯ Den(L) type
enron 1702 1001 53 3.378 0.064 nominal
yeast 2417 103 14 4.237 0.303 numeric
corel5k 5000 499 374 3.522 0.009 nominal
bibtex 7395 1836 159 2.402 0.015 nominal
eurlex 19348 5000 3993 5.310 0.001 nominal
mediamill 43907 120 101 4.376 0.043 numeric
Figure1: SomeoftheimageannotationresultsoftheproposedPrMLonthebenchmarkcorel5k
dataset. Tags: red,ground-truth;blue,correctpredictions;black,wrongpredictions.
Forγ >0,thenwehavew =w andw˜ =w˜ .
1 2 1 2
ProofofTheorem1mainlyliesinthestrictconvexityoftheobjectivefunctionineitherEq.(6)or
(10). Concretedetailsarereferredtothesupplementarymaterials. Inthisway,thecorrectingfunction
W˜ servesasabridgetochanneltheDandW,andtheconvergenceofW˜ inferstheconvergenceof
DandW. ThuswecantakeW˜ asthebarometerofthewholealgorithm’sconvergence.
4 EXPERIMENTAL RESULTS
Inthissection,weconductvariousexperimentsonbenchmarkdatasetstovalidatetheeffectivenessof
usingtheintrinsicprivilegedinformationformulti-labellearning. Inaddition,wealsoinvestigatethe
performanceandsuperiorityoftheproposedPrMLmodelcomparingtorecentcompetingmulti-label
methods.
4.1 EXPERIMENTCONFIGURATION
Datasets. We select six benchmark multi-label datasets, including enron, yeast, corel5k, bibtex,
eurlexandmediamill. Specially,weconsiderthecaseswhend(eurlex),L(eurlex)andn(corel5k,
bibtex,eurlex&mediamill)arelargerespectively. Alsonotethatenron,corel5k,bibtexandeurlex
areofsparsefeatures. SeeTable1forthedetailsofthesedatasets.
Comparisonapproaches.
1). BR(binaryrelevance)Tsoumakasetal.(2010). ASVMistrainedwithrespecttoeachlabel.
2). ECC (ensembles of classifier chains) Read et al. (2011). It turns ML into a series of binary
classificationproblems.
3). RAKEL(randomk-labelsets)Tsoumakasetal.(2011). IttransformsMLintoanensembleof
multi-classclassificationproblems.
4). LEML(lowrankempiricalriskminimizationformulti-labellearning)Yuetal.(2014). Itisa
low-rankembeddingapproachwhichiscastedintoERMframework.
7
(a)Hammingloss↓ (b)One-error↓ (c)Coverage↓
(d)Rankingloss↓ (e)Averageprecision↑ (f)Macro-averagingAUC↑
Figure2: ClassificationresultsofPrML(bluesolidline)andLEML(reddashedline)oncorel5k
(50%fortraining&50%fortesting)w.r.t. differentdimensionofprivilegedlabelfeatures.
5). ML2(multi-labelmanifoldlearning)Houetal.(2016). Itisalatestmulti-labellearningmethod,
whichisbasedonthemanifoldassumptioninlabelspace.
Evaluation Metrics. We use six prevalent metrics to evaluate the performance of all methods,
includingHammingloss,One-error,Coverage,Rankingloss,Averageprecision(Averprecision)
andMacro-averagingAUC(MacAUC).Notethatallevaluationmetricshavethevaluerange[0,1].
In addition, for the first four metrics, the smaller values would indicate the better classification
performanceandweuse↓toindexthispositivelogic. Onthecontrary,forthelasttwometricslarger
valuesrepresentthebetterperformance,indexedby↑.
4.2 ALGORITHMANALYSIS
Performance visualization. First we analyze our proposed PrML and on a global sense, we
selectthebenchmarkimagedatasetcorel5kandvisualizetheresultsofimageannotationtodirectly
examinehowPrMLfunctions. Forthelearningprocess,werandomlyselected50%exampleswithout
repeatingasthetrainingsetandtherestonesasthetestingset. Inourexperiment,parameterγ isset
1
tobe1;γ andC areintherangeof0.1∼2.0and10−3 ∼20respectively,anddeterminedusing
2
crossvalidationbyapartoftrainingpoints. Theembeddingdimensionkissettobek = (cid:100)0.9L(cid:101),
where(cid:100)r(cid:101)isthesmallestintegergreaterthanr. SomeoftheannotationresultsispresentedinFigure
1,wherethelefttagsaretheground-truthandtherightonesarethetopfivepredictedtags.
AsshowninFigure1,wecansafelyconcludethattheproposedPrMLperformswellontheimage
annotationtasks. Itcanpredictcorrectlythesemanticlabelsinmostcases. Notethatalthoughin
somecasesthepredictedlabelsarenotintheground-truth,theyareessentiallyrelatedinsemantic
sense. Forexample,the“swimmers”inimage(b)wouldbemuchnaturalwhenitcomestothe“pool”.
Moreover,wecanalsoseethatPrMLwouldmakesupplementarypredictionstodescribeimages,
enrichingthecorrespondingcontent. Forexample,the“grass”inimage(c)and“buildings”inimage
(d)arethemissingobjectsinground-truthlabels.
Validationofprivilegedlabelfeatures.Wethenvalidatetheeffectivenessoftheproposedprivileged
informationformulti-labellearning. Asdiscussedpreviously,theprivilegedlabelfeaturesserveas
anguidanceorcommentsfromanOracleteachertoconnectthelearningofallthelabels. Forthe
sakeoffairness,wesimplyimplementthevalidationexperimentswithLEML(withoutprivileged
8
labelfeatures)andPrML(withprivilegedlabelfeatures). Notethatourproposedprivilegedlabel
featuresarecomposedwiththevaluesoflabels;however,notalllabelshaveprominentconnections
inmulti-labellearningSunetal.(2014). Thusweselectivelyconstructtheprivilegedlabelfeatures
withrespecttoeachlabel.
Particularly, wejustuseK-nearestneighborruletoformthepoolperlabel. Foreachlabel, only
labelsinitslabelpool,insteadofthewholelabelset,arereckonedtoprovidemutualguidanceduring
itslearning. Inourimplementation,wesimplyutilizeHammingdistancetoaccomplishsearchof
K-nearestneighboronthedatasetcorel5k. Theexperimentalsettingisthesamewithbeforeandboth
algorithmssharethesameembeddingdimensionk. Moreover,wecarryoutindependentteststen
timesandtheaverageresultsareshowninFigure2.
AsshowninFigure2,wehavethefollowingtwoobservations. (a)PrMLisclearlysuperiortoLEML
when we select enough labels as privileged label features, e.g. more than 350 labels in corel5k
dataset. Sincetheironlydifferenceliesintheusageoftheprivilegedinformation,wecanconclude
thattheguidancefromtheprivilegedinformation,i.e.theproposedprivilegedlabelfeatures,can
significantlyimprovetheperformanceofmulti-labellearning. (b)Withmorelabelsinvolvedinthe
privilegedlabelfeatures,theperformanceofPrMLkeepsimprovinginasteadyspeed,andwhen
thedimensionofprivilegedlabelfeaturesislargeenough,theperformancetendstostabilizeonthe
whole.
Thenumberoflabelsisdirectlyrelatedtothecomplexityofcorrectingfunctiondefinedasalinear
function. Thusfewlabelsmightinducethelowfunctioncomplexity,andthecorrectingfunctioncan
notdeterminetheoptimalslackvariables. Inthisway,thefault-tolerantcapacitywouldbecrippled
andthustheperformanceisevenworsethanLEML.Forexample,whenthedimensionofprivileged
labelsislessthan250oncorel5k,theHammingloss,One-error,CoverageandRankinglossofPrML
ismuchlargerthanLEML.Incontrast,overmuchlabelsmightintroduceunnecessaryguidanceof
labels,andtheextralabelsthusmakenocontributiontothefurtherimprovementofclassification
performance. Forinstance,theperformancewith365labelsinvolvedinprivilegedlabelfeatures
wouldbeonparwiththatofalltheother(373)labelsinHammingloss,One-error,Rankingloss
andAverageprecision. Moreover,inrealapplications,itisstillasafechoicethatallotherlabelsare
involvedinprivilegedinformation.
4.3 PERFORMANCECOMPARISON
Nowweformallyanalyzetheperformanceoftheproposedprivilegedmulti-labellearning(PrML)
incomparisonwithpopularstate-of-the-artmethods. Foreachdataset,werandomlyselected50%
exampleswithoutrepeatingasthetrainingsetandtherestfortesting. Fortheresults’credibility,the
datasetdivisionprocessisimplementedtentimesindependentlyandwerecordedthecorresponding
results in each trail. Parameters γ ,γ and C are determined in the same manner as before. As
1 2
forthelowembeddingdimensionk,followingthewisdomofYuetal.(2014),wechoosektobe
in{(cid:100)0.8L(cid:101),(cid:100)0.85L(cid:101),(cid:100)0.9L(cid:101),(cid:100)0.95L(cid:101)}anddeterminedbycrossvalidationusingapartoftraining
points. Particularly,wealsocoverthePrBR(privilegedinformation+BR)tofurtherinvestigatethe
proposedprivilegedinformation. ThedetailedresultsarereportedinTable2.
FromTable2,wecanseetheproposedPrMLiscomparabletothestate-of-the-artML2method,and
significantlysurpassestheothercompetingmulti-labelmethods. Concretely,acrossallevaluation
metricsanddatasets,PrMLranksfirstin52.8%casesandthefirsttwoinallcases;eveninthesecond
place,PrML’sperformanceisclosetothetopone. ComparingBR&PrBR,andLEML&PrML,we
cansafelyinferthattheprivilegedinformationplaysanimportantroleinenhancingtheclassification
performanceofmulti-labelpredictors. Besides,inallthe36cases,PrMLwins34casesagainstPrBR
andplaysatietwiceinRankinglossonenronandHamminglossoncorel5krespectively,which
impliesthatthelow-rankstructureinPrMLhaspositiveimpactinfurtherimprovingthemulti-label
performance. Therefore, we can see PrML has inherited the merits of both low-rank parameter
structureandprivilegedlabelinformation. Inaddition,PrMLandLEMLtendtoperformbetteron
datasetswithmorelabels(>100). Thismightbebecausethelow-rankassumptionismoresensible
whenthenumberoflabelsisconsiderablylarge.
9
Table2: Averagepredictiveperformance(mean±std. deviation)oftenindepedenttrailsforvarious
multi-labellearningmethods. Ineachtrail,50%examplesarerandomlyselectedwithoutrepeatingas
trainingsetandtherestastestingset. Thetopperformanceamongallmethodsismarkedinboldface.
dataset method Hammingloss↓ One-error↓ Coverage↓ Rankingloss↓ Averprecision↑ MacAUC↑
BR 0.060±0.001 0.498±0.012 0.595±0.010 0.308±0.007 0.449±0.011 0.579±0.007
ECC 0.056±0.001 0.293±0.008 0.349±0.014 0.133±0.004 0.651±0.006 0.646±0.008
enron
RAKEL 0.058±0.001 0.412±0.016 0.523±0.008 0.241±0.005 0.539±0.006 0.596±0.007
LEML 0.049±0.002 0.320±0.004 0.276±0.005 0.117±0.006 0.661±0.004 0.625±0.007
ML2 0.051±0.001 0.258±0.090 0.256±0.017 0.090±0.012 0.681±0.053 0.714±0.021
PrBR 0.053±0.001 0.342±0.010 0.238±0.006 0.088±0.003 0.618±0.004 0.638±0.005
PrML 0.050±0.001 0.288±0.005 0.221±0.005 0.088±0.006 0.685±0.005 0.674±0.004
BR 0.201±0.003 0.256±0.008 0.641±0.005 0.315±0.005 0.672±0.005 0.565±0.003
ECC 0.207±0.003 0.244±0.009 0.464±0.005 0.186±0.003 0.752±0.006 0.646±0.003
yeast
RAKEL 0.202±0.003 0.251±0.008 0.558±0.006 0.245±0.004 0.720±0.005 0.614±0.003
LEML 0.201±0.004 0.224±0.003 0.480±0.005 0.174±0.004 0.751±0.006 0.642±0.004
ML2 0.196±0.003 0.228±0.009 0.454±0.004 0.168±0.003 0.765±0.005 0.702±0.007
PrBR 0.227±0.004 0.237±0.006 0.487±0.005 0.204±0.003 0.719±0.005 0.623±0.004
PrML 0.201±0.003 0.214±0.005 0.459±0.004 0.165±0.003 0.771±0.003 0.685±0.003
BR 0.012±0.001 0.849±0.008 0.898±0.003 0.655±0.004 0.101±0.003 0.518±0.001
ECC 0.015±0.001 0.699±0.006 0.562±0.007 0.292±0.003 0.264±0.003 0.568±0.003
corel5k
RAKEL 0.012±0.001 0.819±0.010 0.886±0.004 0.627±0.004 0.122±0.004 0.521±0.001
LEML 0.010±0.001 0.683±0.006 0.273±0.008 0.125±0.003 0.268±0.005 0.622±0.006
ML2 0.010±0.001 0.647±0.007 0.372±0.006 0.163±0.003 0.297±0.002 0.667±0.007
PrBR 0.010±0.001 0.740±0.007 0.367±0.005 0.165±0.004 0.227±0.004 0.560±0.005
PrML 0.010±0.001 0.675±0.003 0.266±0.007 0.118±0.003 0.282±0.005 0.651±0.004
BR 0.015±0.001 0.559±0.004 0.461±0.006 0.303±0.004 0.363±0.004 0.624±0.002
ECC 0.017±0.001 0.404±0.003 0.327±0.008 0.192±0.003 0.515±0.004 0.763±0.003
bibtex
RAKEL 0.015±0.001 0.506±0.005 0.443±0.006 0.286±0.003 0.399±0.004 0.641±0.002
LEML 0.013±0.001 0.394±0.004 0.144±0.002 0.082±0.003 0.534±0.002 0.757±0.003
ML2 0.013±0.001 0.365±0.004 0.128±0.003 0.067±0.002 0.596±0.004 0.911±0.002
PrBR 0.014±0.001 0.426±0.004 0.178±0.010 0.096±0.005 0.529±0.009 0.702±0.003
PrML 0.012±0.001 0.367±0.003 0.131±0.007 0.066±0.003 0.571±0.004 0.819±0.005
BR 0.018±0.004 0.537±0.002 0.322±0.008 0.186±0.009 0.388±0.005 0.689±0.007
ECC 0.011±0.003 0.492±0.003 0.298±0.004 0.155±0.006 0.458±0.004 0.787±0.009
eurlex
RAKEL 0.009±0.004 0.496±0.007 0.277±0.009 0.161±0.001 0.417±0.010 0.822±0.005
LEML 0.003±0.002 0.447±0.005 0.233±0.003 0.103±0.010 0.488±0.006 0.821±0.014
ML2 0.001±0.001 0.320±0.001 0.171±0.003 0.045±0.007 0.497±0.003 0.885±0.003
PrBR 0.007±0.008 0.484±0.003 0.229±0.009 0.108±0.009 0.455±0.003 0.793±0.008
PrML 0.001±0.002 0.299±0.003 0.192±0.008 0.057±0.002 0.526±0.009 0.892±0.004
BR 0.031±0.001 0.200±0.003 0.575±0.003 0.230±0.001 0.502±0.002 0.510±0.001
ECC 0.035±0.001 0.150±0.005 0.467±0.009 0.179±0.008 0.597±0.014 0.524±0.001
mediamill
RAKEL 0.031±0.001 0.181±0.002 0.560±0.002 0.222±0.001 0.521±0.001 0.513±0.001
LEML 0.030±0.001 0.126±0.003 0.184±0.007 0.084±0.004 0.720±0.007 0.699±0.010
ML2 0.035±0.002 0.231±0.004 0.278±0.003 0.121±0.003 0.647±0.002 0.847±0.003
PrBR 0.031±0.001 0.147±0.005 0.255±0.003 0.092±0.002 0.648±0.003 0.641±0.004
PrML 0.029±0.002 0.130±0.002 0.172±0.004 0.055±0.006 0.726±0.002 0.727±0.008
5 CONCLUSION
In this paper, we investigate the intrinsic privileged information to connect labels in multi-label
learning. Tactfully,weregardthelabelvaluesastheprivilegedlabelfeatures. Thisstrategyindicates
thatforeachlabel’slearning,otherlabelsofeachexamplemayserveasitsOraclecommentsonthe
learningofthislabel. Withouttherequirementofadditionaldata,weproposetoactivelyconstruct
privilegedlabelfeaturesdirectlyfromthelabelspace. Thenweintegratetheprivilegedinformation
withthelow-rankhypothesesZ =DTW inmulti-labellearning,andformulateprivilegedmulti-label
learning(PrML)asaresult. Duringtheoptimization,boththedictionaryDandthecoefficientmatrix
W canreceivethecommentsfromtheprivilegedinformation. Andexperimentalresultsshowthat
withthisveryprivilegedinformation,theclassificationperformancecanbesignificantlyimproved.
Thuswecanalsotaketheprivilegedlabelfeaturesasawaytoboosttheclassificationperformanceof
thelow-rankbasedmodels.
Asforthefuturework,ourproposedPrMLcanbeeasilyextendedintoKernelversiontocohere
withthenonlinearityintheparameterspace. Besides,usingSVM-styleL -hingelossmightfurther
2
improvethetrainingefficiencyXuetal.(2016). Theoreticalguaranteeswillbealsoinvestigated.
10