Table Of ContentFitting Spectral Decay with the k-Support Norm
Andrew M. McDonald1 Massimiliano Pontil1,2
Dimitris Stamos2
6
1
0
(1) Department of Computer Science
2
n University College London
a
email: {a.mcdonald,d.stamos.12}@ucl.ac.uk
J
4 Gower Street, London WC1E 6BT, UK
]
G
(2) Istituto Italiano di Tecnologia
L
s. Via Morego 30, 16163 Genova, Italy
c
[
1 January 5, 2016
v
9
4
4
0 Abstract
0
.
1
Thespectralk-supportnormenjoysgoodestimationpropertiesinlowrankmatrixlearning
0
6 problems, empiricallyoutperformingthetracenorm. Itsunitballistheconvexhullofrankk
1 matriceswithunitFrobeniusnorm. Inthispaperwegeneralizethenormtothespectral(k,p)-
:
v supportnorm,whoseadditionalparameterpcanbeusedtotailorthenormtothedecayofthe
i
X spectrum of the underlying model. We characterize the unit ball and we explicitly compute
r the norm. We further provide a conditional gradient method to solve regularization problems
a
with the norm, and we derive an efficient algorithm to compute the Euclidean projection on
the unit ball in the case p = ∞. In numerical experiments, we show that allowing p to
vary significantly improves performance over the spectral k-support norm on various matrix
completionbenchmarks,andbettercapturesthespectraldecayoftheunderlyingmodel.
Keywords. k-supportnorm,orthogonallyinvariantnorms,matrixcompletion,multitasklearn-
ing,proximalpointalgorithms.
1
1 Introduction
Theproblemoflearningasparsevectororalowrankmatrixhasgeneratedmuchinterestinrecent
years. A popular approach is to use convex regularizers which encourage sparsity, and a number
ofthesehavebeenstudiedwithapplicationsincludingimagedenoising,collaborativefilteringand
multitask learning, see for example, [BuehlmannandvanderGeer2011, Wainwright2014] and
referencestherein.
Recently, the k-support norm was proposed by [Argyriouetal. 2012], motivated as a tight
relaxation of the set of k-sparse vectors of unit Euclidean norm. The authors argue that as a regu-
larizerforsparsevectorestimation,thenormempiricallyoutperformstheLasso[Tibshirani1996]
and Elastic Net [ZouandHastie2005] penalties. Statistical bounds on the Gaussian width of the
k-supportnormhavebeenprovidedby[Chatterjeeetal. 2014]. Thek-supportnormhasalsobeen
extended to the matrix setting. By applying the norm to the vector of singular values of a matrix,
[McDonaldetal. 2014] obtain the orthogonally invariant spectral k-support norm, reporting state
oftheartperformanceonmatrixcompletionbenchmarks.
Motivatedbytheperformanceofthek-supportnorminsparsevectorandmatrixlearningprob-
lems, in this paper we study a natural generalization by considering the (cid:96) -norms (for p ∈ [1,∞])
p
in place of the Euclidean norm. These allow a further degree of freedom when fitting a model to
the underlying data. We denote the ensuing norm the (k,p)-support norm. As we demonstrate in
numerical experiments, using p = 2 is not necessarily the best choice in all instances. By tuning
the value of p the model can incorporate prior information regarding the singular values. When
prior knowledge is lacking, the parameter can be chosen by validation, hence the model can adapt
to a variety of decay patterns of the singular values. An interesting property of the norm is that it
interpolates between the (cid:96) norm (for k = 1) and the (cid:96) -norm (for k = d). It follows that varying
1 p
both k and p the norm allows one to learn sparse vectors which exhibit different patterns of decay
inthenon-zeroelements. Inparticular,whenp = ∞thenormprefersvectorswhichareconstant.
Amaingoalofthepaperistostudytheproposednorminmatrixlearningproblems. The(k,p)-
support norm is a symmetric gauge function hence it induces the orthogonally invariant spectral
(k,p)-support norm. This interpolates between the trace norm (for k = 1) and the Schatten p-
norms (for k = d) and its unit ball has a simple geometric interpretation as the convex hull of
matricesofranknogreaterthank andSchattenp-normnogreaterthanone. Thissuggeststhatthe
new norm favors low rank structure and the effect of varying p allows different patterns of decay
in the spectrum. In the special case of p = ∞, the (k,p)-support norm is the dual of the Ky-Fan
k-norm[Bhatia1997]anditencouragesaflatspectrumwhenusedasaregularizer.
Themaincontributionsofthepaperare: i)weproposethe(k,p)-supportnormasanextension
of the k-support norm and we characterize in particular the unit ball of the induced orthogonally
invariant matrix norm (Section 3); ii) we show that the norm can be computed efficiently and
we discuss the role of the parameter p (Section 4); iii) we outline a conditional gradient method
to solve the associated regularization problem for both vector and matrix problems (Section 5);
2
and in the special case p = ∞ we provide an O(dlogd) computation of the projection operator
(Section 5.1); finally, iv) we present numerical experiments on matrix completion benchmarks
whichdemonstratethattheproposednormofferssignificantimprovementoverpreviousmethods,
and we discuss the effect of the parameter p (Section 6). The appendix contains derivations of
resultswhicharesketchedinorareomittedfromthemainbodyofthepaper.
Notation. We use N for the set of integers from 1 up to and including n. We let Rd be the
n
d-dimensionalrealvectorspace,whoseelementsaredenotedbylowercaseletters. Foranyvector
w ∈ Rd, its support is defined as supp(w) = {i ∈ N : w (cid:54)= 0}, and its cardinality is defined
d i
as card(w) = |supp(w)|. We let Rd×m be the space of d×m real matrices. We denote the rank
of a matrix as rank(W). We let σ(W) ∈ Rr be the vector formed by the singular values of W,
where r = min(d,m), and where we assume that the singular values are ordered nonincreasing,
that is σ (W) (cid:62) ··· (cid:62) σ (W) (cid:62) 0. For p ∈ [1,∞) the (cid:96) -norm of a vector w ∈ Rd is defined
1 r p
as (cid:107)w(cid:107) = ((cid:80)d |w |p)1/p and (cid:107)w(cid:107) = maxd |w |. Given a norm (cid:107) · (cid:107) on Rd or Rd×m, (cid:107) · (cid:107)
p i=1 i ∞ i=1 i ∗
denotes the corresponding dual norm, defined by (cid:107)u(cid:107) = sup{(cid:104)u,w(cid:105) : (cid:107)w(cid:107) (cid:54) 1}. The convex
∗
hullofasubsetS ofavectorspaceisdenotedco(S).
2 Background and Previous Work
Foreveryk ∈ N ,thek-supportnorm(cid:107)·(cid:107) isdefinedasthenormwhoseunitballisgivenby
d (k)
co(cid:8)w ∈ Rd : card(w) (cid:54) k,(cid:107)w(cid:107) (cid:54) 1(cid:9), (2.1)
2
that is, the convex hull of the set of vectors of cardinality at most k and (cid:96) -norm no greater than
2
one[Argyriouetal. 2012]. Wereadilyseethatfork = 1andk = dwerecovertheunitballofthe
(cid:96) and(cid:96) -normsrespectively.
1 2
The k-support norm of a vector w ∈ Rd can be expressed as an infimal convolution
[Rockafellar1970,p.34],
(cid:40) (cid:41)
(cid:88) (cid:88)
(cid:107)w(cid:107) = inf (cid:107)v (cid:107) : v = w , (2.2)
(k) g 2 g
(vg) g∈G g∈G
k k
where G is the collection of all subsets of N containing at most k elements and the infimum
k d
is over all vectors v ∈ Rd such that supp(v ) ⊆ g, for g ∈ G . Equation (2.2) highlights that
g g k
the k-support norm is a special case of the group lasso with overlap [Jacobetal. 2009], where
the cardinality of the support sets is at most k. This expression suggests that when used as a
regularizer, the norm encourages vectors w to be a sum of a limited number of vectors with small
support. Due to the variational form of (2.2) computing the norm is not straightforward, however
[Argyriouetal. 2012]notethatthedualnormhasasimpleform,namelyitisthe(cid:96) -normofthek
2
3
largestcomponents,
(cid:118)
(cid:117) k
(cid:117)(cid:88)
(cid:107)u(cid:107) = (cid:116) (|u|↓)2, u ∈ Rd, (2.3)
(k),∗ i
i=1
where |u|↓ is the vector obtained from u by reordering its components so that they are nonin-
creasing in absolute value. Note also from equation (2.3) that for k = 1 and k = d, the dual
normisequaltothe(cid:96) -normand(cid:96) -norm,respectively,whichagreeswithourearlierobservation
∞ 2
regardingtheprimalnorm.
Arelatedproblemwhichhasbeenstudiedinrecentyearsislearningamatrixfromasetoflinear
measurements,inwhichtheunderlyingmatrixisassumedtohavesparsespectrum(lowrank). The
trace norm, the (cid:96) -norm of the singular values of a matrix, has been shown to perform well in this
1
setting,seee.g. [Argyriouetal. 2008,JaggiandSulovsky2010]. Recallthatanorm(cid:107)·(cid:107)onRd×m
is called orthogonally invariant if (cid:107)W(cid:107) = (cid:107)UWV(cid:107), for any orthogonal matrices U ∈ Rd×d and
V ∈ Rm×m. A classical result by von Neumann establishes that a norm is orthogonally invariant
if and only if it is of the form (cid:107)W(cid:107) = g(σ(W)), where σ(W) is the vector formed by the singular
values of W in nonincreasing order, and g is a symmetric gauge function [VonNeumann1937].
In other words, g is a norm which is invariant under permutations and sign changes of the vector
components,thatisg(w) = g(Pw) = g(Jw),whereP isanypermutationmatrixandJ isdiagonal
withentriesequalto±1[HornandJohnson1991,p. 438].
Examplesofsymmetricgaugefunctionsarethe(cid:96) normsforp ∈ [1,∞]andthecorresponding
p
orthogonally invariant norms are called the Schatten p-norms [HornandJohnson1991, p. 441].
In particular, those include the trace norm and Frobenius norm for p = 1 and p = 2 respectively.
RegularizationwithSchattenp-normshasbeenpreviouslystudiedby[Argyriouetal. 2007]anda
statistical analysis has been performed by [RohdeandTsybakov2011]. As the set G includes all
k
subsets of size k, expression (2.2) for the k-support norm reveals that is a symmetric gauge func-
tion. [McDonaldetal. 2014]usethisfacttointroducethespectralk-supportnormformatrices,by
defining (cid:107)W(cid:107) = (cid:107)σ(W)(cid:107) , for W ∈ Rd×m and report state of the art performance on matrix
(k) (k)
completionbenchmarks.
3 The (k,p)-Support Norm
In this section we introduce the (k,p)-support norm as a natural extension of the k-support norm.
Thisfollowsbyapplyingthe(cid:96) -norm,ratherthantheEuclideannorm,intheinfimumconvolution
p
definitionofthenorm.
Definition 1. Let k ∈ N and p ∈ [1,∞]. The (k,p)-support norm of a vector w ∈ Rd is defined
d
4
as
(cid:40) (cid:41)
(cid:88) (cid:88)
(cid:107)w(cid:107) = inf (cid:107)v (cid:107) : v = w . (3.1)
(k,p) g p g
(vg) g∈G g∈G
k k
wheretheinfimumisoverallvectorsv ∈ Rd suchthatsupp(v ) ⊆ g,forg ∈ G .
g g k
Let us note that the norm is well defined. Indeed, positivity, homogeneity and non degeneracy
are immediate. To prove the triangle inequality, let w,w(cid:48) ∈ Rd. For any (cid:15) > 0 there exist {v }
g
and {v(cid:48)} such that w = (cid:80) v , w(cid:48) = (cid:80) v(cid:48), (cid:80) (cid:107)v (cid:107) (cid:54) (cid:107)w(cid:107) + (cid:15)/2, and (cid:80) (cid:107)v(cid:48)(cid:107) (cid:54)
g g g g g g g p (k,p) g g p
(cid:107)w(cid:48)(cid:107) +(cid:15)/2. As(cid:80) v +(cid:80) v(cid:48) = w+w(cid:48),wehave
(k,p) g g g g
(cid:88) (cid:88)
(cid:107)w+w(cid:48)(cid:107) (cid:54) (cid:107)v (cid:107) + (cid:107)v(cid:48)(cid:107)
(k,p) g p g p
g g
(cid:54) (cid:107)w(cid:107) +(cid:107)w(cid:48)(cid:107) +(cid:15),
(k,p) (k,p)
andtheresultfollowsbyletting(cid:15)tendtozero.
Notethat,sinceaconvexsetisequivalenttotheconvexhullofitsextremepoints,Definition1
implies that the unit ball of the (k,p)-support norm, denoted by Cp, is given by the convex hull of
k
thesetofvectorswithcardinalitynogreaterthank and(cid:96) -normnogreaterthan1,thatis
p
Cp = co(cid:8)w ∈ Rd : card(w) (cid:54) k,(cid:107)w(cid:107) (cid:54) 1(cid:9). (3.2)
k p
Definition 1 gives the norm as the solution of a variational problem. Its explicit computation
is not straightforward in the general case, however for p = 1 the unit ball (3.2) does not depend
on k and is always equal to the (cid:96) unit ball. Thus, the (k,1)-support norm is always equal to the
1
(cid:96) -norm, and we do not consider further this case in this section. Similarly, for k = 1 we recover
1
the (cid:96) -norm for all values of p. For p = ∞, from the definition of the dual norm it is not difficult
1
to show that (cid:107)· (cid:107) = max{(cid:107)· (cid:107) ,(cid:107)· (cid:107) /k}. We return to this in Section 4 when we describe
(k,p) ∞ 1
howtocomputethenormforallvaluesofp.
Note further that in Equation (3.1), as p tends to ∞, the (cid:96) -norm of each v is increasingly
p g
dominated by the largest component of v . As the variational formulation tries to identify vectors
g
v withsmallaggregate(cid:96) -norm,thissuggeststhathighervaluesofpencourageeachv totendto
g p g
a vector whose k entries are equal. In this manner varying p allows us adjust the degree to which
thecomponentsofvectorw canbeclusteredinto(possiblyoverlapping)groupsofsizek.
As in the case of the k-support norm, the dual (k,p)-support norm has a simple expression.
Recallthatthedualnormofavectoru ∈ Rd isdefinedbytheoptimizationproblem
(cid:8) (cid:9)
(cid:107)u(cid:107) = max (cid:104)u,w(cid:105) : (cid:107)w(cid:107) = 1 . (3.3)
(k,p),∗ (k,p)
5
Proposition2. Ifp ∈ (1,∞]thenthedual(k,p)-supportnormisgivenby
(cid:32) (cid:33)1
q
(cid:88)
(cid:107)u(cid:107) = |u |q , u ∈ Rd,
(k,p),∗ i
i∈I
k
whereq = p/(p−1)andI ⊂ N isthesetofindicesofthek largestcomponentsofuinabsolute
k d
value. Furthermore,ifp ∈ (1,∞)andu ∈ Rd\{0}thenthemaximumin(3.3)isattainedfor
(cid:16) (cid:17) 1
sign(u ) |ui| p−1 ifi ∈ I ,
w = i (cid:107)u(cid:107) k (3.4)
i (k,p),∗
0 otherwise.
Ifp = ∞themaximumisattainedfor
sign(u ) ifi ∈ I ,u (cid:54)= 0,
i k i
w = λ ∈ [−1,1] ifi ∈ I ,u = 0,
i i k i
0 otherwise.
Notethatforp = 2werecoverthedualofthek-supportnormin(2.3).
3.1 The Spectral (k,p)-Support Norm
From Definition 1 it is clear that the (k,p)-support norm is a symmetric gauge function. This
followssinceG containsallgroupsofcardinalityk andthe(cid:96) -normsonlyinvolveabsolutevalues
k p
ofthecomponents. Hencewecandefinethespectral(k,p)-supportnormas
(cid:107)W(cid:107) = (cid:107)σ(W)(cid:107) , W ∈ Rd×m.
(k,p) (k,p)
Since the dual of any orthogonally invariant norm is given by (cid:107) · (cid:107) = (cid:107)σ(·)(cid:107) , see e.g.
∗ ∗
[Lewis1995],weconcludethatthedualspectral(k,p)-supportnormisgivenby
(cid:107)Z(cid:107) = (cid:107)σ(Z)(cid:107) , Z ∈ Rd×m.
(k,p),∗ (k,p),∗
The next result characterizes the unit ball of the spectral (k,p)-support norm. Due to the rela-
tionshipbetweenanorthogonallyinvariantnormanditscorrespondingsymmetricgaugefunction,
weseethatthecardinalityconstraintforvectorsgeneralizesinanaturalmannertotherankopera-
torformatrices.
Proposition 3. The unit ball of the spectral (k,p)-support norm is the convex hull of the set of
matricesofrankatmostk andSchattenp-normnogreaterthanone.
6
In particular, if p = ∞, the dual vector norm is given by u ∈ Rd, by (cid:107)u(cid:107) = (cid:80)k |u|↓.
(k,∞),∗ i=1 i
Hence, for any Z ∈ Rd×m, the dual spectral norm is given by (cid:107)Z(cid:107) = (cid:80)k σ (Z), that
(k,∞),∗ i=1 i
is the sum of the k largest singular values, which is also known as the Ky-Fan k-norm, see e.g.
[Bhatia1997].
4 Computing the Norm
Inthissectionwecomputethenorm,illustratinghowitinterpolatesbetweenthe(cid:96) and(cid:96) -norms.
1 p
Theorem4. Letp ∈ (1,∞). Foreveryw ∈ Rd,andk (cid:54) d,itholdsthat
(cid:34)(cid:88)(cid:96) (cid:32)(cid:80)d |w|↓(cid:33)p(cid:35)p1
(cid:107)w(cid:107) = (|w|↓)p + √i=(cid:96)+1 i (4.1)
(k,p) i q k −(cid:96)
i=1
where 1 + 1 = 1, and for k = d, we set (cid:96) = d, otherwise (cid:96) is the largest integer in {0,...,k −1}
p q
satisfying
d
(cid:88)
(k −(cid:96))|w|↓ (cid:62) |w|↓. (4.2)
(cid:96) i
i=(cid:96)+1
Furthermore,thenormcanbecomputedinO(dlogd)time.
Proof. Note first that in (4.1) when (cid:96) = 0 we understand the first term in the right hand side to
bezero,andwhen(cid:96) = dweunderstandthesecondtermtobezero.
Weneedtocompute
(cid:40) (cid:41)
d
(cid:88)
(cid:107)w(cid:107) = max u w : (cid:107)u(cid:107) (cid:54) 1
(k,p) i i (k,p),∗
i=1
where the dual norm (cid:107)·(cid:107) is described in Proposition 2. Let z = |w|↓. The problem is then
(k,p),∗ i i
equivalentto
(cid:40) (cid:41)
d k
(cid:88) (cid:88)
max z u : uq (cid:54) 1,u (cid:62) ··· (cid:62) u . (4.3)
i i i 1 d
i=1 i=1
Thisfurthersimplifiestothek-dimensionalproblem
(cid:40) (cid:41)
k−1 d k
(cid:88) (cid:88) (cid:88)
max u z +u z : uq (cid:54) 1,u (cid:62) ··· (cid:62) u .
i i k i i 1 k
i=1 i=k i=1
7
Note that when k = d, the solution is given by the dual of the (cid:96) -norm, that is the (cid:96) -norm. For
q p
the remainder of the proof we assume that k < d. We can now attempt to use Holder’s inequality,
which states that for all vectors x such that (cid:107)x(cid:107) = 1, (cid:104)x,y(cid:105) (cid:54) (cid:107)y(cid:107) , and the inequality is tight if
q p
andonlyif
(cid:18) |y | (cid:19)p−1
i
x = sign(y ).
i i
(cid:107)y(cid:107)
p
We use it for the vector y = (z ,...,z ,(cid:80)d z ). The components of the maximizer u satisfy
1 k−1 i=k i
(cid:16) (cid:17)p−1
u = zi ifi (cid:54) k −1,and
i M
k−1
(cid:32)(cid:80)d z (cid:33)p−1
u = i=(cid:96)+1 i .
k
M
k−1
whereforevery(cid:96) ∈ {0,...,k−1},M denotesther.h.s. inequation(4.1). Wethenneedtoverify
(cid:96)
thattheorderingconstraintsaresatisfied. Thisrequiresthat
(cid:32) (cid:33)p−1
d
(cid:88)
(z )p−1 (cid:62) z
k−1 i
i=k
whichisequivalenttoinequality(4.2)for(cid:96) = k−1. Ifthisinequalityistruewearedone,otherwise
wesetu = u andsolvethesmallerproblem
k k−1
(cid:26)k−2 d
(cid:88) (cid:88)
max u z +u z :
i i k−1 i
i=1 i=k−1
k−2 (cid:27)
(cid:88)
uq +2uq (cid:54) 1, u (cid:62) ··· (cid:62) u .
i k−1 1 k−1
i=1
We use again Ho¨lder’s inequality and keep the result if the ordering constraints are fulfilled. Con-
tinuinginthisway,thegenericproblemweneedtosolveis
(cid:26) (cid:96) d
(cid:88) (cid:88)
max u z +u z :
i i (cid:96)+1 i
i=1 i=(cid:96)+1
(cid:96) (cid:27)
(cid:88)
uq +(k −(cid:96))uq (cid:54) 1, u (cid:62) ··· (cid:62) u
i (cid:96)+1 1 (cid:96)+1
i=1
where (cid:96) ∈ {0,...,k − 1}. Without the ordering constraints the maximum, M , is obtained by
(cid:96)
1
the change of variable u(cid:96)+1 (cid:55)→ (k − (cid:96))qu(cid:96) followed by applying Ho¨lder’s inequality. A direct
8
(cid:16) (cid:17)p−1
computationprovidesthatthemaximizerisu = zi ifi (cid:54) (cid:96),and
i M
(cid:96)
(cid:32) (cid:80)d z (cid:33)p−1
(k −(cid:96))1qu(cid:96)+1 = (k −i=(cid:96)(cid:96))+1q1Mip .
(cid:96)
Usingtherelationship 1 + 1 = 1,wecanrewritethisas
p q
(cid:32) (cid:80)d z (cid:33)p−1
u = i=(cid:96)+1 i .
(cid:96)+1 (k −(cid:96))Mp
(cid:96)
Hence,theorderingconstraintsaresatisfiedif
(cid:32)(cid:80)d z (cid:33)p−1
zp−1 (cid:62) i=(cid:96)+1 i ,
(cid:96) (k −(cid:96))
whichisequivalentto(4.2). FinallynotethatM isanondecreasingfunctionof(cid:96). Thisisbecause
(cid:96)
theproblemwithasmallervalueof(cid:96)ismoreconstrained,namely,itsolves(4.3)withtheadditional
constraintsu = ··· = u . Moreover,iftheconstraint(4.2)holdsforsomevalue(cid:96) ∈ {0,...,k−
(cid:96)+1 d
1} then it also holds for a smaller value of (cid:96), hence we maximize the objective by choosing the
largest(cid:96).
The computational complexity stems from using the monotonicity of M with respect to (cid:96),
(cid:96)
whichallowsustoidentifythecriticalvalueof(cid:96)usingbinarysearch.
Note that for k = d we recover the (cid:96) -norm and for p = 2 we recover the result in
p
[Argyriouetal. 2012,McDonaldetal. 2014],howeverourprooftechniqueisdifferentfromtheirs.
Remark 5 (Computation of the norm for p ∈ {1,∞}). Since the norm (cid:107)·(cid:107) computed above
(k,p)
forp ∈ (1,∞)iscontinuousinp,thespecialcasesp = 1andp = ∞canbederivedbyalimiting
argument. We readily see that for p = 1 the norm does not depend on k and it is always equal to
the(cid:96) -norm,inagreementwithourobservationintheprevioussection. Forp = ∞weobtainthat
1
(cid:107)w(cid:107) = max((cid:107)w(cid:107) ,(cid:107)w(cid:107) /k).
(k,∞) ∞ 1
5 Optimization
In this section, we describe how to solve regularization problems using the vector and matrix
(k,p)-supportnorms. Weconsidertheconstrainedoptimizationproblem
min(cid:8)f(w) : (cid:107)w(cid:107) (cid:54) α(cid:9), (5.1)
(k,p)
9
Algorithm1Frank-Wolfe.
Choosew(0) suchthat(cid:107)w(0)(cid:107) (cid:54) α
(k,p)
fort = 0,...,T do
Computeg := ∇f(w(t))
Computes := argmin (cid:8)(cid:104)s,g)(cid:105) : (cid:107)s(cid:107) (cid:54) α(cid:9)
(k,p)
Updatew(t+1) := (1−γ)w(t) +γs,forγ := 2
t+2
endfor
wherew isinRd orRd×m,α > 0isaregularizationparameterandtheerrorfunctionf isassumed
to be convex and continuously differentiable. For example, in linear regression a valid choice is
thesquareerror,f(w) = (cid:107)Xw−y(cid:107)2,whereX ismatrixofobservationsandy avectorofresponse
2
variables. Constrained problems of form (5.1) are also referred to as Ivanov regularization in the
inverseproblemsliterature[Ivanovetal. 1978].
Aconvenienttooltosolveproblem(5.1)isprovidedbytheFrank-Wolfemethod[FrankandWolfe1956],
seealso[Jaggi2013]forarecentaccount. ThemethodisoutlinedinAlgorithm1,andithasworst
caseconvergencerateO(1/T). Thekeystepofthealgorithmistosolvethesubproblem
argmin (cid:8)(cid:104)s,g(cid:105) : (cid:107)s(cid:107) (cid:54) α(cid:9), (5.2)
(k,p)
whereg = ∇f(w(t)),thatisthegradientoftheobjectivefunctionatthet-thiteration. Thisproblem
involves computing a subgradient of the dual norm at g. It can be solved exactly and efficiently as
a consequence of Proposition 2. We discuss here the vector case and postpone the discussion of
the matrix case to Section 5.2. By symmetry of the (cid:96) -norm, problem (5.2) can be solved in the
p
samemannerasthemaximuminProposition2,andthesolutionisgivenbys = −αw ,wherew
i i i
is given by (3.4). Specifically, letting I ⊂ N be the set of indices of the k largest components of
k d
g inabsolutevalue,forp ∈ (1,∞)wehave
(cid:16) (cid:17) 1
−αsign(g ) gi p−1 , ifi ∈ I
s = i (cid:107)g(cid:107) k (5.3)
i (k,p),∗
0, ifi ∈/ I
k
and,forp = ∞wechoosethesubgradient
(cid:40)
−αsign(g ) ifi ∈ I , g (cid:54)= 0,
i k i
s = (5.4)
i
0 otherwise.
5.1 Projection Operator
Analternativemethodtosolve(5.1)inthevectorcaseistoconsidertheequivalentproblem
(cid:110) (cid:111)
min f(w)+δ (w) : w ∈ Rd , (5.5)
{(cid:107)·(cid:107) (cid:54)α}
(k,p)
10