Table Of ContentStatisticalScience
2012,Vol.27,No.4,447–449
DOI:10.1214/12-STS409
(cid:13)c InstituteofMathematicalStatistics,2012
Introduction to the Special Issue on
Sparsity and Regularization Methods
Jon Wellner and Tong Zhang
1. INTRODUCTION statisticsandoptimization.Rapidadvanceshavebeen
3
1 Traditionalstatisticalinferenceconsidersrelatively madeinrecentyears.Inviewofthegrowingresearch
0 activitiesandtheirpracticalimportance,wehaveor-
small data sets and the corresponding theoretical
2 ganized this special issue of Statistical Science with
analysisfocusesontheasymptoticbehaviorofasta-
n the goal of providing overviews of several topics in
tistical estimator when the number of samples ap-
a modern sparsity analysis and associated regulariza-
J proaches infinity. However, many data sets encoun-
tion methods. Our hope is that general readers will
2 tered in modern applications have dimensionality
get a broad idea of the field as well as current re-
significantly larger thanthenumberoftrainingdata
] search directions.
E available, and for such problems the classical statis-
M tical tools become inadequate. In order to analyze 2. SPARSE MODELING AND
high-dimensional data, new statistical methodology
REGULARIZATION
.
t and the corresponding theory have to be developed.
a
t Inthepastdecade,sparsemodelingandthecorre- Oneofthecentralprobleminstatisticsislinearre-
[s sponding use of sparse regularization methods have gression,whereweconsiderann×pdesignmatrixX
emergedasamajortechniquetohandlehigh-dimen- andann-dimensionalresponsevectorY ∈Rn sothat
1
v sional data. While the data dimensionality is high, (1) Y =Xβ¯+ε,
1 thebasicassumptioninthisapproachisthattheac- whereβ¯∈Rp is thetrueregression coefficient vector
1 tualestimatorissparseinthesensethatonlyasmall
andε∈Rn isanoisevector.Inthecaseofn<p,this
2
numberofcomponentsarenonzero.Onthepractical
0 problem is ill-posed because the number of parame-
side,thesparsityphenomenonhasbeenubiquitously
1. ters is more than the number of observations. This
observed in applications, including signal recovery,
0 ill-posedness can be resolved by imposing a spar-
genomics, computer vision, etc. On the theoretical
3 sity constraint: that is, by assuming that kβ¯k ≤s
1 side, this assumption makes it possible to overcome for some s, where the ℓ -norm of β¯ is define0d as
: the problem associated with estimating more pa- 0
v kβ¯k =|supp(β¯)|,andthesupportsetof β¯isdefined
Xi rameters than the number of observations which is as s0upp(β¯):={j:β¯ 6=0}. If s≪n, then the effec-
impossible to deal with in the classical setting. j
r tive number of parameters in (1) is smaller than the
Thereareanumberofchallenges, includingdevel-
a
number of observations.
oping new theories for high-dimensional statistical
The sparsity assumption may be viewed as the
estimation as well as new formulations and compu-
classical model selection problem, where models are
tational procedures.Related problemshavereceived
indexed by the set of nonzero coefficients. The clas-
alotofattention invariousresearchfields,including
sical model selection criteria such as AIC, BIC or
applied math, signal processing, machine learning,
Cp [1, 7, 11] naturally lead to the so-called ℓ regu-
0
larization estimator:
Jon Wellner is Professor of Statistics and Biostatistics,
1
Department of Statistics, University of Washington, (2) βˆ(ℓ0)=argmin kXβ−Yk2+λkβk .
Seattle, Washington 98112-2801, USA e-mail: β∈Rp(cid:20)n 2 0(cid:21)
[email protected]. Tong Zhang is Professor of The main difference of modern ℓ analysis in high-
0
Statistics, Department of Statistics, Rutgers University,
dimensional statistics and the classical model selec-
New Jersey, USA e-mail: [email protected].
tion methods is that the choice of λ will be differ-
This is an electronic reprint of the original article ent, and the modern analysis requires choosing a
published by the Institute of Mathematical Statistics in larger λ than that considered in the classical model
Statistical Science, 2012, Vol. 27, No. 4, 447–449. This selection setting because it is necessary to compen-
reprint differs from the original in pagination and sate for the effect of consideringmany models in the
typographic detail. high-dimensional setting. The analysis for ℓ regu-
0
1
2
J. WELLNER ANDT. ZHANG
larization in the high-dimensional setting (e.g., [15] larization methods. Moreover, many articles go be-
in this issue) employs different techniques and the yond the current state of the art in various ways.
results obtained are also different from the classical Therefore, these articles not only give some high
literature. level ideas about the current topics, but will also
The ℓ regularization formulation leads to a non- be valuable for experts working in the field.
0
convexoptimizationproblemthatisdifficulttosolve
• Bach,Jenatton,MairalandGuillaume(Structured
computationally. On the other hand, an important
sparsity through convex optimization, [2]) study
requirement for modern high-dimensional problems
convex relaxations based on structured norms in-
is to design computationally efficient and statisti-
corporatingfurtherstructuralpriorknowledge.An
cally effective algorithms. Therefore, the main fo-
extension of the standard ℓ sparsity model that
0
cus of the existing literature is on convex relaxation
has received a lot of attention in recent years is
methods that use ℓ -regularization (Lasso) to re-
1 structured sparsity. The basic idea is that not all
place sparsity constraints: sparsity patterns for supp(β¯) are equally likely.
1 Asimpleexampleisgroupsparsitywherenonzero
(3) βˆ(ℓ1)=argmin kXβ−Yk2+λkβk .
β∈Rp(cid:20)n 2 1(cid:21) coefficients occur together in predefined groups.
More complex structured sparsity models have
This method is referred to as Lasso [12] in the liter-
been investigated in recent years. Although the
ature and its theoretical properties have been inten-
paper by Bach et al. focuses on the convex op-
sively studied. Since the formulation is regarded as
timization approach, they also give an extensive
an approximation of (2), a key question is how good
survey of recent developments, including the use
thisapproximationis,andhowgoodistheestimator
of sub-modular set functions.
βˆ(ℓ1) for estimating β¯.
• van de Geer and Mu¨ller (Quasi-likelihood and/or
Many extensions of Lasso have appeared in the
robust estimation in high dimensions, [13]) ex-
literature for more complex problems. One exam-
tend ℓ regularization methods to generalized lin-
ple is group Lasso [14] that assumes that variables 1
earmodels.Thisinvolvesconsiderationoflossfunc-
are selected in groups. Another extension is the es-
tions beyond the usual least-squares loss and, in
timation of graphical models,whereone can employ
particular, loss functions arising via quasi-likeli-
Lasso to estimate unknown graphical model struc-
hoods.
tures[3,8].Athirdexampleismatrixregularization,
• Huang, Breheny and Ma (A selective review of
wheretheconceptof sparsitycan bereplaced bythe
groupselectioninhigh-dimensionalregression,[5])
conceptoflow-rankness,andsparsityconstraintsbe-
provide a detailed review of the most important
come low-rank constraints. Of special interest is the
special case of structured sparsity, namely, group
so-calledmatrixcompletionproblem,wherewewant
sparsity. Their review covers both convex relax-
to recover a matrix from a few observations of the
ation (or group Lasso) and approaches based on
matrix entries. This problem is encountered in rec-
nonconvex group penalties.
ommender system applications (e.g., a person buys
• Huet, Giraud and Verzelen (High-dimensional re-
a book at amazon.com will be recommended other
gression with unknown variance, [4]) address is-
books purchased by other users with similar inter-
suesinhigh-dimensionalregressionestimationcon-
ests),andlow-rankmatrixfactorization isoneofthe
nected with lack of knowledge of the error vari-
maintechniquesforthisproblem.Similartosparsity
ance. In the standard Lasso formulation (3), the
regularization, using low-rank regularization leads
regularization parameter λ is considered as a tun-
to nonconvex formulations and, thus, it is natural
ing parameter that needs to be chosen propor-
to consider its convex relaxation which is referred
tionally to the standard deviation σ of the noise
to as trace-norm (or Nuclear norm) regularization.
vector. A natural question is whether it is possi-
Thetheoreticalpropertiesandnumericalalgorithms
ble to automatically estimate σ instead of leaving
fortrace-norm regularization methodshavereceived
λ as a tuning parameter. This problem has re-
attention.
ceived much attention and a number of develop-
mentshave beenmadeinrecentyears. Thispaper
3. ARTICLES IN THIS ISSUE
reviews and compares several approaches to this
The eight articles in this issue present general problem.
overviews of the state of the art in a number of dif- • Lafferty, Liu and Wasserman (Sparse nonpara-
ferent topics concerning sparsity analysis and regu- metric graphical models, [6]) discuss another im-
3
INTRODUCTION
portant topic in sparsity analysis, the graphical sityconstraintis ℓ regularization, duetoitscom-
0
modelestimationproblem.Whilemuchofthecur- putational difficulty, most of the recent literature
rentworkassumesthatthedatacomefromamul- focuses on the simpler ℓ regularization method
1
tivariate Gaussian distribution, this paper goes (Lasso)thatapproximatesℓ regularization.How-
0
beyond the standard practice. The authors out- ever, it is also known that ℓ1 regularization is not
line a number of possible approaches and intro- a very good approximation to ℓ0 regularization.
duce more flexible models for the problem. The This leads to the study of nonconvex penalties.
authors also describe some of their recent work, The nonconvex formulations are both harder to
and describe future research directions. analyze statistically and harder to handle com-
• Negahban,Ravikumar,WainwrightandYu(Auni- putationally. Some fundamental understandingof
fied framework for high-dimensional analysis of high-dimensional nonconvex procedures has only
M-estimatorswithdecomposableregularizers,[9]) startedtoemergerecently. Nevertheless,someba-
sic questions have remained unanswered: for ex-
provideaunifiedtreatmentofexistingapproaches
ample, properties of the global solution of non-
to sparse regularization. The paper extends the
standardsparserecoveryanalysisofℓ regularized convex formulations and whether it is possible to
1
computetheglobaloptimalsolutionefficientlyun-
least squares regression problems by introducing
der suitable conditions. The authors go a consid-
a general concept of restricted strong convexity.
erable distance toward providing a general the-
This allows the authors to study more general
ory that answers someof these fundamentalques-
formulations with different convex loss functions
tions.
anda class of “decomposable” regularization con-
ditions.
• Rigollet and Tsybakov (Sparse estimation by ex- REFERENCES
ponentialweighting,[10])presentathoroughanal-
[1] Akaike, H. (1973). Information theory and an exten-
ysis of oracle inequalities in the context of model
sion of the maximum likelihood principle. In Sec-
averaging procedures, a class of methods which ond International Symposium on Information The-
has its original in the Bayesian literature. Model ory(Tsahkadsor, 1971)267–281.Akad´emiaiKiado´,
Budapest. MR0483125
averaging is in general more stable than model
[2] Bach, F., Jenatton, R., Mairal, J. and Obozin-
selection. For example, in the scenario that two
ski, G.(2012). Structuredsparsity throughconvex
models are very similar and only one is correct, optimization. Statist. Sci. 27 450–468.
model selection forces us to choose one of the [3] Banerjee, O., El Ghaoui, L. and d’Aspremont, A.
models even if we are not certain which model (2008). Model selection through sparse maximum
is true. On the other hand, a model averaging likelihood estimation for multivariate Gaussian or
binary data. J. Mach. Learn. Res. 9 485–516.
procedure does not force us to choose one of the
MR2417243
two models, but only to take the average of the
[4] Giraud, C., Huet, S. and Verzelen, N. (2012).
two models. This is beneficial when several of the High-dimensional regression with unknown vari-
modelsare similar and wecannot tell which is the ance. Statist. Sci. 27 500–518.
correctone.Themodernanalysisofmodelaverag- [5] Huang, J., Breheny, P. and Ma, S. (2012). A selec-
tive review of group selection in high dimensional
ingproceduresleadstooracleinequalitiesthatare
models. Statist. Sci. 27 481–499.
sharperthanthecorrespondingoracleinequalities
[6] Lafferty, J., Liu, H. and Wasserman, L. (2012).
for model selection methods such as Lasso. The Sparse nonparametric graphical models. Statist.
authors give an extensive discussion of such or- Sci. 27 519–537.
acle inequalities using an exponentially weighted [7] Mallows, C. L. (1973). Some comments on Cp. Tech-
nometrics 12 661–675.
modelaveraging procedure.Suchprocedureshave
[8] Meinshausen, N. and Bu¨hlmann, P. (2006). High-
advantages over model selection when the under-
dimensional graphs and variable selection with the
lying models are correlated and when the model lasso. Ann. Statist. 34 1436–1462. MR2278363
class is misspecified. [9] Negahban, S., Ravikumar, P., Wainwright, M. J.
• Zhang and Zhang (A general theory of concave and Yu, B. (2012). A unified framework for high-
dimensional analysis of M-estimators with decom-
regularization forhigh-dimensionalsparseestima-
posable regularizers. Statist. Sci. 27 538–557.
tion problems, [15]) focus on nonconvex penalties
[10] Rigollet, P. and Tsybakov, A. B. (2012). Sparse es-
andstudyavarietyofissuesrelatedtosuchpenal- timation by exponential weighting. Statist. Sci. 27
ties. Although the natural formulation of a spar- 558–575.
4
J. WELLNER ANDT. ZHANG
[11] Schwarz, G. (1978). Estimating the dimension of a [14] Yuan, M. and Lin, Y. (2006). Model selection and
model. Ann. Statist. 6 461–464. MR0468014 estimation in regression with grouped variables.
[12] Tibshirani, R. (1996). Regression shrinkage and selec- J. R. Stat. Soc. Ser. B Stat. Methodol. 68 49–67.
tion via the lasso. J. R. Stat. Soc. Ser. B Stat. MR2212574
Methodol. 58 267–288. MR1379242 [15] Zhang, C.-H. and Zhang, T. (2012). A general the-
[13] van de Geer, S. and Muller, P. (2012). Quasi- ory of concave regularization for high dimensional
likelihood and/or robust estimation in high dimen- sparse estimation problems. Statist. Sci. 27 576–
sions. Statist. Sci. 27 469–480. 593.