Table Of ContentIEEETRANSACTIONSONKNOWLEDGEANDDATAENGINEERING,VOL.XX,NO.X,SEPTEMBER2016 1
Kernelized LRR on Grassmann Manifolds for
Subspace Clustering
Boyue Wang, Yongli Hu Member, IEEE, Junbin Gao, Yanfeng Sun Member, IEEE, and Baocai Yin
(cid:70)
6
Abstract—Lowrankrepresentation(LRR)hasrecentlyattractedgreat Amongallthesubspaceclusteringmethodsaforemen-
1
0 interest due to its pleasing efficacy in exploring low-dimensional sub- tioned, the spectral clustering methods based on affinity
2 spacestructuresembeddedindata.Oneofitssuccessfulapplications matrix are considered having good prospects [2], in
is subspace clustering, by which data are clustered according to the whichanaffinitymatrixisfirstlylearnedfromthegiven
n subspacestheybelongto.Inthispaper,atahigherlevel,weintendto
a dataandthenthefinalclusteringresultsareobtainedby
clustersubspacesintoclassesofsubspaces.Thisisnaturallydescribed
J as a clustering problem on Grassmann manifold. The novelty of this spectral clustering algorithms such as Normalized Cuts
9 paper is to generalize LRR on Euclidean space onto an LRR model (NCut) [16] or simply the K-means. The key ingredient
on Grassmann manifold in a uniform kernelized LRR framework. The in a spectral clustering method is to construct a proper
V] newmethodhasmanyapplicationsindataanalysisincomputervision affinity matrix for data. In the typical method, Sparse
tasks. The proposed models have been evaluated on a number of
Subspace Clustering (SSC) [2], one assumes that the
C practicaldataanalysisapplications.Theexperimentalresultsshowthat
data of subspaces are independent and are sparsely
. theproposedmodelsoutperformanumberofstate-of-the-artsubspace
s represented under the so-called (cid:96) Subspace Detection
clusteringmethods. 1
c
Property [17], in which the within-class affinities are
[
IndexTerms—LowRankRepresentation,SubspaceClustering,Grass- sparse and the between-class affinities are all zeros.
1
mannManifold,KernelizedMethod It has been proved that under certain conditions the
v
multiple subspace structures can be exactly recovered
4
2 via (cid:96)p(p≤1) minimization [18].
1 1 INTRODUCTION In most of current sparse subspace methods, one
2 In recent years, subspace clustering or segmentation mainlyfocusesonindependentsparserepresentationfor
0
has attracted great interest in image analysis, computer data objects. However, the relation among data objects
.
1 vision, pattern recognition and signal processing [1], or the underlying global structure of subspaces that
0 [2]. The basic idea of subspace clustering is based on generate the subsets of data to be grouped is usually
6
the fact that most data often have intrinsic subspace not well considered, while these intrinsic properties are
1
: structures and can be regarded as the samples of a very important for clustering applications. So some re-
v union of multiple subspaces. Thus the main goal of searchers explore these intrinsic properties and relations
i
X subspace clustering is to group data into different clus- among data objects and then revise the sparse represen-
r ters, data points in each of which justly come from one tationmodeltorepresentthesepropertiesbyintroducing
a subspace. To investigate and represent the underlying extra constraints, such as the low rank constraint [11],
subspace structure, many subspace methods have been the data Laplace consistence regularization [19] and the
proposed, such as the conventional iterative methods data sequential property [20]. etc. In these constraints,
[3],thestatisticalmethods[4]–[7],thefactorization-based the holistic constraints such as the low rank or nuclear
algebraicapproaches[8],[9],andthespectralclustering- norm (cid:107)·(cid:107)∗ are proposed in favour of structural sparsity.
based methods [2], [10]–[13]. These methods have been The Low Rank Representation (LRR) model [11] is one
successfullyappliedinmanyapplications,suchasimage of representatives. The LRR model tries to reveal the
representation [14], motion segement [8], face classifica- latent sparse property embedded in a data set in high
tion [15] and saliency detection [13], etc. dimensional space. It has been proved that, when the
high-dimensional data set is actually from a union of
several low dimension subspaces, the LRR model can
• Boyue Wang, Yongli Hu, Yanfeng Sun and Baocai Yin are with Beijing
Municipal Key Lab of Multimedia and Intelligent Software Technol- reveal this structure through subspace clustering [11].
ogy, College of Metropolitan Transportation, Beijing University of Tech- Although most current subspace clustering methods
nology, Beijing 100124, China. E-mail: [email protected],
show good performance in various applications, the
{huyongli,yfsun,ybc}@bjut.edu.cn
• Junbin Gao is with Discipline of Business Analytics, The University of similarityamongdataobjectsismeasuredintheoriginal
Sydney Business School, The University of Sydney, Camperdown, NSW data domain. For example, the current LRR method
2006,Australia.E-mail:[email protected]
is based on the principle of data self representation
IEEETRANSACTIONSONKNOWLEDGEANDDATAENGINEERING,VOL.XX,NO.X,SEPTEMBER2016 2
In the other type of learning tasks, we clearly know
manifolds where the data come from. For example, in
image analysis, people usually use covariance matrices
of features as a region descriptor [28]. In this case, such
a descriptor is a point on the manifold of symmetrical
positive definite matrices. More generally in computer
vision,itiscommontocollectdataonaknownmanifold.
Forexampleitisacommonpracticetouseasubspaceto
represent a set of images [29], while such a subspace is
actually a point on the Grassmann manifold [30]. Thus
an image set is regarded as a point from the known
Grassmann manifold. This type of tasks incorporating
manifold properties in learning is called learning on
manifolds.Therearethreemajorstrategiesindealingwith
learning tasks on manifolds.
1) Intrinsic Strategy: The ideal but hardest strategy is
tointrinsicallyperformlearningtasksonmanifolds
basedontheirintrinsicgeometry.Veryfewexisting
approaches adopt this strategy.
2) Extrinsic Strategy: The second strategy is to im-
plement a learning algorithm within the tangent
spaces of manifolds where all the linear relations
can be exploited. In fact, this is a first order
approximation to the Intrinsic strategy and most
approaches fall in this category.
Fig. 1. An overview of our proposed LRR on Grass-
3) Embedding Strategy: The third strategy is to embed
mann manifolds. Three steps are involved in the pro-
a manifold into a “larger” Euclidean space by
posed model: (a) The points on Grassmann manifold
an appropriate mapping like kernel methods and
are mapped onto symmetric matrices. (b) LRR model is
any learning algorithms will be implemented in
formulatedinsymmetricmatrixspace.(c)Thecoefficients
this “flatten” embedding space. But for a practical
inLRRmodelareusedbyNCutforclustering.
learning task, how to incorporate the manifold
properties of those known manifolds in kernel
mapping design is still a challenging work.
and the representation error is measured in terms of
In this paper, we are concerned with the points on
Euclidean alike distance. However, this hypothesis may
a particular known manifold, the Grassmann manifold.
not be always true for many high-dimensional data in
We explore the LRR model to be used for clustering a
practicewherecorrupteddatamaynotresideinalinear
set of data points on Grassmann manifold by adopting
space nicely. In fact, it has been proved that many high-
the aforementioned third strategy. In fact, Grassmann
dimensional data are embedded in low dimensional
manifold has a nice property that it can be embedded
manifolds. For example, the human face images are
into the linear space of symmetric matrices [31], [32]. By
considered as samples from a non-linear submanifold
this way, all the abstract points (subspaces) on Grass-
[21]. Generally manifolds can be considered as low
mannmanifoldcanbeembeddedintoaEuclideanspace
dimensional smooth ”surfaces” embedded in a higher
where the classic LRR model can be applied. Then an
dimensional Euclidean space. At each point of the man-
LRR model can be constructed in the embedding space,
ifold, manifold is locally similar to Euclidean space. To
wheretheerrormeasureissimplytakenastheEuclidean
effectivelyclusterthesehighdimensiondata,itisdesired
metric in the embedding space. The main idea of our
to reveal the nonlinear manifold structure underlying
method is illuminated in Fig. 1.
these high-dimensional data and obtain a proper rep-
The contributions of this work are listed as follows:
resentation for the data objects.
Therearetwotypesofmanifoldrelatedlearningtasks. • Constructing an extended LRR model on Grass-
In the so-called manifold learning, one has to respect mann Manifold based on our prior work in [33];
the local geometry existed in data but the manifold • Giving the solutions and practical algorithms to the
itself is unknown to learners. The classic representative problems of the extended Grassmann LRR model
algorithms for manifold learning include LLE (Locally under different noise models, particularly defined
Linear Embedding) [22], ISOMAP [23], LLP (Locally by Frobenius norm and (cid:96)2/(cid:96)1 norm;
Linear Projection) [24], LE (Laplacian Embedding) [25], • Presenting a new kernelized LRR model on Grass-
LTSA (Local Tangent Space Alignment) [26] and TKE mann manifold.
(Twin Kernel Embedding) [27]. The rest of the paper is organized as follows. In
IEEETRANSACTIONSONKNOWLEDGEANDDATAENGINEERING,VOL.XX,NO.X,SEPTEMBER2016 3
Section 2, we review some related works. In Section be represented as a sparse linear combinations of other
3, the proposed LRR on Grassmann Manifold (GLRR) pointsfromthesamesubspace.Mathematicallywewrite
is described and the solutions to the GLRR models this sparse formulation as
with different noises assumptions are given in detail. In
min(cid:107)E(cid:107) +λ(cid:107)Z(cid:107)
Section4,weintroduceageneralframeworkfortheLRR (cid:96) 1
E,Z (2)
model on Grassmann manifold from the kernelization
s.t. Y =YZ+E, diag(Z)=0.
point of view. In Section 5, the performance of the pro-
posedmethodsisevaluatedonseveralpublicdatabases. From the sparse representation matrix Z, an affinity
Finally, conclusions and suggestions for future work are matrix can be constructed. For example one commonly
provided in Section 6. used form is (|Z|+|ZT|)/2. This affinity matrix is in-
terpreted as a graph upon which a clustering algorithm
2 RELATED WORKS such as NCut is applied for final segmentation. This is
thetypicalapproachusedinmodernsubspaceclustering
In this section, we briefly review the existing sparse
techniques.
subspaceclusteringmethodsincludingtheclassicSparse
Subspace Clustering (SSC) and the Low Rank Repre-
sentation (LRR) and then summarize the properties of 2.2 Low-RankRepresentation(LRR)
Grassmann manifold that are related to the work pre-
The LRR can be regarded as one special type of
sented in this paper.
sparse representation, in which rather than computing
thesparserepresentationofeachdatapointindividually,
2.1 SparseSubspaceClustering(SSC) the global structure of data is collectively computed by
the low rank representation of a set of data points.
Given a set of data drawn from a union of unknown
subspaces, the task of subspace clustering is to find The low rank measurement has long been utilized in
the number of subspaces and their dimensions and matrix completion from corrupted or missing data [35],
bases, and then segment the data set according to the [36]. Specifically for clustering applications, it has been
subspaces. In recent years, sparse representation has provedthat,whenahigh-dimensionaldatasetisactually
been applied to subspace clustering, and the proposed composedofdatafromaunionofseverallowdimension
SparseSubspaceClustering(SSC)aimstofindthesparse subspaces,LRRmodelcanrevealthesubspacesstructure
representationforthedatasetusing(cid:96) regularization[2]. underlying data samples [11]. It is also proved that LRR
1
The general SSC can be formulated as follows: has good clustering performance in dealing with the
challenges in subspace clustering, such as the unclean
min(cid:107)E(cid:107) +λ(cid:107)Z(cid:107)
(cid:96) 1 data corrupted by noise or outliers, no prior knowledge
E,Z (1)
of the subspace parameters, and lacking of theoretical
s.t. Y =DZ+E, diag(Z)=0,
guaranteesfortheoptimalityofclusteringmethods[11],
whereY ∈Rd×N isasetofN signalsindimensiondand [13], [37].
Z isthecorrespondentsparserepresentationofY under The general LRR model can be formulated as the
thedictionaryD,andE representstheerrorbetweenthe following optimization problem:
signals and its reconstructed values, which is measured
min(cid:107)E(cid:107)2+λ(cid:107)Z(cid:107)
by norm |·| , particularly in terms of Euclidean norm, (cid:96) ∗
(cid:96) E,Z (3)
i.e.,(cid:96)=2(or(cid:96)=F)denotingtheFrobeniusnormtodeal
s.t. Y =YZ+E,
withtheGaussiannoise,or(cid:96)=1(theLaplaciannorm)to
deal with the random gross corruptions or (cid:96) = (cid:96) /(cid:96) to where Z is the low rank representation of the data set
2 1
deal with the sample-specific corruptions. Finally λ>0 Y by itself. Here the low rank constraint is achieved by
is a penalty parameter to balance the sparse term and approximating rank with the nuclear norm (cid:107)·(cid:107)∗, which
the reconstruction error. is defined as the sum of singular values of a matrix and
In the above sparse model, it is critical to use an is the low envelop of the rank function of matrices [11].
appropriatedictionaryD torepresentsignals.Generally, Although the current LRR method has good perfor-
a dictionary can be learned from some training data by mance in subspace clustering, it relies on Euclidean
using one of many dictionary learning methods, such as distance for measuring the similarity of the raw data.
the K-SVD method [34]. However, a dictionary learning However, this measurement is not suitable to high-
procedure is usually time-consuming and so should be dimensional data with embedding low manifold struc-
done in an offline manner. So many researchers adopt ture. To characterize the local geometry of data on an
a simple and direct way to use the original signals unknown manifold, the LapLRR methods [19], [38] uses
themselves as the dictionary to find subspaces, which is thegraphLaplacianmatrixderivedfromthedataobjects
known as the self-expressiveness property [2], i.e. each as a regularized term for the LRR model to represent
data point in a union of subspaces can be efficiently the nonlinear structure of high dimensional data, while
reconstructed by a linear combination of other points in the reconstruction error of the revised model is still
dataset. More specifically, every point in the dataset can computed in Euclidean space.
IEEETRANSACTIONSONKNOWLEDGEANDDATAENGINEERING,VOL.XX,NO.X,SEPTEMBER2016 4
2.3 GrassmannManifold
In recent years, Grassmann manifold has attracted
great interest in computer vision research community.
Although Grassmann manifold itself is quite abstract, it
can be well represented as a matrix quotient manifold
and its Riemannian geometry has been investigated for
algorithmic computation [30].
Grassmann manifold G(p,d) [30] is the space of all
p-dimensional linear subspaces of Rd for 0 ≤ p ≤ d.
A point on Grassmann manifold is a p-dimensional Fig. 2. The GLRR Model. The mapping of the points
subspace of Rd which can be represented by any or- on Grassmann manifold, the tensor X with each slice
thonormal basis X = [x ,x ,...,x ] ∈ Rd×p. The cho- beingasymmetricmatrixcanberepresentedbythelinear
1 2 p
combinationofitself.Theelementz ofZ representsthe
sen orthonormal basis is called a representative of the ij
subspace S =span(X). Grassmann manifold G(p,d) has similaritybetweenslicesiandj.
one-to-onecorrespondencetoaquotientmanifoldofthe
Stiefel manifold on Rd×p, see [30].
where data matrix Y =[y ,y ,...,y ]∈RD×N.
Grassmann manifold has a nice property that it can 1 2 N
As mentioned above, many high dimensional data
be embedded into the space of symmetric matrices via
have their intrinsic manifold structures. To extend an
a projection embedding, i.e. we can embed Grassmann
LRRmodelformanifold-valueddata,twoissueshaveto
manifold G(p,d) into the space of d × d symmetric
be resolved, i.e., (i) model error should be measured in
positivesemi-definitematricesSym (d)bythefollowing
+ terms of manifold geometry, and (ii) the linear relation-
mapping, see [32],
ship has to be re-interpreted. This is because the linear
Π:G(p,d)→Sym (d), Π(X)=XXT. (4) relation defined by Y =YZ+E in (3) is no longer valid
+
The embedding Π(X) is diffeomorphism [30] (a one- on a manifold.
In the extrinsic strategy mentioned in Section 1, one
to-one continuous and differentiable mapping with a
gets around this difficulty by using the Log map on
continuousanddifferentiableinverse).Thenitisreason-
a manifold to lift points (data) on a manifold onto
abletoreplacethedistanceonGrassmannmanifoldwith
the tangent space at a data point. This idea has been
the following distance defined on the symmetric matrix
applied for clustering and dimensionality reduction on
space under this mapping,
manifolds in [44], [45] and recently for LRR on Stiefel
dg(X1,X2)=(cid:107)Π(X1)−Π(X2)(cid:107)F. (5) and SPD manifolds [46], [47].
This property was used in subspace analysis, learning In this paper, instead of using the Log map tool, we
and representation [39]–[41]. The sparse coding and dic- extend the LRR model onto Grassmann manifold by
tionary learning within the space of symmetric positive using the Embedding Strategy. Given a set of Grassmann
definitematriceshavebeeninvestigatedbyusingkernel- points {X1,X2,...,XN} on Grassmann manifold G(p,d),
ing method [32]. For clustering applications, the mean we mimic the classical LRR defined in (3) and (6) as
shift method was discussed on Stiefel and Grassmann follows
manifolds in [42]. Recently, a new version of K-means N
(cid:88)
method was proposed to cluster Grassmann points, min (cid:107)Xi(cid:9)((cid:93)Nj=1Xj (cid:12)zji)(cid:107)G +λ(cid:107)Z(cid:107)∗, (7)
Z
which is constructed by a statistical modeling method i=1
[43]. These works try to expand the clustering methods where (cid:9), (cid:93) and (cid:12) are only dummy operators to be
within Euclidean space to more practical situations on specified soon and (cid:107)X (cid:9)((cid:93)N X (cid:12)z )(cid:107) is to measure
i j=1 j ji G
nonlinear spaces. Along with this direction, we further the error between the point X and its “reconstruction”
i
explorethesubspaceclusteringproblemsonGrassmann (cid:93)N X (cid:12)z . Thus, to get an LRR model on Grassmann
j=1 j ji
manifold and try to establish a novel and feasible LRR manifold, we should define proper distance and opera-
model on Grassmann manifold. tors for the manifold.
Based on the property of Grassmann manifold in
3 LRR ON GRASSMANN MANIFOLDS
(4), we have an easy way to use the distance of the
3.1 LRRonGrassmannManifolds embedded space to replace the manifold distance in the
In the current LRR model (3), the data reconstruction LRR model on Grassmann manifold as follows,
errorisgenerallycomputedintheoriginaldatadomain. (cid:107)X (cid:9)((cid:93)N X (cid:12)z )(cid:107) =d (X ,((cid:93)N X (cid:12)z )).
i j=1 j ji G g i j=1 j ji
For example, the common form of the reconstruction
errorisFrobeniusnorm,i.e.theerrortermcanbechosen This error measure not only avoids using Log map
as follows, operator but also has simple computation with F-norm.
Additionally, the mapping (4) maps a Grassmann
N N
(cid:88) (cid:88)
(cid:107)E(cid:107)2F =(cid:107)Y −YZ(cid:107)2F = (cid:107)yi− zjiyj(cid:107)2F, (6) point to a point in the d×d symmetric positive semi-
definitematricesspaceSym (d)inwhichthereisalinear
i=1 j=1 +
IEEETRANSACTIONSONKNOWLEDGEANDDATAENGINEERING,VOL.XX,NO.X,SEPTEMBER2016 5
combinationoperationifthecoefficientsarerestrictedto where the (cid:107)E(cid:107) norm of a tensor is defined as the
(cid:96)2/(cid:96)1
be positive. So it is intuitive to replace the Grassmann sum of the Frobenius norm of 3-mode slices as follows:
points with its mapped points to implement the combi- N
(cid:88)
nation in (7), i.e. (cid:107)E(cid:107) = (cid:107)E(:,:,i)(cid:107) . (11)
(cid:96)2/(cid:96)1 F
N N i=1
(cid:93) (cid:88)
Xj (cid:12)zji = zji(XjXjT), for i=1,2,...,N. Note that (11) without squares is different from (9).
j=1 j=1
Furthermore, we stack all the symmetric matrices 3.2 AlgorithmsforLRRonGrassmannManifold
X XT’s as front slices of a 3rd order tensor X, i.e.,
i i The GLRR models in (8) and (10) present two typical
X(:,:,i) = X XT, then all the N linear relations above
i i optimization problems. In this subsection, we propose
can be simply written as X × Z, where × means the
3 3 appropriate algorithms to solve them.
mode-3 multiplication of a tensor and a matrix, see [48].
TheGLLR-FmodelwasproposedinourearlierACCV
Thus the self-representation in (3) can be represented by
paper [33] where an algorithm based on ADMM was
X =X × Z+E proposed.Inthispaper,weprovideanevenfasterclosed
3
form solution for (8) and further investigate the tensor
where E is the error tensor. The representation is illus- structure in these models to obtain a practical solution
trated in Fig. 2. for (10).
Inthefollowingsubsections,wegivetwoLRRmodels Intuitively, the tensor calculation can be converted
on Grassmann manifold for two types of noise cases. to matrix operation by tensorial matricization, see [48].
For example, we can matricize the tensor X ∈ Rd×d×N
3.1.1 LRRonGrassmannManifoldwithGaussianNoise in mode-3 and obtain a matrix X ∈ RN×(d∗d) of N
(3)
(GLRR-F)[33] data points (in rows). So it seems that the problem
For the completion of this paper, we include our has been solved using the method of the standard LRR
prior work reported in the conference paper [33]. This model. However, as the dimension d ∗ d is often too
LRR model on Grassmann manifold, based on the error large in practical problems, the existing LRR algorithm
measurement defined in (5), is defined as follows, could break down. To avoid this matter, we carefully
analyze the representation of the construction tensor
min(cid:107)E(cid:107)2 +λ(cid:107)Z(cid:107)
F ∗ error terms and convert the optimization problems to
E,Z (8)
s.t.X =X × Z+E. its equivalent and readily solvable optimization model.
3
In the following two subsections, we will give the detail
The Frobenius norm here is adopted because of the of these solutions.
assumption that the model fits to Gaussian noise. We
call this model the Frobenius norm constrained GLRR 3.2.1 Algorithm for the Frobenius Norm Constrained
(GLRR-F). In this case, the error term in (8) is GLRRModel
N We follow the notation used in [33]. By using variable
(cid:88)
(cid:107)E(cid:107)2 = (cid:107)E(:,:,i)(cid:107)2, (9) elimination, we can convert problem (8) into the follow-
F F
i=1 ing problem
where E(:,:,i) = XiXiT − (cid:80)N zij(XjXjT) is the i-th slice mZin(cid:107)X −X ×3Z(cid:107)2F +λ(cid:107)Z(cid:107)∗. (12)
j=1
of E, which is the error between the symmetric ma- We note that (XTX ) has a small dimension p×p which
j i
trix XiXiT and its reconstruction of linear combination is easy to handle. Denote
N
(cid:80) zji(XjXjT). ∆ij =tr(cid:2)(XjTXi)(XiTXj)(cid:3), (13)
j=1
and the N ×N symmetric matrix
3.1.2 LRR on Grassmann Manifold with (cid:96) /(cid:96) Noise
2 1
(GLRR-21) ∆=[∆ij]. (14)
When there exist outliers in the data set, the Gaussian Then we have the following Lemma.
noisemodelisnolongerafavoredchoice.Therefore,we
Lemma 1. Given a set of matrices {X ,X ,...,
propose using the so-called (cid:107)·(cid:107) noise model, which 1 2
(cid:96)2/(cid:96)1 X } s.t. X ∈ Rd×p and XTX = I, if ∆ = [∆ ] ∈
is used to cope with signal oriented gross errors in LRR RNN×N withielement ∆ = tri(cid:2)(Xi TX )(XTX )(cid:3), thijeni,jthe
clustering applications [11]. Similar to the above GLRR- ij j i i j
matrix ∆ is semi-positive definite.
F model, we formulate the (cid:107)·(cid:107) norm constrained
(cid:96)2/(cid:96)1
GLRR model (GLRR-21) as follows, Proof: Please refer to [33].
From Lemma 1, we have the eigenvector decomposi-
min(cid:107)E(cid:107) +λ(cid:107)Z(cid:107)
E,Z (cid:96)2/(cid:96)1 ∗ (10) tion for ∆ defined by ∆=UDUT, where UTU =I and
s.t. X =X ×3Z+E, D = diag(σi) with nonnegative eigenvalues σi. Denote
IEEETRANSACTIONSONKNOWLEDGEANDDATAENGINEERING,VOL.XX,NO.X,SEPTEMBER2016 6
thesquarerootof∆by∆12 =UD12UT,thenitisnothard The above ADM is appealing only if we can find
toprovethatproblem (12)isequivalenttothefollowing closed form solutions to the subproblems (17) and (18).
problem Consider problem (17) first. Denote Ck =X −X × Zk
3
mZin(cid:107)Z∆12 −∆12(cid:107)2F +λ(cid:107)Z(cid:107)∗. (15) ai-nthd fforornatnsyli3ce-orAd(e:r,:,tei)nsaolronAgwtheeu3se-mAo(di)etaosdaensohtoertthene
notation.Thenweobservethat(17)isseparableinterms
Finally we have
of matrix variable E(i) as follows:
Theorem 2. Given that ∆ = UDUT as defined above, the
solution to (15) is given by
Ek+1(i)=argmin(cid:107)E(i)(cid:107) +(cid:104)ξk(i),Ck(i)−E(i)(cid:105)
F
Z∗ =UD UT, E(i)
λ
µk
where Dλ is a diagonal matrix with its i-th element defined + 2 (cid:107)Ck(i)−E(i)(cid:107)2F
by µk 1
Dλ(i,i)=(cid:40)1− σλi if σi >λ, =arEg(mi)in(cid:107)E(i)(cid:107)F + 2 (cid:107)Ck(i)−E(i)+ µkξk(i)(cid:107)2F.
0 otherwise. (20)
From [11], we know that the problem in (20) has a
Proof: Please refer to the proof of Lemma 1 in [49].
closed form solution, given by
According to Theorem 2, the main cost for solving
(cid:40)
the LRR on Grassmann manifold problem (8) is (i) 0 if M < 1 ;
computation of the symmetric matrix ∆ and (ii) a SVD Ek+1(i)= µk
(1− 1 )(Ck(i)+ 1 ξk(i)) otherwise.
for∆.Thisisasignificantimprovementtothealgorithm Mµk µk
(21)
presented in [33].
where M =(cid:107)Ck(i)+ 1 ξk(i)(cid:107) .
µk F
3.2.2 Algorithm for the (cid:96) /(cid:96) Norm Constrained GLRR Now consider problem (18). Denote
2 1
Model
Now we turn to the GLRR-12 problem (10). Because f(Z)=(cid:104)ξk,X−X× Z−Ek+1(cid:105)+µk(cid:107)X−X× Z−Ek+1(cid:107)2,
the existence of (cid:96) /(cid:96) norm in error term, the objective 3 2 3 F
2 1
function is not differentiable but convex. We propose
using the alternating direction method (ADM) method then problem (18) becomes
to solve this problem.
Firstly, we construct the following augmented La-
Zk+1 =argminλ(cid:107)Z(cid:107) +f(Z). (22)
grangian function: ∗
Z
L(E,Z,ξ)=(cid:107)E(cid:107) +λ(cid:107)Z(cid:107) +(cid:104)ξ,X −X × Z−E(cid:105)
(cid:96)2/(cid:96)1 ∗ 3
µ We adopt the linearization method to solve the above
+ (cid:107)X −X × Z−E(cid:107)2, (16)
2 3 F problem.
where (cid:104)·,·(cid:105) is the standard inner product of two tensors Forthispurpose,wefirstlyutilizethematricesineach
in the same order, ξ is the Lagrange multiplier, and µ is slicetocomputethetensoroperationinthedefinitionof
f(Z). For the i-th slice of the first term in f(Z), we have
the penalty parameter.
Specifically, the iteration of ADM for minimizing (16)
goes as follows: N
(cid:88)
(cid:104)ξk(i),X XT − z X XT −Ek+1(i)(cid:105)
Ek+1 =argminL(E,Zk,ξk) i i ji j j
j=1
E
N
=argmin(cid:107)E(cid:107) +(cid:104)ξk,X −X × Zk−E(cid:105) (cid:88)
E (cid:96)2/(cid:96)1 3 =− zjitr(ξk(i)TXjXjT)+tr(ξk(i)T(XiXiT −Ek+1(i))).
µk j=1
+ (cid:107)X −X × Zk−E(cid:107)2, (17)
2 3 F
Zk+1 =argminL(Ek+1,Z,ξk) Define a new matrix by
Z
=argminλ(cid:107)Z(cid:107)∗+(cid:104)ξk,X −X ×3Z−Ek+1(cid:105) Φk =(cid:2)tr(ξk(i)TX XT)(cid:3) ,
Z j j i,j
µk
+ (cid:107)X −X × Z−Ek+1(cid:107)2, (18)
2 3 F then the first term in f(Z) has the following representa-
ξk+1 = ξk+µk[X −X ×3Zk+1−Ek+1], (19) tion:
where we have used an adaptive parameter µk. The
adaptive rule will be specified later in Algorithm 1. (cid:104)ξk,X −X × Z−Ek+1(cid:105)=−tr(ΦkZT)+const. (23)
3
IEEETRANSACTIONSONKNOWLEDGEANDDATAENGINEERING,VOL.XX,NO.X,SEPTEMBER2016 7
For the i-th slice of the second term of f(Z), we have Algorithm 1 Low-Rank Representation on Grassmann
Manifold with (cid:96) /(cid:96) Norm Constraint.
2 1
N
(cid:107)XiXiT −(cid:88)zjiXjXjT −Ek+1(i)(cid:107)2F Input: The Grassmann sample set {Xi}Ni=1,Xi ∈G(p,d),
theclusternumberk andthebalancingparameterλ.
j=1
=tr((X XT)TX XT)+tr(Ek+1(i)TEk+1(i)) Output: The Low-Rank Representation Z
i i i i 1: Initialize:Z0 = 0, E0 = ξ0 = 0, ρ0 = 1.9, η > (cid:107)X(cid:107)2,
N N
+ (cid:88) (cid:88) z z tr((X XT)T(X XT)) µ0 =0.01, µmax =1010, ε1 =10−4 and ε2 =10−4.
j1i j2i j1 j1 j2 j2 2: Prepare ∆ according to (13);
j1=1j2=1 3: while not converged do
−2tr((XiXiT)TEk+1(i)) 4: Update Ek+1 according to (21);
(cid:88)N 5: Update Zk+1 according to (26);
−2 zjitr((XjXjT)T(XiXiT −Ek+1(i))). 6: Update ξk+1 according to (19);
j=1 7: Update µk+1 according to the following rule:
Denoting a matrix by
µk+1 ←min{ρkµk,µmax}
Ψk =(cid:2)tr(Ek+1(i)TX XT)(cid:3)
j j i,j where
and noting (14), we will have ρ0 if µk/(cid:107)X(cid:107)max{√η(cid:107)Zk+1−Zk(cid:107)F,
(cid:107)X −X × Z−Ek+1(cid:107)2 ρk = (cid:107)Ek+1−Ek(cid:107) }≤ε
3 F F 2
=tr(Z∆ZT)−2tr((∆−Ψk)Z)+const. (24) 1 otherwise
Combining (23) and (24), we have
8: Check the convergence conditions:
µk 1
f(Z)= tr(Z∆ZT)−µktr((∆−Ψk+ Φk)Z)+const. (cid:107)X −X ×3Zk+1−Ek+1(cid:107)/(cid:107)X(cid:107)≤ε1
2 µk
and
Thus we have
√
(cid:18) 1 (cid:19)T µk/(cid:107)X(cid:107)max{ η(cid:107)Zk+1−Zk(cid:107)F,(cid:107)Ek+1−Ek(cid:107)F}≤ε2
∂f(Z)=µkZ∆−µk ∆−Ψk+ Φk .
µk
9: end while
Finally we can use the following linearized proximity
approximation to replace (22) as follows
Zk+1 4 KERNELIZED LRR ON GRASSMANN MANI-
FOLD
ηµk
=argminλ(cid:107)Z(cid:107) +(cid:104)∂f(Zk),Z−Zk(cid:105)+ (cid:107)Z−Zk(cid:107)2
∗ 2 F
Z 4.1 KernelsonGrassmannManifold
=argminλ(cid:107)Z(cid:107)∗+ ηµ2k (cid:13)(cid:13)(cid:13)(cid:13)Z−Zk+ ∂fη(µZkk)(cid:13)(cid:13)(cid:13)(cid:13)2 , (25) In this section, we consider the kernelization of the
Z F GLRR-F model. In fact, the LRR model on Grassman
withaconstantη >(cid:107)X(cid:107)2 where(cid:107)X(cid:107)2 isthematrixnorm manifold (8) can be regarded a kernelized LRR with
ofthethirdmodematricizationofthetensorX.Thenew a kernel feature mapping Π defined by (4). It is not
problem (25) has a closed form solution given by, see surprised that ∆ is semi-definite positive as it serves
[50], as a kernel matrix. It is natural to further generalize
the GLRR-F based on kernel functions on Grassmann
Zk+1 =U S (Σ )VT, (26)
z ηµλk z z manifold.
where U Σ VT is the SVD of Z − ∂f(Zk) and S (·) is There are a number of Grassmann kernel functions
z z z k ηµk τ proposed in recent years in computer vision and ma-
the Singular Value Thresholding (SVT) operator defined
chine learning communities, see [31], [41], [52], [53]. For
by
simplicity, we focus on the following kernels:
S (Σ)=diag(sgn(Σ )(|Σ |−τ)).
τ ii ii
1. Projection Kernel: This kernel is defined in [41]. For
Finally the procedure of solving the (cid:96)2/(cid:96)1 norm con- any two Grassmann points Xi and Xj, the kernel value
strainedGLRRproblem(10)issummarizedinAlgorithm is
1.Forthepurposeoftheself-completionofthepaper,we
borrow the convergence analysis for Algorithm 1 from kp(X ,X )=(cid:107)XTX (cid:107)2 =tr((X XT)T(X XT)).
i j i j F i i j j
[51] without proof.
Theorem 3. If µk is non-decreasing and upper bounded, The feature mapping of the kernel is actually the map-
η > (cid:107)X(cid:107)2, then the sequence {(Zk,Ek,ξk)} generated by ping defined in (4).
2. Canonical Correlation Kernel: Referring to [41], this
Algorithm 1 converges to a KKT point of problem (10).
kernelisbasedonthecosinevaluesoftheso-calledprin-
IEEETRANSACTIONSONKNOWLEDGEANDDATAENGINEERING,VOL.XX,NO.X,SEPTEMBER2016 8
cipal angle between two subspaces defined as follows where K21 is the square root matrix of the kernel matrix
K. So the Kernelized model KGLRR-p is similar to
cos(θ )= max max uTv ,
m m m GLRR-F model in Section 3.
um∈span(Xi)vm∈span(Xj)
Ithasbeenprovedthatusingmultiplekernelfunctions
s.t. (cid:107)u (cid:107) =(cid:107)v (cid:107) =1;
m 2 m 2
improves performance in many application scenarios
uTu =0, k =1,2,...,m−1;
m k [56], due to the virtues of different kernel functions for
vTv =0, l=1,2,...,m−1. thecomplexdata.Soinpractice,wecanemploydifferent
m l
kernelfunctionstoimplementthemodelin(27),evenwe
Wecanusethelargestcanonicalcorrelationvalue(the
can adopt a combined kernel function. For example, in
cosine of the first principal angle) as the kernel value as
ourexperiments,weuseacombinationoftheabovetwo
done in [54], i.e., kernel functions kcc and kproj as follows.
xTx
kcc(Xi,Xj)=xi∈mspaanx(Xi)xj∈mspaanx(Xj)(cid:107)xi(cid:107)i2(cid:107)xjj(cid:107)2. kccp(Xi,Xj)=αkcc(Xi,Xj)+(1−α)kp(Xi,Xj).
where 0 < α < 1 is a hand assigned combination
The cosine of principal angles of two subspaces can
coefficient. We denote the Kernelized LRR model of
be calculated by using SVD as discussed in [55], see k =kccp by KGLRR-ccp.
Theorem 2.1 there.
Consider two subspaces span(X ) and span(X ) as
i j
twoGrassmannpointswhereX andX aregivenbases. 4.3 AlgorithmforKGLRR
i j
If we take the following SVD It is straightforward to use Theorem 2 to solve (28).
For the sake of convenience, we present the algorithm
XTX =UΣVT,
i j below.
then the values on the diagonal matrix Σ are the cosine Let us take the eigenvector decomposition of the ker-
values of all the principal angles. The kernel kcc(X ,X ) nel matrix K
i j
K =UDUT,
uses partial information regarding the two subspaces.
To increase its performance in our LRR, in this paper,
where D =diag(σ ,σ ,....,σ ) is the diagonal matrix of
1 2 N
we use the sum of all the diagonal values of Σ as the
all the eigenvalues. Then the solution to (28) is given by
kernelvaluebetweenX andX .Westillcallthisrevised
i j
version the canonical correlation kernel. Z∗ =UD UT,
λ
where D is the diagonal matrix with elements defined
λ
4.2 KernelizedLRRonGrassmannManifold by
(cid:40)
Let k be any kernel function on Grassmann manifold. Dλ(i,i)= 1− σλi if σi >λ;
Accordingtothekerneltheory[56],thereexistsafeature 0 otherwise.
mapping φ such that
This algorithm is valid for any kernel functions on
φ:G(p,n)→F, Grassmann manifold.
where F is the relevant feature space under the given
kernel k. 5 EXPERIMENTS
Give a set of points {X1,X2,...,XN} on Grassmann Toevaluatetheperformanceoftheproposedmethods,
manifold G(p,n), we define the following LRR model GLRR-21,GLRR-F/KGLRR-pandKGLRR-ccp,weselect
various public datasets of different types to conduct
min(cid:107)φ(X)−φ(X)Z(cid:107)2 +λ(cid:107)Z(cid:107) . (27)
F ∗ clustering experiments. These datasets are challenging
We call the above model the Kernelized LRR on Grass- forclusteringapplications.Wedividethesedatasetsinto
man manifold, denoted by KGLRR, and KGLRR-cc, four types:
KGLRR-p for k =kcc and k =kp, respectively. • Face or expression image sets, including the Ex-
However,forKGLRR-p,theabovemodel(27)becomes tended Yale B face dataset (http://vision.ucsd.edu/
the LRR model (12). Denote by K the N × N kernel content/yale-face-database) and the BU-3DEF ex-
matrix over all the data points X’s. By using the similar pression dataset (http://www.cs.binghamton.edu/
derivation in [33], we can prove that the model (27) is ∼lijun/Research/3DFE/3DFE Analysis.html).
equivalent to • Large scale object image sets, including the
Caltech 101 dataset (http://www.vision.caltech.
min−2tr(KZ)+tr(ZKZT)+λ(cid:107)Z(cid:107) ,
∗ edu/feifeili/Datasets.htm) and the ImageNet
Z
2012 dataset (http://www.image-net.org/
which is equivalent to
download-images).
min(cid:107)ZK12 −K12(cid:107)2F +λ(cid:107)Z(cid:107)∗. (28) • Humanactiondatasets,includingtheBalletdataset
Z (https://www.cs.sfu.ca/research/groups/VML/
IEEETRANSACTIONSONKNOWLEDGEANDDATAENGINEERING,VOL.XX,NO.X,SEPTEMBER2016 9
semilatent/) and the SKIG dataset (http://lshao. on Grassmann manifold cannot be used directly. So to
staff.shef.ac.uk/data/SheffieldKinectGesture.htm). compare our method with SSC and LRR, we have to
• Traffic scence video clip sets, including the High- vectorize each image set to construct inputs for SSC and
way Traffic Dataset (http://www.svcl.ucsd.edu/ LRR,i.e.we”vectorize”asetofimagesintoalongvector
projects/traffic/) and a traffic road dataset we col- by stacking all the vectors of the raw data in the image
lected. set in a particular order, e.g., in the frame order etc.
The proposed methods will be compared with the However, in most of the experiments, we cannot simply
benchmark spectral clustering methods, Sparse Sub- take these long vectors because of high dimensionality
space Clustering (SSC) [2] and Low-Rank Representa- for a larger image set. In this case, we apply PCA to
tion (LRR) [11], and several state-of-the-art clustering reducethesevectorstoalowdimensionwhichequalsto
methods concerned with manifolds, including Statisti- eitherthedimensionofsubspaceofGrassmannmanifold
cal computations on Grassmann and Stiefel manifolds or the number of PCA components retaining 95% of
(SCGSM) [43], Sparse Manifold Clustering and Embed- its variance energy. Then PCA projected vectors will be
ding (SMCE) [57] and Latent Space Sparse Subspace taken as inputs for SSC and LRR.
Clustering (LS3C) [58]. In the sequel, we first describe Inour experiments,the performanceof differentalgo-
the experiment setting, then report and analyze the rithms is measured by the following clustering accuracy
clustering results on these datasets.
number of correctly classified points
Accuracy= ×100%.
total number of points
5.1 ExperimentSetting
To clearly present our experiments, we denote by C
Our GLRR model is designed to cluster Grassmann the number of clusters, N the total number of image
points,whicharesubspacesinsteadofrawobject/signal sets, P the number of images in each image set and p
vectors (points). Thus before implementing the main the dimension of subspace of a Grassmann point.
components of GLRR and the spectral clustering al- All the algorithms are coded in Matlab 2014a and
gorithm (here we use Ncut algorithm), we must form implemented on an Intel Core i7-4770K 3.5GHz CPU
subspaces from raw signals. Generally, a subspace can machine with 32G RAM.
be represented by an orthonormal basis, so we utilize
the samples drawn from the same subspace to construct
5.2 ClusteringonFaceandExpressionImageSets
its orthonormal basis. Similar to the work in [41], [59],
we simply adopt SVD to construct subspace bases. Con- Human face or expression image sets are widely used
cretely, given a set of images, denoted by {Y }P with in computer vision and pattern recognition communi-
i i=1
each Y in size of m×n pixels, we construct a matrix ties. They are considered as challenging data sets for
i
Γ=[vec(Y ),vec(Y ),...,vec(Y )] of size (m∗n)×P by either clustering or recognition applications. The main
1 2 P
vectorizing all the images. Then Γ is decomposed by difficulty is that the face image is affected greatly by
SVD as Γ = UΣV. We pick the first p (≤ P) singular- various factors, such as the complex structure of face,
vectors X = [u ,u ,...,u ] of U to represent the entire thenon-rigidelasticdeformationofexpression,different
1 2 p
image set as a point X on the Grassmann manifold poses and various light conditions.
G(p,m∗n). 1) Extended Yale B dataset
Thesettingofthemodelparametersaffectstheperfor- Extended Yale B dataset contains face images of 38
manceofourproposedmethods.λisthemostimportant individualsandeachsubjecthasabout64frontalimages
penalty parameter for balancing the error term and the captured in different illuminations. All the face images
low-rankterminourproposedmethods.Empirically,the have been normalized to the size of 20×20 pixels in 256
valueofλindifferentapplicationshasbiggaps,andthe gray levels. Some samples of Extended Yale B dataset
best value for λ has to be chosen from a large range are shown in Fig. 3.
of values to get a better performance in a particular To prepare the experiment data, we randomly choose
application. From our experiments, we have observed P images from each subject to construct an image set
that when the cluster number is increasing, the best and P is set to 4 or 8 in order to test the affection of
λ is decreasing. Additionally, λ will be smaller when different scales of image set for the clustering results.
the noise level in data is lower while λ will become We produce 10 image sets for each subject, so there are
larger if the noise level higher. These observations are totally 380 points for clustering. To get a Grassmann
usefulinselectingaproperλvaluefordifferentdatasets. point, we use the aforementioned SVD operator to get
The error tolerance ε is also an important parameter the basis of subspace corresponding to each image set.
in controlling the terminal condition, which bounds the The dimension of subspace p = 4. Thus the Grassmann
allowed reconstructed error. We experimentally seek a point X ∈ G(4,400) in this experiment. For SSC and
proper value of ε to make the iteration process stop at LRR methods, the original vector of an image set has
an appropriate level of reconstructed error. dimension of 20×20×4=1600 or 20×20×8=3200 for
For both SSC and LRR methods, which demand the P =4and8,respectively.Here,wereducethedimension
vector form of inputs, the subspace form of points to 146 by PCA.
IEEETRANSACTIONSONKNOWLEDGEANDDATAENGINEERING,VOL.XX,NO.X,SEPTEMBER2016 10
to present the same expression from different persons.
It is not surprised that all the methods perform badly
overthisdataset.Yet,theperformanceofmanifoldbased
methods are superior to other methods, especially our
methodsproduceaclusteringaccuracyatleast4percent
better than other methods.
5.3 ClusteringonLargeScaleObjectImageSets
Large scale dataset is challenging for clustering meth-
ods. When the cluster number of the data is increas-
ing, the performance of many state-of-the-art clustering
methods drops dramatically. In this set of experiment,
our intention is to test our methods on two large scale
object image sets, the Caltech 101 dataset and the Ima-
Fig.3. SomesamplesfromExtendedYaleBdataset(the genet 2012 dataset.
firsttworows)andBU-3DEFdataset(thelasttworows). 1) Caltech 101 dataset
Caltech 101 dataset contains pictures of 101 categories
objects and each category has about 40 to 800 images.
The experiment results are shown in Table 1. It shows The objects are generally centered in images. All images
that most experimental results with P =8 are obviously aregrayedandrescaledtoasizeof20×20.Fig.4shows
better than that with P = 4. In fact, the larger P is, some samples of Caltech 101 dataset.
the better the performance. When more images in the In each category, we randomly select P = 4 images
set, the impact of outlier images such as darkness faces to construct an image set. The image sets are then con-
or special expressions will be decreased. However a vertedtoGrassmannpointsX ∈G(4,400),i.e.p=4.For
larger P may also increase more variances to be fitted bothSSCandLRR,thesubspacevectorswithdimension
by the subspace. Compared with other manifold based 20×20×4=1600 are reduced to 249 by PCA.
methods, SCGSM, SMCE and LS3C, the excellent per- To evaluate the robustness of our proposed methods
formance of our methods is due to the incorporation of with large cluster numbers, we test the cases of C =
low rank constraint over the similarity matrix Z. Finally 10,20,30,40and50.Table3showstheexperimentresults
we also note that the performance of LRR and SSC is for different methods. As can be seen from this table,
greatly worse than all manifold based methods, which in all cases, our methods outperform other state-of-the-
demonstratesincorporatingmanifoldpropertiescanalso art methods at least 10 percent and are stable with the
improve the performance in clustering. increase of cluster numbers. It reveals that when the
2) BU-3DEF dataset number of clusters is higher, GLRR-F is slightly better
BU-3DEF dataset collects 3D face models and face thanKGLRR-ccp.Itisworthnotingthatourmethodsare
images from 100 subjects of different genders, races and alsorobusttocomplexbackgroundscontainedinCaltech
ages. Each subject has 25 face images, one neutral im- 101 images.
age and six expressions (happiness, disgust, fear, angry, 2) ImageNet 2012 dataset
surprise and sadness), each of which is at four levels ImageNet2012datasetisawildlyusedobjectdatabase
of intensity. In our experiment, we use these expression for image retrieve, which contains more than 1.2 million
imagesforclustering.Theyarenormalizedandcentered object images from over 1000 categories. This database
in a fixed size of 24×20. Fig. 3 shows some samples of is more difficult for clustering due to the large scale
BU-3DEF dataset. of categories. In addition, most objects are small and
Foreachexpression,werandomlyselectP =6images un-centered in images, even many objects have severe
to construct an image set and totally obtain 100 image occlusions.Weextracttheregionofinterestfromimages
sets. Then, a Grassmann point X ∈ G(4,400) is created by the bounding boxes defined in image-net.org and
for each image set by using SVD. There are C = 6 resize the region to 20×20. Some samples of ImageNet
classes of expressions for clustering. For SSC and LRR, are shown in Fig. 4.
the original vectors with dimension 24×20×6 = 2880 Many categories in this dataset are quite similar to
is reduced to dimension 274 by PCA. each other, and the orientation and scale of objects in
Table2presentstheexperimentresults,whichshowall images change largely, so it is hard for an algorithm
themethodsperformpoorlyforthischallengingdataset. to get high accuracy for classification or clustering. To
Analyzing this dataset reveals that some images of dif- test the performance of our methods in the existence of
ferent expressions from one person only have very little these varieties, we only set a mild value of C = 10 in
distinction while some images of the same expression this experiments, i.e. we randomly select 10 classes of
from different persons have a strong distinction, which objects. In each class, Grassman points on G(2,400) are
leads to a difficult problem to find a common feature constructed by using SVD for image sets, each of which