Table Of ContentPose Invariant Embedding for Deep Person Re-identification
LiangZheng†,YujiaHuang‡,HuchuanLu§,YiYang†
†UniversityofTechnologySydney ‡CMU §DalianUniversityofTechnology
{liangzheng06,yee.i.yang}@gmail.com [email protected] [email protected]
7 Abstract
1
0
Pedestrian misalignment, which mainly arises from de-
2
tector errors and pose variations, is a critical problem for
n
a robust person re-identification (re-ID) system. With bad
a
J alignment,thebackgroundnoisewillsignificantlycompro-
misethefeaturelearningandandmatchingprocess. Toad-
6
2 dressthisproblem,thispaperintroducestheposeinvariant
embedding (PIE) as a pedestrian descriptor. First, in or- Figure1.ExamplesofmisalignmentcorrectionbyPoseBox.Row
] der to align pedestrians to a standard pose, the PoseBox 1: originalboundingboxeswithdetectionerrors/occlusions. Ev-
V
structure is introduced, which is generated through pose eryconsecutivetwoboxescorrespondtoasameperson. Row2:
C correspondingPoseBoxes. Weobservethatmisalignmentcanbe
estimation followed by affine transformations. Second, to
. correctedtosomeextent.
s reduce the impact of pose estimation errors and informa-
c
tion loss during PoseBox construction, we design a Pose-
[
Box fusion (PBF) CNN architecture that takes the origi-
head,orthatoneisridingabicycleinsteadofbeingupright.
1 nalimage,thePoseBox,andtheposeestimationconfidence
Thesecondcauseofmisalignmentisdetectionerror. Asil-
v
as input. The proposed PIE descriptor is thus defined as
2 lustratedinthesecondrowofFig. 1, detectionerrorsmay
the fully connected layer of the PBF network for the re-
3 leadtosevereverticalmisalignment.
7 trieval task. Experiments are conducted on the Market-
Whenpedestriansarepoorlyaligned,there-IDaccuracy
7 1501, CUHK03, and VIPeR datasets. We show that Pose-
can be compromised. For example, a common practise in
0 Boxaloneyieldsdecentre-IDaccuracy, andthatwhenin-
. re-IDistopartitiontheboundingboxintohorizontalstripes
1 tegrated in the PBF network, the learned PIE descriptor
[20,42,1,21]. Thismethodworksundertheassumptionof
0 producescompetitiveperformancecomparedwiththestate-
7 of-the-artapproaches. slight vertical misalignment. But when vertical misalign-
1 mentdoeshappenasinthecasesinRow2ofFig. 1,one’s
: head will be matched to the background of a misaligned
v
i 1.Introduction image. Sohorizontalstripesmaybelesseffectivewhense-
X
veremisalignmenthappens. Inanotherexample,undervar-
r Thispaperstudiesthetaskofpersonre-identification(re- ious pedestrian poses, the background may be incorrectly
a
ID).Givenaprobe(personofinterest)andagallery,weaim weighted by the feature extractors and thus affect the fol-
tofindinthegalleryalltheimagescontainingthesameper- lowingmatchingaccuracy.
sonwiththeprobe. Wefocusontheidentificationproblem, To our knowledge, two previous works [8, 7] from the
aretrievaltaskinwhicheachprobehasatleastoneground same group explicitly consider the misalignment problem.
truthinthegallery[42]. Anumberoffactorsaffectthere- In both works, the pictorial structure (PS) is used, which
IDaccuracy,suchasdetection/trackingerrors,variationsin shares a similar motivation and construction process with
illumination,pose,viewpoint,etc. PoseBox,andtheretrievalprocessmainlyreliesonmatch-
Acriticalinfluencingfactoronre-IDaccuracyisthemis- ing the normalized body parts. While the idea of con-
alignment of pedestrians, which can be attributed to two structingnormalizedposesissimilar,ourworklocatesbody
causes. First, pedestrians naturally take on various poses joints using a state-of-the-art CNN based pose estimator,
asshowninFig. 1. Posevariationsimplythattheposition and the components of PoseBox are different from PS as
ofthebodypartswithintheboundingboxisnotpredictable. evidencedbylarge-scaleevaluations. Anotherdifferenceof
Forexample,itispossiblethatone’shandsreachabovethe our work is the matching procedure. While [8, 7] do not
1
• Using PIE, we report competitive re-ID accuracy on
theMarket-1501,CUHK03,andVIPeRdatasets.
2.RelatedWork
Pose estimation. The pose estimation research has
shifted from traditional approaches [8, 7] to deep learning
followingthepioneerwork“DeepPose”[30]. Somerecent
methodsemploymulti-scalefeaturesandstudymechanisms
on how to combine them [29, 26]. It is also effective to
Figure 2. Information loss and pose estimation errors that occur injectspatialrelationshipsbetweenbodyjointsbyregular-
during PoseBox construction. Row 1: important pedestrian de- izing the unary scores and pairwise comparisons [11, 27].
tails (highlighted in red bounding boxes) may be missing in the Thispaperadoptstheconvolutionalposemachines(CPM)
PoseBox.Row2:poseestimationerrorsdeterioratethequalityof
[34], a state-of-the-art pose estimator with multiple stages
PoseBoxes. Foreachimagepair,theoriginalimageanditsPose-
andsuccessiveposepredictions.
Boxareontheleftandright,respectively.
Deep learning for re-ID. Due to its superior perfor-
mance,deeplearningbasedmethodshavebeendominating
discuss the pose estimation errors which prevalently exist there-IDcommunityinthepasttwoyears. Inthetwoear-
inreal-worlddatasets,weshowthattheseerrorsmakerigid lierworks[20,39],thesiamesemodelwhichtakestwoim-
feature learning/matching with only the PoseBox yield in- agesasinputisused. Inlaterworks,thismodelisimproved
feriorresultstotheoriginalimage,andthatthethree-stream invariousways,suchasinjectingmoresophisticatedspatial
PoseBoxfusionnetworkeffectivelyalleviatesthisproblem. constraint[1,6],modelingthesequentialpropertiesofbody
parts using LSTM [32], and mining discriminative match-
Consideringtheabove-mentionedproblemsandthelimit
ing parts for different image pairs [31]. It is pointed out
ofpreviousmethods,thispaperproposestheposeinvariant
in [43] that the siamese model only uses weak re-ID la-
embedding (PIE) as a robust visual descriptor. Two steps
bels: two images being of the same person or not; and it
areinvolved. First,weconstructaPoseBoxforeachpedes-
is suggested that an identification model which fully uses
trianboundingbox.PoseBoxdepictsapedestrianwithstan-
thestrongre-IDlabelsbesuperior. Severalpreviousworks
darizeduprightstance. Carefullydesignedwiththehelpof
adopt the identification model [37, 36, 41]. In [41], the
poseestimators[34],PoseBoxaimstoproducewell-aligned
video frames are used as training samples of each person
pedestrian images so that the learned feature can find the
class,andin[37],effectiveneuronsarediscoveredforeach
same person under intensive pose changes. Trained alone
training domain and a new dropout strategy is proposed.
using a standard CNN architecture [37, 41, 44], we show
Thearchitectureproposedin[36]ismoresimilartothePBF
thatPoseBoxyieldsverydecentre-IDaccuracy.
modelinourwork. In[36],hand-craftedlow-levelfeatures
Second, to reduce the impact of information loss and
areconcatenatedafterafullyconnected(FC)layerwhichis
pose estimation errors (Fig. 2) during PoseBox construc-
connected to the softmax layer. Our network is similar to
tion, we build a PoseBox fusion (PBF) CNN model with
[36] in that confidence scores of pose estimation are cate-
three streams as input: the PoseBox, the original image,
natedwiththeothertwoFClayers. Itdepartsfrom[36]in
and the pose estimation confidence. PBF achieves a glob-
thatournetworktakesthreestreamsasinput,twoofwhich
allyoptimizedtradeoffbetweentheoriginalimageandthe
arerawimages.
PoseBox. PIE is thus defined as the FC activations of the
Posesforre-ID.Althoughposechangeshavebeenmen-
PBFnetwork.Onseveralbenchmarkdatasets,weshowthat
tionedbymanypreviousworksasaninfluencingfactoron
the joint training procedure yields competitive re-ID accu-
re-ID,onlyahandfulofreportscanbefounddiscussingthe
racy to the state of the art. To summarize, this paper has
connectionbetweenthem. Farenzenaetal. [12]proposeto
threecontributions.
detectthesymmetricalaxisofdifferentbodypartsandex-
• Minor contribution: the PoseBox is proposed which tract features following the pose variation. In [35], rough
shares a similar nature with a previous work [8]. It estimates of the upper-body orientation is provided by the
enables well-aligned pedestrian matching, and yields HOGdetector,andtheupperbodyisthenrenderedintothe
satisfyingre-IDperformancewhenbeingusedalone. texture of an articulated 3D model. Bak et al. [3] further
classifyeachpersonintothreeposetypes: front,back,and
• Major contribution: the pose invariant embedding
side. A similar idea is exploited in [9], where four pose
(PIE) is proposed as a part of the PoseBox Fusion
typesareused. Bothworks[3,9]applyview-pointspecific
(PBF) network. PBF fuses the original image, Pose-
distance metrics according to different testing pose pairs.
Box and the pose estimation errors, thus providing a
The closest works to PoseBox are [8, 7], which construct
fallbackmechanismwhenposeestimationfails.
2
arm by the elbow and wrist joints. The width of the arms
boxesissetto20pixels.Similarly,theupperandlowerlegs
aredefinedbythehipandkneejoints,andthekneeandan-
klejoints,respectively.Theirwidthsareboth30pixels.The
torsoisconfinedbyfourbodyjoints,i.e.,thetwoshoulders
and the two hips, so we simply draw a quadrangle for the
torso. Duetoposeestimationerrors,theaffinetransforma-
tionmayencountersingularvalues. Soinpractice,weadd
some small random disturbance when the pose estimation
confidenceofabodypartisbelowathreshold(setto0.4).
Three types of PoseBoxes. In several previous works
discussing the performance of different parts, a common
observationisthatthetorsoandlegsmakethelargestcon-
tributions [8, 1, 6]. This is expected because the most
distinguishing features exist in the upper-body and lower-
bodyclothes.Basedontheexistingobservations,thispaper
Figure3.PoseBoxconstruction.Givenaninputimage,thepedes- buildsthreetypesofPoseBoxesasdescribedbelow.
trianposeisestimatedbyCPM[34]. Tenbodypartscanthenbe
• PoseBox 1. It consists of the torso and two legs. A
discoveredthroughthebodyjoints.ThreetypesofPoseBoxesare
legiscomprisedoftheupperandthelowerlegs. Pose-
built from the body parts. PoseBox1: torso + legs; PoseBox2:
PoseBox1+arms;PoseBox3:PoseBox2+head. Box1includestwomostimportantbodypartsandisa
baselinefortheothertwoPoseBoxtypes.
• PoseBox 2. Based on PoseBox 1, we further add the
the pictorial structure (PS), a similar concept to PoseBox.
left and right arms. An arm includes the upper and
They use traditional pose estimators and hand-crafted de-
lower arm sub-modules. In our experiment we show
scriptors that are inferior to CNN by a large margin. Our
that PoseBox 2 is superior to PoseBox 1 due to the
workemploysafullsetofstrongertechniques,anddesigns
enrichedinformationbroughtbythearms.
amoreeffectiveCNNstructureevidencedbythecompeti-
• PoseBox3.OnthebasisofPoseBox2,weputthehead
tivere-IDaccuracyonlarge-scaledatasets.
box on top of the torso box. It is shown in [8] that
the inclusion of head brought marginal performance
3.ProposedMethod
increase. Inourcase, wefindthatPoseBox3slightly
3.1.PoseBoxConstruction inferiortoPoseBox2,probablybecauseofthefrequent
head/neckestimationerrors.
The construction of PoseBox has two steps, i.e., pose
estimationandPoseBoxprojection. Remarks. TheadvantageofPoseBoxistwo-fold. First,
Pose estimation. This paper adopts the off-the-shelf the pose variations can be corrected. Second, background
model of the convolutional pose machines (CPM) [34]. In noisecanberemovedlargely.
a nutshell, CPM is a sequential convolutional architecture PoseBox is also limited in two aspects. First, pose es-
thatenforcesintermediatesupervisiontopreventvanishing timation errors often happen, leading to imprecisely de-
gradients. A set of 14 body joints are detected, i.e., head, tected joints. Second, PoseBox is designed manually, so
neck,leftandrightshoulders,leftandrightelbows,leftand it is not guaranteed to be optimal in terms of information
rightwrists,leftandrighthips,leftandrightknees,andleft loss or re-ID accuracy. We address the two problems by a
andrightankles,asshowninthesecondcolumnofFig. 3 fusionschemetobeintroducedinSection3.3. Forthesec-
Body part discovery and affine projection. From the ond problem, specifically, we note that we construct Pose-
detected joints, 10 body parts can be depicted (the third Boxesmanuallybecausecurrentre-IDdatasetsdonotpro-
column of Fig. 3). The parts include head, torso, upper vide ground truth poses, without which it is not trivial to
and lower arms (left and right), and upper and lower legs designanend-to-endlearningmethodtoautomaticallygen-
(leftandright),whichalmostcoverthewholebody. These eratenormalizedposes.
quadrilateral parts are projected to rectangles using affine
3.2.Baselines
transformations.
In more details, the head is defined with the joints of Thispaperconstructstwobaselinesbasedontheoriginal
headandneck,andwemanuallysetthewidthofeachhead pedestrianimageandPoseBox, respectively. Accordingto
box to 2 of its height (from head to neck). An upper arm theresultsintherecentsurvey[43],theidentificationmodel
3
isconfinedbytheshoulderandelbowjoints,andthelower [19]outperformstheverificationmodel[1,20]significantly
3
conv. layers FC layers (a) (b) (c) (d)
or
Figure4.ThebaselineidentificationCNNmodelusedinthispa-
per.TheAlexNet[19]orResNet-50[15]withsoftmaxlossisused.
TheFCactivationsareextractedforEuclidean-distancetesting.
ontheMarket-1501dataset[42]: theformermakesfulluse
(a) .05 .01 .06 .13 .07 .03 .04 .03 .16 .38 .51 .12 .55 .60
ofthere-IDlabels, i.e., theidentityofeachboundingbox, (b) .03 .01 .05 .05 .07 .04 .06 .08 .04 .01 .02 .06 .02 .01
while the latter only uses weak labels, i.e., whether two (c) .75 .79 .64 .57 .45 .74 .43 .50 .49 .76 .52 .56 .72 .13
boxesbelongtothesameperson. Sointhispaperweadopt (d) .15 .34 .25 .71 .62 .58 .72 .55 .09 .04 .01 .11 .01 .01
the identification CNN model (Fig. 4). Specifically, this Figure5.Examplesofposeestimationerrorsandtheconfidence
paperusesthestandardAlexNet[19]andResidual-50[15] scores. Upper: fourpedestrianboundingboxesnamedwith(a),
(b), (c), and(d), andtheirposeestimationresults. Lower: pose
architectures. Wereferreaderstotherespectivepapersfor
estimation confidence scores of the four images. A confidence
detailednetworkdescriptions.
vectorconsistsof14numberscorrespondingtothe14bodyjoints.
During training, we employ the default parameter set-
Wehighlightthecorrectlydetectedjointsingreen.
tings,excepteditingthelastFClayertohavethesamenum-
ber of neurons as the number of distinct IDs in the train-
ing set. During testing, given an input image resized to quality. For the second problem, the missing visual cues
224×224,weextracttheFC7/FC8activationsforAlexNet, canberescuedbyre-introducingtheoriginalimage,sothat
andthePool5/FCactivationsforResNet-50. After(cid:96)2 nor- thediscriminativedetailsarecapturedbythedeepnetwork.
malization, we use Euclidean distance to perform person
Network. Given the above considerations, this paper
retrievalinthetestingset. Withrespecttotheinputimage
proposes a three-stream PoseBox Fusion (PBF) network
type,twobaselinesareusedinthispaper:
whichtakestheoriginalimage,thePoseBox,andtheconfi-
• Baseline1:theoriginalimage(resizedto224×224)is dencevectorasinput(seeFig.6).ToleveragetheImageNet
usedasinputtoCNNduringtrainingandtesting. [10]pre-trainedmodels,twotypesofimageinputs,i.e.,the
• Baseline2: thePoseBox(resizedto224×224)isused original image and the PoseBox are resized to 256×256
asinputtoCNNduringtrainingandtesting. Notethat (thencroppedrandomlyto227×227)forAlexNet[19]and
onlyonePoseBoxtypeisusedeachtime. 224×224 for ResNet-50 [15]. The third input, i.e., pose
estimation confidence scores, is a 14-dim vector, in which
3.3.ThePoseBoxFusion(PBF)Network
eachentryfallswithintherange[0,1].
Motivation. During PoseBox construction, pose esti- ThetwoimageinputsarefedtotwoCNNsofthesame
mationerrorsandinformationlossmayhappen,leadingto structure. Due to the content differences of the original
compromised quality of the PoseBox (see Fig. 2). On the image and its PoseBox, the two streams of convolutional
onehand,poseestimationerrorsoftenhappen,asweusean layers do not share weights, although they are initialized
off-the-shelf pose estimator (which is usually the case un- from the same seed model. The FC6 and FC7 layers are
der practical usage). As illustrated in Fig. 5 and Fig. 1, connectedtotheseconvolutionallayers. Fortheconfidence
poseestimationmayfailwhenthedetectionshavemissing vector,weaddasmallFClayerwhichprojectsthe14-dim
parts or the pedestrian images are of low resolution. On vectortoa14-dimFCvector. Weconcatenatethethreein-
theotherhand, whencroppinghumanpartsfromabound- puts at the FC7 layer, which is further fully connected to
ing box, it is inevitable that important details are missed FC8. ThesumofthethreeSoftmaxlossesisusedforloss
out,suchasbagsandumbrellas(Fig. 2). Thefailureinthe computation. When the ResNet-50 [15] is used instead of
constructionofhigh-qualityPoseBoxesandtheinformation AlexNet,Fig. 6doesnothavetheFC6layers,andtheFC7
lossduringpartcroppingmayresultincompromisedresults andFC8layersareknownasPool5andFC.
ofthebaseline2. Thisisconfirmedintheexperimentthat In Fig. 6, as denoted by the green bounding box, the
baseline1yieldssuperiorre-IDaccuracytobaseline2. pose invariant embedding (PIE) can either be the concate-
Forthefirstproblem,i.e.,theposeestimationerrors,we nated FC7 activations (4,096+4,096+14 = 8,206-dim) or
can mostly foretell the quality of pose estimation by re- its next fully connected layer (751-dim and 1,160-dim for
sorting to the confidence scores (examples can be seen in Market-1501 and CUHK03, respectively). For AlexNet,
Fig. 5). Under high estimation confidence, we envision we denote the two PIE descriptors as PIE(A, FC7) and
fine quality of the generated PoseBox. But when the pose PIE(A,FC8), respectively; forResNet-50, theyaretermed
estimation confidence scores are low for some body parts, asPIE(R,Pool5)andPIE(R,FC),respectively.
it may be expected that the constructed PoseBox has poor Duringtraining,batchesoftheinputtriplets(theoriginal
4
PIE (A, FC7)
conv. layers FC6 FC7 FC8
… ori. Img xssoL
a
classification M
tfo
S
n3-dim
original image n1-dim n1-dim PIE (A, FC8)
weights not shared sso
PIE xL
a
classification M
… tfoS
n3-dim
PoseBox ssoL
PoseBox n1-dim n1-dim classification xaM
tfo
confidence score n3-dim S
n2-dim n2-dim
Figure6.IllustrationofthePoseBoxFusion(PBF)networkusingAlexNet. Thenetworkinputs,i.e.,theoriginalimage,itsPoseBox,and
theposeestimationconfidence,arehighlightedinbold. Theformertwoundergoconvolutionallayersbeforethefullyconnections(FC).
TheconfidencevectorundergoesoneFClayer,beforethethreeFC7layersareconcatenatedandfullyconnectedtoFC8. SoftMaxlossis
used. TwoalternativesofthePIEdescriptorarehighlightedbygreenboxes. ForAlexNetandMarket-1501,PIE(A,FC7)is8,206-dim,
andPIE(A,FC8)is751-dim;ForResNet-50,therewouldbenoFC6,andPIE(R,Pool5)is4,110-dim,andPIE(R,FC)is751-dim.
image,itsPoseBox,andtheconfidencevector)arefedinto testingsets,eachconsistingof316IDsand632images.We
PBF,andthesumofthethreelossesisback-propagatedto perform 10 random train/test splits and calculate the aver-
the convolutional layers. The ImageNet pretrained model agedaccuracy. TheCUHK03datasetcontains1,360iden-
initializesboththeoriginalimageandPoseBoxstreams. titiesand13,164images.Eachpersonisobservedby2cam-
Duringtesting,giventhethreeinputsofanimage,weex- eras, andonaveragethereare4.8imagesundereachcam-
tractPIEasthedescriptor. NotethatweapplyReLUonthe era.Weadoptthesingle-shotmodeandevaluatethisdataset
extractedembeddings, whichproducessuperiorresultsac- under20randomtrain/testsplits.TheMarket-1501dataset
cordingtoourpreliminaryexperiment. ThentheEuclidean isfeaturedby1,501IDs,19,732galleryimagesand12,936
distanceisusedtocalculatethesimilaritybetweentheprobe trainingimagescapturedby6cameras. BothCUHK03and
andgalleryimages,beforeasortedranklistisproduced. Market-1501areproducedbytheDPMdetector[13]. The
PBF has three advantages. First, the confidence vector CumulativeMatchingCharacteristics(CMC)curveisused
isanindicatorwhetherPoseBoxisreliable. Thisimproves forallthethreedatasets,whichencodesthepossibilitythat
thelearningabilityofPBFasastaticembeddingnetwork, thequerypersonisfoundwithinthetopnranksintherank
sothataglobaltradeoffbetweenthePoseBoxandtheorig- list. For Market-1501 and CUHK03, we additionally em-
inal image can be found. Second, the original image not ploy the mean Average Precision (mAP), which considers
only enables a fallback mechanism when pose estimation both the precision and recall of the retrieval process [42].
fails,butalsoretrainsthepedestriandetailsthatmaybelost The evaluation toolbox provided by the Market-1501 au-
duringPoseBoxconstructionbutareusefulindiscriminat- thorsisused.
ingidentities. Third,thePoseBoxprovidesimportantcom-
4.2.ExperimentalSetups
plementarycuestotheoriginalimage. Usingthecorrectly
predictedjoints,pedestrianmatchingcanbemoreaccurate Our experiments directly employ the off-the-shelf con-
with the well-aligned images. The influence of detection volutional pose machines (CPM) trained using the multi-
errorsandposevariationscanthusbereduced. stageCNNmodeltrainedontheMPIIhumanposedataset
[2]. Default settings are used with input images resized to
4.Experiment 384 × 192. For the PBF network, we replace the convo-
lutional layers with those from either the AlexNet [19] or
4.1.Dataset
ResNet-50[15]. WhenAlexNetisused,n =4,096,n =
1 2
Thispaperusesthreedatasetsforevaluation,i.e.,VIPeR 14,n = 751. When ResNet-50 is used, PBF will not
3
[14], CUHK03 [20], and Market-1501 [42]. The VIPeR havetheFC6layer,andtheFC7layerisdenotedbyPool5:
dataset contains 632 identities, each having 2 images cap- n = 2,048,n = 751. We train the PBF network for 36
1 3
tured by 2 cameras. It is evenly divided into training and epochs. The initial learning rate is set to 0.01, and is re-
5
Table1.Comparisonoftheproposedmethodwithvariousbaselines. PoseBox2isemployedhere. Baseline1: trainingusingtheoriginal
image.Baselin2:trainingusethePoseBox.PIE:proposedposeinvariantembedding.A:AlexNet.R:ResNet-50.
Market-1501 CUHK03 Market-1501→VIPeR
Methods dim
1 5 10 20 mAP 1 5 10 20 1 5 10 20
Baseline1(A,FC7) 4,096 55.49 76.28 83.55 88.98 32.36 57.15 83.50 90.85 95.70 17.44 31.84 41.04 51.36
Baseline1(A,FC8) 751 53.65 75.48 82.93 88.51 31.60 58.80 85.80 91.90 96.25 17.15 32.06 41.68 51.55
Baseline1(R,Pool5) 2,048 73.02 87.44 91.24 94.70 47.62 51.60 79.60 87.70 95.00 23.42 42.31 51.96 63.80
Baseline1(R,FC) 751 70.58 84.95 90.02 93.53 45.84 54.80 84.20 91.70 97.60 15.85 28.80 37.41 47.85
Baseline2(A,FC7) 4,096 52.22 71.53 78.95 85.04 28.95 39.90 71.40 82.30 90.00 17.28 32.59 42.25 55.09
Baseline2(A,FC8) 751 51.10 72.24 79.48 85.60 29.91 42.30 75.05 84.35 92.00 16.04 33.45 42.66 54.97
Baseline2(R,Pool5) 2,048 64.49 79.48 85.07 88.95 38.16 36.90 68.40 78.70 86.70 21.11 37.18 45.89 54.34
Baseline2(R,FC) 751 62.20 78.36 83.76 88.84 37.91 41.70 72.70 84.20 92.50 15.57 26.68 33.54 41.71
PIE(A,FC7) 8,206 64.61 82.07 87.83 91.75 38.95 59.80 85.35 91.85 95.85 21.77 38.04 46.61 56.61
PIE(A,FC8) 751 65.68 82.51 87.89 91.63 41.12 62.40 88.00 93.70 96.50 18.10 31.20 38.92 49.40
PIE(R,Pool5) 4,108 78.65 90.26 93.59 95.69 53.87 57.10 84.60 91.40 96.20 27.44 43.01 50.82 60.22
PIE(R,FC) 751 75.12 88.27 92.28 94.77 51.57 61.50 89.30 94.50 97.60 23.80 37.88 47.31 56.55
ducedby10xevery6epochs. Werunthedeeplearningex- 95
perimentsusingGTX1080undertheCaffeframework[16]
90
and the batch size is set to 32 and 16 using AlexNet and
ResNet-50,respectively. ForbothCNNmodels,ittakes6-7 %)85
hours for the training process to converge on the Market- e (
1501dataset. g rat80 PIE(Pool5)+kissme
We train PIE on Market-1501 and CUHK03, respec- n B1+B2+kissme
tively, which have relatively large data volumes. We chi75 PIE(Pool5)+EU
at PIE(FC)+FC(img)+FC(pb)+EU
also test the generalization ability of PIE on some smaller m70
PIE(Pool5,img)+EU
datasets such as VIPeR. That is, we only extract features PIE(Pool5,pb)+EU
65
usingthemodelpre-trainedonMarket-1501,andthenlearn B1+EU
B2+EU
somedistancemetriconthesmalldatasets. 60
1 2 3 4 5 6 7 8 9 10
rank
4.3.Evaluation Figure 7. Comparison with various feature combinations on the
Market-1501 dataset. ResNet-50 [15] is used. Kissme [18] is
Baselines. Wefirstevaluatethethetwore-IDbaselines
used for distance metric learning. “EU”: Euclidean distance.
described in Section 3.2. The results on three datasets are “PIE(Pool5,img)”and“PIE(Pool5,pb)”denotethe2,048-dimsub-
showninTable1. Twomajorconclusionscanbedrawn. vectorsofthefull4,108-dimPIE(Pool5)vector,correspondingto
First,weobservethatverycompetitiveperformancecan theimageandPoseBoxstreamsofPBF,respectively. “FC(img)”
be achieved by baseline 1, i.e., training with the original and“FC(pb)”arethe751-dimFCvectorsoftheimageandPose-
BoxstreamsofPBF,respectively. “B1”and“B2”representbase-
image.Specifically,onMarket-1501,weachieverank-1ac-
line1and2,respectively,usingthe2,048-dimPool5features.
curacyof55.49%and73.02%usingAlexNetandResNet-
50, respectively. These numbers are consistent with those
reportedin[43]. Moreover,wefindthatFC7(Pool5)issu-
mator,wespeculateinthefuturethatthePoseBoxbaseline
perior to FC8 (FC) on Market-1501 but situation reverses
canbeimprovedbyre-trainingposeestimationusingnewly
on CUHK03. We speculate the CNN model is trained to
labeleddataonthere-IDdatasets.
be more specific to the Market-1501 training set due to its
largerdatavolume,soretrievalonMarket-1501ismoreofa The effectiveness of PIE. We test PIE on the re-ID
transfertaskthanCUHK03. Thisisalsoobservedintrans- benchmarks,andpresenttheresultsinTable1andFig. 7.
ferringImageNetmodelstootherrecognitiontasks[28]. Comparing with baseline 1 and baseline 2, we observe
Second,comparedwithbaseline1,wecanseethatbase- clearly that PIE yields higher re-ID accuracy. On Market-
line 2 is to some extent inferior. On the Market-1501 1501, for example, when using AlexNet and the FC7 de-
dataset,forexample,resultsobtainedbybaseline2is3.3% scriptor, our method exceeds the two baselines by +5.5%
and8.9%lowerusingAlexNetandResNet-50,respectively. and +8,8% in rank-1 accuracy, respectively. With ResNet-
The performance drop is expected due to the pose estima- 50, theimprovement becomesslightlysmaller, butstill ar-
tion errors and information loss mentioned in Section 3.3. rivesat+5.0%and+6.8%,respectively. Specifically,rank-
Since this paper only employs the off-the-shelf pose esti- 1accuracyandmAPonMarket-1501arriveat78.65%and
6
95 95
90 90
85
%) %)85
e (80 e (
at at80
ng r75 ng r PIE full model
chi70 PIE − PoseBox2 chi75 −conf. vector
at PIE − PoseBox3 at −two losses
m65 PIE − PoseBox1 m70 −PoseBox
baseline2 − PoseBox2 −img
60 baseline2 − PoseBox3 65 baseline1
baseline2 − PoseBox1 baseline2
55 60
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
rank rank
Figure8.Re-IDaccuracyofthethreetypesofPoseBoxes.Results Figure9.AblationstudiesonMarket-1501.Fromthe“full”model,
of both the baseline and PIE are presented on the Market-1501 we remove one component at a time. The removed components
dataset. include PoseBox, original image, the confidence vector, and the
twolossesofthePoseBoxandtheoriginalimage.
53.87%, respectively. OnCUHK03andVIPeR,consistent
improvementoverthebaselinescanalsobeobserved. Box,theoriginalimage,theconfidencevector,andthetwo
Moreover, Figure 7 shows that Kissme [18] marginally losses of the PoseBox and original image streams. The
improves the accuracy, proving that the PIE descriptor is CMC curves are drawn in Fig. 9, from which three con-
well-learned. The concatenation of the Pool5 features of clusionscanbedrawn.
baseline 1 and 2 coupled with Kissme produces lower ac- First, when the confidence vector or the two losses are
curacy compared with “PIE(Pool5)+kissme”, illustrating removed,theremainingsystemisinferiortothefullmodel,
that the PBF network learns more effective embeddings butdisplayssimilaraccuracy. Theperformancedropisap-
than learning separately. We also find that the 2,048- proximately 1% in the rank-1 accuracy. It illustrates that
dim “PIE(Pool5,img)+EU” and “PIE(Pool5,pb)+EU” out- these two components are important regularization terms.
performs the corresponding baseline 1 and 2. This sug- Theconfidencevectorinformsthesystemofthereliability
geststhatPBFimprovesthebaselineperformanceprobably ofthePoseBox, thusfacilitatingthelearningprocess. The
throughthebackpropagationofthefusedloss. two identification losses provide additional supervision to
ComparisonofthethreetypesofPoseBoxes. InSec- prevent the performance degradation of the two individual
tion3.1,threetypesofPoseBoxesaredefined. Theircom- streams.Second,aftertheremovalofthestreamoftheorig-
parison results on Market-1501 are shown in Fig. 8. Our inalimage(“-img”),theperformancedropssignificantlybut
observationistwo-fold. stillremainssuperiortobaseline2. Therefore,theoriginal
First, PoseBox2 is superior to PoseBox1. On Market- imagestreamisveryimportant,asitreducesre-IDfailures
1501 dataset, PoseBox2 improves the rank-1 accuracy by thatlikelyresultfromposeestimationerrors. Third, when
xx% over PoseBox1. The inclusion of arms therefore in- thePoseBoxstreamiscutoff(“-PoseBox”),thenetworkis
creases the discriminative ability of the system. Since the inferiortothefullmodel,butisbetterthanbaseline1. This
upper arm typically shares the same color/texture with the validatestheindispensabilityofPoseBox,andsuggeststhat
torso, wespeculatethatitisthelong/shortsleevesthaten- theconfidencevectorimprovesbaseline1.
hancethedescriptors. Second,PoseBox2hasbetterperfor- Comparison with the state-of-the-art methods. On
mancethanPoseBox3aswell. ForPoseBox3, theintegra- Market-1501, we compare PIE with the state-of-the-art
tion of the head introduces more noise due to the unstable methods in Table 2. It is clear that our method outper-
head detection, which deteriorates the overall system per- forms these latest results by a large margin. Specifically,
formance. Nevertheless, wefindinFig. 8thatthegapbe- we achieve rank-1 accuracy = 77.97%, mAP = 52.76%
tweendifferentPoseBoxesdecreasesafterbeingintegrated using the single query mode. To our knowledge, we have
inPBF.Itisbecausethecombinationwiththeoriginalim- setnewstateoftheartontheMarket-1501dataset.
age reduces the impact of estimation errors and the infor- On CUHK03, comparisons are presented in Table 3.
mationloss,acontributionmentionedinSection1. When metric learning is not used, our results are com-
Ablation experiment. To evaluate the effectiveness of petitive in rank-1 accuracy with recent methods such as
differentcomponentsofPBF,ablationexperimentsarecon- [31], but are superior in rank-5, 10, 20, and mAP. When
ducted on the Market-1501 dataset. We remove one com- Kissme[18]isemployed,wereporthigherresults: rank-1
ponent from the full system at a time, including the Pose- =67.10%,andmAP=71.32%,whichexceedthecurrent
7
Table2.ComparisonwithstateoftheartonMarket-1501. Table 4. Comparison with state of the art on VIPeR. The top 6
Methods rank-1 rank-5 rank-10 rank-20 mAP rowsareunsupervised;thebottom10rowsusesupervision.
BoW+Kissme[42] 44.42 63.90 72.18 78.95 20.76 Methods rank-1 rank-5 rank-10 rank-20
WARCA[17] 45.16 68.12 76 84 - GOG[25] 21.14 40.34 53.29 67.21
Temp.Adapt.[24] 47.92 - - - 22.31 EnhancedDeep[36] 15.47 34.53 43.99 55.41
SCSP[4] 51.90 - - - 26.35 SDALF[23] 19.87 38.89 49.37 65.73
NullSpace[40] 55.43 - - - 29.87 gBiCov[23] 17.01 33.67 46.84 58.72
LSTMSiamese[32] 61.6 - - - 35.3 BOW[42] 21.74 - - 60.85
GatedSiamese[31] 65.88 - - - 39.55 PIE 27.44 43.01 50.82 60.22
PIE(Alex) 65.68 82.51 87.89 91.63 41.12 XQDA[21] 40.00 67.40 80.51 91.08
PIE(Res50) 78.65 90.26 93.59 95.69 53.87 MLAPG[22] 40.73 - 82.34 92.37
+Kissme 79.33 90.76 94.41 96.52 55.95 WARCA[17] 40.22 68.16 80.70 91.14
NullSpace[40] 42.28 71.46 82.94 92.06
SI-CI[33] 35.8 67.4 83.5 -
SCSP[4] 53.54 82.59 91.49 96.65
Table3.ComparisonwithstateoftheartonCUHK03(detected).
Mirror[5] 42.97 75.82 87.28 94.84
Methods rank-1 rank-5 rank-10 rank-20 mAP
Enhanced[36]+Mirror[5] 34.87 66.68 79.30 90.38
BoW+HS[42] 24.30 - - - -
LSTMSiamese[32] 42.4 68.7 79.4 -
ImprovedCNN[1] 44.96 76.01 83.47 93.15 -
GatedSiamese[31] 37.8 66.9 77.4 -
XQDA[21] 46.25 78.90 88.55 94.25 -
PIE+Mirror[5]+MFA[38] 43.29 69.40 80.41 89.94
SI-CI[33] 52.2 74.3 92.3 - -
Fusion+MFA 54.49 84.43 92.18 96.87
NullSpace[40] 54.70 84.75 94.80 95.20 -
LSTMSiamese[32] 57.3 80.1 88.3 - 46.3
MLAPG[22] 57.96 87.09 94.74 98.00 -
GatedSiamese[31] 61.8 80.9 88.3 - 51.25
baseline 1
PIE(Alex) 62.60 87.05 92.50 96.30 67.91 (image)
PIE(Res50) 61.50 89.30 94.50 97.60 67.21
+Kissme 67.10 92.20 96.60 98.10 71.32 baseline 2
(PoseBox)
PIE
stateoftheart.Wenotethatin[17],veryhighresultsarere-
portedonthehand-drawnsubsetbutnoresultscanbefound baseline 1
(image)
onthedetectedset. Wealsonotethatmetriclearningyields
smallerimprovementsonMarket-1501thanCUHK03, be- baseline 2
(PoseBox)
causethePBFnetworkisbettertrainedonMarket-1501due
toitsricherannotations.
PIE
On VIPeR, we extract features using the off-the-shelf
PIE model trained on Market-1501, and the comparison Figure10.Samplere-IDresultsontheMarket-1501dataset. For
is shown in Table 4. We first compare PIE (using Eu- each query placed on the left, the three rows correspond to the
ranklistsofbaseline1, baseline2, andPIE,respectively. Green
clideandistance)withthelatestunsupervisedmethods,e.g.,
boundingboxesdenotecorrectlyretrievedimages.
the Gaussian of Gaussian (GoG) [25], the Bag-of-Words
(BOW)[42]descriptors,etc.Weusetheavailablecodepro-
videdbytheauthors.WeobservethatPIEexceedsthecom-
trianmatching.
peting methods in the rank-1, 5, and 10 accuracies. Then,
comparedwithsupervisedworkswithoutfeaturefusion,our
5.Conclusion
method(coupledwithMirrorRepresentation[5]andMFA
[38])hasdecentresults. WefurtherfusethePIEdescriptor This paperexplicitly addresses the pedestrian misalign-
withthepre-computedtransferreddeepdescriptors[36]and ment problem in person re-identification. We propose the
theLOMOdescriptor[21].Weemploythemirrorrepresen- pose invariant embedding (PIE) as pedestrian descriptor.
tation[5]andtheMFAdistancemetriccoupledwiththeChi WefirstconstructPoseBoxwiththe16jointsdetectedwith
Squarekernel. Thefusedsystemachievesnewstateofthe the convolutional pose machine [34]. PoseBox helps cor-
artontheVIPeRdatasetwithrank-1accuracy=54.49%. rect the pose variations caused by camera views, person
Twogroupsofsamplere-IDresultsareshowninFig.10. motionsanddetectorerrorsandenableswell-alignedpedes-
Inthefirstquery,forexample,thecyanclothesontheback- trian matching. PIE is thus learned through the PoseBox
ground lead to the misjudgement of the foreground char- fusion(PBF)network,inwhichtheoriginalimageisfused
acteristics, so that some pedestrians with local green/blue withthePoseBoxandtheposeestimationconfidence. PBF
colorsincorrectlyreceivetopranks. UsingPIE,foreground reducestheimpactofposeestimationerrorsanddetailloss
canbeeffectivelycropped,leadingtomoreaccuratepedes- duringPoseBoxconstruction.WeshowthatPoseBoxyields
8
fairaccuracywhenusedaloneandthatPIEproducescom- [14] D. Gray, S. Brennan, and H. Tao. Evaluating appearance
petitiveaccuracycomparedwiththestateoftheart. modelsforrecognition,reacquisition,andtracking. InProc.
IEEE International Workshop on Performance Evaluation
References for Tracking and Surveillance (PETS), volume 3. Citeseer,
2007. 5
[1] E.Ahmed, M.Jones, andT.K.Marks. Animproveddeep
[15] K.He,X.Zhang,S.Ren,andJ.Sun. Deepresiduallearning
learningarchitectureforpersonre-identification.InProceed- forimagerecognition. InProceedingsoftheIEEEConfer-
ingsoftheIEEEConferenceonComputerVisionandPattern
enceonComputerVisionandPatternRecognition,2016. 4,
Recognition,pages3908–3916,2015. 1,2,3,8
5,6
[2] M.Andriluka, L.Pishchulin, P.Gehler, andB.Schiele. 2d
[16] Y.Jia,E.Shelhamer,J.Donahue,S.Karayev,J.Long,R.Gir-
humanposeestimation: Newbenchmarkandstateoftheart
shick, S. Guadarrama, and T. Darrell. Caffe: Convolu-
analysis. In2014IEEEConferenceonComputerVisionand
tionalarchitectureforfastfeatureembedding. InProceed-
PatternRecognition,pages3686–3693.IEEE,2014. 5
ingsofthe22ndACMinternationalconferenceonMultime-
[3] S.Bak,F.Martins,andF.Bremond. Personre-identification
dia,pages675–678.ACM,2014. 6
by pose priors. In SPIE/IS&T Electronic Imaging, pages
[17] C.JoseandF.Fleuret.Scalablemetriclearningviaweighted
93990H–93990H.InternationalSocietyforOpticsandPho-
approximaterankcomponentanalysis. InEuropeanConfer-
tonics,2015. 2
enceonComputerVision,2016. 8
[4] D.Chen,Z.Yuan,B.Chen,andN.Zheng. Similaritylearn-
[18] M. Ko¨stinger, M. Hirzer, P. Wohlhart, P. M. Roth, and
ing with spatial constraints for person re-identification. In
H. Bischof. Large scale metric learning from equivalence
Proceedings of the IEEE Conference on Computer Vision
constraints. In IEEE Conference on Computer Vision and
andPatternRecognition,pages1268–1277,2016. 8
PatternRecognition,pages2288–2295,2012. 6,7
[5] Y.-C. Chen, W.-S. Zheng, and J. Lai. Mirror represen-
[19] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet
tation for modeling view-specific transform in person re-
classification with deep convolutional neural networks. In
identification. In Proc. IJCAI, pages 3402–3408. Citeseer,
AdvancesinNeuralInformationProcessingSystems,pages
2015. 8
1097–1105,2012. 3,4,5
[6] D.Cheng,Y.Gong,S.Zhou,J.Wang,andN.Zheng.Person
[20] W.Li,R.Zhao,T.Xiao,andX.Wang.Deepreid:Deepfilter
re-identification by multi-channel parts-based cnn with im-
pairingneuralnetworkforpersonre-identification. InPro-
provedtripletlossfunction.InProceedingsoftheIEEECon-
ceedings of the IEEE Conference on Computer Vision and
ferenceonComputerVisionandPatternRecognition,pages
PatternRecognition,pages152–159,2014. 1,2,3,5
1335–1344,2016. 2,3
[21] S.Liao,Y.Hu,X.Zhu,andS.Z.Li. Personre-identification
[7] D.S.ChengandM.Cristani. Personre-identificationbyar-
by local maximal occurrence representation and metric
ticulatedappearancematching. InPersonRe-Identification,
learning. InProceedingsoftheIEEEConferenceonCom-
pages139–160.Springer,2014. 1,2
puter Vision and Pattern Recognition, pages 2197–2206,
[8] D. S. Cheng, M. Cristani, M. Stoppa, L. Bazzani, and
2015. 1,8
V.Murino. Custompictorialstructuresforre-identification.
[22] S.LiaoandS.Z.Li. Efficientpsdconstrainedasymmetric
InBritishMachineVisionConference,2011. 1,2,3
metriclearningforpersonre-identification. InProceedings
[9] Y.-J.ChoandK.-J.Yoon.Improvingpersonre-identification
of the IEEE International Conference on Computer Vision,
viapose-awaremulti-shotmatching. InProceedingsofthe
pages3685–3693,2015. 8
IEEEConferenceonComputerVisionandPatternRecogni-
tion,pages1354–1362,2016. 2 [23] B.Ma,Y.Su,andF.Jurie. Bicov: anovelimagerepresen-
tation for person re-identification and face verification. In
[10] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-
BritishMachiveVisionConference,page11,2012. 8
Fei. Imagenet: A large-scale hierarchical image database.
InProceedingsoftheIEEEConferenceonComputerVision [24] N. Martinel, A. Das, C. Micheloni, and A. K. Roy-
andPatternRecognition,pages248–255,2009. 4 Chowdhury. Temporal model adaptation for person re-
[11] X. Fan, K. Zheng, Y. Lin, and S. Wang. Combining lo- identification. InEuropeanConferenceonComputerVision,
cal appearance and holistic view: Dual-source deep neural 2016. 8
networksforhumanposeestimation. InProceedingsofthe [25] T.Matsukawa,T.Okabe,E.Suzuki,andY.Sato. Hierarchi-
IEEEConferenceonComputerVisionandPatternRecogni- calgaussiandescriptorforpersonre-identification. InPro-
tion,pages1347–1355,2015. 2 ceedings of the IEEE Conference on Computer Vision and
[12] M. Farenzena, L. Bazzani, A. Perina, V. Murino, and PatternRecognition,pages1363–1372,2016. 8
M.Cristani.Personre-identificationbysymmetry-drivenac- [26] A. Newell, K. Yang, and J. Deng. Stacked hourglass net-
cumulation of local features. In Computer Vision and Pat- worksforhumanposeestimation. InEuropeanConference
ternRecognition(CVPR),2010IEEEConferenceon,pages onComputerVision,2016. 2
2360–2367.IEEE,2010. 2 [27] L.Pishchulin, E.Insafutdinov, S.Tang,B.Andres, M.An-
[13] B.Fernando,E.Fromont,D.Muselet,andM.Sebban. Dis- driluka,P.Gehler,andB.Schiele.Deepcut:Jointsubsetpar-
criminativefeaturefusionforimageclassification. InPro- titionandlabelingformultipersonposeestimation. InPro-
ceedings of the IEEE Conference on Computer Vision and ceedings of the IEEE Conference on Computer Vision and
PatternRecognition,pages3434–3441,2012. 5 PatternRecognition,2016. 2
9
[28] A.SharifRazavian, H.Azizpour, J.Sullivan, andS.Carls- [42] L.Zheng,L.Shen,L.Tian,S.Wang,J.Wang,andQ.Tian.
son. Cnnfeaturesoff-the-shelf: anastoundingbaselinefor Scalablepersonre-identification:Abenchmark.InProceed-
recognition.InProceedingsoftheIEEEConferenceonCom- ingsoftheIEEEInternationalConferenceonComputerVi-
puterVisionandPatternRecognitionWorkshops,pages806– sion,pages1116–1124,2015. 1,3,5,8
813,2014. 6 [43] L. Zheng, Y. Yang, and A. G. Hauptmann. Person re-
[29] J.J.Tompson,A.Jain,Y.LeCun,andC.Bregler.Jointtrain- identification: Past, present and future. arXiv preprint
ing of a convolutional network and a graphical model for arXiv:1610.02984,2016. 2,3,6
humanposeestimation. InAdvancesinneuralinformation [44] L. Zheng, H. Zhang, S. Sun, M. Chandraker, and Q. Tian.
processingsystems,pages1799–1807,2014. 2 Person re-identification in the wild. arXiv preprint
[30] A.ToshevandC.Szegedy. Deeppose: Humanposeestima- arXiv:1604.02531,2016. 2
tionviadeepneuralnetworks. InProceedingsoftheIEEE
Conference on Computer Vision and Pattern Recognition,
pages1653–1660,2014. 2
[31] R. R. Varior, M. Haloi, and G. Wang. Gated siamese
convolutional neural network architecture for human re-
identification. InEuropeanConferenceonComputerVision,
2016. 2,7,8
[32] R. R. Varior, B. Shuai, J. Lu, D. Xu, and G. Wang. A
siameselongshort-termmemoryarchitectureforhumanre-
identification. InEuropeanConferenceonComputerVision,
2016. 2,8
[33] F. Wang, W. Zuo, L. Lin, D. Zhang, and L. Zhang. Joint
learningofsingle-imageandcross-imagerepresentationsfor
person re-identification. In Proceedings of the IEEE Con-
ferenceonComputerVisionandPatternRecognition,2016.
8
[34] S.-E.Wei,V.Ramakrishna,T.Kanade,andY.Sheikh. Con-
volutionalposemachines.arXivpreprintarXiv:1602.00134,
2016. 2,3,8
[35] C.Weinrich,M.Volkhardt,andH.-M.Gross. Appearance-
based 3d upper-body pose estimation and person re-
identificationonmobilerobots. In2013IEEEInternational
ConferenceonSystems,Man,andCybernetics,pages4384–
4390.IEEE,2013. 2
[36] S. Wu, Y.-C. Chen, X. Li, A.-C. Wu, J.-J. You, and W.-S.
Zheng. Anenhanceddeepfeaturerepresentationforperson
re-identification.In2016IEEEWinterConferenceonAppli-
cationsofComputerVision(WACV),pages1–8.IEEE,2016.
2,8
[37] T.Xiao,H.Li,W.Ouyang,andX.Wang.Learningdeepfea-
turerepresentationswithdomainguideddropoutforperson
re-identification. InProceedingsoftheIEEEConferenceon
ComputerVisionandPatternRecognition,2016. 2
[38] S.Yan,D.Xu,B.Zhang,H.-J.Zhang,Q.Yang,andS.Lin.
Graphembeddingandextensions: ageneralframeworkfor
dimensionality reduction. IEEE transactions on pattern
analysisandmachineintelligence,29(1):40–51,2007. 8
[39] D.Yi,Z.Lei,S.Liao,S.Z.Li,etal. Deepmetriclearning
for person re-identification. In ICPR, volume 2014, pages
34–39,2014. 2
[40] L.Zhang,T.Xiang,andS.Gong. Learningadiscriminative
nullspaceforpersonre-identification. InProceedingsofthe
IEEEConferenceonComputerVisionandPatternRecogni-
tion,2016. 8
[41] L. Zheng, Z. Bie, Y. Sun, J. Wang, C. Su, S. Wang, and
Q. Tian. Mars: A video benchmark for large-scale person
re-identification. InEuropeanConferenceonComputerVi-
sion,pages868–884.Springer,2016. 2
10