Table Of ContentPublishedasaconferencepaperatICLR2017
LR-GAN: LAYERED RECURSIVE GENERATIVE AD-
VERSARIAL NETWORKS FOR IMAGE GENERATION
JianweiYang∗ AnithaKannan DhruvBatra∗andDeviParikh∗
VirginiaTech FacebookAIResearch GeorgiaInstituteofTechnology
Blacksburg,VA MenloPark,CA Atlanta,GA
[email protected] [email protected] {dbatra, parikh}@gatech.edu
7 ABSTRACT
1
0
We present LR-GAN: an adversarial image generation model which takes scene
2
structure and context into account. Unlike previous generative adversarial net-
r works(GANs),theproposedGANlearnstogenerateimagebackgroundandfore-
a
groundsseparatelyandrecursively,andstitchtheforegroundsonthebackground
M
inacontextuallyrelevantmannertoproduceacompletenaturalimage. Foreach
foreground, the model learns to generate its appearance, shape and pose. The
5
whole model is unsupervised, and is trained in an end-to-end manner with gra-
] dientdescentmethods. TheexperimentsdemonstratethatLR-GANcangenerate
V morenaturalimageswithobjectsthataremorehumanrecognizablethanDCGAN.
C
.
s
c 1 INTRODUCTION
[
Generativeadversarialnetworks(GANs)(Goodfellowetal.,2014)haveshownsignificantpromise
1
asgenerativemodelsfornaturalimages. Aflurryofrecentworkhasproposedimprovementsover
v
0 the original GAN work for image generation (Radford et al., 2015; Denton et al., 2015; Salimans
6 et al., 2016; Chen et al., 2016; Zhu et al., 2016; Zhao et al., 2016), multi-stage image generation
5 includingpart-basedmodels(Imetal.,2016;Kwak&Zhang,2016),imagegenerationconditioned
1 oninputtextorattributes(Mansimovetal.,2015;Reedetal.,2016b;a),imagegenerationbasedon
0 3Dstructure(Wang&Gupta,2016),andevenvideogeneration(Vondricketal.,2016).
.
3 Whiletheholistic‘gist’ofimagesgeneratedbytheseapproachesisbeginningtolooknatural,there
0
isclearlyalongwaytogo.Forinstance,theforegroundobjectsintheseimagestendtobedeformed,
7
blendedintothebackground,andnotlookrealisticorrecognizable.
1
: Onefundamentallimitationofthesemethodsisthattheyattempttogenerateimageswithouttaking
v
i intoaccountthatimagesare2Dprojectionsofa3Dvisualworld,whichhasalotofstructuresinit.
X Thismanifestsasstructureinthe2Dimagesthatcapturethisworld. Oneexampleofthisstructure
r isthatimagestendtohaveabackground, andforegroundobjectsareplacedinthisbackgroundin
a
contextuallyrelevantways.
WedevelopaGANmodelthatexplicitlyencodesthisstructure. Ourproposedmodelgeneratesim-
agesinarecursivefashion: itfirstgeneratesabackground,andthenconditionedonthebackground
generates a foreground along with a shape (mask) and a pose (affine transformation) that together
definehowthebackgroundandforegroundshouldbecomposedtoobtainacompleteimage. Condi-
tionedonthiscompositeimage,asecondforegroundandanassociatedshapeandposearegenerated,
andsoon. Asabyproductinthecourseofrecursiveimagegeneration,ourapproachgeneratessome
object-shape foreground-background masks in a completely unsupervised way, without access to
anyobjectmasksfortraining. Notethatdecomposingasceneintoforeground-backgroundlayersis
aclassicalill-posedproblemincomputervision. Byexplicitlyfactorizingappearanceandtransfor-
mation,LR-GANencodesnaturalpriorsabouttheimagesthatthesameforegroundcanbe‘pasted’
tothedifferentbackgrounds,underdifferentaffinetransformations. Accordingtotheexperiments,
the absence of these priors result in degenerate foreground-background decompositions, and thus
alsodegeneratefinalcompositeimages.
∗WorkwasdonewhilevisitingFacebookAIResearch.
1
PublishedasaconferencepaperatICLR2017
Figure1: GenerationresultsofourmodelonCUB-200(Welinderetal.,2010). Itgeneratesimages
intwotimesteps. Atthefirsttimestep,itgeneratesbackgroundimages,whilegeneratesforeground
images, masksandtransformationsatthesecondtimestep. Then, theyarecomposedtoobtainthe
final images. From top left to bottom right (row major), the blocks are real images, generated
backgroundimages, foregroundimages, foregroundmasks, carvedforegroundimages, carvedand
transformed foreground images, final composite images, and their nearest neighbor real images in
thetrainingset. Notethatthemodelistrainedinacompletelyunsupervisedmanner.
Wemainlyevaluateourapproachonfourdatasets:MNIST-ONE(onedigit)andMNIST-TWO(two
digits)synthesizedfromMNIST(LeCunetal.,1998),CIFAR-10(Krizhevsky&Hinton,2009)and
CUB-200(Welinderetal.,2010).Weshowqualitatively(viasamples)andquantitatively(viaevalu-
ationmetricsandhumanstudiesonAmazonMechanicalTurk)thatLR-GANgeneratesimagesthat
globally look natural and contain clear background and object structures in them that are realistic
andrecognizablebyhumansassemanticentities. AnexperimentalsnapshotonCUB-200isshown
inFig.1. WealsofindthatLR-GANgeneratesforegroundobjectsthatarecontextuallyrelevantto
thebackgrounds(e.g.,horsesongrass,airplanesinskies,shipsinwater, carsonstreets, etc.). For
quantitativecomparison,besidesexistingmetricsintheliterature,weproposetwonewquantitative
metricstoevaluatethequalityofgeneratedimages.Theproposedmetricsarederivedfromthesuffi-
cientconditionsfortheclosenessbetweengeneratedimagedistributionandrealimagedistribution,
andthussupplementexistingmetrics.
2 RELATED WORK
Early work in parametric texture synthesis was based on a set of hand-crafted features (Portilla &
Simoncelli, 2000). Recent improvements in image generation using deep neural networks mainly
fall into one of the two stochastic models: variational autoencoders (VAEs) (Kingma et al., 2016)
andgenerativeadversarialnetworks(GANs)(Goodfellowetal.,2014). VAEspairatop-downprob-
abilisticgenerativenetworkwithabottomuprecognitionnetworkforamortizedprobabilisticinfer-
ence.Twonetworksarejointlytrainedtomaximizeavariationallowerboundonthedatalikelihood.
GANs consist of a generator and a discriminator in a minmax game with the generator aiming to
foolthediscriminatorwithitssampleswiththelatteraimingtonotgetfooled.
Sequentialmodelshavebeenpivotalforimprovedimagegenerationusingvariationalautoencoders:
DRAW (Gregor et al., 2015) uses attention based recurrence conditioning on the canvas drawn so
far. In Eslami et al. (2016), a recurrent generative model that draws one object at a time to the
canvaswasusedasthedecoderinVAE.Thesemethodsareyettoshowscalabilitytonaturalimages.
EarlycompellingresultsusingGANsusedsequentialcoarse-to-finemultiscalegenerationandclass-
conditioning (Denton et al., 2015). Since then, improved training schemes (Salimans et al., 2016)
andbetterconvolutionalstructure(Radfordetal.,2015)haveimprovedthegenerationresultsusing
2
PublishedasaconferencepaperatICLR2017
GANs. PixelRNN(vandenOordetal.,2016)isalsorecentlyproposedtosequentiallygeneratesa
pixelatatime,alongthetwospatialdimensions.
In this paper, we combine the merits of sequential generation with the flexibility of GANs. Our
model for sequential generation imbibes a recursive structure that more naturally mimics image
compositionbyinferringthreecomponents: appearance,shape,andpose. Onecloselyrelatedwork
combiningrecursivestructurewithGANisthatofImetal.(2016)butitdoesnotexplicitlymodel
objectcompositionandfollowsasimilarparadigmasbyGregoretal.(2015). Anothercloselyre-
lated work is that of Kwak & Zhang (2016). It combines recursive structure and alpha blending.
However, ourworkdiffersinthreemainways: (1)Weexplicitlyuseageneratorformodelingthe
foregroundposes. Thatprovidessignificantadvantagefornaturalimages,asshownbyourablation
studies; (2) Our shape generator is separate from the appearance generator. This factored repre-
sentation allows more flexibility in the generated scenes; (3) Our recursive framework generates
subsequent objects conditioned on the current and previous hidden vectors, and previously gener-
ated object. This allows for explicit contextual modeling among generated elements in the scene.
SeeFig.17forcontextuallyrelevantforegroundsgeneratedforthesamebackground,orFig.6for
meaningfulplacementoftwoMNISTdigitsrelativetoeach.
Models that provide supervision to image generation using conditioning variables have also been
proposed: Style/StructureGANs(Wang&Gupta,2016)learnsseparategenerativemodelsforstyle
and structure that are then composed to obtain final images. In Reed et al. (2016a), GAN based
imagegenerationisconditionedontextandtheregionintheimagewherethetextmanifests,spec-
ifiedduringtrainingviakeypointsorboundingboxes. Whilenotthefocusofourwork,themodel
proposedinthispapercanbeeasilyextendedtotakeintoaccounttheseformsofsupervision.
3 PRELIMINARIES
3.1 GENERATIVEADVERSARIALNETWORKS
Generative Adversarial Networks (GANs) consist of a generator G and a discriminator D that are
simultaneously trained with competing goals: The generator G is trained to generate samples that
can ‘fool’ a discriminator D, while the discriminator is trained to classify its inputs as either real
(comingfromthetrainingdataset)orfake(comingfromthesamplesofG). Thiscompetitionleads
toaminmaxformulationwithavaluefunction:
(cid:16) (cid:17)
minmax E [log(D(x;θ ))]+E [log(1−D(G(z;θ );θ ))] , (1)
θG θD x∼pdata(x) D z∼pz(z) G D
wherez isarandomvectorfromastandardmultivariateGaussianorauniformdistributionp (z),
z
G(z;θ ) maps z to the data space, D(x) is the probability that x is real estimated by D. The
G
advantage of the GANs formulation is that it lacks an explicit loss function and instead uses the
discriminatortooptimizethegenerativemodel. Thediscriminator, inturn, onlycareswhetherthe
sampleitreceivesisonthedatamanifold, andnotwhetheritexactlymatchesaparticulartraining
example (as opposed to losses such as MSE). Hence, the discriminator provides a gradient signal
onlywhenthegeneratedsamplesdonotlieonthedatamanifoldsothatthegeneratorcanreadjust
itsparametersaccordingly. Thisformoftrainingenableslearningthedatamanifoldofthetraining
setandnotjustoptimizingtoreconstructthedataset,asinautoencoderanditsvariants.
WhiletheGANsframeworkislargelyagnostictothechoiceofGandD,itisclearthatgenerative
modelswiththe‘right’inductivebiaseswillbemoreeffectiveinlearningfromthegradientinfor-
mation(Dentonetal.,2015;Imetal.,2016;Gregoretal.,2015;Reedetal.,2016a;Yanetal.,2015).
Withthismotivation,weproposeageneratorthatmodelsimagegenerationviaarecurrentprocess
–ineachtimestepoftherecurrence,anobjectwithitsownappearanceandshapeisgeneratedand
warpedaccordingtoageneratedposetocomposeanimageinlayers.
3.2 LAYEREDSTRUCTUREOFIMAGE
Animagetakenofour3Dworldtypicallycontainsalayeredstructure. Onewayofrepresentingan
imagelayerisbyitsappearanceandshape. Asanexample,animagexwithtwolayers,foreground
f andbackgroundbmaybefactorizedas:
x=f (cid:12)m+b(cid:12)(1−m), (2)
3
PublishedasaconferencepaperatICLR2017
wheremisthemaskdepictingtheshapesofimagelayers, and(cid:12)theelementwisemultiplication
operator. Someexistingmethodsassumetheaccesstotheshapeoftheobjecteitherduringtraining
(Isola&Liu,2013)orbothattrainandtesttime(Reedetal.,2016a;Yanetal.,2015). Representing
imagesinlayeredstructureisevenstraightforwardforvideowithmovingobjects(Darrell&Pent-
land,1991;Wang&Adelson,1994;Kannanetal.,2005). Vondricketal.(2016)generatesvideos
byseparatelygeneratingafixedbackgroundandmovingforegrounds. Asimilarwayofgenerating
singleimagecanbefoundinKwak&Zhang(2016).
Anotherwayismodelingthelayeredstructurewithobjectappearanceandposeas:
x=ST(f,a)+b, (3)
wheref andbareforegroundandbackground, respectively; aistheaffinetransformation; ST is
thespatialtransformationoperator. Severalworksfallintothisgroup(Rouxetal.,2011;Huang&
Murphy,2015;Eslamietal.,2016).InHuang&Murphy(2015),imagesaredecomposedintolayers
ofobjectswithspecificposesinavariationalautoencoderframework,whilethenumberofobjects
(i.e.,layers)isadaptivelyestimatedinEslamietal.(2016).
To contrast with these works, LR-GAN uses a layered composition, and the foreground layers si-
multaneouslymodelallthreedominantfactorsofvariation:appearancef,shapemandposea. We
willelaborateitinthefollowingsection.
4 LAYERED RECURSIVE GAN (LR-GAN)
ThebasicstructureofLR-GANissimilartoGAN:itconsistsofadiscriminatorandageneratorthat
are simultaneously trained using the minmax formulation of GAN, as described in §.3.1. The key
innovationofourworkisthelayeredrecursivegenerator,whichiswhatwedescribeinthissection.
ThegeneratorinLR-GANisrecursiveinthattheimageisconstructedrecursivelyusingarecurrent
network. Layeredinthateachrecursivestepcomposesanobjectlayerthatis‘pasted’ontheimage
generated so far. Object layer at timestep t is parameterized by the following three constituents –
‘canonical’appearancef ,shape(ormask)m ,andpose(oraffinetransformation)a forwarping
t t t
theobjectbeforepastingintheimagecomposition.
Fig.2showsthearchitectureoftheLR-GANwiththegeneratorarchitectureunrolledforgenerating
.
backgroundx (= x )andforegroundx andx . Ateachtimestept,thegeneratorcomposesthe
0 b 1 2
nextimagex viathefollowingrecursivecomputation:
t
x = ST(m ,a ) (cid:12) ST(f ,a ) +(1−ST(m ,a ))(cid:12)x , ∀t∈[1,T] (4)
t t t t t t t t−1
(cid:124) (cid:123)(cid:122) (cid:125) (cid:124) (cid:123)(cid:122) (cid:125) (cid:124) (cid:123)(cid:122) (cid:125)
affinetransformedmask affinetransformedappearance pastingonimagecomposedsofar
whereST((cid:5),a )isaspatialtransformationoperatorthatoutputstheaffinetransformedversionof
t
(cid:5)witha indicatingparametersoftheaffinetransformation.
t
Sinceourproposedmodelhasanexplicittransformationvariablea thatisusedtowarptheobject,
t
it can learn a canonical object representation that can be re-used to generate scenes where the ob-
ject occurs as mere transformations of it, such as different scales or rotations. By factorizing the
appearance, shape and pose, the object generator can focus on separately capturing regularities in
thesethreefactorsthatconstituteanobject. Wewilldemonstrateinourexperimentsthatremoving
thesefactorizationsfromthemodelleadstoitsspendingcapacityinvariabilitythatmaynotsolely
beabouttheobjectinSection5.5and5.6.
4.1 DETAILSOFGENERATORARCHITECTURE
Fig.2showsourLR-GANarchitectureindetail–weusedifferentshapestoindicatedifferentkinds
oflayers(convolutional,fractionalconvolutional,(non)linear,etc),asindicatedbythelegend. Our
modelconsistsoftwomainpieces–abackgroundgeneratorG andaforegroundgeneratorG . G
b f b
andG donotshareparameterswitheachother. G computationhappensonlyonce,whileG is
f b f
recurrentovertime,i.e.,allobjectgeneratorssharethesameparameters. Inthefollowing,wewill
introduceeachmoduleandconnectionsbetweenthem.
Temporal Connections. LR-GAN has two kinds of temporal connections – informally speaking,
oneon‘top’andoneon‘bottom’. The‘top’connectionsperformtheactofsequentially‘pasting’
4
PublishedasaconferencepaperatICLR2017
𝑥
𝐶 6
𝐷
𝐶 𝑥2
Real sample
𝑓15 𝑚45
𝑓12 𝑚42 𝑆
Fractional Convolutional
𝑆 𝑓5 𝑚5 Layers
𝑥 𝐺’
$ 𝑓2 𝑚2 𝐺"% 𝐺"& Convolutional Layers
𝐺$ 𝐺’ 𝐺"% 𝐺"& 𝑇" (laNyoenrs)l ianneda ro ethmebrsedding
𝑇" 𝐺"# Spatial Sampler
𝐺"# 𝑃"# 𝐸"# 𝐸", Compositor
LSTM LSTM LSTM
𝑧8 𝑧2 𝑧5
Figure2: LR-GANarchitectureunfoldedtothreetimesteps. Itmainlyconsistsofonebackground
generator,oneforegroundgenerator,temporalconnectionsandonediscriminator. Themeaningof
eachcomponentisexplainedinthelegend.
objectlayers(Eqn.4). The‘bottom’connectionsareconstructedbyaLSTMonthenoisevectors
z ,z ,z . Intuitively, this noise-vector-LSTM provides information to the foreground generator
0 1 2
about what else has been generated in past. Besides, when generating multiple objects, we use a
poolinglayerPcandafully-connectedlayerEc toextracttheinformationfrompreviousgenerated
f f
objectresponsemap. Bythisway,themodelisableto‘see’previouslygeneratedobjects.
BackgroundGenerator.ThebackgroundgeneratorG ispurposelykeptsimple.Ittakesthehidden
b
state of noise-vector-LSTM h0 as the input and passes it to a number of fractional convolutional
l
layers(alsocalled‘deconvolution’layerinsomepapers)togenerateimagesatitsend. Theoutput
ofbackgroundgeneratorx willbeusedasthecanvasforthefollowinggeneratedforegrounds.
b
ForegroundGenerator.TheforegroundgeneratorG isusedtogenerateanobjectwithappearance
f
and shape. Correspondingly, G consists of three sub-modules, Gc, which is a common ‘trunk’
f f
whose outputs are shared by Gi and Gm. Gi is used to generate the foreground appearance f ,
f f f t
while Gm generates the mask m for the foreground. All three sub-modules consists of one or
f t
morefractionalconvolutionallayerscombinedwithbatch-normalizationandnonlinearlayers. The
generatedforegroundappearanceandmaskhavethesamespatialsizeasthebackground. Thetop
ofGmisasigmoidlayerinordertogenerateonechannelmaskwhosevaluesrangein(0,1).
f
Spatial Transformer. To spatially transform foreground objects, we need to estimate the trans-
formationmatrix. AsinJaderbergetal.(2015),wepredicttheaffinetransformationmatrixwitha
linearlayerT thathassix-dimensionaloutputs.Thenbasedonthepredictedtransformationmatrix,
f
weuseagridgeneratorG togeneratethecorrespondingsamplingcoordinatesintheinputforeach
g
locationattheoutput. Thegeneratedforegroundappearanceandmasksharethesametransforma-
tion matrix, and thus the same sampling grid. Given the grid, the sampler S will simultaneously
sample the f and m to obtain fˆ and mˆ , respectively. Different from Jaderberg et al. (2015),
t t t t
oursamplerherenormallyperformsdownsampling,sincethetheforegroundtypicallyhassmaller
sizethanthebackground. Pixelsinfˆ andmˆ thatarefromoutsidetheextentoff andm areset
t t t t
tozero. Finally, fˆ andmˆ aresenttothecompositorC whichcombinesthecanvasx andfˆ
t t t−1 t
throughlayeredcompositionwithblendingweightsgivenbymˆ (Eqn.4).
t
Pseudo-codeforourapproachanddetailedmodelconfigurationareprovidedintheAppendix.
5
PublishedasaconferencepaperatICLR2017
4.2 NEWEVALUATIONMETRICS
Several metrics have been proposed to evaluate GANs, such as Gaussian parzen window (Good-
fellow et al., 2014), Generative Adversarial Metric (GAM) (Im et al., 2016) and Inception Score
(Salimansetal.,2016). Thecommongoalistomeasurethesimilaritybetweenthegenerateddata
distributionP (x) = G(z;θ )andtherealdatadistributionP(x). Mostrecently,InceptionScore
g z
hasbeenusedinseveralworks(Salimansetal.,2016;Zhaoetal.,2016).However,itisanassymetric
metricandcouldbeeasilyfooledbygeneratingcentersofdatamodes.
In addition to these metrics, we present two new metrics based on the following intuition – a suf-
ficient(butnotnecessary)conditionforclosenessofP (x)andP(x)isclosenessofP (x|y)and
g g
P(x|y), i.e., distributions of generated data and real data conditioned on all possible variables of
interesty,e.g.,categorylabel. Onewaytoobtainthisvariableofinterestyisviahumanannotation.
Specifically,giventhedatasampledfromP (x)andP(x),weaskpeopletolabelthecategoryofthe
g
samplesaccordingtosomerules. Notethatsuchhumanannotationisofteneasierthancomparing
samplesfromthetwodistributions(e.g., becausethereisno1:1correspondencebetweensamples
toconductforced-choicetests).
Aftertheannotations,weneedtoverifywhetherthetwodistributionsaresimilarineachcategory.
Clearly, directly comparing the distributions P (x|y) and P(x|y) may be as difficult as compar-
g
ing P (x) and P(x). Fortunately, we can use Bayes rule and alternatively compare P (y|x) and
g g
P(y|x), which is a much easier task. In this case, we can simply train a discriminative model on
the samples from P (x) and P(x) together with the human annotations about categories of these
g
samples. Withaslightabuseofnotation,weuseP (y|x)andP(y|x)todenoteprobabilityoutputs
g
from these two classifiers (trained on generated samples vs trained on real samples). We can then
usethesetwoclassifierstocomputethefollowingtwoevaluationmetrics:
AdversarialAccuracy: Computestheclassificationaccuraciesachievedbythesetwoclassifierson
avalidationset,whichcanbethetrainingsetoranothersetofrealimagessampledfromP(x). If
P (x)isclosetoP(x),weexpecttoseesimilaraccuracies.
g
AdversarialDivergence: ComputestheKLdivergencebetweenP (y|x)andP(y|x). Thelower
g
theadversarialdivergence,theclosertwodistributionsare. Thelowboundforthismetricisexactly
zero,whichmeansP (y|x)=P(y|x)forallsamplesinthevalidationset.
g
Asdiscussedabove,weneedhumaneffortstolabeltherealandgeneratedsamples. Fortunately,we
canfurthersimplifythis. Basedonthelabelsgivenontrainingdata,wesplitthetrainingdatainto
categories,andtrainonegeneratorforeachcategory.Withallthesegenerators,wegeneratesamples
ofallcategories. Thisstrategywillbeusedinourexperimentsonthedatasetswithlabelsgiven.
5 EXPERIMENT
We conduct qualitative and quantitative evaluations on three datasets: 1) MNIST (LeCun et al.,
1998); 2) CIFAR-10 (Krizhevsky & Hinton, 2009); 3) CUB-200 (Welinder et al., 2010). To add
variabilitytotheMNISTimages,werandomlyscale(factorof0.8to1.2)androtate(−π to π)the
4 4
digitsandthenstitchthemto48×48uniformbackgroundswithrandomgrayscalevaluebetween
[0, 200]. Images are then rescaled back to 32×32. Each image thus has a different background
grayscalevalueandadifferenttransformeddigitasforeground. Werenamethissythensizeddataset
asMNIST-ONE(singledigitonagraybackground). WealsosynthesizeadatasetMNIST-TWO
containing two digits on a grayscale background. We randomly select two images of digits and
perform similar transformations as described above, and put one on the left and the other on the
rightsideofa78×78graybackground. Weresizethewholeimageto64×64.
We develop LR-GAN based on open source code1. We assume the number of objects is known.
Therefore, for MNIST-ONE, MNIST-TWO, CIFAR-10, and CUB-200, our model has two, three,
two, and two timesteps, respectively. Since the size of foreground object should be smaller than
thatofcanvas,wesettheminimalallowedscale2 inaffinetransforamtiontobe1.2foralldatasets
exceptforMNIST-TWO,whichissetto2(objectsaresmallerinMNIST-TWO).InLR-GAN,the
1https://github.com/soumith/dcgan.torch
2Scalecorrespondstothesizeofthetargetcanvaswithrespecttotheobject–thelargerthescale,thelarger
thecanvas,andthesmallertherelativesizeoftheobjectinthecanvas.1meansthesamesizeasthecanvas.
6
PublishedasaconferencepaperatICLR2017
Figure3: GeneratedimagesonCIFAR-10basedonourmodel.
Figure4: GeneratedimagesonCUB-200basedonourmodel.
background generator and foreground generator have similar architectures. One difference is that
thenumberofchannelsinthebackgroundgeneratorishalfoftheoneintheforegroundgenerator.
We compare our results to that of DCGAN (Radford et al., 2015). Note that LR-GAN without
LSTM at the first timestep corresponds exactly to the DCGAN. This allows us to run controlled
experiments. Inbothgeneratoranddiscriminator,allthe(fractional)convolutionallayershave4×
4 filter size with stride 2. As a result, the number of layers in the generator and discriminator
automaticallyadapttothesizeoftrainingimages. PleaseseetheAppendix(Section6.2)fordetails
abouttheconfigurations.Weusethreemetricsforquantitativeevaluation,includingInceptionScore
(Salimansetal.,2016)andtheproposedAdversarialAccuracy,AdversarialDivergence. Notethat
we report two versions of Inception Score. One is based on the pre-trained Inception net, and the
otheroneisbasedonthepre-trainedclassifieronthetargetdatasets.
5.1 QUALITATIVERESULTS
InFig.3and4,weshowthegeneratedsamplesforCIFAR-10andCUB-200,respectively. MNIST
resultsareshowninthenextsubsection. Aswecanseefromtheimages,thecompositionalnature
ofourmodelresultsintheimagesbeingfreeofblendingartifactsbetweenbackgroundsandfore-
grounds. ForCIFAR-10,wecanseethehorsesandcarswithclearshapes. ForCUB-200,thebird
shapestendtobeevensharper.
5.2 MNIST-ONEANDMNIST-TWO
WenowreporttheresultsonMNIST-ONEandMNIST-TWO.Fig.5showsthegenerationresultsof
ourmodelonMNIST-ONE.Aswecansee,ourmodelgeneratesthebackgroundandtheforeground
in separate timestep, and can disentagle the foreground digits from background nearly perfectly.
Thoughinitialvaluesofthemaskrandomlydistributeintherangeof(0,1),aftertraining,themasks
arenearlybinaryandaccuratelycarveoutthedigitsfromthegeneratedforeground. Moreresultson
MNIST-ONE(includinghumanstudies)canbefoundintheAppendix(Section6.3).
Fig.6showsthegenerationresultsforMNIST-TWO.Similarly,themodelisalsoabletogenerate
backgroundandthetwoforegroundobjectsseparately. Theforegroundgeneratortendstogenerate
a single digit at each timestep. Meanwhile, it captures the context information from the previous
timesteps. Whenthefirstdigitisplacedtotheleftside, thesecondonetendstobeplacedonthe
rightside,andviceversa.
7
PublishedasaconferencepaperatICLR2017
Figure5: GenerationresultsofourmodelonMNIST-ONE.Fromlefttoright,theimageblocksare
realimages,generatedbackgroundimages,generatedforegroundimages,generatedmasksandfinal
compositeimages,respectively.
Figure 6: Generation results of our model on MNIST-TWO. From top left to bottom right (row
major), the image blocks are real images, generated background images, foreground images and
masks at the second timestep, composite images at the second time step, generated foreground
imagesandmasksatthethirdtimestepandthefinalcompositeimages,respectively.
5.3 CUB-200
We study the effectiveness of our model trained on the CUB-200 bird dataset. In Fig. 1, we have
shownarandomsetofgeneratedimages,alongwiththeintermediategenerationresultsofthemodel.
While being completely unsupervised, the model, for a large fraction of the samples, is able to
Figure7: MatchedpairsofgeneratedimagesbasedonDCGANandLR-GAN,respectivedly. The
odd columns are generated by DCGAN, and the even columns are generated by LR-GAN. These
arepairedaccordingtotheperfectmatchingbasedonHungarianalgorithm.
8
PublishedasaconferencepaperatICLR2017
Figure8: QualitativecomparisononCIFAR-10. TopthreerowsareimagesgeneratedbyDCGAN;
Bottom three rows are by LR-GAN. From left to right, the blocks display generated images with
increasingqualitylevelasdeterminedbyhumanstudies.
successfully disentangle the foreground and the background. This is evident from the generated
bird-likemasks.
WedoacomparativestudybasedonAmazonMechanicalTurk(AMT)betweenDCGANandLR-
GANtoquantifyrelativevisualqualityofthegeneratedimages. Wefirstgenerated1000samples
from both the models. Then, we performed perfect matching between the two image sets using
the Hungarian algorithm on L2 norm distance in the pixel space. This resulted in 1000 image
pairs. SomeexamplarpairsareshowninFig.7. Foreachimagepair,9judgesareaskedtochoose
the one that is more realistic. Based on majority voting, we find that our generated images are
selected68.4%times,comparedwith31.6%timesforDCGAN.Thisdemonstratesthatourmodel
hasgeneratedmorerealisticimagesthanDCGAN.Wecanattributethisdifferencetoourmodel’s
abilitytogenerateforegroundseparatelyfromthebackground,enablingstrongeredgecues.
5.4 CIFAR-10
WenowqualitativelyandquantitativelyevaluateourmodelonCIFAR-10,whichcontainsmultiple
objectcategoriesandalsovariousbackgrounds.
Comparison of image generation quality: We conduct AMT studies to compare the fidelity of
image generation. Towards this goal, we generate 1000 images from DCGAN and LR-GAN, re-
spectively. We ask 5 judges to label each image to either belong to one of the 10 categories or as
‘nonrecognizable’or‘recognizablebutnotbelongingtothelistedcategories’. Wethenassigneach
imageaqualitylevelbetween[0,5]thatcapturesthenumberofjudgesthatagreewiththemajority
choice. Fig.8showstheimagesgeneratedbybothapproaches,orderedbyincreasingqualitylevel.
Wemergeimagesatqualitylevel0(alljudgessaidnon-recognizable)and1together,andsimilarly
imagesatlevel4and5. Visually,thegeneratedsamplesbyourmodelhaveclearerboundariesand
objectstructures. Wealsocomputedthefractionofnon-recognizableimages:Ourmodelhada10%
absolute drop in non-recognizability rate (67.3% for ours vs. 77.7% for DCGAN). For reference,
11.4%ofrealCIFARimageswerecategorizedasnon-recognizable. Fig.9showsmoregenerated
(intermediate)resultsofourmodel.
Quantitative evaluation on generators: We evaluate the generators based on three metrics: 1)
InceptionScore;2)AdversarialAccuracy;3)AdversarialDivergence. Toobtainaclassifiermodel
for evaluation, we remove the top layer in the discriminator used in our model, and then append
two fully connected layers on the top of it. We train this classifier using the training samples of
CIFAR-10basedontheannotations. FollowingSalimansetal.(2016),wegenerated50,000images
Table1: QuantitativecomparisonbetweenDCGANandLR-GANonCIFAR-10.
TrainingData RealImages DCGAN Ours
InceptionScore† 11.18±0.18 6.64±0.14 7.17±0.07
InceptionScore†† 7.23±0.09 5.69±0.07 6.11±0.06
AdversarialAccuracy 83.33±0.08 37.81±0.02 44.22±0.08
AdversarialDivergence 0 7.58±0.04 5.57±0.06
†Evaluateusingthepre-trainedInceptionnetasSalimansetal.(2016)
††EvaluateusingthesupervisedlytrainedclassifierbasedonthediscriminatorinLR-GAN.
9
PublishedasaconferencepaperatICLR2017
Figure9: GenerationresultsofourmodelonCIFAR-10. Fromlefttoright,theblocksare: gener-
ated background images, foreground images, foreground masks, foreground images carved out by
masks,carvedforegroundsafterspatialtransformation,finalcompositeimagesandnearestneighbor
trainingimagestothegeneratedimages.
Figure10:CategoryspecificgenerationresultsofourmodelonCIFAR-10categoriesofhorse,frog,
andcat(toptobottom). Theblocksfromlefttorightare:generatedbackgroundimages,foreground
images,foregroundmasks,foregroundimagescarvedoutbymasks,carvedforegroundsafterspatial
transformationandfinalcompositeimages.
based on DCGAN and LR-GAN, repsectively. We compute two types of Inception Scores. The
standardInceptionScoreisbasedontheInceptionnetasinSalimansetal.(2016),andthecontex-
tualInceptionScoreisbasedonourtrainedclassifiermodel. Todistinguish,wedenotethestandard
oneas‘InceptionScore†’,andthecontextualoneas‘InceptionScore††’. ToobtaintheAdversarial
Accuracy and Adversarial Divergence scores, we train one generator on each of 10 categories for
DCGANandLR-GAN,respectively. Then,weusethesegeneratorstogeneratesamplesofdifferent
categories. Giventhesegeneratedsamples,wetraintheclassifiersforDCGANandLR-GANsepa-
rately. Alongwiththeclassifiertrainedontherealsamples,wecomputetheAdversarialAccuracy
10
Description:Dhruv Batra∗and Devi Parikh∗. Georgia and so on. As a byproduct in the course of recursive image generation, our approach generates some.