Table Of ContentImproved Texture Networks: Maximizing Quality and Diversity in
Feed-forward Stylization and Texture Synthesis
DmitryUlyanov AndreaVedaldi
SkolkovoInstituteofScienceandTechnology&Yandex UniversityofOxford
[email protected] [email protected]
VictorLempitsky
7
SkolkovoInstituteofScienceandTechnology
1
0 [email protected]
2
n
a
Abstract
J
9
The recent work of Gatys et al., who characterized the
] style of an image by the statistics of convolutional neural
V
networkfilters,ignitedarenewedinterestinthetexturegen-
C
erationandimagestylizationproblems. Whiletheirimage
.
s generationtechniqueusesaslowoptimizationprocess, re- (a)Panda. (b) (c)
c cently several authors have proposed to learn generator
[
neural networks that can produce similar outputs in one
1 quickforwardpass. Whilegeneratornetworksarepromis-
v ing, they are still inferior in visual quality and diversity
6
compared to generation-by-optimization. In this work, we
9
advance them in two significant ways. First, we introduce
0
2 aninstancenormalizationmoduletoreplacebatchnormal-
(d)Style. (e) (f)
0 izationwithsignificantimprovementstothequalityofimage
1. stylization. Second, we improve diversity by introducing Figure 1: Which panda stylization seems the best to you?
0 a new learning formulation that encourages generators to
Definitely not the variant (b), which has been produced
7 sampleunbiasedlyfromtheJulesztextureensemble,which
byastate-of-the-artalgorithmamongmethodsthattakeno
1
istheequivalenceclassofallimagescharacterizedbycer-
: longerthanasecond. The(e)picturetookseveralminutes
v tainfilterresponses. Together,thesetwoimprovementstake
to generate using an optimization process, but the quality
i feed forward texture synthesis and image stylization much
X isworthit,isn’tit? Wewouldbeparticularlyhappyifyou
closer to the quality of generation-via-optimization, while
r choseonefromtherightmosttwoexamples,whicharecom-
a retainingthespeedadvantage.
putedwithournewmethodthataspirestocombinethequal-
ity of the optimization-based method and the speed of the
fast one. Moreover, our method is able to produce diverse
1.Introduction
stylizationsusingasinglenetwork.
TherecentworkofGatysetal.[4,5], whichuseddeep
linearconvolutionalneuralnetworkfiltersforthispurpose.
neuralnetworksfortexturesynthesisandimagestylization
Despite the excellent results, however, the matching pro-
toagreateffect,hascreatedasurgeofinterestinthisarea.
cess is based on local optimization, and generally requires
Following an eariler work by Portilla and Simoncelli [15],
aconsiderableamountoftime(tensofsecondstominutes)
they generate an image by matching the second order mo-
inordertogenerateasingletexturesorstylizedimage.
mentsoftheresponseofcertainfiltersappliedtoareference
Inordertoaddressthisshortcoming,Ulyanovetal.[19]
textureimage. TheinnovationofGatysetal.istousenon-
and Johnson et al. [8] suggested to replace the optimiza-
tion process with feed-forward generative convolutional
The source code is available at https://github.com/
DmitryUlyanov/texture_nets networks. In particular, [19] introduced texture networks
1
togeneratetexturesofacertainkind,asin[4],ortoapply was first studied by Julesz [9], who suggested that the vi-
acertaintexturestyletoanarbitraryimage,asin[5]. Once sual system discriminates between different textures based
trained, such texture networks operate in a feed-forward ontheaverageresponsesofcertainimagefilters.
manner,threeordersofmagnitudefasterthantheoptimiza- The work of [20] formalized Julesz’ ideas by introduc-
tionmethodsof[4,5]. ing the concept of Julesz ensemble. There, an image is
The price to pay for such speed is a reduced perfor- a real function x : Ω R3 defined on a discrete lat-
→
mance. For texture synthesis, the neural network of [19] tice Ω = 1,...,H 1,...,W and a texture is a
{ } × { }
generates good-quality samples, but these are not as di- distribution p(x) over such images. The local statistics
verse as the ones obtained from the iterative optimization of an image are captured by a bank of (non-linear) filters
method of [4]. For image stylization, the feed-forward re- F : Ω R,l=1,...,L,whereF (x,u)denotesthe
l l
X × →
sults of [19, 8] are qualitatively and quantitatively worse responseoffilterF atlocationuonimagex. Theimagex
l
than iterative optimization. In this work, we address both ischaracterizedbythespatialaverageofthefilterresponses
(cid:80)
limitations by means of two contributions, both of which µ (x) = F (x,u)/Ω. Theimageisperceivedasa
l u∈Ω l | |
extendbeyondtheapplicationsconsideredinthispaper. particular texture if these responses match certain charac-
Our first contribution (section 4) is an architectural teristicvaluesµ¯ . Formally,giventhelossfunction,
l
change that significantly improves the generator net-
L
works. The change is the introduction of an instance- (cid:88)
(x)= (µ (x) µ¯ )2 (1)
normalizationlayerwhich,particularlyforthestylization L l − l
l=1
problem,greatlyimprovestheperformanceofthedeepnet-
workgenerators.Thisadvancesignificantlyreducesthegap theJuleszensembleisthesetofalltextureimages
instylisationqualitybetweenthefeed-forwardmodelsand
the original iterative optimization method of Gatys et al., (cid:15) = x : (x) (cid:15)
T { ∈X L ≤ }
bothquantitativelyandqualitatively.
that approximately satisfy such constraints. Since all tex-
Our second contribution (section 3) addresses the lim-
turesintheJuleszensembleareperceptuallyequivalent,itis
iteddiversityofthesamplesgeneratedbytexturenetworks.
naturaltorequirethetexturedistributionp(x)tobeuniform
In order to do so, we introduce a new formulation that
overthisset. Inpractice, itismoreconvenienttoconsider
learns generators that uniformly sample the Julesz en-
theexponentialdistribution
semble [20]. The latter is the equivalence class of im-
agesthatmatchcertainfilterstatistics. Uniformlysampling
e−L(x)/T
this set guarantees diverse results, but traditionally doing p(x)= (cid:82) , (2)
e−L(y)/T dy
so required slow Monte Carlo methods [20]; Portilla and
Simoncelli, and hence Gatys et al., cannot sample from whereT >0isatemperatureparameter.Thischoiceismo-
this set, but only find individual points in it, and possibly tivated as follows [20]: since statistics are computed from
just one point. Our formulation minimizes the Kullback- spatial averages of filter responses, one can show that, in
Leibler divergence between the generated distribution and the limit of infinitely large lattices, the distribution p(x) is
a quasi-uniform distribution on the Julesz ensemble. The zero outside the Julesz ensemble and uniform inside. In
learning objective decomposes into a loss term similar to this manner, eq. (2) can be though as a uniform distribu-
Gatysetal.minustheentropyofthegeneratedtexturesam- tion over images that have a certain characteristic filter re-
ples, whichweestimateinadifferentiablemannerusinga sponsesµ¯ =(µ¯ ,...,µ¯ ).
1 L
non-parametricestimator[12]. Notealsothatthetextureiscompletelydescribedbythe
We validate our contributions by means of extensive filter bank F = (F ,...,F ) and their characteristic re-
1 L
quantitative and qualitative experiments, including com- sponsesµ¯. Asdiscussedbelow,thefilterbankisgenerally
paring the feed-forward results with the gold-standard fixed, so in this framework different textures are given by
optimization-based ones (section 5). We show that, com- differentcharacteristicsµ¯.
bined,theseideasdramaticallyimprovethequalityoffeed-
forward texture synthesis and image stylization, bringing
Generation-by-minimization. Foranyinterestingchoice
them to a level comparable to the optimization-based ap-
ofthefilterbankF,samplingfromeq.(2)isratherchalleng-
proaches.
ingandclassicallyaddressedbyMonteCarlomethods[20].
In order to make this framework more practical, Portilla
2.Backgroundandrelatedwork
andSimoncelli[16]proposedinsteadtoheuristicallysam-
Julesz ensemble. Informally, a texture is a family of vi- plefromtheJuleszensemblebytheoptimizationprocess
sual patterns, such as checkerboards or slabs of concrete,
x∗ =argmin (x). (3)
thatsharecertainlocalstatisticalregularities. Theconcept L
x∈X
If this optimization problem can be solved, the minimizer While this approach works well in practice, it shares the
x∗ is by definition a texture image. However, there is no same important limitation as the original work of Portilla
reasonwhythisprocessshouldgeneratefairsamplesfrom and Simoncelli: there is no guarantee that samples gener-
the distribution p(x). In fact, the only reason why eq. (3) ated by g∗ would be fair samples of the texture distribu-
maynotsimplyreturnalwaysthesameimageisthattheop- tion(2). Inpractice,asweshowinthepaper,suchsamples
timization algorithm is randomly initialized, the loss func- tendinfacttobenotdiverseenough.
tionishighlynon-convex,andsearchislocal.Onlybecause Both[8,19]havealsoshownthatsimilargeneratornet-
ofthiseq.(3)maylandondifferentsamplesx∗ondifferent worksworkalsoforstylization. Inthiscase, thegenerator
runs. g(x ,z) is a function of the content image x and of the
0 0
randomnoisez. Thenetworkg islearnedtominimizethe
sumoftexturelossandthecontentloss:
Deep filter banks. Constructing a Julesz ensemble re-
quireschoosingafilterbankF.Originally,researcherscon-
g∗ =argmin E [ (g(x ,z))+α (g(x ,z),x )].
sideredtheobviouscandidates: Gaussianderivativefilters, g px0,pz L 0 Lcont. 0 0
Gaborfilters,wavelets,histograms,andsimilar[20,16,21]. (5)
Morerecently,theworkofGatysetal.[4,5]demonstrated
thatmuchsuperiorfiltersareautomaticallylearnedbydeep
convolutional neural networks (CNNs) even when trained Alternativeneuralgeneratormethods. Therearemany
for apparently unrelated problems, such as image classifi- other techniques for image generation using deep neural
cation. Inthispaper, inparticular, wechoosefor (x)the networks.
L
stylelossproposedby[4].Thelatteristhedistancebetween TheJuleszdistributioniscloselyrelatedtotheFRAME
theempiricalcorrelationmatricesofdeepfilterresponsesin maximumentropymodelof[21],aswellastotheconcept
aCNN.1 ofMaximumMeanDiscrepancy(MMD)introducedin[7].
BothFRAMEandMMDmaketheobservationthataprob-
ability distribution p(x) can be described by the expected
Stylization. The texture generation method of Gatys et values µ = E [φ (x)] of a sufficiently rich set of
α x∼p(x) α
al.[4]canbeconsideredasadirectextensionofthetexture
statisticsφ (x). Buildingontheseideas, [14,3]construct
α
generation-by-minimization technique (3) of Portilla and
generatorneuralnetworksgwiththegoalofminimizingthe
Simoncelli [16]. Later, Gatys et al. [5] demonstrated that
discrepancybetweenthestatisticsaveragedoverabatchof
the same technique can be used to generate an image that generatedimages(cid:80)N φ (g(z ))/N andthestatisticsav-
mixesthestatisticsoftwootherimages,oneusedasatex- eraged over a traningi=s1etα(cid:80)M iφ (x )/M. The resulting
ture template and one used as a content template. Content i=1 α i
networksgarecalledMomentMatchingNetworks(MMN).
is captured by introducing a second loss (x,x ) that
Lcont. 0 An important alternative methodology is based on the
comparestheresponsesofdeepCNNfiltersextractedfrom
concept of Generative Adversarial Networks (GAN; [6]).
thegeneratedimagexandacontentimagex . Minimizing
0
This approach trains, together with the generator network
thecombinedloss (x)+α (x,x )yieldsimpressive
L Lcont. 0 g(z),asecondadversarialnetworkf()thatattemptstodis-
artisticimages,whereatextureµ¯,definingtheartisticstyle, ·
tinguishbetweengeneratedsamplesg(z),z (0,I)and
isfusedwiththecontentimagex0. realsamplesx p (x). Theadversarialm∼odNelf canbe
data
∼
used as a measure of quality of the generated samples and
Feed-forwardgeneratornetworks. Forallitssimplicity usedtolearnabettergeneratorg.GANarepowerfulbutno-
and efficiency compared to Markov sampling techniques, toriouslydifficulttotrain. Alotofresearchishasrecently
generation-by-optimization (3) is still relatively slow, and focused on improving GAN or extending it. For instance,
certainlytooslowforreal-timeapplications. Therefore, in LAPGAN[2]combinesGANwithaLaplacianpyramidand
thepastfewmonthsseveralauthors[8,19]haveproposedto DCGAN[17]optimizesGANforlargedatasets.
learngeneratorneuralnetworksg(z)thatcandirectlymap
random noise samples z pz = (0,I) to a local mini- 3.Juleszgeneratornetworks
∼ N
mizerofeq.(3). Learningtheneuralnetworkgamountsto
minimizingtheobjective This section describes our first contribution, namely a
methodtolearnnetworksthatdrawsamplesfromtheJulesz
g∗ =argminE (g(z)). (4) ensemble modelling a texture (section 2), which is an in-
g pzL tractable problem usually addressed by slow Monte Carlo
methods[21,20]. Generation-by-optimization,popularized
1Note that such matrices are obtained by averaging local non-linear
by Portilla and Simoncelli and Gatys et al., is faster, but
filters:thesearetheouterproductsoffiltersinacertainlayerofthenerual
network.Hence,thestylelossofGatysetal.isinthesameformaseq.(1). canonlyfindonepointintheensemble,notsamplefromit,
withscarcesamplediversity,particularlywhenusedtotrain An energy term similar to (6) was recently proposed
feed-forwardgeneratornetworks[8,19]. in [10] for improving the diversity of a generator network
Here,weproposeanewformulationthatallowstotrain inaadversariallearningscheme. Whiletheideaissuperfi-
generatornetworksthatsampletheJuleszensemble,gener- cially similar, the application (sampling the Julesz ensem-
atingimageswithhighvisualfidelityaswellashighdiver- ble) and instantiation (the way the entropy term is imple-
sity. mented)areverydifferent.
A generator network [6] maps an i.i.d. noise vector
z (0,I) to an image x = g(z) in such a way that
Learning objective. We are now ready to define an ob-
∼ N
x is ideally a sample from the desired distribution p(x).
jectivefunctionE(g)tolearnthegeneratornetworkg.This
Such generators have been adopted for texture synthesis
is given by substituting the expected loss (7) and the en-
in [19], but without guarantees that the learned generator
tropyestimator(9),computedoverabatchofN generated
g(z)wouldindeedsampleaparticulardistribution.
images,intheKLdivergence(6):
Here, we would like to sample from the Gibbs distri-
bution (2) defined over the Julesz ensemble. This distri-
N
butioncanbewrittencompactlyasp(x) = Z−1e−L(x)/T, E(g)= 1 (cid:88)(cid:104)1 (g(z ))
where Z = (cid:82) e−L(x)/T dx is an intractable normalization N TL i
i=1
constant. (cid:105)
λlnmin g(z ) g(z ) (10)
Denote by q(x) the distribution induced by a generator − j(cid:54)=i (cid:107) i − j (cid:107)
networkg. Thegoalistomakethetargetdistributionpand
thegeneratordistributionqascloseaspossiblebyminimiz- The batch itself is obtained by drawing N samples
ingtheirKullback-Leibler(KL)divergence: z1,...,zn (0,I) from the noise distribution of the
∼ N
generator. The first term in eq. (10) measures how closely
(cid:90) q(x)Z
KL(q p)= q(x)ln dx thegeneratedimagesg(zi)aretotheJuleszensemble. The
|| p(x) secondtermquantifiesthelackofdiversityinthebatchby
1 mutuallycomparingthegeneratedimages.
= E (x)+ E lnq(x)+ln(Z) (6)
T x∼q(x)L x∼q(x)
= 1 E (x) H(q)+const. Learning. The loss function (10) is in a form that al-
T x∼q(x)L − lowsoptimizationbymeansofStochasticGradientDescent
(SGD).Thealgorithmsamplesabatchz ,...,z atatime
Hence,theKLdivergenceisthesumoftheexpectedvalue 1 n
andthendescendsthegradient:
ofthestyleloss andthenegativeentropyofthegenerated
L
distributionq.
N
Thefirsttermcanbeestimatedbytakingtheexpectation 1 (cid:88)(cid:104) d dg(zi)
L
overgeneratedsamples: N dx(cid:62) dθ(cid:62)
i=1
x∼Eq(x)L(x)=z∼NE(0,I)L(g(z)). (7) − ρλi(g(zi)−g(zji∗))(cid:62)(cid:18)ddgθ(z(cid:62)i) − dgd(θz(cid:62)ji∗)(cid:19)(cid:105) (11)
Thisissimilartothereparametrizationtrickof[11]andis
alsousedin[8,19]toconstructtheirlearningobjectives. whereθisthevectorofparametersoftheneuralnetworkg,
The second term, the negative entropy, is harder to es- thetensorimagexhasbeenimplicitlyvectorizedandji∗ is
timate accurately, but simple estimators exist. One which theindexofthenearestneighbourofimageiinthebatch.
isparticularlyappealinginourscenarioistheKozachenko-
4.Stylizationwithinstancenormalization
Leonenkoestimator[12]. Thisestimatorconsidersabatch
of N samples x ,...,x q(x). Then, for each sample
1 n ∼ The work of [19] showed that it is possible to learn
x , it computes the distance ρ to its nearest neighbour in
i i high-qualitytexture networksg(z) thatgenerate imagesin
thebatch:
a Julesz ensemble. They also showed that it is possible to
ρ =min x x . (8)
i j(cid:54)=i (cid:107) i− j(cid:107) learn good quality stylization networks g(x0,z) that apply
thestyleofafixedtexturetoanarbitrarycontentimagex .
Thedistancesρ canbeusedtoapproximatetheentropyas 0
i
Nevertheless, the stylization problem was found to be
follows:
N harder than the texture generation one. For the stylization
D (cid:88)
H(q)≈ N lnρi+const. (9) task,theyfoundthatlearningthemodelfromtoomanyex-
i=1 amplecontentimagesx0,saymorethan16,yieldedpoorer
whereD = 3WH isthenumberofcomponentsoftheim- qualitativeresultsthanusingasmallernumberofsuchex-
agesx R3×W×H. amples. Someofthemostsignificanterrorsappearedalong
∈
theborderofthegeneratedimages,probablyduetopadding
and other boundary effects in the generator network. We
conjecturedthatthesearesymptomsofalearningproblem
toodifficultfortheirchoiceofneuralnetworkarchitecture.
Asimpleobservationthatmaymakelearningsimpleris
thattheresultofstylizationshouldnot, ingeneral, depend Figure2:Comparisonofnormalizationtechniquesinimage
onthecontrastofthecontentimagebutrathershouldmatch stylization. From left to right: BN, cross-channel LRN at
thecontrastofthetexturethatisbeingappliedtoit. Thus, thefirstlayer,INatthefirstlayer,INthroughout.
the generator network should discard contrast information
16.0 16.5
inthecontentimagex0. Wearguethatlearningtodiscard 15.5 Batch 16.0 Batch
contrastinformationbyusingstandardCNNbuildingblock 15.0 Instance 15.5 Instance
isunnecessarilydifficult,andisbestdonebyaddingasuit- (x)L14.5 (x)L1145..50 Random
ablelayertothearchitecture. ln 14.0 ln 14.0
To see why, let x RN×C×W×H be an input tensor 13.5 13.5
containingabatchofN∈images. Letx denoteitsnijk- 13.00 3000 6000 9000 12000 13.00 50 100 150 200 250 300
nijk Iteration Iteration
thelement,wherek andj spanspatialdimensions,iisthe (a)Feed-forwardhistory. (b)Finetuninghistory.
feature channel (i.e. the color channel if the tensor is an
RGBimage), andnistheindexoftheimageinthebatch.
Then,contrastnormalizationisgivenby:
x µ
ynijk = (cid:112)nijk− ni,
σ2 +(cid:15)
ni
W H
1 (cid:88) (cid:88) (c)Content. (d)StyleNetIN. (e)INfinetuned.
µ = x ,
ni HW nilm (12)
l=1m=1
W H
1 (cid:88) (cid:88)
σ2 = (x µ )2.
ni HW nilm− ni
l=1m=1
Itisunclearhowsuchasfunctioncouldbeimplementedas
asequenceofstandardoperatorssuchasReLUandconvo-
(f)Style. (g)StyleNetBN. (h)BNfinetuned.
lution.
On the other hand, the generator network of [19] does Figure3: (a)learningobjectiveasafunctionofSGDitera-
containanormalizationlayers,andpreciselybatchnormal- tionsforStyleNetINandBN.(b)Directoptimizationofthe
ization(BN)ones. Thekeydifferencebetweeneq.(12)and Gatysetal.forthisexampleimagestartingfromthersultof
batchnormalizationisthatthelatterappliesthenormaliza- StyleNetINandBN.(d,g)ResultofStyleNetwithinstance
tiontoawholebatchofimagesinsteadofsingleones: (d)andbatchnormalization(g).(e,h)Resultoffinetuingthe
Gatysetal.energy.
x µ
ynijk = (cid:112)niσjk2−+(cid:15)i, batchasawhole. Noteinparticularthatthismeansthatin-
i
stancenormalizationisappliedthroughoutthearchitecture,
N W H
1 (cid:88)(cid:88) (cid:88) notjustattheinputimage—fig.2showsthebenefitofdoing
µ = x ,
i HWN nilm (13) so.
n=1l=1m=1
Another similarity with BN is that each IN layer is fol-
N W H
σ2 = 1 (cid:88)(cid:88) (cid:88)(x µ )2. lowedbyascalingandbiasoperators x+b.Adifference
i HWN nilm− i isthattheINlayerisappliedattesttim(cid:12)easwell,unchanged,
n=1l=1m=1
whereas BN is usually switched to use accumulated mean
We argue that, for the purpose of stylization, the normal- andvarianceinsteadofcomputingthemoverthebatch.
izationoperatorofeq.(12)ispreferableasitcannormalize IN appears to be similar to the layer normalization
eachindividualcontentimagex0. method introduced in [1] for recurrent networks, although
Whilesomeauthorscalllayereq.(12)contrastnormal- itisnotclearhowtheyhandlespatialdata. Liketheirs,IN
ization, here we refer to it as instance normalization (IN) isagenericlayer,sowetesteditinclassificationproblems
since we use it as a drop-in replacement for batch nor- aswell.Insuchcases,itstillworksurprisinglywell,butnot
malization operating on individual instances instead of the aswellasbatchnormalization(e.g.AlexNet[13]INhas2-
Content StyleNetIN(ours) StyleNetBN Gatysetal. Style
Figure4: Stylizationresultsobtainedbyapplyingdifferenttextures(rightmostcolumn)todifferentcontentimages(leftmost
column). Threemethodsarecompared: StyleNetIN,StyleNetBN,anditerativeoptimization.
3% worse top-1 accuracy on ILSVRC [18] than AlexNet residualarchitecturefrom[8]forallourstyletransferexper-
BN). iments. We also experimented with architecture from [19]
and observed a similar improvement with our method, but
5.Experiments use the one from [8] for convenience. We call it StyleNet
withapostfixBNifitisequippedwithbatchnormalization
In this section, after discussing the technical details of
orINforinstancenormalization.
themethod,weevaluateournewtexturenetworkarchitec-
turesusinginstancenormalization,andtheninvestigatethe
Fortexturesynthesiswecomparetwoarchitectures: the
abilityofthenewformulationtolearndiversegenerators.
multiscalefully-convolutionalarchitecturefrom[19](Tex-
tureNetV1) and the one we design to have a very large re-
5.1.Technicaldetails
ceptive field (TextureNetV2). TextureNetV2 takes a noise
Network architecture. Among two generator network vector of size 256 and first transforms it with two fully-
architectures,proposedpreviouslyin [19,8],wechoosethe connected layers. The output is then reshaped to a 4 4
×
Input TextureNetV2 λ=0 TextureNetV2 λ>0(ours) TextureNetV1 λ=0
Figure5: ThetexturesgeneratedbythehighcapacityTextureNetV2withoutdiversityterm(λ = 0ineq.(10))areneary
identical. ThelowcapacityTextureNetV1of [19]achievesdiversity,buthassometimespoorresults. TextureNetV2with
diversityisthebestofbothworlds.
image and repeatedly upsampled with fractionally-strided 5.2.Effectofinstancenormalization
convolutions similar to [17]. More details can be found in
In order to evaluate the impact of replacing batch nor-
thesupplementarymaterial.
malization with instance normalization, we consider first
theproblemofstylization,wherethegoalistolearnagen-
erator x = g(x ,z) that applies a certain texture style to
0
Weight parameters. In practice, for the case of λ > 0, the content image x0 using noise z as “random seed”. We
entropylossandtexturelossineq.(10)shouldbeweighted setλ = 0forwhichgeneratorismostlikelytodiscardthe
properly. As only the value of Tλ is important for opti- noise.
mization we assume λ = 1 and choose T from the set of TheStyleNetINandStyleNetBNarecomparedinfig.3.
three values (5,10,20) for texture synthesis (we pick the Panel fig. 3.a shows the training objective (5) of the net-
highervalueamongthosenotleadingtoartifacts–seeour works as a function of the SGD training iteration. The
discussionbelow). WefixT = 10000forstyletransferex- objective function is the same, but StyleNet IN converges
periments.Fortexturesynthesis,similarlyto[19],wefound much faster, suggesting that it can solve the stylization
usefultonormalizegradientofthetexturelossasitpasses problem more easily. This is confirmed by the stark dif-
backthroughtheVGG-19network. Thisallowsrapidcon- ferenceinthequalitativeresultsinpanels(d)end(g). Since
vergenceforstochasticoptimizationbutimplicitlyaltersthe theStyleNetsaretrainedtominimizeinoneshotthesame
objectivefunctionandrequirestemperaturetobeadjusted. objective as the iterative optimization of Gatys et al., they
Weobservethatfortextureswithflatlightninghighentropy canbeusedtoinitializethelatteralgorithm.Panel(b)shows
weightresultsinbrightnessvariationsovertheimagefig.7. theresultofapplyingtheGatysetal.optimizationstarting
Wehypothesizethisissuecanbesolvedifeithermoreclever from their random initialization and the output of the two
distanceforentropyestimationisusedoranimagepriorin- StyleNets. Clearly both networks start much closer to an
troduced. optimum than random noise, and IN closer than BN. The
Content Style StyleNet λ>0 StyleNetλ=0
Figure6: TheStyleNetV2g(x ,z), trainedwithdiversityλ > 0, generatessubstantiallydifferentstylizationsfordifferent
0
valuesoftheinputnoisez. Inthiscase,thelackofstylizationdiversityisvisibleinuniformregionssuchasthesky.
5.3.Effectofthediversityterm
HavingvalidatedtheIN-basedarchitecture,weevaluate
now the effect of the entropy-based diversity term in the
objectivefunction(10).
The experiment in fig. 5 starts by considering the
problem of texture generation. We compare the new
high-capacity TextureNetV2 and the low-capacity Tex-
tureNetsV1 texture synthesis networks. The low-capacity
model is the same as [19]. This network was used there
Figure7: Negativeexamples. Ifthediversitytermλistoo in order to force the network to learn a non-trivial depen-
high for the learned style, the generator tends to generate dency on the input noise, thus generating diverse outputs
artifacts in which brightness is changed locally (spotting) even though the learning objective of [19], which is the
insteadof(oraswellas)changingthestructure. same as eq. (10) with diversity coefficient λ = 0, tends to
suppress diversity. The results in fig. 5 are indeed diverse,
but sometimes of low quality. This should be contrasted
differenceisqualitativelylarge: panels(e)and(h)showthe with TextureNetV2, the high-capacity model: its visual fi-
changeintheStyleNetsoutputafterfinetuningbyiterative delityismuchhigher,but,byusingthesameobjectivefunc-
optimizationoftheloss,whichhasasmalleffectfortheIN tion[19],thenetworklearnstogenerateasingleimage,as
variant,andamuchlargeronefortheBNone. expected. TextureNetV2 with the new diversity-inducing
objective (λ > 0) is the best of both worlds, being both
Similar results apply in general. Other examples are
high-qualityanddiverse.
showninfig.4, wheretheINvariantisfarsuperiortoBN
Theexperimentinfig.6assessestheeffectofthediver-
andmuchclosertotheresultsobtainedbythemuchslower
sityterminthestylizationproblem. Theresultsaresimilar
iterativemethodofGatysetal.StyleNetsaretrainedonim-
totheonesfortexturesynthesisandthediversitytermeffec-
agesofafixedsized,butsincetheyareconvolutional,they
tivelyencouragesthenetworktolearntoproducedifferent
canbeappliedtoarbitrarysizes. Inthefigure, thetoptree
resultsbasedontheinputnoise.
images are processed at 512 512 resolution and the bot-
× One difficultly with texture and stylization networks is
tomtwoat1024 1024. Ingeneral, wefoundthathigher
× that the entropy loss weight λ must be tuned for each
resolutionimagesyieldvisuallybetterstylizationresults.
learned texture model. Choosing λ too small may fail to
While instance normalization works much better than learnadiversegenerator,andsettingittoohighmaycreate
batchnormalizationforstylization,fortexturesynthesisthe artifacts,asshowninfig.7.
two normalization methods perform equally well. This is
consistent with our intuition that IN helps in normalizing 6.Summary
the information coming from content image x , which is
0
highlyvariable,whereasitisnotimportanttonormalizethe Thispaperadvancesfeed-forwardtexturesynthesisand
texture information, as each model learns only one texture stylization networks in two significant ways. It introduces
style. instancenormalization, anarchitecturalchangethatmakes
trainingstylizationnetworkseasierandallowsthetraining [16] J.PortillaandE.P.Simoncelli. Aparametrictexturemodel
processtoachievemuchlowerlosslevels.Italsointroduces based on joint statistics of complex wavelet coefficients.
anewlearningformulationfortraininggeneratornetworks IJCV,2000. 2,3
to sample uniformly from the Julesz ensemble, thus ex- [17] A.Radford,L.Metz,andS.Chintala. Unsupervisedrepre-
plicitlyencouragingdiversityinthegeneratedoutputs. We sentationlearningwithdeepconvolutionalgenerativeadver-
sarialnetworks. CoRR,abs/1511.06434,2015. 3,7
show that both improvements lead to noticeable improve-
[18] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh,
mentsofthegeneratedstylizedimagesandtextures, while
S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein,
keepingthegenerationruntimesintact.
A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual
References RecognitionChallenge. InternationalJournalofComputer
Vision(IJCV),115(3):211–252,2015. 5
[1] L.J.Ba,R.Kiros,andG.E.Hinton. Layernormalization. [19] D. Ulyanov, V. Lebedev, A. Vedaldi, and V. S. Lempitsky.
CoRR,abs/1607.06450,2016. 5 Texture networks: Feed-forward synthesis of textures and
[2] E.L.Denton, S.Chintala, A.Szlam, andR.Fergus. Deep stylized images. In Proceedings of the 33nd International
generativeimagemodelsusingalaplacianpyramidofadver- Conference on Machine Learning, ICML 2016, New York
sarialnetworks. InNIPS,pages1486–1494,2015. 3 City, NY, USA, June 19-24, 2016, pages 1349–1357, 2016.
[3] G.K.Dziugaite, D.M.Roy, andZ.Ghahramani. Training 1,2,3,4,5,6,7,8
generativeneuralnetworksviamaximummeandiscrepancy [20] S. C. Zhu, X. W. Liu, and Y. N. Wu. Exploring texture
optimization. InUAI,pages258–267.AUAIPress,2015. 3 ensembles by efficient markov chain monte carlotoward a
[4] L. Gatys, A. S. Ecker, and M. Bethge. Texture synthesis atrichromacyotheoryoftexture. PAMI,2000. 2,3
using convolutional neural networks. In Advances in Neu- [21] S.C.Zhu, Y.Wu, andD.Mumford. Filters, randomfields
ralInformationProcessingSystems, NIPS,pages262–270, andmaximumentropy(FRAME):Towardsaunifiedtheory
2015. 1,2,3 fortexturemodeling. IJCV,27(2),1998. 3
[5] L.A.Gatys,A.S.Ecker,andM.Bethge.Aneuralalgorithm
ofartisticstyle. CoRR,abs/1508.06576,2015. 1,2,3
[6] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu,
D.Warde-Farley, S.Ozair, A.C.Courville, andY.Bengio.
Generativeadversarialnets. InAdvancesinNeuralInforma-
tionProcessingSystems(NIPS),pages2672–2680,2014. 3,
4
[7] A.Gretton,K.M.Borgwardt,M.Rasch,B.Scho¨lkopf,and
A.J.Smola. Akernelmethodforthetwo-sample-problem.
InAdvancesinneuralinformationprocessingsystems,NIPS,
pages513–520,2006. 3
[8] J.Johnson, A.Alahi, andL.Fei-Fei. Perceptuallossesfor
real-time style transfer and super-resolution. In Computer
Vision - ECCV 2016 - 14th European Conference, Amster-
dam, The Netherlands, October 11-14, 2016, Proceedings,
PartII,pages694–711,2016. 1,2,3,4,6
[9] B.Julesz. Textons, theelementsoftextureperception, and
theirinteractions. Nature,290(5802):91–97,1981. 2
[10] T. Kim and Y. Bengio. Deep directed generative models
with energy-based probability estimation. arXiv preprint
arXiv:1606.03439,2016. 4
[11] D. P. Kingma and M. Welling. Auto-encoding variational
bayes. CoRR,abs/1312.6114,2013. 4
[12] L.F.KozachenkoandN.N.Leonenko. Sampleestimateof
the entropy of a random vector. Probl. Inf. Transm., 23(1-
2):95–101,1987. 2,4
[13] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet
classification with deep convolutional neural networks. In
NIPS,pages1106–1114,2012. 5
[14] Y. Li, K. Swersky, and R. S. Zemel. Generative moment
matching networks. In Proc. International Conference on
MachineLearning,ICML,pages1718–1727,2015. 3
[15] J.PortillaandE.P.Simoncelli. Aparametrictexturemodel
based on joint statistics of complex wavelet coefficients.
IJCV,40(1):49–70,2000. 1