Table Of ContentUnderreviewasaconferencepaperatICLR2017
THE INCREDIBLE SHRINKING NEURAL NETWORK:
NEW PERSPECTIVES ON LEARNING REPRESENTA-
TIONS THROUGH THE LENS OF PRUNING
NikolasWolfe,AdityaSharma&BhikshaRaj
SchoolofComputerScience
CarnegieMellonUniversity
Pittsburgh,PA15213,USA
{nwolfe, bhiksha}@cs.cmu.edu, [email protected]
7
1
LukasDrude
0
UniversitatPaderborn
2
[email protected]
n
a
J
6 ABSTRACT
1
How much can pruning algorithms teach us about the fundamentals of learning
]
E representations in neural networks? A lot, it turns out. Neural network model
compression has become a topic of great interest in recent years, and many dif-
N
ferenttechniqueshavebeenproposedtoaddressthisproblem. Ingeneral,thisis
.
s motivatedbytheideathatsmallermodelstypicallyleadtobettergeneralization.
c At the same time, the decision of what to prune and when to prune necessarily
[
forces us to confront our assumptions about how neural networks actually learn
1 torepresentpatternsindata. Inthisworkwesetouttotestseverallong-heldhy-
v pothesesaboutneuralnetworklearningrepresentationsandnumericalapproaches
5 topruning. Toaccomplishthiswefirstreviewedthehistoricalliteratureandde-
6 rived a novel algorithm to prune whole neurons (as opposed to the traditional
4 methodofpruningweights)fromoptimallytrainednetworksusingasecond-order
4
Taylor method. We then set about testing the performance of our algorithm and
0
analyzingthe qualityof thedecisions itmade. As abaseline forcomparison we
.
1 used a first-order Taylor method based on the Skeletonization algorithm and an
0 exhaustivebrute-forceserialpruningalgorithm. Ourproposedalgorithmworked
7 well compared to a first-order method, but not nearly as well as the brute-force
1 method. Ourerroranalysisledustoquestionthevalidityofmanywidely-heldas-
:
v sumptionsbehindpruningalgorithmsingeneralandthetrade-offsweoftenmake
i intheinterestofreducingcomputationalcomplexity. Wediscoveredthatthereis
X
a straightforward way, however expensive, to serially prune 40-70% of the neu-
r ronsinatrainednetworkwithminimaleffectonthelearningrepresentationand
a
withoutanyre-training.
1 INTRODUCTION
Inthisworkweproposeandevaluateanovelalgorithmforpruningwholeneuronsfromatrained
neuralnetworkwithoutanyre-trainingandexamineitsperformancecomparedtotwosimplermeth-
ods.Wethenanalyzethekindsoferrorsmadebyouralgorithmandusethisasasteppingoffpointto
launchaninvestigationintothefundamentalnatureoflearningrepresentationsinneuralnetworks.
OurresultscorroborateaninsightfulthoughlargelyforgottenobservationbyMozer&Smolensky
(1989a)concerningthenatureofneuralnetworklearning. Thisobservationisbestsummarizedin
a quotation from Segee & Carter (1991) on the notion of fault-tolerance in multilayer perceptron
networks:
Contrarytothebeliefwidelyheld,multilayernetworksarenotinherentlyfault
tolerant. Infact,thelossofasingleweightisfrequentlysufficienttocompletely
1
UnderreviewasaconferencepaperatICLR2017
disruptalearnedfunctionapproximation. Furthermore,havingalargenumberof
weightsdoesnotseemtoimprovefaulttolerance. [Emphasisadded]
Essentially,Mozer&Smolensky(1989b)observedthatduringtrainingneuralnetworksdonotdis-
tribute the learning representation evenly or equitably across hidden units. What actually happens
isthatafew, eliteneuronslearnanapproximationoftheinput-outputfunction, andtheremaining
units must learn a complex interdependence function which cancels out their respective influence
onthenetworkoutput. Furthermore,assumingenoughunitsexisttolearnthefunctioninquestion,
increasing the number of parameters does not increase the richness or robustness of the learned
approximation, but rather simply increases the likelihood of overfitting and the number of noisy
parameterstobecanceledduringtraining. Thisisevincedbythefactthatinmanycases,multiple
neuronscanberemovedfromanetworkwithnore-trainingandwithnegligibleimpactonthequality
oftheoutputapproximation. Inotherwords,therearefewbipartisanunitsinatrainednetwork. A
unitistypicallyeitherpartofthe(possiblyoverfit)input-outputfunctionapproximation,oritispart
ofanelaboratenoisecancellationtaskforce. Assumingthisisthecase, mostofthecompute-time
spenttraininganeuralnetworkislikelyoccupiedbythisarguablywastefulprocedureofsilencing
superfluousparameters,andpruningcanbeviewedasanecessaryprocedureto“trimthefat.”
We observed copious evidence of this phenomenon in our experiments, and this is the motivation
behind our decision to evaluate the pruning algorithms in this study on the simple criteria of their
ability to trim neurons without any re-training. If we were to employ re-training as part of our
evaluation criteria, we would arguably not be evaluating the quality of our algorithm’s pruning
decisions per se but rather the ability of back-propagation trained networks to recover from faults
caused by non-ideal pruning decisions, as suggested by the conclusions of Segee & Carter (1991)
and Mozer & Smolensky (1989a). Moreover, as Fahlman & Lebiere (1989) discuss, due to the
“herdeffect”and“movingtarget”phenomenainback-propagationlearning,theremainingunitsin
anetworkwillsimplyshiftcoursetoaccountforwhatevererrorsignalisre-introducedasaresultof
abadpruningdecisionornetworkfault. Solongasthereareenoughcriticalparameterstolearnthe
functioninquestion,anetworkcantypicallyrecoverfaultswithadditionaltraining. Thislimitsthe
conclusionswecandrawaboutthequalityofourpruningcriteriawhenweemployre-training.
Intermsofremovingunitswithoutre-training,whatwediscoveredisthatpredictingthebehaviorof
anetworkwhenaunitistobeprunedisverydifficult,andmostoftheapproximationtechniquesput
forthinexistingpruningalgorithmsdonotfarewellatallwhencomparedtoabrute-forcesearch.
Tobeginourdiscussionofhowwearrivedatouralgorithmandsetupourexperiments,wereview
oftheexistingliterature.
2 LITERATURE REVIEW
Pruning algorithms, as comprehensively surveyed by Reed (1993), are a useful set of heuristics
designedtoidentifyandremoveelementsfromaneuralnetworkwhichareeitherredundantordo
notsignificantlycontributetotheoutputofthenetwork. Thisismotivatedbytheobservedtendency
of neural networks to overfit to the idiosyncrasies of their training data given too many trainable
parametersortoofewinputpatternsfromwhichtogeneralize,asstatedbyChauvin(1990).
Network architecture design and hyperparameter selection are inherently difficult tasks typically
approached using a few well-known rules of thumb, e.g. various weight initialization procedures,
choosingthewidthandnumberoflayers,differentactivationfunctions,learningrates,momentum,
etc. Some of this “black art” appears unavoidable. For problems which cannot be solved using
linearthresholdunitsalone,Baum&Haussler(1989)demonstratethatthereisnowaytoprecisely
determinetheappropriatesizeofaneuralnetworkapriorigivenanyrandomsetoftraininginstances.
Usingtoofewneuronsseemstoinhibitlearning,andsoinpracticeitiscommontoattempttoover-
parameterize networks initially using a large number of hidden units and weights, and then prune
orcompressthemafterwardsifnecessary. Ofcourse,astheoldsayinggoes,there’smorethanone
waytoskinaneuralnetwork.
2.1 NON-PRUNINGBASEDGENERALIZATION&COMPRESSIONTECHNIQUES
The generalization behavior of neural networks has been well studied, and apart from pruning al-
gorithms many heuristics have been used to avoid overfitting, such as dropout (Srivastava et al.
2
UnderreviewasaconferencepaperatICLR2017
(2014)), maxout (Goodfellow et al. (2013)), and cascade correlation (Fahlman & Lebiere (1989)),
among others. Of course, while cascade correlation specifically tries to construct of minimal net-
works, manytechniquestoimprovenetworkgeneralizationdonotexplicitlyattempttoreducethe
totalnumberofparametersorthememoryfootprintofatrainednetworkperse.
Modelcompressionoftenhasbenefitswithrespecttogeneralizationperformanceandtheportability
ofneuralnetworkstooperateinmemory-constrainedorembeddedenvironments.Withoutexplicitly
removing parameters from the network, weight quantization allows for a reduction in the number
ofbytesusedtorepresenteachweightparameter,asinvestigatedbyBalzeretal.(1991),Dundar&
Rose(1994),andHoehfeld&Fahlman(1992).
Arecentlyproposedmethodforcompressingrecurrentneuralnetworks(Prabhavalkaretal.(2016))
uses the singular values of a trained weight matrix as basis vectors from which to derive a com-
pressedhiddenlayer. Øland&Raj(2015)successfullyimplementednetworkcompressionthrough
weightquantizationwithanencodingstepwhileotherssuchasHanetal.(2016)havetriedtoexpand
onthisbyaddingweight-pruningasaprecedingsteptoquantizationandencoding.
In summary, we can say that there are many different ways to improve network generalization by
alteringthetrainingprocedure,theobjectiveerrorfunction,orbyusingcompressedrepresentations
of the network parameters. But these are not, strictly speaking, examples of techniques to reduce
thenumberofparametersinanetwork. Forthiswemustemploysomeformofpruningcriteria.
2.2 PRUNINGTECHNIQUES
Ifwewantedtocontinuallyshrinkaneuralnetworkdowntominimumsize,themoststraightforward
brute-forcewaytodoitistoindividuallyswitcheachelementoffandmeasuretheincreaseintotal
erroronthetrainingset. Wethenpicktheelementwhichhastheleastimpactonthetotalerror,and
removeit. Rinseandrepeat. Thisisextremelycomputationallyexpensive,givenareasonablylarge
neuralnetworkandtrainingset. Alternatively,wemightaccomplishthisusinganynumberofmuch
fasteroff-the-shelfpruningalgorithms,suchasSkeletonization(Mozer&Smolensky(1989a)),Op-
timalBrainDamage(LeCunetal.(1989)),orlatervariantssuchasOptimalBrainSurgeon(Hassibi
&Stork(1993)). Infact,weborrowmuchofourinspirationfromthesealgorithms,withonemajor
variation: Insteadofpruningindividualweights,wepruneentireneurons,therebyeliminatingallof
theirincomingandoutgoingweightparametersinonego,resultinginmorememorysaved,faster.
The algorithm developed for this paper is targeted at reducing the total number of neurons in a
trained network, which is one way of reducing its computational memory footprint. This is often
adesirablecriteriatominimizeinthecaseofresource-constrainedorembeddeddevices, andalso
allows us to probe the limitations of pruning down to the very last essential network elements. In
terms of generalization as well, we can measure the error of the network on the test set as each
element is sequentially removed from the network. With an oracle pruning algorithm, what we
expecttoobserveisthattheoutputofthenetworkremainsstableasthefirstfewsuperfluousneurons
areremoved,andaswestarttobiteintothemorecrucialmembersofthefunctionapproximation,
the error should start to rise dramatically. In this paper, the brute-force approach described at the
beginningofthissectionservesasaproxyforanoraclepruningalgorithm.
Onereasontochoosetorankandpruneindividualneuronsasopposedtoweightsisthattherearefar
fewerelementstoconsider. Furthermore,theremovalofasingleweightfromalargenetworkisa
dropinthebucketintermsofreducinganetwork’scorememoryfootprint. Ifwewanttoreducethe
sizeofanetworkasefficientlyaspossible,wearguethatpruningneuronsinsteadofweightsismore
efficient computationally as well as practically in terms of quickly reaching a hypothetical target
reduction in memory consumption. This approach also offers downstream applications a realistic
expectationoftheminimalincreaseinerrorresultingfromtheremovalofaspecifiedpercentageof
neurons. Such trade-offs are unavoidable, but performance impacts can be limited if a principled
approachisusedtofindthebestcandidateneuronsforremoval.
Itiswellknownthattoomanyfreeparametersinaneuralnetworkcanleadtooverfitting.Regardless
ofthenumberofweightsusedinagivennetwork,asSegee&Carter(1991)assert,therepresentation
of a learned function approximation is almost never evenly distributed over the hidden units, and
thustheremovalofanysinglehiddenunitatrandomcanactuallyresultinanetworkfault. Mozer
& Smolensky (1989b) argue that only a subset of the hidden units in a neural network actually
3
UnderreviewasaconferencepaperatICLR2017
latchontotheinvariantorgeneralizingpropertiesofthetraininginputs,andtherestlearntoeither
mutually cancel each other’s influence or begin overfitting to the noise in the data. We leverage
this idea in the current work to rank all neurons in pre-trained networks based on their effective
contributions to the overall performance. We then remove the unnecessary neurons to reduce the
network’sfootprint. Throughourexperimentswenotonlyconcretelyvalidatethetheoryputforth
byMozer&Smolensky(1989b)butwealsosuccessfullybuildonittoprunenetworksto40to60
%oftheiroriginalsizewithoutanymajorlossinperformance.
3 PRUNING NEURONS TO SHRINK NEURAL NETWORKS
AsdiscussedinSection1ouraimistoleveragethehighlynon-uniformdistributionofthelearning
representation in pre-trained neural networks to eliminate redundant neurons, without focusing on
individualweightparameters. Takingthisapproachenablesustoremovealltheweights(incoming
andoutgoing)associatedwithanon-contributingneuronatonce. Wewouldliketonoteherethatin
anidealscenario,basedontheneuroninterdependencytheoryputforwardbyMozer&Smolensky
(1989a),onewouldevaluateallpossiblecombinationsofneuronstoremove(oneatatime,twoata
time,threeatatimeandsoforth)tofindtheoptimalsubsetofneuronstokeep. Thisiscomputation-
ally unacceptable, and so we will only focus on removing one neuron at a time and explore more
“greedy”algorithmstodothisinamoreefficientmanner.
Thegeneralapproachtakentopruneanoptimallytrainedneuralnetworkhereistocreatearanked
listofalltheneuronsinthenetworkbasedoffofoneofthe3proposedrankingcriteria:abruteforce
approximation,alinearapproximationandaquadraticapproximationoftheneuron’simpactonthe
outputofthenetwork. Wethentesttheeffectsofremovingneuronsontheaccuracyanderrorofthe
network. Allthealgorithmsandmethodspresentedhereareeasilyparallelizableaswell.
One last thing to note here before moving forward is that the methods discussed in this section
involvesomenon-trivialderivationswhicharebeyondthescopeofthispaper. Wearemorefocused
on analyzing the implications of these methods on our understanding of neural network learning
representations. However,acompletestep-by-stepderivationandproofofalltheresultspresented
isprovidedintheSupplementaryMaterialasanAppendix.
3.1 BRUTEFORCEREMOVALAPPROACH
Thisisperhapsthemostnaiveyetthemostaccuratemethodforpruningthenetwork. Itisalsothe
slowestandhencepossiblyunusableonlarge-scaleneuralnetworkswiththousandsofneurons.This
methodexplicitlyevaluateseachneuroninthenetwork. Theideaistomanuallychecktheeffectof
everysingleneuronontheoutput. Thisisdonebyrunningaforwardpropagationonthevalidation
setK times(whereK isthetotalnumberofneuronsinthenetwork),turningoffexactlyoneneuron
eachtime(keepingallotherneuronsactive)andnotingdownthechangeinerror. Turninganeuron
offcanbeachievedbysimplysettingitsoutputto0. Thisresultsinalltheoutgoingweightsfrom
thatneuronbeingturnedoff. Thischangeinerroristhenusedtogeneratetherankedlist.
3.2 TAYLORSERIESREPRESENTATIONOFERROR
Let us denote the total error from the optimally trained neural network for any given validation
datasetbyE. E canbeseenasafunctionofO,whereOistheoutputofanygeneralneuroninthe
network. Thiserrorcanbeapproximatedataparticularneuron’soutput(sayO )byusingthe2nd
k
orderTaylorSeriesas,
Eˆ(O)≈E(Ok)+(O−Ok)· ∂∂EO(cid:12)(cid:12)(cid:12)(cid:12) +0.5·(O−Ok)2· ∂∂2OE2(cid:12)(cid:12)(cid:12)(cid:12) , (1)
Ok Ok
Whenaneuronispruned,itsoutputObecomes0.
ReplacingObyO inequation1showsusthattheerrorisapproximatedperfectlybyequation1at
k
O . So:
k
4
UnderreviewasaconferencepaperatICLR2017
∆Ek =Eˆ(0)−Eˆ(Ok)=−Ok· ∂∂EO(cid:12)(cid:12)(cid:12)(cid:12) +0.5·Ok2· ∂∂2OE2(cid:12)(cid:12)(cid:12)(cid:12) , (2)
Ok Ok
where ∆E is the change in the total error of the network when exactly one neuron (k) is turned
k
off. Most of the terms in this equation are fairly easy to compute, as we have O already from
k
theactivationsofthehiddenunitsandwealreadycompute ∂E| foreachtraininginstanceduring
∂O Ok
backpropagation. The ∂2E| terms are a little more difficult to compute. This is derived in the
∂O2 Ok
appendixandsummarizedinthesectionsbelow.
3.2.1 LINEARAPPROXIMATIONAPPROACH
We can use equation 2 to get the linear error approximation of the change in error due to the kth
neuronbeingturnedoffandrepresentitas∆E1asfollows:
k
(cid:12)
∆Ek1 =−Ok· ∂∂EO(cid:12)(cid:12)(cid:12) (3)
Ok
Thederivativetermaboveisthefirst-ordergradientwhichrepresentsthechangeinerrorwithrespect
totheoutputagivenneuron. Thistermcanbecollectedduringback-propagation. Asweshallsee
furtherinthissection, linearapproximationsarenotreliableindicatorsofchangeinerrorbutthey
provideuswithaninterestingbasisforcomparisonwiththeothermethodsdiscussedinthispaper.
3.2.2 QUADRATICAPPROXIMATIONAPPROACH
Asabove,wecanuseequation2togetthequadraticerrorapproximationofthechangeinerrordue
tothekthneuronbeingturnedoffandrepresentitas∆E2asfollows:
k
∆Ek2 =−Ok· ∂∂EO(cid:12)(cid:12)(cid:12)(cid:12) +0.5·Ok2· ∂∂2OE2(cid:12)(cid:12)(cid:12)(cid:12) (4)
Ok Ok
Theadditionalsecond-ordergradienttermappearingaboverepresentsthequadraticchangeinerror
with respect to the output of a given neuron. This term can be generated by performing back-
propagation using second order derivatives. Collecting these quadratic gradients involves some
non-trivial mathematics, the entire step-by-step derivation procedure of which is provided in the
SupplementaryMaterialasanAppendix.
3.3 PROPOSEDPRUNINGALGORITHM
Figure 1 shows a random error function plotted against the output of any given neuron. Note that
thisfigureisforillustrationpurposesonly. Theerrorfunctionisminimizedataparticularvalueof
theneuronoutputascanbeseeninthefigure.Theprocessoftraininganeuralnetworkisessentially
the process of finding these minimizing output values for all the neurons in the network. Pruning
thisparticularneuron(whichtranslatestogettingazerooutputfromitwillresultinachangeinthe
total overall error. This change in error is represented by distance between the original minimum
error(shownbythedashedline)andthetopredarrow. Thisneuronisclearlyabadcandidatefor
removalsinceremovingitwillresultinahugeerrorincrease.
Thestraightredlineinthefigurerepresentsthefirst-orderapproximationoftheerrorusingTaylor
Series as described before while the parabola represents a second-order approximation. It can be
clearlyseenthatthesecond-orderapproximationisamuchbetterestimateofthechangeinerror.
One thing to note here is that it is possible in some cases that there is some thresholding required
whentryingtoapproximatetheerrorusingthe2ndorderTaylorSeriesexpansion.Thesecasesmight
arisewhentheparabolicapproximationundergoesasteepslopechange. Totakeintoaccountsuch
cases,meanandmedianthresholdingwereemployed,whereanychangeaboveacertainthreshold
wasassignedameanormedianvaluerespectively.
5
UnderreviewasaconferencepaperatICLR2017
2nd Order
Real Change in Error Approximation
2nd Order Estimate
1st Order Approximation
1st Order Estimate
18
Figure1: Theintuitionbehind1st&2ndorderneuronpruningdecisions
Twopruningalgorithmsareproposedhere. Theyaredifferentinthewaytheneuronsarerankedbut
bothofthemuse∆E ,theapproximationofthechangeinerrorasthebasisfortheranking. ∆E
k k
can be calculated using the Brute Force method, or one of the two Taylor Series approximations
discussedpreviously.
The first step in both the algorithms is to decide a stopping criterion. This can vary depending on
theapplicationbutsomeintuitivestoppingcriteriacanbe: maximumnumberofneuronstoremove,
percentagescalingneeded,maximumallowableaccuracydropetc.
3.3.1 ALGORITHMI:SINGLEOVERALLRANKING
ThecompletealgorithmisshowninAlgorithm1. Theideahereistogenerateasinglerankedlist
basedonthevaluesof∆E . Thisinvolvesasinglepassofsecond-orderback-propagation(without
k
weightupdates)tocollectthegradientsforeachneuron. Theneuronsfromthisrank-list(withthe
lowest values of ∆E ) are then pruned according to the stopping criterion decided. We note here
k
thatthisalgorithmisintentionallynaiveandisusedforcomparisononly.
Data: optimallytrainednetwork,trainingset
Result: Aprunednetwork
initializeanddefinestoppingcriterion;
performforwardpropagationoverthetrainingset;
performsecond-orderback-propagationwithoutupdatingweightsandcollectlinearandquadratic
gradients;
ranktheremainingneuronsbasedon∆E ;
k
whilestoppingcriterionisnotmetdo
removethelastrankedneuron;
end
Algorithm1:SingleOverallRanking
3.3.2 ALGORITHMII:ITERATIVERE-RANKING
In this greedy variation of the algorithm (Algorithm 2), after each neuron removal, the remaining
networkundergoesasingleforwardandbackwardpassofsecond-orderback-propagation(without
weightupdates)andtheranklistisformedagain. Hence,eachremovalinvolvesanewpassthrough
6
UnderreviewasaconferencepaperatICLR2017
thenetwork. Thismethodiscomputationallymoreexpensivebuttakesintoaccountthedependen-
ciestheneuronsmighthaveononeanotherwhichwouldleadtoachangeinerrorcontributionevery
timeadependentneuronisremoved.
Data: optimallytrainednetwork,trainingset
Result: Aprunednetwork
initializeanddefinestoppingcriterion;
whilestoppingcriterionisnotmetdo
performforwardpropagationoverthetrainingset;
performsecond-orderback-propagationwithoutupdatingweightsandcollectlinearand
quadraticgradients;
ranktheremainingneuronsbasedon∆E ;
k
removetheworstneuronbasedontheranking;
end
Algorithm2:IterativeRe-Ranking
4 EXPERIMENTAL RESULTS
4.1 EXAMPLEREGRESSIONPROBLEM
Thisproblemservesasaquickexampletodemonstratemanyofthephenomenadescribedinprevi-
oussections. Wetrainedtwonetworkstolearnthecosinefunction,withoneinputandoneoutput.
Thisisataskwhichrequiresnomorethan11sigmoidneuronstosolveentirely,andinthiscasewe
don’t care about overfitting because the cosine function has a precise definition. Furthermore, the
cosine function is a good toy example because it is a smooth continuous function and, as demon-
stratedbyNielsen(2015),ifweweretotinkerdirectlywiththeweightsandbiasparametersofthe
network, we could allocate individual units within the network to be responsible for constrained
rangesofinputs,similartoabasissplinefunctionwithmanycontrolpoints. Thiswoulddistribute
thelearnedfunctionapproximationevenlyacrossallhiddenunits, andthuswehavepresentedthe
networkwithaprobleminwhichitcouldproductivelyuseasmanyhiddenunitsaswegiveit. In
thiscase, apruningalgorithmwouldobserveafairlyconsistentincreaseinerroraftertheremoval
ofeachsuccessiveunit. Inpracticehowever,regardlessofthenumberofexperimentaltrials,thisis
notwhathappens. Thenetworkwillalwaysuse10-11hiddenunitsandleavetheresttocanceleach
other’sinfluence.
4500
Brute Force Brute Force
4000 1st Order / Skeletonization 4000 1st Order / Skeletonization
2nd Order Taylor Approx 2nd Order Taylor Approx
3500
Sum of Squared Errors23000000 Sum of Squared Errors1223505000000000
1000 1000
500
0 0
0 5 10 15 20 0 20 40 60 80 100
Number of Neurons Removed Number of Neurons Removed
Figure 2: Degradation in squared error after pruning a two-layer network trained to compute the
cosine function (Left Network: 2 layers, 10 neurons each, 1 output, logistic sigmoid activation,
starting test accuracy: 0.9999993, Right Network: 2 layers, 50 neurons each, 1 output, logistic
sigmoidactivation,startingtestaccuracy: 0.9999996)
Figure2showstwographs. Bothgraphsdemonstratetheuseoftheiterativere-rankingalgorithm
andthecomparativeperformanceofthebrute-forcepruningmethod(inblue),thefirstordermethod
(in green), and the second order method (in red). The graph on the left shows the performance of
thesealgorithmsstartingfromanetworkwithtwolayersof10neurons(20total),andthegraphon
therightshowsanetworkwithtwolayersof50neurons(100total).
7
UnderreviewasaconferencepaperatICLR2017
In the left graph, we see that the brute-force method shows a graceful degradation, and the error
onlybeginstorisesharplyafter50%ofthetotalneuronshavebeenremoved. Theerrorisbasically
constantuptothatpoint. Inthefirstandsecondordermethods, weseeevidenceofpoordecision
makinginthesensethatbothmademistakesearlyon,whichdisruptedtheoutputfunctionapproxi-
mation. Thefirstordermethodmadealargeerrorearlyon,thoughweseeafterafewmoreneurons
were removed this error was corrected somewhat (though it only got worse from there). This is
directevidenceofthelackoffaulttoleranceinatrainedneuralnetwork. Thisphenomenoniseven
more starkly demonstrated in the second order method. After making a few poor neuron removal
decisionsinarow,theerrorsignalrosesharply,andthenwentbacktozeroafterthe6thneuronwas
removed. This is due to the fact that the neurons it chose to remove were trained to cancel each
others’influencewithinalocalizedpartofthenetwork. Aftertheentiregroupwaseliminated,the
approximationreturnedtonormal. Thiscanonlyhappeniftheoutputfunctionapproximationisnot
evenlydistributedoverthehiddenunitsinatrainednetwork.
This phenomenon is even more starkly demonstrated in the graph on the right. Here we see the
first order method got “lucky” in the beginning and made decent decisions up to about the 40th
removed neuron. The second order method had a small error in the beginning which it recovered
fromgracefullyandproceededtopassthe50neuronpointbeforefinallybeginningtounravel. The
brute force method, in sharp contrast, shows little to no increase in error at all until 90% of the
neurons in the network have been obliterated. Clearly first and second order methods have some
valueinthattheydonotmakecompletelyarbitrarychoices,butthebruteforcemethodisfarbetter
atthistask.
Thisalsodemonstratesthesharpdualisminneuronroleswithinatrainednetwork. Thesenetworks
weretrainedtonear-perfectprecisionandeachpruningmethodwasappliedwithoutanyre-training
ofanykind.Clearly,inthecaseofthebruteforceororaclemethod,upto90%ofthenetworkcanbe
completelyextirpatedbeforetheoutputapproximationevenbeginstoshowanysignsofdegradation.
Thiswouldbeimpossibleifthelearningrepresentationwereevenlyorequitablydistributed. Note,
forexample,thatthedegradationpointinbothcasesisapproximatelythesame. Thisexampleisnot
areal-worldapplicationofcourse,butitbringsintoveryclearfocusthekindofphenomenawewill
discussinthefollowingsections.
4.2 RESULTSONMNISTDATASET
Foralltheresultspresentedinthissection,theMNISTdatabaseofHandwrittenDigitsbyLeCun&
Cortes(2010) wasused. Itisworth notingthat dueto thetime taken bythe bruteforce algorithm
weratheruseda5000imagesubsetoftheMNISTdatabaseinwhichwehavenormalizedthepixel
values between 0 and 1.0, and compressed the image sizes to 20x20 images rather than 28x28, so
thestartingtestaccuracyreportedhereappearshigherthanthosereportedbyLeCunetal.Wedonot
believethatthisaffectstheinterpretationofthepresentedresultsbecausethebasiclearningproblem
doesnotchangewithalargerdatasetorinputdimension.
4.3 PRUNINGA1-LAYERNETWORK
Thenetworkarchitectureinthiscaseconsistedof1layer,100neurons,10outputs,logisticsigmoid
activations,andastartingtestaccuracyof0.998.
4.3.1 SINGLEOVERALLRANKINGALGORITHM
We first present the results for a single-layer neural network in Figure 3, using the Single Overall
algorithm(Algorithm1)asproposedinSection3.(Weagainnotethatthisalgorithmisintentionally
naiveandisusedforcomparisononly. Itsperformanceshouldbeexpectedtobepoor.) Aftertrain-
ing,eachneuronisassigneditspermanentrankingbasedonthethreecriteriadiscussedpreviously:
Abruteforce“groundtruth”ranking,andtwoapproximationsofthisrankingusingfirstandsecond
orderTaylorestimationsofthechangeinnetworkoutputerrorresultingfromtheremovalofeach
neuron.
Aninterestingobservationhereisthatwithonlyasinglelayer, nocriteriaforrankingtheneurons
inthenetwork(bruteforceorthetwoTaylorSeriesvariants)usingAlgorithm1emergessuperior,
indicatingthatthe1stand2ndorderTaylorSeriesmethodsareactuallyreasonableapproximations
8
UnderreviewasaconferencepaperatICLR2017
1.0
8000 Brute Force Brute Force
1st Order / Skeletonization 1st Order / Skeletonization
7000 2nd Order Taylor Approx 0.8 2nd Order Taylor Approx
Sum of Squared Errors3456000000000000 Classification Accuracy00..46
2000
0.2
1000
0 0.0
0 20 40 60 80 100 0 20 40 60 80 100
Percentage of Neurons Removed Percentage of Neurons Removed
Figure 3: Degradation in squared error (left) and classification accuracy (right) after pruning a
single-layernetworkusingTheSingleOverallRankingalgorithm(Network: 1layer,100neurons,
10outputs,logisticsigmoidactivation,startingtestaccuracy: 0.998)
of the brute force method under certain conditions. Of course, this method is still quite bad in
termsoftherateofdegradationoftheclassificationaccuracyandinpracticewewouldlikelyfollow
Algorithm 2 which is takes into account Mozer & Smolensky (1989a)’s observations stated in the
Related Work section. The purpose of the present investigation, however, is to demonstrate how
much of a trained network can be theoretically removed without altering the network’s learned
parametersinanyway.
4.3.2 ITERATIVERE-RANKINGALGORITHM
8000 Brute Force 1.2 Brute Force
1st Order / Skeletonization 1st Order / Skeletonization
7000 2nd Order Taylor Approx 1.0 2nd Order Taylor Approx
Sum of Squared Errors3456000000000000 Classification Accuracy000...468
2000
0.2
1000
0 0.0
0 20 40 60 80 100 0 20 40 60 80 100
Percentage of Neurons Removed Percentage of Neurons Removed
Figure 4: Degradation in squared error (left) and classification accuracy (right) after pruning a
single-layernetworktheiterativere-rankingalgorithm(Network: 1layer,100neurons,10outputs,
logisticsigmoidactivation,startingtestaccuracy: 0.998)
InFigure4wepresentourresultsusingAlgorithm2(Theiterativere-rankingAlgorithm)inwhich
allremainingneuronsarere-rankedaftereachsuccessiveneuronisswitchedoff. Wecomputethe
samebruteforcerankingsandTaylorseriesapproximationsoferrordeltasovertheremainingactive
neurons in the network after each pruning decision. This is intended to account for the effects of
cancellinginteractionsbetweenneurons.
Thereare2keyobservationshere. Usingthebruteforcerankingcriteria,almost60%oftheneurons
in the network can be pruned away without any major loss in performance. The other noteworthy
observationhereisthatthe2ndorderTaylorSeriesapproximationoftheerrorperformsconsistently
betterthanits1storderversion,inmostsituations,thoughFigure21isapoignantcounter-example.
4.3.3 VISUALIZATIONOFERRORSURFACE&PRUNINGDECISIONS
AsexplainedinSection3,thesegraphsareavisualizationoftheerrorsurfaceofthenetworkoutput
withrespecttotheneuronschosenforremovalusingeachofthe3rankingcriteria, representedin
9
UnderreviewasaconferencepaperatICLR2017
intervals of 10 neurons. In each graph, the error surface of the network output is displayed in log
space(left)andinrealspace(right)withrespecttoeachcandidateneuronchosenforremoval. We
createtheseplotsduringthepruningexercisebypickinganeurontoswitchoff,andthenmultiplying
itsoutputbyascalargainvalueαwhichisadjustedfrom0.0to10.0withastepsizeof0.001.When
thevalueofαis1.0,thisrepresentstheunperturbedneuronoutputlearnedduringtraining.Between
0.0and1.0,wearegraphingtheliteraleffectofturningtheneuronoff(α = 0),andwhenα > 1.0
wearesimulatingaboostingoftheneuron’sinfluenceinthenetwork,i.e. inflatingthevalueofits
outgoingweightparameters.
We graph the effect of boosting the neuron’s output to demonstrate that for certain neurons in the
network,evendoubling,tripling,orquadruplingthescalaroutputoftheneuronhasnoeffectonthe
overallerrorofthenetwork,indicatingtheremarkabledegreetowhichthenetworkhaslearnedto
ignore the value of certain parameters. In other cases, we can get a sense of the sensitivity of the
network’soutputtothevalueofagivenneuronwhenthecurverisessteeplyafterthered1.0line.
Thisindicatesthatthelearnedvalueoftheparametersemanatingfromagivenneuronarerelatively
important,andthisiswhyweshouldideallyseesharperupticksinthecurvesforthelater-removed
neurons in the network, that is, when the neurons crucial to the learning representation start to be
pickedoff. Someveryinterestingobservationscanbemadeineachofthesegraphs.
Rememberthatlowerisbetterintermsoftheheightofthecurveandminimal(ornegative)horizon-
talchangebetweentheverticalredlineat1.0(neuronon, α = 1.0)and0.0(neuronoff, α = 0.0)
isindicativeofagoodcandidateneurontoprune, i.e. therewillbeminimaleffectonthenetwork
outputwhentheneuronisremoved.
4.3.4 VISUALIZATIONOFBRUTEFORCEPRUNINGDECISIONS
104 5000
Neuron: 0 Neuron: 0
Neuron: 10 Neuron: 10
Neuron: 20 Neuron: 20
Neuron: 30 Neuron: 30
Neuron: 40 4000 Neuron: 40
Neuron: 50 Neuron: 50
Neuron: 60 Neuron: 60
103 Neuron: 70 Neuron: 70
m of Squared Errors NNeeuurroonn:: 8900 m of Squared Errors23000000 NNeeuurroonn:: 8900
Su102 Su
1000
101 0
0 2 4 6 8 10 0 2 4 6 8 10
Gain value (1 = normal output, 0 = off) Gain value (1 = normal output, 0 = off)
Figure5: Errorsurfaceofthenetworkoutputinlogspace(left)andrealspace(right)withrespect
toeachcandidateneuronchosenforremovalusingthebruteforcecriterion;(Network: 1layer,100
neurons,10outputs,logisticsigmoidactivation,startingtestaccuracy: 0.998)
InFigure??,wenoticehowlowtothefloorandflatmostofthecurvesare. It’sonlyuntilthe90th
removed neuron that we see a higher curve with a more convex shape (clearly a more sensitive,
influentialpieceofthenetwork).
4.3.5 VISUALIZATIONOF1STORDERAPPROXIMATIONPRUNINGDECISIONS
ItcanbeseeninFigure6thatmostchoicesseemtohaveflatornegativelyslopedcurves,indicating
that the first order approximation seems to be pretty good, but examining the brute force choices
showstheycouldbebetter.
4.3.6 VISUALIZATIONOF2NDORDERAPPROXIMATIONPRUNINGDECISIONS
ThemethodinFigure7lookssimilartothebruteforcemethodchoices,thoughclearlynotasgood
(they’remorespreadout). Noticethedifferenceinconvexitybetweenthe2ndand1stordermethod
10