Table Of ContentDataMiningandKnowledgeDiscovery,1,317–328(1997)
(cid:176)c 1997KluwerAcademicPublishers,Boston.ManufacturedinTheNetherlands.
Methodological Note
On Comparing Classifiers: Pitfalls to Avoid and a
Recommended Approach
STEVENL.SALZBERG [email protected]
DepartmentofComputerScience,JohnsHopkinsUniversity,Baltimore,MD21218,USA
Editor:UsamaFayyad
ReceivedDecember3,1996;RevisedJuly3,1997;AcceptedJuly15,1997
Abstract. Animportantcomponentofmanydataminingprojectsisfindingagoodclassificationalgorithm,a
processthatrequiresverycarefulthoughtaboutexperimentaldesign. Ifnotdoneverycarefully, comparative
studiesofclassificationandothertypesofalgorithmscaneasilyresultinstatisticallyinvalidconclusions. This
isespeciallytruewhenoneisusingdataminingtechniquestoanalyzeverylargedatabases,whichinevitably
containsomestatisticallyunlikelydata. Thispaperdescribesseveralphenomenathatcan,ifignored,invalidate
anexperimentalcomparison. Thesephenomenaandtheconclusionsthatfollowapplynotonlytoclassification,
buttocomputationalexperimentsinalmostanyaspectofdatamining.Thepaperalsodiscusseswhycomparative
analysisismoreimportantinevaluatingsometypesofalgorithmsthanforothers,andprovidessomesuggestions
abouthowtoavoidthepitfallssufferedbymanyexperimentalstudies.
Keywords:classification,comparativestudies,statisticalmethods
1. Introduction
Dataminingresearchersoftenuseclassifierstoidentifyimportantclassesofobjectswithin
adatarepository. Classificationisparticularly usefulwhenadatabasecontainsexamples
that can be used as the basis for future decision making; e.g., for assessing credit risks,
formedicaldiagnosis,orforscientificdataanalysis. Researchershavearangeofdifferent
types of classification algorithms at their disposal, including nearest neighbor methods,
decisiontreeinduction,errorbackpropagation,reinforcementlearning,andrulelearning.
Overtheyears,manyvariationsofthesealgorithmshavebeendevelopedandmanystudies
have been produced comparing their effectiveness on different data sets, both real and
artificial. Theproductivenessofclassificationresearchinthepastmeans thatresearchers
todayconfrontaprobleminusingthosealgorithms,namely: howdoesonechoosewhich
algorithm touse fora newproblem? Thispaper addressesthemethodologythat onecan
use toanswerthisquestion, and discusseshowit hasbeen addressedin theclassification
community. It also discusses some of the pitfalls that confront anyone trying to answer
this question, and demonstrates how misleading results can easily follow from a lack of
attentiontomethodology. Below,Iwilluseexamplesfromthemachinelearningcommunity
whichillustratehowcarefulonemustbewhenusingfastcomputationalmethodstomine
alargedatabase. Theseexamplesshowthatwhenonerepeatedlysearchesalargedatabase
with powerful algorithms, it is all too easy to “find” a phenomenon or pattern that looks
impressive,evenwhenthereisnothingtodiscover.
318 SALZBERG
Itisnaturalforexperimentalresearcherstowanttouserealdatatovalidatetheirsystems.
Aculturehasevolvedinthemachinelearningcommunitythatnowinsistsonaconvincing
evaluationofnewideas,whichveryoftentakestheformofexperimentaltesting. Thisisa
healthydevelopmentanditrepresentsanimportantstepinthematurationofthefield. One
indicationofthis maturationisthecreation andmaintenanceoftheUC Irvinerepository
ofmachinelearningdatabases(Murphy,1995),whichnowcontainsover100datasetsthat
haveappearedinpublishedwork. Thisrepositorymakesitveryeasyformachinelearning
researcherstocomparenewalgorithmstopreviouswork. Thedataminingfield,althougha
newerareaofresearch,isalreadyevolvingamethodologyofitsowntocomparetheeffec-
tivenessofdifferentalgorithmsonlargedatabases. Large publicdatabasesarebecoming
increasingly popular in many areas of science and technology, bringing with them great
opportunities,butalsotechnicaldangers. Aswewillseebelow,however,onemustbevery
carefulinthedesignofanexperimentalstudyusingpubliclyavailabledatabases.
Although the development and maintenance of data repositories has in general been
positive, some research on classification algorithms has relied too heavily on the UCI
repository andothershared datasets,and hasconsequently producedcomparativestudies
whoseresults areat bestconfusing. To bemore precise: it hasbecome commonplaceto
taketwoormoreclassifiersandcomparethemonarandomselectionofdatasetsfromthe
UCIrepository. Anydifferencesinclassificationaccuracythatreachstatisticalsignificance
(moreonthatbelow)areprovidedassupportingevidenceofimportantdifferencesbetween
the algorithms. As argued below, many such comparisons are statistically invalid. The
message to the data mining community is that one must be exceedingly careful when
usingpowerfulalgorithmstoextractinformationfromlargedatabases,becausetraditional
statisticalmethodswerenotdesignedforthisprocess. BelowIgivesomeexamplesofhow
tomodifytraditionalstatisticsbeforeusingthemincomputationalevaluations.
2. Comparingalgorithms
Empirical validation is clearly essential to the process of designing and implementing
new algorithms, and the criticisms beloware not intended to discourageempirical work.
Classification research, which is a component of data mining as well as a subfield of
machine learning, has always had a need for veryspecific, focused studies that compare
algorithms carefully. The evidence to date is that good evaluations are not done nearly
enough—for example, Prechelt(1996)recently surveyed nearly200 experimental papers
onneuralnetworklearningalgorithmsandfoundmostofthemtohaveseriousexperimental
deficiencies. Hissurveyfoundthatastrikinglyhighpercentageofnewalgorithms(29%)
werenotevaluatedonanyrealproblematall,andthatveryfew(only8%)werecompared
to more than one alternative on real data. In a survey by Flexer (1996) of experimental
neuralnetworkpapers,only3outof43studiesinleadingjournalsusedaseparatedataset
for parameter tuning, which leavesopen the possibility that manyof the reported results
wereoverlyoptimistic.
Classification research comes in a variety of forms: it lays out new algorithms and
demonstrates their feasibility, or it describes creative new algorithms which may not (at
first) require rigorous experimental validation. It is important that work designed to be
primarilycomparativedoesnotundertaketocriticizeworkthatwasintendedtointroduce
COMPARING CLASSIFIERS 319
creativenewideasortodemonstratefeasibilityonanimportantdomain. Thisonlyserves
to suppress creative work (if the new algorithm does not perform well) and encourages
peopleinsteadtofocusonnarrowstudiesthatmakeincrementalchangestopreviouswork.
On the other hand, if a new method outperforms an established one on some important
tasks, then thisresult wouldbeworthreporting becauseit mightyieldimportant insights
about both algorithms. Perhaps most important, comparative work should be done in a
statistically acceptable framework. Work intended to demonstrate feasibility, in contrast
topurelycomparativework,mightnotalwaysneedstatisticalcomparisonmeasurestobe
convincing.
2.1. Datarepositories
Inconductingcomparativestudies,classificationresearchersandotherdataminersmustbe
careful notto relytooheavilyonstored repositoriesofdata(suchastheUCI repository)
asitssource ofproblems, because itisdifficult toproducemajor newresults usingwell-
studied and widely shared data. For example, Fisher’s iris data has been around for 60
yearsand has been used in hundreds (maybe thousands) ofstudies. The NetTalkdataset
ofEnglishpronunciation data (introduced bySejnowskiandRosenberg,(1987)has been
usedinnumerousexperiments,ashastheproteinsecondarystructuredata(introducedby
QianandSejnowski(1988),tocitejusttwoexamples. Holte(1993)collectedresultson16
differentdatasets,andfoundasmanyas75differentpublishedaccuracyfiguresforsome
of them. Any new experiments on these and other UCI datasets run the risk of finding
“significant” results that are no more than statistical accidents, as explained in Section
3.2. Note that the repository still serves many useful functions; among other things, it
allows someone with a new algorithmic idea to test its plausibility on known problems.
However,itisamistaketoconclude,ifsomedifferencesdoshowup,thatanewmethodis
“significantly”betteronthesedatasets. Itisveryhard–sometimesimpossible–tomake
suchanargumentinaconvincingandstatisticallycorrectway.
2.2. Comparativestudiesandpropermethodology
Thecomparativestudy,whetheritinvolvesclassificationalgorithmsorotherdataextraction
techniques,doesnotusuallyproposeanentirelynewmethod;mostoftenitproposeschanges
tooneormoreknownalgorithms,andusescomparisonstoshowwhereandhowthechanges
will improve performance. Although these studies may appear superficially to be quite
easytodo,infactitrequiresconsiderableskilltobesuccessfulatbothimprovingknown
algorithmsanddesigningtheexperiments. HereIfocusondesignofexperiments,which
has been the subject of little concern in the machine learning community until recently
(withsomeexceptions,suchas(KiblerandLangley,1988)and(CohenandJensen,1997)).
Includedinthecomparativestudycategoryarepapersthatneitherintroduceanewalgorithm
norimproveanoldone;instead,theyconsideroneormoreknownalgorithmsandconduct
experimentsonknown datasets. Theymay also include variations onknownalgorithms.
The goal of these papers is ostensibly to highlight the strengths and weaknesses of the
algorithmsbeingcompared. Althoughthegoalisworthwhile,theapproachtakenbysuch
papersissometimesnotvalid,forreasonstobeexplainedbelow.
320 SALZBERG
3. Statisticalvalidity: atutorial
Statisticsoffersmanyteststhataredesignedtomeasurethesignificanceofanydifference
betweentwoormore “treatments.” Thesetestscanbeadaptedfor usein comparisonsof
classifiers,buttheadaptationmustbedonecarefully,becausethestatisticswerenotdesigned
with computational experiments in mind. For example, in one recent machine learning
study, fourteendifferentvariationsonseveralclassificationalgorithmswerecomparedon
elevendatasets. Thisisnotunusual;manyotherrecentstudiesreportcomparisonsofsimilar
numbersofalgorithmsanddatasets.1All154ofthevariationsinthisstudywerecompared
toadefaultclassifier,anddifferenceswerereportedassignificantifatwo-tailed,pairedt-test
producedap-valuelessthan0.05.2 Thisparticularsignificancelevelwasnotnearlystringent
enough,however:ifyoudo154experiments,thenyouhave154chancestobesignificant,so
theexpectednumberof“significant”resultsatthe0.05levelis154⁄0:05=7:7. Obviously
thisisnotwhatonewants. Inordertogetresultsthataretrulysignificantatthe0.05level,
youneedto setamuchmorestringent requirement. Statisticianshavebeenawareofthis
problem for a very long time; it is known as themultiplicity effect. At least two recent
papershavefocusedtheirattentionnicelyonhowclassificationresearchersmightaddress
thiseffect(GascuelandCaraux,1992,FeeldersandVerkooijen,1995).
Inparticular, letfibe theprobabilitythat ifnodifferences exist amongouralgorithms,
wewillmakeatleastonemistake;i.e.,wewillfindatleastonesignificantdifference. Thus
fiisthepercentofthetimeinwhichwe(theexperimenters)makeanerror. Foreachofour
tests(i.e.,eachexperiment),letthenominalsignificancelevelbefi⁄. Thenthechanceof
makingtherightconclusionforoneexperimentis1¡fi⁄.
If we conduct n independent experiments, the chance of getting them all right is then
(1¡fi⁄)n. (Notethatthisistrueonlyifalltheexperimentsareindependent;whentheyare
not,tighterboundscanbecomputed. Ifasetofdifferentalgorithmsarecomparedonthe
sametestdata,thenthetestsareclearlynotindependent. Infact,acomparisonthatdraws
thetraining andtestsets fromthesame samplewillnot be independenteither.) Suppose
thatinfactnorealdifferencesexistamongthealgorithmsbeingtested;thenthechancethat
wewillnotgetalltheconclusionsright—inotherwords,thechancethatwewillmakeat
leastonemistakeis
fi=1¡(1 ¡fi⁄)n
Suppose for example we set our nominal significance level fi⁄ for each experiment to
0.05. Then the odds of making at least one mistake in our 154 experiments are fi =
1¡(1 ¡:05)154 = 0:9996. Clearly,a99.96%chanceofdrawinganincorrectconclusion
isnotwhatwewant! Again,we areassumingthat notruedifferencesexist; i.e.,thisisa
conditionalprobability. (Moreprecisely,thereisa99.96%chancethatatleastoneofthe
resultswillincorrectlyreach“significance”atthe0.05level.)
In order to obtain results significant at the 0.05 level with 154 tests, we need to set
1¡(1 ¡fi⁄)154 •0:05,whichgivesfi⁄ •0:0003. Thiscriterionisover150timesmore
stringentthantheoriginalޥ0:05criterion.
Theaboveargumentisstill veryrough, because itassumes thatall theexperimentsare
independentofoneanother. Whenthisassumptioniscorrect,onecanmaketheadjustment
ofsignificancedescribedabove,whichiswellknowninthestatisticscommunityastheBon-
COMPARING CLASSIFIERS 321
ferroniadjustment. Whenexperimentsuse identicaltrainingand/ortestsets,thetestsare
clearlynotindependent. Theuseofthewrongp-valuemakesitevenmorelikelythatsome
experimentswillfindsignificancewherenoneexists. Nonetheless,manyresearcherspro-
ceedwithusingasimplet-testtocomparemultiplealgorithmsonmultipledatasetsfromthe
UCIrepository;see,e.g.,WettschereckandDietterich(WettschereckandDietterich,1995).
Althougheasytoconduct,thet-testissimplythewrongtestforsuchanexperimentaldesign.
Thet-testassumesthatthetestsetsforeach“treatment”(eachalgorithm)areindependent.
Whentwoalgorithmsarecomparedonthesamedataset,thenobviouslythetestsetsarenot
independent,sincetheywillsharesomeofthesameexamples—assumingthetrainingand
testsetsarecreatedbyrandompartitioning,whichisthestandardpractice. Thisproblemis
widespreadincomparativemachinelearningstudies. (Oneoftheauthorsofthestudycited
abovehas writtenrecentlythat thepaired t-testhas “ahigh probabilityofType Ierror...
andshouldneverbeused”(Dietterich,1996).) Itisworthnotingherethatevenstatisticians
havedifficultyagreeingonthecorrectframeworkforhypothesistestingincomplexexper-
imentaldesigns. Forexample,thewholeframeworkofusingalphalevelsandp-valueshas
beenquestionedwhenmorethantwohypothesesareunderconsideration(Reftery,1995).
3.1. Alternativestatisticaltests
One obvious problem with the experimental design cited above is that it only considers
overallaccuracyonatestset. Butwhenusingacommontestsettocomparetwoalgorithms,
acomparisonmustconsiderfournumbers: thenumberofexamplesthatAlgorithmAgot
rightand Bgotwrong (A > B), thenumberthatBgotrightandAgot wrong(B > A),
thenumberthatbothalgorithmsgotright,andthenumberthatbothgotwrong. Ifonehas
justtwoalgorithmstocompare,thenasimplebutmuchimproved(overthet-test)wayto
comparethemistocomparethepercentageoftimesA>B versusB >A,andthrowout
theties. Onecanthenuseasimplebinomialtestforthecomparison,withtheBonferroni
adjustmentformultipletests. Analternativeistouserandom,distinctsamplesofthedata
totesteachalgorithm,andtouseananalysisofvariance(ANOVA)tocomparetheresults.
Asimpleexampleshowshowabinomialtestcanbeusedtocomparetwoalgorithms. As
above,measureeachalgorithm’sansweronaseriesoftestexamples. Letnbethenumber
ofexamplesforwhichthealgorithmsproducedifferentoutput. Wemustassumethisseries
oftestsisindependent;i.e.,weareobservingasetofBernoullitrials. (Thisassumptionis
validifthetestdataisarandomly drawnsamplefromthepopulation.) Lets (successes)
bethenumberoftimesA>B,andf (failures)bethenumberoftimesB >A. Ifthetwo
algorithmsperformequallywell,thentheexpectedvalueE(s)= 0:5n= E(f). Suppose
thats>f,soitlookslikeAisbetterthanB. Wewouldliketocalculate
P(s‚observedvaluejp(success)=0:5)
which is the probability that A “wins” overB at least as many times as observed in the
experiment. Typically, the reported p-value is double this value because a 2-sided test
is used. This can be easily computed using the binomial distribution, which gives the
probabilityofssuccessesinntrialsas
n!
psqn¡s
s!(n¡s)!
322 SALZBERG
Ifwe expectnodifferences betweenthealgorithms, thenp = q = 0:5. Supposethat we
hadn = 50examplesforwhichthealgorithmsdiffered,andins = 35casesalgorithmA
wascorrectwhileBwaswrong. Thenwecancomputetheprobabilityofthisresultas
X50
n!
(0:5)n =0:0032
s!(n¡s)!
s=35
Thusitishighlyunlikelythatthealgorithmshavethesameaccuracy;wecanrejectthenull
hypothesiswithhighconfidence. (Note: thecomputationhereusesthebinomialdistribu-
tion,whichisexact. Another,nearlyidenticalformofthistestisknownasMcNemar’stest
(Everitt,1977),whichusesthe´2 distribution. ThestatisticusedfortheMcNemartestis
(js¡fj¡1)2=(s+f),whichissimplertocompute.) Ifwemakethisintoa2-sidedtest,
wemustdoubletheprobabilityto0.0064,butwecanstillrejectthenullhypothesis. Ifwe
hadobserveds =30,thentheprobabilitywouldriseto0.1012(fortheone-sidedtest),or
justover10%. Inthiscasewemightsaythatwecannotrejectthenullhypothesis;inother
words,thealgorithmsmayinfactbeequallyaccurateforthispopulation.
The aboveis just an example, and is not meant to cover all comparative studies. The
methodappliesaswelltoclassifiersastootherdataminingmethodsthatattempttoextract
patterns from a database. However, the binomial test is a relatively weak test that does
nothandlequantitativedifferencesbetweenalgorithms,nor doesithandlemorethan two
algorithms,nordoesitconsiderthefrequencyofagreementbetweentwoalgorithms. IfN
isthenumberofagreementsandN >> n,thenitcanbearguedthatourbeliefthatthealgo-
rithmsaredoingthesamethingshouldincreaseregardlessofthepatternofdisagreement.
AspointedoutbyFeeldersandVerkooijen(1995),findingtheproperstatisticalprocedureto
comparetwoormoreclassificationalgorithmscanbequitedifficult,andrequiresmorethan
an introductory levelknowledgeofstatistics. A good generalreference for experimental
designisCochranandCox(1957), anddescriptionsofANOVAandexperimentaldesign
canbefoundinintroductorytextssuchasHildebrand(1986).
Jensen(Jensen,1991,Jensen,1995)discussesaframeworkforexperimentalcomparison
ofclassifiersandaddressessignificancetesting,andCohenandJensen(1997)discusssome
specificwaystoremoveoptimisticstatisticalbiasfromsuchexperiments. Oneimportantin-
novationtheydiscussisrandomizationtesting,inwhichonederivesareferencedistribution
asfollows. Foreachtrial,thedatasetiscopiedandclasslabelsarereplacedwithrandom
classlabels. Thenanalgorithmisusedtofindthemostaccurateclassifieritcan,usingthe
same methodology that is used with the original data. Any estimate of accuracy greater
than random for the copied data reflects the bias in the methodology, and this reference
distributioncanthenbeusedtoadjusttheestimatesontherealdata.
3.2. Communityexperiments: CautionaryNotes
Infact,theprobleminthemachinelearningcommunityisworsethanstatedabove,because
many people are sharing a small repository of datasets and repeatedly using those same
datasets for experiments. Thus there is a substantial danger that published results, even
whenusingstrictsignificance criteria andtheappropriatesignificancetests, willbemere
accidents of chance. The problem is as follows. Suppose that 100 different people are
COMPARING CLASSIFIERS 323
studyingtheeffectsofalgorithmsAandB,tryingtodeterminewhichoneisbetter. Suppose
thatinfactbothhavethesamemeanaccuracy(onsomeverylargepopulationofdatasets),
althoughthealgorithmsvaryrandomlyintheirperformanceonspecificdatasets. Now,if
100 people are studying the effect of algorithms A and B, we would expect that five of
themwillgetresultsthatarestatisticallysignificantatthep•0:05level,andonewillget
significance atthe 0.01level! (Actually, thepicture isa bitmore complicated, sincethis
assumesthe100experimentsareindependent.) Clearlyinthiscasetheseresultsaredueto
chance,butifthe100peopleareworkingseparately,theoneswhogetsignificantresultswill
publish,whiletheotherswillsimplymoveontootherexperiments. Denton(1985)made
similar observations about how the reviewing and publication process can skew results.
Althoughthedataminingcommunityismuchbroaderthantheclassificationcommunity,it
islikelythatbenchmarkdatabaseswillemerge,andthatdifferentresearcherswilltesttheir
mining techniques on them. The experienceof the machine learning community should
serveasacautionarytale.
Inothercommunities (e.g.,testingnewdrugs), experimenterstrytodeal withthisphe-
nomenonof“communityexperiments”byduplicatingresults. Properduplicationrequires
drawinganewrandomsamplefromthepopulationandrepeatingthestudy. However,this
is not what happens with benchmark databases, which normally are static. If someone
wants to duplicate results, they canonly re-run the algorithmswith the same parameters
onthesamedata,andofcoursetheresultswillbethesame. Usingadifferentpartitioning
ofthedataintotrainingandtestsetsdoesnothelp;duplicationcanonlyworkifnewdata
becomesavailable.
3.3. Repeatedtuning
Another very substantial problem with reporting significance results based on computa-
tionalexperimentsissomething thatisoften leftunstated: manyresearchers“tune”their
algorithmsrepeatedlyinordertomakethemperformoptimallyonatleastsomedatasets.
Attheveryleast,whenasystemisbeingdeveloped,thedeveloperspendsagreatdealof
timedeterminingwhat parametersit shouldhaveandwhatthe optimalvaluesshouldbe.
For example, the back propagation algorithm has a learning rate and a momentum term
thatgreatlyaffectlearning,andthearchitectureofaneuralnet(whichhasmanydegreesof
freedom)hasanenormouseffectonlearning. Equallyimportantformanyproblemsisthe
representationofthedata,whichmayvaryfromonestudytothenextevenwhenthesame
basicdatasetisused. Forexample,numericvaluesaresometimesconvertedtoadiscrete
setofintervals,especiallywhenusingdecisiontreealgorithms(FayyadandIrani,1993).
Whenever tuning takes place, every adjustment should really be considered a separate
experiment. For example, if one attempts 10 different combinations of parameters, then
significancelevels(p-values)wouldhavetobe, e.g., 0.005in orderto obtainlevelscom-
parable to a single experiment using a level of 0.05. (This assumes, unrealistically, that
theexperimentsareindependent.) Butfewifanyexperimenterskeepcarefulcountofhow
manyadjustments theyconsider. (KiblerandLangley(1988)andAha(1992)suggest,as
analternative,thattheparametersettingsthemselvesbestudiedasindependentvariables,
and that theireffects be measuredon artificial data. A greater problem occurs whenone
usesanalgorithmthathasbeenusedbefore: thatalgorithmmayalreadyhavebeentuned
324 SALZBERG
onpublic databases. Thereforeone cannot evenknow howmuch tuning took place, and
anyattempttoclaimsignificantresultsonknowndataissimplynotvalid.
Fortunately, one canperform virtually unlimited tuningwithout corrupting the validity
oftheresults. Thesolutionistousecrossvalidation3 entirelywithinthetrainingsetitself.
The recommended procedure is to reserve a portion of the training set as a “tuning” set,
andtorepeatedlytestthealgorithmandadjustitsparametersonthetuningset. Whenthe
parametersappeartobeattheiroptimalsettings,accuracycanfinallybemeasuredonthe
testdata.
Thus,whendoingcomparativeevaluations,everythingthatisdonetomodifyorprepare
thealgorithmsmustbedoneinadvanceofseeingthetestdata. Thispointwillseemobvious
tomanyexperimentalresearchers,butthefactisthatpapersarestillappearinginwhichthis
methodologyisnotfollowed. InthesurveybyFlexer(1996),only3outof43experimental
papers in leading neural network journals used a separate data set for parameter tuning;
the remaining 40 papers either did not explain how they adjusted parameters or else did
their adjustments after using the test set. Thus this point is worthstating explicitly: any
tweaking or modifying of code, any decisions about experimental design, and any other
parametersthatmayaffectperformancemustbefixedbeforetheexperimenterhasthetest
data. Otherwisetheconclusionsmightnotevenbevalidforthedatabeingused,muchless
anylargerpopulationofproblems.
3.4. Generalizingresults
Acommonmethodologicalapproachinrecentcomparativestudiesistopickseveraldatasets
from the UCI repository (or some other public source of data) and perform a series of
experiments,measuringclassificationaccuracy, learningrates, andperhapsothercriteria.
Whetherornotthechoiceofdatasetsisrandom,theuseofsuchexperimentstomakemore
generalstatementsaboutclassificationalgorithmsisnotnecessarilyvalid.
In fact, if one draws samples from a population to conduct an experiment, any infer-
encesonemakescanonlybeappliedtotheoriginalpopulation,whichinthiscasemeans
theUCI repository itself. It isnotvalidto makegeneralstatements about other datasets.
The only way this might be valid would be if the UCI repository were known to repre-
sentalargerpopulationofclassificationproblems. Infact,though,asarguedpersuasively
by Holte (Holte,1993) and others, the UCI repository is a very limited sample of prob-
lems, many of which are quite easy for a classifier. (Many of them may represent the
sameconceptclass,forexamplemanymightbealmostlinearlyseparable,assuggestedby
thestrongperformanceoftheperceptronalgorithminonewell-knowncomparativestudy
(Shavlik,MooneyandTowell,1991).) ThustheevidenceisstrongthatresultsontheUCI
datasets donotapply to allclassification problems,and therepository isnotan unbiased
“sample”ofclassificationproblems.
ThisisnotbyanymeanstosaythattheUCIrepositoryshouldnotexist. Therepository
servesseveralimportantfunctions. Havingapublicrepositorykeepsthecommunityhonest,
inthesensethatanypublishedresultscanbechecked. Asecondfunctionisthatanynew
ideacanbe“vetted”againsttherepositoryasawayofcheckingitforplausibility. Ifitworks
well,itmightbeworthfurtherinvestigation. Ifnot,ofcourseitmaystill beagoodidea,
butthecreatorwillhavetothinkupothermeanstodemonstrateitsfeasibility. Thedangers
COMPARING CLASSIFIERS 325
of repeated re-use of the same data should spur the research community to continually
enlargetherepository,sothatovertimeitwillbecomemorerepresentativeofclassification
problemsingeneral. Inaddition, theuseofartificialdatashouldalwaysbeconsideredas
a way to testmore precisely the strengthsand weaknessesof a new algorithm. Thedata
miningcommunitymustlikewisebevigilantnottocometorelytooheavilyonanystandard
setofdatabases.
Thereisasecond,relatedproblemwithmakingbroadstatementsbasedonresultsfrom
a common repository of data. The problem is that someone can write an algorithm that
workswellonsomeofthesedatasetsspecifically; i.e.,thealgorithmisdesignedwiththe
datasets in mind. If researchers become familiar with the datasets, they are likely to be
biasedwhendevelopingnewalgorithms,eveniftheydonotconsciouslyattempttotailor
theiralgorithmstothedata. Inotherwords,peoplewillbeinclinedtoconsiderproblems
suchasmissingdataoroutliersbecause theyknowtheseareproblemsrepresentedin the
publicdatasets. However,theseproblemsmaynotbeimportantfordifferentdatasets. Itis
anunavoidablefactthatanyonefamiliarwiththedatamaybebiased.
3.5. Arecommendedapproach
Althoughitisprobablyimpossibletodescribeamethodologythatwillserveasa“recipe”for
allcomparisonsofcomputationalmethods,hereisanapproachthatwillallowthedesigner
ofanewalgorithmtoestablishthenewalgorithm’scomparativemerits.
1. Chooseotheralgorithmstoincludeinthecomparison. Makesuretoincludethealgo-
rithmthatismostsimilartothenewalgorithm.
2. Choose a benchmark data set that illustrates thestrengths of the new algorithm. For
example,ifthealgorithmissupposedtohandlelargeattributespaces,chooseadataset
withalargenumberofattributes.
3. Dividethedatasetintoksubsetsforcrossvalidation. Atypicalexperimentusesk =10,
thoughothervaluesmaybechosendependingonthedatasetsize. Forasmalldataset,
itmaybebettertochooselargerk,sincethisleavesmoreexamplesinthetrainingset.
4. Runacross-validationasfollows.
(A) ForeachoftheksubsetsofthedatasetD,createatrainingsetT =D¡k.
(B) Divide each training set into two smaller subsets, T1 and T2. T1 will be used
for training, and T2for tuning. Thepurpose of T2is to allow theexperimenter
to adjust all the parameters of his algorithm. This methodology also forces the
experimentertobemoreexplicitaboutwhatthoseparametersare.
(C) Oncetheparametersareoptimized,re-runtrainingonthelargersetT.
(D) Finally,measureaccuracyonk.
(E) Overallaccuracyisaveragedacrossallk partitions. Thesek valuesalsogivean
estimateofthevarianceofthealgorithms.
5. Tocomparealgorithms,usethebinomialtestdescribedinsection3.1,ortheMcNemar
variantonthattest.
(Itmaybetempting,becauseitiseasytodowithpowerfulcomputers,torunmanycross
validationsonthesamedataset,andreporteachcrossvalidationasasingletrial. However,
326 SALZBERG
thiswouldnotproducevalidstatistics,becausethetrialsinsuchadesignarehighlyinter-
dependent.) Theaboveprocedureappliestoasingledataset. Ifanexperimenterwishesto
comparemultipledatasets,thenthesignificancelevelsusedshouldbeincreasedusingthe
Bonferroniadjustment. Thisisaconservativeadjustmentthatwilltendtomisssignificance
insomecases,butitenforcesahighstandardforreportingthatonealgorithmisbetterthan
another.
4. Conclusion
Assomeresearchershaverecentlyobserved,nosingleclassificationalgorithmisthebest
forallproblems. Thesameobservationcanbemadefordatamining: nosingletechnique
islikelytoworkbestonalldatabases. Recenttheoreticalworkhasshownthat,withcertain
assumptions, no classifier is always better than another one (Wolpert,1992). However,
experimental science is concerned with data that occurs in the real world, and it is not
clearthatthese theoreticallimitationsarerelevant. Comparativestudiestypicallyinclude
atleastonenewalgorithmandseveralknownmethods;thesestudiesmustbeverycareful
abouttheirmethodsandtheirclaims. Thepointofthispaperisnottodiscourageempirical
comparisons; clearly, comparisons and benchmarks are an important, central component
of experimental computer science, and can be very powerful if done correctly. Rather,
thispaper’sgoalistohelpexperimentalresearcherssteerclearofproblemsindesigninga
comparativestudy.
Acknowledgments
ThankstoSimonKasifandEricBrillforhelpfulcommentsanddiscussions. ThankstoAlan
Salzbergforhelpwiththestatisticsandforothercomments. Thisworkwassupportedin
partbytheNationalScienceFoundationunderGrantsIRI-9223591andIRI-9530462,and
bytheNationalInstitutesofHealthunderGrantK01-HG00022-01. Theviewsexpressed
herearetheauthor’sown,anddonotnecessarilyrepresenttheviewsofthoseacknowledged
above,theNationalScienceFoundation,ortheNationalInstitutesofHealth.
Notes
1. Thisstudywasneverpublished;asimilarstudyis(KohaviandSommerfield,1995),whichused22datasets
and4algorithms.
2. Ap-valueissimplytheprobabilitythataresultoccurredbychance.Thusap-valueof0.05indicatesthatthere
wasa5%probabilitythattheobservedresultsweremerelyrandomvariationofsomekind,anddonotindicate
atruedifferencebetweenthetreatments. Foradescriptionofhowtoperformapairedt-test,seereference
(Hilderbrand,1986)oranotherintroductorystatisticstext.
3. Crossvalidationreferstoawidelyusedexperimentaltestingprocedure.Theideaistobreakadatasetupinto
kdisjointsubsetsofapproximatelyequalsize.Thenoneperformskexperiments,whereforeachexperiment
thekthsubsetisremoved,thesystemistrainedontheremainingdata,andthenthetrainedsystemistestedon
theheld-outsubset.Attheendofthisk-foldcrossvalidation,everyexamplehasbeenusedinatestsetexactly
once. Thisprocedurehastheadvantagethatallthetestsetsareindependent. However,thetrainingsetsare
clearlynotindependent,sincetheyoverlapeachothersubstantially. Thelimitingcaseistosetk = n¡1,
wherenisthesizeoftheentiredataset.Thisformofcrossvalidationissometimescalled“leaveoneout.”