Table Of ContentXiaochun Wang
Xiali Wang
Mitch Wilkes
New Developments
in Unsupervised
Outlier Detection
Algorithms and Applications
New Developments in Unsupervised Outlier
Detection
· ·
Xiaochun Wang Xiali Wang Mitch Wilkes
New Developments
in Unsupervised Outlier
Detection
Algorithms and Applications
XiaochunWang XialiWang
SchoolofSoftwareEngineering SchoolofInformationEngineering
Xi’anJiaotongUniversity Chang’anUniversity
Xi’an,Shaanxi,China Xi’an,Shaanxi,China
MitchWilkes
DepartmentofElectricalEngineeringand
ComputerScience
VanderbiltUniversity
Nashville,TN,USA
ISBN978-981-15-9518-9 ISBN978-981-15-9519-6 (eBook)
https://doi.org/10.1007/978-981-15-9519-6
JointlypublishedwithXi’anJiaotongUniversityPress
TheprinteditionisnotforsaleinChina(Mainland).CustomersfromChina(Mainland)pleaseorderthe
printbookfrom:Xi’anJiaotongUniversityPress.
©Xi’anJiaotongUniversityPress2021
Thisworkissubjecttocopyright.AllrightsarereservedbythePublishers,whetherthewholeorpartof
thematerialisconcerned,specificallytherightsoftranslation,reprinting,reuseofillustrations,recitation,
broadcasting,reproductiononmicrofilmsorinanyotherphysicalway,andtransmissionorinformation
storageandretrieval,electronicadaptation,computersoftware,orbysimilarordissimilarmethodology
nowknownorhereafterdeveloped.
Theuseofgeneraldescriptivenames,registerednames,trademarks,servicemarks,etc.inthispublication
doesnotimply,evenintheabsenceofaspecificstatement,thatsuchnamesareexemptfromtherelevant
protectivelawsandregulationsandthereforefreeforgeneraluse.
Thepublishers,theauthors,andtheeditorsaresafetoassumethattheadviceandinformationinthisbook
arebelievedtobetrueandaccurateatthedateofpublication.Neitherthepublishersnortheauthorsor
theeditorsgiveawarranty,expressorimplied,withrespecttothematerialcontainedhereinorforany
errorsoromissionsthatmayhavebeenmade.Thepublishersremainneutralwithregardtojurisdictional
claimsinpublishedmapsandinstitutionalaffiliations.
ThisSpringerimprintispublishedbytheregisteredcompanySpringerNatureSingaporePteLtd.
The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721,
Singapore
Foreword
Being an active research topic in data mining, outlier detection aims to discover
observationsinadatasetthatdeviatefromotherobservationssomuchastoarouse
suspicionsthattheyaregeneratedbyadifferentmechanismandisofutmostimpor-
tance in many application domains. Unsupervised outlier detection plays a crucial
roleintheoutlierdetectionresearchandsetsoutenormoustheoreticalandapplied
challenges to advanced data mining technology using unsupervised learning tech-
niques.Thismonographaddressesunsupervisedoutlierdetectioninalocalsettingof
k-nearestneighborhood.Unliketraditionaldistribution-basedoutlierdetectiontech-
niques,k-nearestneighbor-basedoutlierdetectionapproaches,typifiedbydistance-
based and density-based outlier detection methods, have become more and more
popular.However,theproblemswiththesemethodsarethattheyareverysensitiveto
thevalueofk,mayhavedifferentrankingsfortopoutliers,anddoubtsexistingeneral
whethertheywouldworkwellforhigh-dimensionaldatasets.Topartiallycircumvent
theseproblems,thealgorithmsofchoiceproposedforunsupervisedoutlierdetection
inthecurrentresearchcombinek-nearestneighbor-basedoutlierdetectionmethods
andgeneticclusteringalgorithms.
Distance-basedoutliersanddensity-basedoutliersdenotetwodifferentkindsof
definitionsforoutlierdetectionalgorithms.Distance-basedoutlierdetectionmethods
can identify more globally oriented outliers while density-based outlier detection
methods can identify more locally distributed outliers. In this book, several new
globaloutlierfactorsandnewlocaloutlierfactorshavebeenproposed,andefficient
and effective outlier detection algorithms have been developed upon them that are
easytoimplementandcanprovidecompetingperformanceswithexistingsolutions.
Havingbeenexploitedinoutlierdetectionresearchforyears,distance-basedand
density-basedoutlierdetectionmethodsworktheoreticallybycalculatingk-nearest
neighbors for each data point, computing outlier scores for them, ranking all the
objects according to their scores, and finally returning data points with top largest
scoresasoutliers.However,thereisnoreasontoassumethatthismustbethecase.
To take this aspect into account, several outlier indicators are introduced to judge
whetherdistance-basedanddensity-basedoutliersexistornot.Bythisway,outliers
canbenotonlydetectedbutalsodiscriminatedfromboundarypoints.
v
vi Foreword
Itisgenerallyagreedthatlearning,eithersupervisedorunsupervised,canprovide
thebestpossiblespecificationofknownclassesandofferinferenceforoutlierdetec-
tionbyadissimilaritythresholdfromthenominalfeaturespace.Novelobjectdetec-
tioncantakeastepfurtherbyinvestigatingwhethertheseoutliersformnewdense
clustersinboththefeaturespaceandtheimagespace.Bydefininganovelobjectto
beapatterngroupthathasnotbeenseenbeforeinthefeaturespaceandtheimage
space,anonconventionalapproachisproposedformultiplenovelobjectdetection
applications.
Timeseriesoftencontainoutliersandstructuralchanges.Theseunexpectedevents
areoftheutmostimportanceinfrauddetection,astheymaypinpointsuspiciousactiv-
ities.Thepresenceofsuchunusualactivitiescaneasilymisleadconventionaltime
series analysis and yield erroneous conclusions. Traditionally, time series data are
firstdividedintosmallchunks.k-nearestneighbor-basedoutlierdetectionapproaches
are then applied for monitoring behavior over time in data mining. However, time
seriesdataareverylargeinsizesotheycannotbescannedmultipletimes.Further,
astheyareproducedcontinuously,newdataarearrived.Tocopewiththespeedthey
are coming, a simple statistical parameter-based anomaly method is proposed for
environmentaltimeseriesdatafrauddetectionapplications.
Thechapterscoversuchtopicsasdistance-basedoutlierdetection,density-based
outlier detection, clustering-based outlier detection, and the applications of these
techniquestowardboundarypointdetection,novelobjectdetection,andfrauddetec-
tion in environmental time series data. Overall, the book features a perspective on
bridgingthegapbetweenk-nearestneighbor-basedoutlierdetectionandclustering-
basedoutlierdetection,layingthegroundworkforfutureadvancesinunsupervised
outlierdetectionresearch.Ihopenewdevelopmentsinunsupervisedoutlierdetection
algorithmsandapplicationswillserveasaninvaluablereferenceforoutlierdetection
researchersforyearstocome.
Xi’an,China XubangShen
May2020 ChineseNationalAcademician
Preface
Dataminingrepresentsacomplexoftechnologiesthatarerootedinmanydisciplines:
mathematics, statistics, computer science, physics, engineering, biology, etc., and
withdiverseapplicationsinalargevarietyofdifferentdomains:business,healthcare,
science and engineering, etc. Basically, data mining can be seen as the science of
exploringlargedatasetsforextractingimplicit,previouslyunknownandpotentially
usefulinformation.Recently,outlierdetectionasaresearchareaindatamininghas
advanceddramatically.Amultitudeofdataminingtechniqueshasbeendeveloped
withimpactonunsupervisedoutlierdetectionareas.Ouraiminwritingthisbookis
toprovideafriendlyandcomprehensiveguideforthoseinterestedinexploringthis
fascinatingdomain.Inotherwords,thepurposeofthisbookistoprovideeasyaccess
totherecentcontributionstounsupervisedoutlierdetectiontheoryandtoassessits
impactonthefieldanditsimplicationsfortheoryandpractice.Itisalsointendedto
beusedasanintroductorytextforadvancedundergraduate-levelorgraduate-level
coursesincomputerscience,engineering,orotherfields.Inthisregard,thebookis
intendedtobelargelyself-contained,althoughitisassumedthatthepotentialreader
hasaquitegoodknowledgeofmathematics,statistics,andcomputerscience.
The book is organized as follows. The first part of this book aims to review
thestate-of-the-artunsupervisedtechniquesusedinoutlierdetection.Thematerial
presentedinthesecondpartofthisbookisanextendedversionofseveralselected
conferencearticlesandrepresentssomeofthemostrecentimportantadvancements
in the field of unsupervised outlier detection. In the third part of this book, outlier
detectiontechniquesareappliedtopracticalapplications.Morespecifically,thefirst
part consists of two chapters. In Chap. 1, an overview of the book chapters and a
summaryofcontributionsarepresented.First,theresearchissuesonunsupervised
outlierdetectionareexplained.Theoverviewofthebookisthenfollowed.Finally,
contributions are highlighted. In Chap. 2, some well-known unsupervised outlier
detectiontechniquesandmodelsarereviewed.Thischapterbeginswithanoverview
ofsomeofthemanyfacetsofoutlieranalysis.Then,itinvestigatessomestandard
outlierdetectionapproaches.Finally,theproblemofevaluatingtheperformanceof
differentoutlierdetectionmodelsisdiscussed.Thesecondpartconsistsoffivechap-
ters, which provide an ever-growing list of unsupervised outlier detection models.
InChap. 3,adivisivehierarchicalclusteringalgorithmisexploredasasolutionfor
vii
viii Preface
fastdistance-basedoutlierdetectionproblems.InChap. 4,anewk-nearestneighbor
centroid-based outlier detection method is proposed for both distance-based and
density-basedoutlierdetectiontasks.InChap. 5,wepresentanewfastminimum
spanningtree-inspiredalgorithmforoutlierdetectiontasks.InChap. 6,anefficient
spectralclustering-basedoutlierdetectionalgorithmisproposedtoextractinforma-
tionfromdatainsuchawaythatdistribution-basedoutlierdetectiontechniquescan
beemployedformulti-dimensionaldata.InChap. 7,anoutlierindicatorisproposed
toenhanceoutlierdetectioninwhichtheselectionofappropriateparametersisless
difficultbutmoremeaningful.Theperformancesevaluatedonsomestandarddatasets
demonstratetheeffectivenessandefficiencyofthesemethods.Thethirdpartofthis
book is concerned with the applications of outlier detection techniques in real-life
problems.Followingthetechniquesdiscussedinthesecondpart,wedevoteanentire
chapter, that is Chap. 8, to a boundary point detection problem, another, that is
Chap. 9, to a novel object detection problem, and finally, the third one, that is,
Chap. 10, to a time series fraud detection problem. An extensive bibliography is
included,whichisintendedtoprovidethereaderwithusefulinformationcovering
allthetopicsapproachedinthisbook.
Last,butcertainlynotleast,itisourhopethatgraduatestudents,youngandsenior
researchers,andprofessionalsfrombothacademiaandindustrywillfindthebook
usefulforunderstandingandreviewingcurrentapproachesinunsupervisedoutlier
detectionresearch.
Xi’an,China XiaochunWang
Xi’an,China XialiWang
Nashville,USA MitchWilkes
June2020
Acknowledgements
Firstandforemost,theauthorswouldliketothankNationalNaturalScienceFoun-
dation of China for its valuable support of this work under award 61473220 and
NaturalScienceFoundationofShaanxiProvince,China,foritsvaluablesupportof
thisworkunderaward2020JM-046.Withoutthesupports,thisworkwouldnothave
beenpossible.
Theauthorsgratefullyacknowledgethecontributionofmanypeople.Firstofall,
they would like to take this opportunity to acknowledge the work of the graduate
students of School of Software Engineering at Xi’an Jiaotong University, Yiqin
Chen, Yongqiang Ma, Yuan Wang, and Jia Li for their diligence and quality work
through these projects. More specifically, Y. Chen developed a k-nearest neighbor
centroid-basedoutlierdetectionalgorithmandappliedittoboundarypointdetection.
Y.MadevelopedaminiMST-basedoutlierdetectionalgorithm.Y.Wangproposed
a spectral clustering-based outlier detection algorithm. J. Li accomplished all the
outlierdetectionexperimentsforspectralclustering-basedoutlierdetectiononreal
multi-dimensionaldatasets.TheauthorswouldalsoliketothankYuanBaoofXi’an
Jiaotong University Press for her timely suggestions and encouragement with the
preparationofthemanuscript.
Finally,theauthorswishtoexpresstheirdeepgratitudetotheirfamiliesfortheir
assistanceinmanywaysforthesuccessfulcompletionofthisbook.
ix
Contents
PartI Introduction
1 OverviewandContributions ................................... 3
1.1 Introduction ............................................. 3
1.2 ResearchIssuesonUnsupervisedOutlierDetection ........... 4
1.3 OverviewoftheBook ..................................... 7
1.4 Contributions ............................................ 8
1.5 Conclusions ............................................. 10
2 DevelopmentsinUnsupervisedOutlierDetectionResearch ........ 13
2.1 Introduction ............................................. 13
2.1.1 ABriefOverviewoftheEarlyDevelopments
inOutlierAnalysis ................................ 15
2.2 Some Standard Unsupervised Outlier Detection
Approaches ............................................. 21
2.2.1 Probabilistic Model-Based Outlier Detection
Approach ........................................ 22
2.2.2 Clustering-BasedOutlierDetectionApproaches ....... 24
2.2.3 Distance-BasedOutlierDetectionApproaches ........ 24
2.2.4 Density-BasedOutlierDetectionApproaches ......... 25
2.2.5 OutlierDetectionforTimeSeries ................... 27
2.3 Performance Evaluation Metrics of Outlier Detection
Approaches ............................................. 29
2.3.1 Precision,RecallandRankPower ................... 30
2.4 Conclusions ............................................. 31
References .................................................... 32
xi