Table Of Content¨ ¨
TECHNISCHEN UNIVERSITAT MUNCHEN
Lehrstuhl fu¨r Informatik XX
From Adversarial Learning to Reliable and
Scalable Learning
Han Xiao
Vollsta¨ndigerAbdruckdervonderFakulta¨tfu¨rInformatikderTechnischen
Universita¨t Mu¨nchen zur Erlangungdes akademischen Grades eines
Doktors der Naturwissenschaften (Dr. rer. nat)
genehmigten Dissertation.
Vorsitzender: Univ.-Prof. Dr. Helmut Seidl
Pru¨fer der Dissertation: 1. Univ.-Prof. Dr. Claudia Eckert
2. Univ.-Prof. Dr. Daniel Cremers
DieDissertationwurdeam14.08.2014beiderTechnischenUniversita¨tMu¨nchen
............
eingereicht und durch die Fakulta¨tfu¨r Informatikam 02.03.2015angenom-
............
men.
Abstract
Nowadaysmachinelearningisconsideredasavitaltoolfordataanalysisandautomaticde-
cision making in many modern enterprise systems. However, there is an emerging threat that
adversariescanmisleadthedecisionofthelearningalgorithmbyintroducingsecurityfaultsinto
thesystem. Previoussecurityresearchdidnotcloselyexaminedthevulnerabilitiesofthelearning
algorithmstoadversarialmanipulations. Understandingthesethreatsistheonlywaytobuildro-
bustlearningalgorithmsforsecurity-sensitiveapplications.Thisdissertationisorganizedinthree
parts. Eachpartcontributesthenewresultsinadversarial,reliableandscalablemachinelearning,
respectively.
Thefirstpartofthisdissertationstudieshowmachinelearningalgorithmsbehaveinthepres-
ence of the adversary. In particular, I provide analyses for the exploratory attack on convex-
inducingclassifiersandcausativeattackonsupport vectormachines. Under theanalysesarethe
tools from convex geometry and optimization theory. Using real-world data, I demonstrate the
devastatingimpactoftheattackalgorithmsonanewsletter classifierand aface recognitionsys-
tem.
The secondpartfocusesondevelopingreliablelearningalgorithmthatisresilienttothead-
versarialnoise. Iconsidertheproblemoflearningfrommultipleobservers,inwhicheachinstance
is associated with multiple but unreliable labels. To solve this problem, I develop a hierarchical
Gaussian process model and consider the groundtruth label as a latent variable. The parameters
ofthemodelcanbeefficientlyestimatedbymaximizingaposterior. Thesuccessfulapplicationof
mymethodonthetaskofaestheticsscoreassessmentwouldraisepractitionersagreatinterest.
The third part concentrates on developing scalable online learning algorithms for security
applications.Iproposethreesystematicapproachesforlearningfromlarge-scaledatastream. The
firstmethodemploysasetofGaussianprocessmodelstoperformreal-timeonlineregression. The
secondmethodisbasedonavariantofsecond-orderperceptrontopredicttheupcominglabelin
asequence. Thelastmethodprovidesanoveldistributedlearningframeworkfortheclient-server
settings. It can learn from partially labeleddata while minimizingthe communication-cost over
thenetwork.
iii
Zusammenfassung
Machinelles Lernen stellt heutzutage ein essentielles Tool fu¨r die Datenanlyse und automa-
tische Entscheidungsfindung in vielenmodernenEnterprisesystemendar. Dadurch ergebensich
jedoch auch neuartige Angriffsvektoren. Besonders kritisch ist dabei, dass Angreifer durch das
Ausnu¨tzenvonSicherheitslu¨ckendenLernalgorithmusgezieltindieIrrefu¨hrenko¨nnen.U¨berraschen-
derweise werden solche Angriffe in der bestehenden Forschung jedoch kaum untersucht. Um
sichere Lernalgorithmen entwickeln zu ko¨nnen ist aber ein eingehendes Versta¨ndnis dieser An-
griffsformenno¨tig.
Die vorliegendeDoktorarbeitist indrei Teile gegliedert. Im erstenTeil wird untersucht wie
sich Lernalgorithmen bei gezielter Manipulation durch den Angreifer verhalten. Basierend auf
dem erworbenen Wissen werden dann im zweiten Teil der Arbeit zuverla¨ssige Lernalgorithmen
entwickelt,dieimmungegenu¨berderManipulationendurchdenAngreifersind. Schließlichwer-
den im dritten Teil der Arbeit skalierbare Onlinelernalgorithmen fu¨r Sicherheitsanwendungen
vorgestellt.
v
Acknowledgments
FirstandforemostIwouldliketothankmyadvisor,ProfessorClaudiaEckert,whoseencour-
agement,guidanceandsupportshehasofferedmethroughoutmygraduatecareer.Moreover,the
freedomgivenbyClaudiaallowedmetopursemyownresearchinterests. Shesharedtheexcite-
mentwhenIhadaccomplishmentandofferedmeencouragementwhenIwasfrustrated. Claudia
madeinnumerablecontributionstomydevelopmentasaresearcherandmyambitionstobeadata
scientist.
IwouldliketothankProfessorShou-DeLinforhisinvitationofasix-monthresearchvisitat
NationalTaiwanUniversity. Shou-Dewith hiskindnessand invaluable experienceguided meto
finish Chapter 11. I would also like to thank PhillipB. Gibbons for his suggestions, insights and
revisionsonChapter11. ItisShou-DeandPhillip’sdedicationthatmadethischapterpossible.
I would particularly like to thank Huang Xiao for his hard work and critical contributions
to Chapter 6, Chapter 7 and Chapter 8. He is an extraordinary collaborator and friend. For his
criticalsuggestionsonChapter4,IwouldliketothankThomasStibor. Iwouldalsoliketothank
Professor Ping Luo for his useful feedback on Chapter 9 and Chapter 11. I would like to thank
ProfessorTakehisaYairifortheinspiringdiscussiononChapter7,andNanLiforherfeedbackon
Chapter9. IthankRuei-BinWangforhiscommentsonChapter11. IthankYu-Rong Taoforher
collaborationandpersistenthardworkonsomeexperimentsinChapter9andChapter10.
Many others have helped me over my graduate career. I cannot thank all these individuals
enough for their support, but I would like to call attention to a few. I would like to thank Xin-
ChangLiuandCheetahLinforbeinggoodfriendswhowerealwayswillingtolistenandprovided
usefuladvices.Inaddition,IwouldliketothankPetraLorenzandAlexanderLu¨dtkefortheirhelp
andassistanceofallkindstomylifeandresearchcareerinGermany.
Finally,Ioffermyregardsandblessingstoallofthose,especiallymyparents,whosupported
meinanyrespectduringthecompletionofmydissertation. Withoutthem,thisworkwouldnot
havebeenpossible.
I gratefully acknowledge the support of my sponsors. Part of this work was supported in
part by the HIVE (Hypervisorbasierte Innovative VErfahren zur Anomalieerkennung mit Hard-
wareunterstu¨tzung),whichreceivessupportfromtheGermanFederalMinistryofEducationand
ResearchundergrantsFKZ16BY1200D;andinpartbyNationalScienceCouncil,NationalTaiwan
UniversityandIntelCorporationundergrantsNSC102-2911-I-002-001andNTU103R7501. Iwould
alsoliketothankChinaScholarshipCouncilforrecognizingmetheawardofoutstandingstudents
abroad.
vii
Publications
[1]HanXiaoandClaudiaEckert.EfficientOnlineSequencePredictionwithSideInformation.IEEE
InternationalConferenceonDataMining,2013.
[2]HanXiaoandClaudiaEckert.LazyGaussianProcessCommitteeforReal-TimeOnlineRegres-
sion. AAAIConferenceonArtificialIntelligence, 2013.
[3]HanXiao,HuangXiaoandClaudiaEckert. LearningfromMultipleObserverswithUnknown
Expertise. Pacific-AsiaConferenceonKnowledgeDiscoveryandDataMining,2013.
[4]Han Xiao, Huang Xiaoand ClaudiaEckert. Adversarial Label Flips Attack on Support Vector
Machines. EuropeanConferenceonArtificialIntelligence, 2012.
[5]Han Xiao and Thomas Stibor. Evasion Attack on Multi-Class Linear Classifier. Pacific-Asia
ConferenceonKnowledgeDiscoveryandDataMining,2012.
[6]HanXiaoandThomasStibor. SupervisedTopicTransitionModelforDetectingMaliciousSys-
temCallSequences.SIGKDDworkshop:KnowledgeDiscovery,ModelingandSimulation,2011. (Best
paperaward)
[7]HanXiaoand ThomasStibor. Toward ArtificialSynesthesia: LinkingPicturesand Sounds via
Words. NIPSworkshop: NextGenerationComputerVisionChallenges, 2010.
[8]HanXiaoandThomasStibor. EfficientCollapsedGibbsSamplingForLatentDirichletAlloca-
tion. AsiaConferenceonMachineLearning, 2010.
ix
x
Description:e third part concentrates on developing scalable online learning algorithms for security applications. I propose three systematic approaches for learning from large-scale data stream. e . 3 Adversarial Machine Learning As di erent users tend to behave di erently, it may help the security experts t