Table Of ContentSpringer Handbooks of Computational Statistics
Wolfgang Karl Härdle
Henry Horng-Shing Lu
Xiaotong Shen Editors
Handbook
of Big Data
Analytics
Springer Handbooks of Computational Statistics
Serieseditors
JamesE.Gentle
WolfgangK.Härdle
YuichiMori
Moreinformationaboutthisseriesathttp://www.springer.com/series/7286
Wolfgang Karl HaRrdle (cid:129) Henry Horng-Shing Lu (cid:129)
Xiaotong Shen
Editors
Handbook of Big Data
Analytics
123
Editors
WolfgangKarlHaRrdle HenryHorng-ShingLu
LadislausvonBortkiewiczChair InstituteofStatistics
ofStatistics NationalChiaoTungUniversity
C.A.S.E.CenterforAppliedStatistics& Hsinchu,Taiwan
Economics
Humboldt-UniversitaRtzuBerlin
Berlin,Germany
XiaotongShen
SchoolofStatistics
UniversityofMinnesota
Minneapolis,USA
ISSN2197-9790 ISSN2197-9804 (electronic)
SpringerHandbooksofComputationalStatistics
ISBN978-3-319-18283-4 ISBN978-3-319-18284-1 (eBook)
https://doi.org/10.1007/978-3-319-18284-1
LibraryofCongressControlNumber:2018948165
©SpringerInternationalPublishingAG,partofSpringerNature2018
Thisworkissubjecttocopyright.AllrightsarereservedbythePublisher,whetherthewholeorpartof
thematerialisconcerned,specificallytherightsoftranslation,reprinting,reuseofillustrations,recitation,
broadcasting,reproductiononmicrofilmsorinanyotherphysicalway,andtransmissionorinformation
storageandretrieval,electronicadaptation,computersoftware,orbysimilarordissimilarmethodology
nowknownorhereafterdeveloped.
Theuseofgeneraldescriptivenames,registerednames,trademarks,servicemarks,etc.inthispublication
doesnotimply,evenintheabsenceofaspecificstatement,thatsuchnamesareexemptfromtherelevant
protectivelawsandregulationsandthereforefreeforgeneraluse.
Thepublisher,theauthorsandtheeditorsaresafetoassumethattheadviceandinformationinthisbook
arebelievedtobetrueandaccurateatthedateofpublication.Neitherthepublishernortheauthorsor
theeditorsgiveawarranty,expressorimplied,withrespecttothematerialcontainedhereinorforany
errorsoromissionsthatmayhavebeenmade.Thepublisherremainsneutralwithregardtojurisdictional
claimsinpublishedmapsandinstitutionalaffiliations.
Printedonacid-freepaper
ThisSpringerimprintispublishedbytheregisteredcompanySpringerInternationalPublishingAGpart
ofSpringerNature.
Theregisteredcompanyaddressis:Gewerbestrasse11,6330Cham,Switzerland
Preface
Atremendousgrowthofhigh-throughputtechniquesleadstohugedatacollections
that are accumulating in an exponential speed with high volume, velocity, and
variety. This creates the challenges of big data analytics for statistical science.
Thesechallengesdemandcreativeinnovationofstatisticalmethodsandsmartcom-
putational Quantlets, macros, and programs to capture the often genuinely sparse
informational content of huge unstructured data. Particularly, the development of
analytic methodologies must take into account the co-design of hardware and
software to handle the massive data corpus so that the veracity can be achieved
andtheinherentvaluecanberevealed.
ThisvolumeoftheHandbookofComputationalStatisticsseriescollectstwenty-
one chapters to provide advice and guidance to the fast developments of big
data analytics. It covers a wide spectrum of methodologies and applications
that provide a general overview of this new exciting field of computational
statistics. The chapters present topics related to mathematics and statistics for
high-dimensional problems, nonlinear structures, sufficient dimension reduction,
spatiotemporaldependence,functionaldataanalysis,graphicmodeling,variational
Bayes, compressive sensing, density functional theory, and supervised and semi-
supervisedlearning.Theapplicationsincludebusinessintelligence,finance,image
analysis, compress sensing, climate changes, text mining, neuroscience, and data
visualization in very large dimensions. Many of the methods that we present are
reproducible in R or MATLAB or Python language. Details of the Quantlets are
foundathttp://www.quantlet.com/.
Wewouldliketoacknowledgethededicatedworkofallthecontributingauthors,
reviewers,andmembersintheeditorialofficeofSpringer,includingAliceBlanck,
Frank Holzwarth, Jessica Fäcks, and the related members. Finally, we also thank
thegreatsupportofourfamiliesandfriendsforthislongjourneyofeditingprocess.
Berlin,Germany WolfgangKarlHärdle
Hsinchu,Taiwan HenryHorng-ShingLu
Minneapolis,USA XiaotongShen
July4,2017
v
Contents
PartI Overview
1 Statistics,Statisticians,andtheInternetofThings ..................... 3
JohnM.JordanandDennisK.J.Lin
2 CognitiveDataAnalysisforBigData .................................... 23
JingShyr,JaneChu,andMikeWoods
PartII Methodology
3 StatisticalLeveragingMethodsinBigData............................. 51
XinlianZhang,RuiXie,andPingMa
4 ScatteredDataandAggregatedInference............................... 75
XiaomingHuo,ChengHuang,andXueleiSherryNi
5 NonparametricMethodsforBigDataAnalytics........................ 103
HaoHelenZhang
6 FindingPatternsinTimeSeries........................................... 125
JamesE.GentleandSeunghyeJ.Wilson
7 VariationalBayesforHierarchicalMixtureModels.................... 151
MutingWan,JamesG.Booth,andMartinT.Wells
8 HypothesisTestingforHigh-DimensionalData......................... 203
WeiBiaoWu,ZhipengLou,andYuefengHan
9 High-DimensionalClassification.......................................... 225
HuiZou
10 Analysis of High-Dimensional RegressionModels Using
OrthogonalGreedyAlgorithms........................................... 263
Hsiang-LingHsu,Ching-KangIng,andTzeLeungLai
vii
viii Contents
11 Semi-supervisedSmoothingforLargeDataProblems................. 285
MarkVereCulp,KennethJosephRyan,andGeorgeMichailidis
12 InverseModeling:AStrategytoCopewithNon-linearity............. 301
QianLin,YangLi,andJunS.Liu
13 SufficientDimensionReductionforTensorData ....................... 325
YiwenLiu,XinXing,andWenxuanZhong
14 CompressiveSensingandSparseCoding................................ 339
KevinChenandH.T.Kung
15 BridgingDensityFunctionalTheoryandBigDataAnalytics
withApplications........................................................... 351
Chien-ChangChen,Hung-HuiJuan,Meng-YuanTsai,andHenry
Horng-ShingLu
PartIII Software
16 Q3-D3-LSA:D3.jsandGeneralizedVectorSpaceModelsfor
StatisticalComputing ...................................................... 377
LukasBorkeandWolfgangK.Härdle
17 ATutorialonLibra:RPackagefortheLinearizedBregman
AlgorithminHigh-DimensionalStatistics............................... 425
JiechaoXiong,FengRuan,andYuanYao
PartIV Application
18 FunctionalData Analysisfor Big Data:A Case Study on
CaliforniaTemperatureTrends........................................... 457
PantelisZenonHadjipantelisandHans-GeorgMüller
19 BayesianSpatiotemporalModelingforDetectingNeuronal
ActivationviaFunctionalMagneticResonanceImaging .............. 485
MartinBezener,LynnE.Eberly,JohnHughes,GalinJones,and
DonaldR.Musgrove
20 ConstructionofTightFramesonGraphsandApplicationto
Denoising .................................................................... 503
FranziskaGöbel,GillesBlanchard,andUlrikevonLuxburg
21 Beta-BoostedEnsembleforBigCreditScoringData................... 523
MaciejZie˛baandWolfgangKarlHärdle
Part I
Overview
Chapter 1
Statistics, Statisticians, and the Internet
of Things
JohnM.JordanandDennisK.J.Lin
Abstract Withintheoverallrubricofbigdata,oneemergingsubsetholdsparticular
promise,peril,andattraction.Machine-generatedtrafficfromsensors,datalogs,and
the like, transmittedusingInternetpracticesandprinciples,is beingreferredto as
the“InternetofThings”(IoT).Understanding,handing,andanalyzingthistypeof
datawillstretchexistingtoolsandtechniques,thusprovidingaprovinggroundfor
otherdisciplinestoadoptandadaptnewmethodsandconcepts.Inparticular,new
tools will be needed to analyze data in motion rather than data at rest, and there
are consequences of having constant or near-constant readings from the ground-
truth phenomenon as opposed to numbers at a remove from their origin. Both
machinelearningandtraditionalstatisticalapproacheswillcoevolverapidlygiven
theeconomicforces,nationalsecurityimplications,andwidepublicbenefitofthis
newareaofinvestigation.Atthesametime,datapractitionerswillbeexposedtothe
possibilityof privacybreaches,accidentscausingbodilyharm,andotherconcrete
consequencesofgettingthingswrongintheoryand/orpractice.Wecontendthatthe
physicalinstantiationofdata practicein the IoTmeansthatstatisticians andother
practitioners may well be seeing the origins of a post-big data era insofar as the
traditional abstractions of numbers from ground truth are attenuated and in some
caseserasedentirely.
Keywords Machinetraffic · InternetofThings · Sensors · Machinelearning ·
Statisticalapproachestobigdata
J.M.Jordan
DepartmentofSupplyChain&InformationSystems,SmealCollegeofBusiness,Pennsylvania
StateUniversity,UniversityPark,PA,USA
e-mail:[email protected]
D.K.J.Lin((cid:2))
DepartmentofStatistics,EberlyCollegeofScience,PennsylvaniaStateUniversity,University
Park,PA,USA
e-mail:[email protected]
©SpringerInternationalPublishingAG,partofSpringerNature2018 3
W.K.Härdleetal.(eds.),HandbookofBigDataAnalytics,SpringerHandbooks
ofComputationalStatistics,https://doi.org/10.1007/978-3-319-18284-1_1
Description:This essential guide to a broad spectrum of big data analytics in cross-disciplinary applications focuses on the statistical prospects offered by recent developments in this field. To do so, it covers statistical methods for high-dimensional problems, algorithmic designs, computation tools, analysis